Apache Spark
Smile
Apache Spark | Smile | |
---|---|---|
121 | 10 | |
41,083 | 6,174 | |
0.6% | 0.4% | |
10.0 | 9.9 | |
5 days ago | 7 days ago | |
Scala | Java | |
Apache License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Apache Spark
-
Every Database Will Support Iceberg — Here's Why
Apache Iceberg defines a table format that separates how data is stored from how data is queried. Any engine that implements the Iceberg integration — Spark, Flink, Trino, DuckDB, Snowflake, RisingWave — can read and/or write Iceberg data directly.
-
How to Reduce Big Data Analytics Costs by 90% with Karpenter and Spark
Apache Spark powers large-scale data analytics and machine learning, but as workloads grow exponentially, traditional static resource allocation leads to 30–50% resource waste due to idle Executors and suboptimal instance selection.
-
Apache Spark VS cocoindex - a user suggested alternative
2 projects | 1 Apr 2025
-
Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom
One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.
-
The Application of Java Programming In Data Analysis and Artificial Intelligence
[1] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Pearson, 2020. [2] F. Chollet, Deep Learning with Python. Manning Publications, 2018. [3] C. C. Aggarwal, Data Mining: The Textbook. Springer, 2015. [4] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008. [5] Apache Software Foundation, "Apache Spark: Lightning-Fast Unified Analytics Engine," Available: https://spark.apache.org/. [6] Java Community Process, "Java Machine Learning Libraries and Frameworks," Available: https://www.oracle.com/java/.
-
Apache Spark: Revolutionizing Big Data with Sustainable Open Source Funding
Apache Spark isn’t just a framework for distributed data processing; it’s a rich ecosystem that includes libraries for machine learning, stream processing, and graph processing. A key aspect of Spark’s ecosystem is its reliance on community contributions. Developers from around the world collaborate on its GitHub repository, ensuring that Spark remains at the cutting edge of technology. The governance process, characterized by transparency and meritocracy, builds trust among contributors and sponsors alike. An essential component of Apache Spark’s model is its use of the Apache 2.0 license. This permissive license not only shields contributors with patent protection but also allows enterprises to integrate Spark into proprietary systems without legal hurdles. The license enables a free flow of innovation—companies can both use and contribute to Spark’s codebase, leading to enhancements that benefit the entire community. The funding mechanisms sustaining Apache Spark are as diverse as they are innovative. Corporate sponsorships play a significant role, with companies dedicating resources and finances to support ongoing development. Additionally, grant programs and community donations help maintain an ecosystem where improvements and new features are continuously shared with users worldwide. These sustainable funding practices ensure that Apache Spark can meet the demands of real-time analytics and high-volume data processing.
-
Automating Enhanced Due Diligence in Regulated Applications
If you're designing an event-based pipeline, you can use a data streaming tool like Kafka to process data as it's collected by the pipeline. For a setup that already has data stored, you can use tools like Apache Spark to batch process and clean it before moving ahead with the pipeline.
-
Run PySpark Local Python Windows Notebook
PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables fast, scalable data processing. PySpark allows Python developers to leverage the powerful capabilities of Spark for big data analytics, machine learning, and data engineering tasks without needing to delve into the complexities of Java or Scala.
- Infraestrutura para análise de dados com Jupyter, Cassandra, Pyspark e Docker
- His Startup Is Now Worth $62B. It Gave Away Its First Product Free
Smile
- Smile 4.0
-
The Current State of Clojure's Machine Learning Ecosystem
> I don't think it's right to recommend that new users move away from the package because of licensing issues
I was going to chime in to agree but then I saw how this was done - a completely innocuous looking commit:
https://github.com/haifengl/smile/commit/6f22097b233a3436519...
And literally no mention in the release notes:
https://github.com/haifengl/smile/releases/tag/v3.0.0
I think if you are going to change license especially in a way that makes it less permissive you need to be super open and clear about both the fact you are doing it and your reasons for that. This is done so silently as to look like it is intentionally trying to mislead and trick people.
So maybe I wouldn't say to move away because of the specific license, but it's legitimate to avoid something when it's so clearly driven by a single entity and that entity acts in a way that isn't trustworthy.
-
Need statistic test library for Spark Scala
Check out Smile too.
-
Just want to vent a bit
Although it may be a bit more work, you can do both machine learning and AI in Java. If you are doing deep learning, you can use DeepJavaLibrary (I do work on this one at Amazon). If you are looking for other ML algorithms, I have seen Smile, Tribuo, or some around Spark.
-
Anybody here using Java for machine learning?
For deploying a trained model there are a bunch of options that use Java on top of some native runtime like TF-Java (which I co-lead), ONNX Runtime, pytorch has inference for TorchScript models. Training deep learning models is harder, though you can do it for some of them in DJL. Training more standard ML models is much simpler, either via Tribuo, or using things like LibSVM & XGBoost directly, or other libraries like SMILE or WEKA.
-
What libraries do you use for machine learning and data visualizing in scala?
I use smile https://github.com/haifengl/smile with ammonite and it feels pretty easy/good to work with. Of course for pure looking at data, and exploration, you're not going to beat python.
-
Python VS Scala
Actually, it does. Scala has Spark for data science and some ML libs like Smile.
-
[R] NLP Machine Learning with low RAM
I guess I must have a mistake somewhere. It's not much code. it's written in Kotlin with smile. My dataset is only about 32MB. I load the dataset into memory. I then use 80% of the data for training, and the other for later testing. I get just the columns I need and store them in the variable dataset.
-
Kotlin with Randon Forest Classifier
I've heard good things about Smile, probably beats libs like Weka by far. I'm not sure if you can load a scikit-learn model though, so you might need to retrain the model in Kotlin.
-
Machine learning on JVM
I was using Smile for some period - https://haifengl.github.io/ - it's quite small and lightweight Java lib with some very basic algorithms - I was using in particularly cauterization. Along with this it provides Scala API.
What are some alternatives?
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Weka
Trino - Official repository of Trino, the distributed SQL query engine for big data, former
Deeplearning4j - Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
Scalding - A Scala API for Cascading
Breeze - Breeze is/was a numerical processing library for Scala.