Top 23 Spark Open-Source Projects
-
Project mention: is anyone want to join maintaining spark java framework? | reddit.com/r/java | 2022-06-21
Wow, this has nothing to do with Apache Spark (https://spark.apache.org/), the wildly popular JVM based data processing framework.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
Scout APM
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Project mention: Show HN: DataStation – App to easily query, script, and visualize data | news.ycombinator.com | 2022-05-31How does it compare to Redash (now Databricks SQL): https://github.com/getredash/redash?
-
const measureNames = [ 'perc_10', 'perc_10_value', 'perc_20', 'perc_20_value', 'perc_30', 'perc_30_value', 'perc_40', 'perc_40_value', 'perc_50', 'perc_50_value', 'perc_60', 'perc_60_value', 'perc_70', 'perc_70_value', 'perc_80', 'perc_80_value', 'perc_90', 'perc_90_value', 'perc_100', 'perc_100_value', ]; const measures = Object.keys(measureNames).reduce((result, name) => { const sqlName = measureNames[name]; return { ...result, [sqlName]: { sql: () => sqlName, type: `max` } }; }, {}); cube('errorpercentiles', { sql: `with sagemaker as ( select model_name, variant_name, cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id, cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction from s3.sagemaker_logs.logs ) , actual as ( select correlation_id, actual_value from postgresql.public.actual_values ) , logs as ( select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual from sagemaker left outer join actual on sagemaker.correlation_id = actual.correlation_id ) , errors as ( select abs(prediction - actual) as abs_err, model_name, model_variant from logs ), percentiles as ( select approx_percentile(abs_err, 0.1) as perc_10, approx_percentile(abs_err, 0.2) as perc_20, approx_percentile(abs_err, 0.3) as perc_30, approx_percentile(abs_err, 0.4) as perc_40, approx_percentile(abs_err, 0.5) as perc_50, approx_percentile(abs_err, 0.6) as perc_60, approx_percentile(abs_err, 0.7) as perc_70, approx_percentile(abs_err, 0.8) as perc_80, approx_percentile(abs_err, 0.9) as perc_90, approx_percentile(abs_err, 1.0) as perc_100, model_name, model_variant from errors group by model_name, model_variant ) SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10 , max(perc_10) as perc_10_value , count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20 , max(perc_20) as perc_20_value , count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30 , max(perc_30) as perc_30_value , count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40 , max(perc_40) as perc_40_value , count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50 , max(perc_50) as perc_50_value , count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60 , max(perc_60) as perc_60_value , count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70 , max(perc_70) as perc_70_value , count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80 , max(perc_80) as perc_80_value , count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90 , max(perc_90) as perc_90_value , count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100 , max(perc_100) as perc_100_value , p.model_name, p.model_variant FROM percentiles p, errors e group by p.model_name, p.model_variant`, preAggregations: { // Pre-Aggregations definitions go here // Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started }, joins: { }, measures: measures, dimensions: { modelVariant: { sql: `model_variant`, type: 'string' }, modelName: { sql: `model_name`, type: 'string' }, } });
-
Project mention: Anyone know of any papers or models for segmenting satellite images of a city into things like roads, buildings, parks, etc? | reddit.com/r/MLQuestions | 2022-04-25
Training is not the same as inference (doing the segmentation), so that scale is probably off by a lot. One or two orders of magnitude just depending on the specifics of what hardware you're running on, and your training and eval dataset would be several orders of magnitude smaller. FAANGs would parallelize that training as well (don't remember if UNet is inherently parallelizable for training) via their internal equivalent of Horovod, so they'll do a GPU-month worth of training in less than a day.
-
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
DL4J
-
Project mention: ⚙️ Data Science Cheat Sheets: A collection of cheat sheets for #DataScience and problem solving. h/t @Sauain | reddit.com/r/policerewired | 2021-10-01
-
JetBrains
Developer Ecosystem Survey 2022. Take part in the Developer Ecosystem Survey 2022 by JetBrains and get a chance to win a Macbook, a Nvidia graphics card, or other prizes. We’ll create an infographic full of stats, and you’ll get personalized results so you can compare yourself with other developers.
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Something like this at least is the most direct answer to your question, as opposed to "you're doing it wrong" which unfortunately seems to be more upvoted. An example of something like this might be https://github.com/donnemartin/dev-setup
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Have you tried Apache Zepellin I remember that you can pretty print spark dataframes directly on it with z.show(df)
-
Alluxio (formerly Tachyon)
Alluxio, data orchestration for analytics and machine learning in the cloud
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. (by delta-io)
Project mention: Databricks platform for small data, is it worth it? | reddit.com/r/dataengineering | 2022-06-29Currently the infrastructure we have is some custom made pipelines that load the data on S3, and I use Delta Tables here and there for its convenience: ACID, time travel, merges, CDC etc...
-
-
Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this.
-
Project mention: [P] Microsoft releases SynapseML v0.9.5 with support for speech synthesis, anomaly detection, and geospatial analytics on large-scale data | reddit.com/r/MachineLearning | 2022-03-08
Link to Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v0.9.5
-
-
-
-
Project mention: Spark-NLP 4.0.0 🚀: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ SOTA models | reddit.com/r/apachespark | 2022-06-15
-
-
-
deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Project mention: Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines? | reddit.com/r/dataengineering | 2022-06-28GE is arguably the most well known OSS alternative to Soda Core. The third option is deequ, originally developed and released in OSS by AWS. Our community has told us that Soda Core is different because it’s easy to get going and embed into data pipelines. And it also allows some of the check authoring work to be moved to other members of the data team. I'm sure there are also scenarios where Soda Core is not the best option. For example, when you only use Pandas dataframes or develop in Scala.
-
I think Quill is the closest to your request: https://github.com/zio/zio-quill
Spark related posts
- Databricks platform for small data, is it worth it?
- Delta Lake 2.0.0 Preview
- Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines?
- Project Nessie: Transactional Catalog for Data Lakes with Git-Like Semantics
- Does anyone actually use ML.NET?
- Introduction to The World of Data - (OLTP, OLAP, Data Warehouses, Data Lakes and more)
- Data point versioning infrastructure for time traveling to a precise point in time?
Index
What are some of the best open-source Spark projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Spark | 33,221 |
2 | data-science-ipython-notebooks | 23,245 |
3 | Redash | 21,321 |
4 | cube.js | 13,251 |
5 | horovod | 12,530 |
6 | Deeplearning4j | 12,509 |
7 | ds-cheatsheets | 10,649 |
8 | H2O | 5,869 |
9 | dev-setup | 5,754 |
10 | Zeppelin | 5,717 |
11 | Alluxio (formerly Tachyon) | 5,700 |
12 | delta | 4,473 |
13 | BigDL | 3,959 |
14 | TensorFlowOnSpark | 3,799 |
15 | SynapseML | 3,340 |
16 | HELK | 3,231 |
17 | koalas | 3,143 |
18 | Spark Notebook | 3,124 |
19 | spark-nlp | 2,785 |
20 | dpark | 2,679 |
21 | RoaringBitmap | 2,653 |
22 | deequ | 2,302 |
23 | Quill | 2,059 |
Are you hiring? Post a new remote job listing for free.