|1 day ago||8 days ago|
|Apache License 2.0||GNU General Public License v3.0 or later|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
What do I need to know about distributed algorithms and systems?
1 project | reddit.com/r/AskProgramming | 22 May 2022
You generally want to keep your data in memory, rather than disk, to keep things reasonably fast. A system like Apache Spark tries to do this for you, spilling to disk when needed. In general, I'd recommend researching Spark, since it will cover a lot of the concepts you care about.
How to use Spark and Pandas to prepare big data
3 projects | dev.to | 10 May 2022
Apache Spark is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the PySpark library. The examples also require that you have your data in Amazon S3 (Simple Storage Service). All this is set up on AWS EMR (Elastic MapReduce).
AWS Glue: what is it and how does it work?
1 project | dev.to | 5 May 2022
With Glue, Apache Spark runs in the background. But if this is the first time you’ve heard of the popular open-source analytics engine, it may take you a while to familiarize yourself with the cloud software.
Real-time Open Source Indexes: Databases, Headless CMSs and Static Site Generators
7 projects | dev.to | 4 May 2022
Spark SQL (302 active contributors).
Top Responsibilities of a Data Engineering Manager
1 project | reddit.com/r/dataengineering | 2 May 2022
What’s more, picking the right technology is always evolving. New tools come out all the time, often with different functionality than existing tools. So it’s important that you stay up-to-date on what technologies are available and their latest features. For example, four years ago Apache Spark was completely unknown but today it is quickly becoming the de facto standard for stream processing.
Apache Spark, Hive, and Spring Boot — Testing Guide
6 projects | dev.to | 22 Apr 2022
In this article, I'm showing you how to create a Spring Boot app that loads data from Apache Hive via Apache Spark to the Aerospike Database. More than that, I'm giving you a recipe for writing integration tests for such scenarios that can be run either locally or during the CI pipeline execution. The code examples are taken from this repository.
Cannot find col function in pyspark
1 project | reddit.com/r/codehunter | 22 Apr 2022
from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?
How To Start Your Next Data Engineering Project
6 projects | dev.to | 16 Apr 2022
Big Data Processing, EMR with Spark and Hadoop | Python, PySpark
2 projects | dev.to | 27 Mar 2022
Apache Spark is an open-source, distributed processing system used for big data workloads. Wanna dig more dipper?
1 project | reddit.com/r/196 | 24 Mar 2022
What libraries do you use for machine learning and data visualizing in scala?
5 projects | reddit.com/r/scala | 27 Nov 2021
I use smile https://github.com/haifengl/smile with ammonite and it feels pretty easy/good to work with. Of course for pure looking at data, and exploration, you're not going to beat python.
Python VS Scala
2 projects | reddit.com/r/scala | 2 Jul 2021
Actually, it does. Scala has Spark for data science and some ML libs like Smile.
[R] NLP Machine Learning with low RAM
1 project | reddit.com/r/MachineLearning | 2 Jun 2021
I guess I must have a mistake somewhere. It's not much code. it's written in Kotlin with smile. My dataset is only about 32MB. I load the dataset into memory. I then use 80% of the data for training, and the other for later testing. I get just the columns I need and store them in the variable dataset.
Kotlin with Randon Forest Classifier
1 project | reddit.com/r/Kotlin | 19 Apr 2021
I've heard good things about Smile, probably beats libs like Weka by far. I'm not sure if you can load a scikit-learn model though, so you might need to retrain the model in Kotlin.
Machine learning on JVM
6 projects | reddit.com/r/scala | 5 Apr 2021
I was using Smile for some period - https://haifengl.github.io/ - it's quite small and lightweight Java lib with some very basic algorithms - I was using in particularly cauterization. Along with this it provides Scala API.
What are some alternatives?
Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Scalding - A Scala API for Cascading
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Deeplearning4j - Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Apache Calcite - Apache Calcite
Scio - A Scala API for Apache Beam and Google Cloud Dataflow.