|almost 3 years ago||about 14 hours ago|
|GNU General Public License v3.0 only||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
We haven't tracked posts mentioning hypothesis-testing yet.
Tracking mentions began in Dec 2020.
Integrate Pyspark Structured Streaming with confluent-kafka
2 projects | dev.to | 12 Aug 2023
Apache Spark - https://spark.apache.org/
Rest in Peas: The Unrecognized Death of Speech Recognition (2010)
4 projects | news.ycombinator.com | 4 May 2023
Gotta write this on my resume
2 projects | /r/ProgrammerHumor | 2 Apr 2023
So for example contributing to say spark may better for experience(and resume) than Twitter-the algorithm.
Query Real Time Data in Kafka Using SQL
7 projects | dev.to | 23 Mar 2023
Additionally, one of the challenges of working with Kafka is how to efficiently analyze and extract insights from the large volumes of data stored in Kafka topics. Traditional batch processing approaches, such as Hadoop MapReduce or Apache Spark, can be slow and expensive, and may not be suitable for real-time analytics. To address this challenge, you can use SQL queries with Kafka to analyze and extract insights from the data in real time.
Unveiling the Analytics Industry in Bangalore
3 projects | /r/u_Khushisondhi7 | 23 Mar 2023
Apache Iceberg as storage for on-premise data store (cluster)
3 projects | /r/dataengineering | 16 Mar 2023
Spark for your transformation compute engine. Get Spark to talk to Nessie.
5 Best Practices For Data Integration To Boost ROI And Efficiency
3 projects | /r/ReviewNPrep | 12 Mar 2023
There are different ways to implement parallel dataflows, such as using parallel data processing frameworks like Apache Hadoop, Apache Spark, and Apache Flink, or using cloud-based services like Amazon EMR and Google Cloud Dataflow. It is also possible to use parallel dataflow frameworks to handle big data and distributed computing, like Apache Nifi and Apache Kafka.
Forward Compatible Enum Values in API with Java Jackson
5 projects | dev.to | 11 Feb 2023
We’re not discussing the technical details behind the deduplication process. It could be Apache Flink, Apache Spark, or Kafka Streams. Anyway, it’s out of the scope of this article.
Uber Interview Experience/Asking Suggestions
4 projects | /r/dataengineering | 1 Feb 2023
One place to look are the projects repo's and docs, once you have a good idea of how the system is architected poking around pieces of the codebase can be helpful in letting you really understand their internals. I personally enjoy going through spark repo and trino repo and the documentation for both projects is decent and can answer many of your questions.
DataOps 101: An Introduction to the Essential Approach of Data Management Operations and Observability
3 projects | dev.to | 22 Jan 2023
DataOps is a collaborative effort within an organization, with many different teams of people working together to ensure that DataOps functions properly and delivers data value . So, before the data is delivered to end users, it is subjected to a number of treatments and refinements from multiple teams. Data scientists first use their data science techniques, such as machine learning and deep learning to build models using software stacks such as Python or R and tools such as Spark or Tensorflow, among others, and the models are then transferred to data engineers, who collect and manage the data used to train and evaluate these models, while data developers and data architects create complete applications that include the models. The data governance team then implements data access controls for training and benchmarking purposes, while the operations team ( "Ops") is in charge of putting everything together and making it available to end users.
What are some alternatives?
Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Scalding - A Scala API for Cascading
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Smile - Statistical Machine Intelligence & Learning Engine
Apache Calcite - Apache Calcite
Scio - A Scala API for Apache Beam and Google Cloud Dataflow.
Deeplearning4j - Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.