|7 days ago||4 days ago|
|Apache License 2.0||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
The Evolution of the Data Engineer Role
2 projects | news.ycombinator.com | 24 Oct 2022
FACT table (sale $, order quantity, order ID, product ID)
customer, account_types etc are dimensions to filter your low-level transactional data. The schema like a snowflake when you add enough dimensions, hence the name.
The FACT table makes "measures" available to the user. Example: Count of Orders. These are based on the values in the FACT table (your big table of IDs that link to dimensions and low-level transactional data).
You can then slice and dice your count of orders by fields in the dimensions.
You could then add Sum of Sale ($) as an additional measure. "Abstract" measures like Average Sale ($) per Order can also be added in the OLTP backend engine.
End users will often be using Excel or Tableau to create their own dashboards / graphs / reports. This pattern makes sense in that case --> user can explore the heavily structured business data according to all the pre-existing business rules.
- Great for enterprise businesses with existing application databases
- Highly structured and transaction support (ACID compliance)
- Ease of use for end business user (create a new pivot table in Excel)
- Easy to query (basically a bunch of SQL queries)
- Encapsulates all your business rules in one place -- a.k.a. single source of truth.
- Massive start up cost (have to work out the schema before you even write any code)
- Slow to change (imagine if the raw transaction amounts suddenly changed to £ after a certain date!)
- Massive nightly ETL jobs (these break fairly often)
- Usually proprietary tooling / storage (think MS SQL Server)
2. Data Lake
Throw everything into an S3 bucket. Database table? Throw it into the S3 bucket. Image data? Throw it into the S3 bucket. Kitchen sink? Throw it into the S3 bucket.
Process your data when you're ready to process it. Read in your data from S3, process it, write back to S3 as an "output file" for downstream consumption.
- Easy to set up
- Fairly simple and standardised i/o (S3 apis work with pandas and pyspark dataframes etc)
- Can store data remotely until ready to process it
- Highly flexible as mostly unstructured (create new S3 keys -- a.k.a. directories -- on the fly )
- Cheap storage
- Doesn't scale -- turns into a "data swamp"
- Not always ACID compliant (looking at you Digital Ocean)
- Very easy to duplicate data
3. Data Lakehouse
Essentially a data lake with some additional bits.
A. Delta Lake Storage Format a.k.a. Delta Tables
Versioned files acting like versioned tables. Writing to a file will create a new version of the file, with previous versions stored for a set number of updates. Appending to the file creates a new version of the file in the same way (e.g. add a new order streamed in from the ordering application).
Every file -- a.k.a. delta table -- becomes ACID compliant. You can rollback the table to last week and replay e.g. because change X caused bug Y to happen.
AWS does allow you do this, but it was a right ol' pain in the arse whenever I had to deal with massively partitioned parquet files. Delta Lake makes versioning the outputs much easier and it is much easier to rollback.
B. Data Storage Layout
Enforce a schema based on processing stages to get some performance & data governance benefits.
Example processing stage schema: DATA IN -> EXTRACT -> TRANSFORM -> AGGREGATE -> REPORTABLE
Or the "medallion" schema: Bronze -> Silver -> Gold.
Write out the data at each processing stage to a delta lake table/file. You can now query 5x data sources instead of 2x. The table's rarity indicates the degree of "data enrichment" you have performed -- i.e. how useful have you made the data. Want to update the codebase for the AGGREGATE stage? Just rerun from the TRANSFORM table (rather than run it all from scratch). This also acts as a caching layer. In a Data Warehouse, the entire query needs to be run from scratch each time you change a field. Here, you could just deliver the REPORTABLE tables as artefacts whenever you change them.
C. "Metadata" Tracking
See AWS Glue Data Catalog.
Index files that match a specific S3 key pattern and/or file format and/or AWS S3 tag etc. throughout your S3 bucket. Store the results in a publicly accessible table. Now you can perform SQL queries against the metadata of your data. Want to find that file you were working on last week? Run a query based on last modified time. Want to find files that contain a specific column name? Run a query based on column names.
- transactional versioning -- ACID compliance and the ability to rollback data over time (I accidentally deleted an entire column of data / calculated VAT wrong yesterday)
- processing-stage schema storage layout acts as a caching layer (only process from the stage where you need to)
- no need for humans to remember the specific path to the files they were working on as files are all indexed
- less chance of creating a "data swamp"
- changes become easier to audit as you can track the changes between versions
- Delta lake table format is only really available with Apache Spark / Databricks processing engines (mostly, for now)
- Requires enforcement of the processing-stage schema (your data scientists will just ignore you when you request they start using it)
- More setup cost than a simple data lake
- Basically a move back towards proprietary tooling (some FOSS libs are starting to pop up for it)
4. Data Mesh
geoduck14's answer on this was pretty good. basically have a data infrastructure team, and them domain-specific teams that spring up as needed (like an infra team looking after your k8s clusters, and application teams that use the clusters). domain specific data team use the data platform provided by the data infrastructure team.
Previously worked somewhere in a "product" team which basically performed this function. They just didn't call it a "data mesh".
5 Reasons Your Data Lakehouse should Embrace Dremio Cloud
2 projects | dev.to | 9 Aug 2022
You can query data organized in many open table formats like Apache Iceberg and Delta Lake. (Here is a good article on what is a table format and the differences between different ones)
Delta 2.0 - The Foundation of your Data Lakehouse is Open
2 projects | reddit.com/r/apachespark | 5 Aug 2022
Note that the roadmap can be found at https://github.com/delta-io/delta/issues/1307 and we’re actively asking for feedback so we can prioritize the remaining items. Please chime in there so we can track and re-prioritize! Thanks!2 projects | reddit.com/r/apachespark | 5 Aug 2022
Still not quite completely on par with the Databricks version, with missing features like GENERATED ALWAYS AS IDENTITY, but this is getting good.
Databricks platform for small data, is it worth it?
3 projects | reddit.com/r/dataengineering | 29 Jun 2022
Currently the infrastructure we have is some custom made pipelines that load the data on S3, and I use Delta Tables here and there for its convenience: ACID, time travel, merges, CDC etc...
Data point versioning infrastructure for time traveling to a precise point in time?
2 projects | reddit.com/r/dataengineering | 18 Jun 2022
I've been playing around a bit with Delta (Table/Lake) whatever you want to call it. It has time travel so you can look back and see what the data looked like at a particular point in time. https://delta.io/
How-to-Guide: Contributing to Open Source
19 projects | reddit.com/r/dataengineering | 11 Jun 2022
What companies/startups are using Scala (open source projects on github)?
13 projects | reddit.com/r/scala | 24 May 2022
There are so many of them in big data, e.g. Kafka, Spark, Flink, Delta, Snowplow, Finagle, Deequ, CMAK, OpenWhisk, Snowflake, TheHive, TVM-VTA, etc.
What is a Delta Table?
2 projects | reddit.com/r/dataengineering | 20 May 2022
Ah. I believe you are correct. As I look at the examples section for python in the github repo these are looking almost identical to what I was seeing. https://github.com/delta-io/delta/blob/master/examples/python/quickstart.py2 projects | reddit.com/r/dataengineering | 20 May 2022
It is a specific table format. https://delta.io/ it’s an open source project just read their website, will have way more info than these comments.
What is the separation of storage and compute in data platforms and why does it matter?
3 projects | dev.to | 29 Nov 2022
However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.
Deequ for generating data quality reports
3 projects | dev.to | 24 Nov 2022
aws documentation — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.
In One Minute : Hadoop
10 projects | dev.to | 21 Nov 2022
Spark, a fast and general engine for large-scale data processing.
Machine Learning Pipelines with Spark: Introductory Guide (Part 1)
5 projects | dev.to | 23 Oct 2022
Apache Spark is a fast and general open-source engine for large-scale, distributed data processing. Its flexible in-memory framework allows it to handle batch and real-time analytics alongside distributed data processing.
A peek into Location Data Science at Ola
6 projects | dev.to | 26 Sep 2022
This requires the use of distributed computation tools such as Spark and Hadoop, Flink and Kafka are used. But for occasional experimentation, Pandas, Geopandas and Dask are some of the commonly used tools.
System Design: Uber
4 projects | dev.to | 21 Sep 2022
Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
System Design: Twitter
5 projects | dev.to | 21 Sep 2022
Recording analytics and metrics is one of our extended requirements. As we will be using Apache Kafka to publish all sorts of events, we can process these events and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing.
How the world caught up with Apache Cassandra
4 projects | dev.to | 15 Sep 2022
Cassandra survived its adolescent years by retaining its position as the database that scales more reliably than anything else, with a continual pursuit of operational simplicity at scale. It demonstrated its value even further by integrating with a broader data infrastructure stack of open source components, including the analytics engine Apache Spark, stream-processing platform Apache Kafka, and others.
Why we don’t use Spark
2 projects | dev.to | 7 Sep 2022
Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.
Tracking Aircraft in Real-Time With Open Source
17 projects | dev.to | 1 Sep 2022
What are some alternatives?
Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
Scalding - A Scala API for Cascading
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Smile - Statistical Machine Intelligence & Learning Engine
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Apache Calcite - Apache Calcite
Scio - A Scala API for Apache Beam and Google Cloud Dataflow.
Apache Flink - Apache Flink