eventsim
corp
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
eventsim
-
What are some good publicly available real-time data sources?
In my last real-time sideproject, I used an open source music app simulator called Eventsim. The original doesn't work anymore, but this repo still did when I used it around Nov/Dec.
-
Mock Data Stream for learning purposes
You can use this event simulator repo perhaps? They have a docker file as well. Hope this helps.
-
Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
corp
-
Are there database design Standards out there? As in, formal documents listing exact best practices for OLTP database design?
Here's one that covers some of your points and that I like in general: https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md Except instead of prefixing my table names with the processing stage, I keep them in schemas by processing stage (source, staging, analytics). So, I can tell my analysts to look into the analytics schema for all the final tables, and they won't be bothered by intermediate models. The table names also have a precise structure that corresponds to our specific subject.
- Looking to understand why the dbt style guide recommends to use *all lower case* for keywords, field names, and function names?
-
Best practices for data modeling with SQL and dbt
I find the content more or less ripped from of dbt's own styleguide
-
SQL Code Style Properties Questions
For anyone wondering this is the DBT style guide I am referencing from.
-
A modern data stack for startups
While the tool choice is obvious, how to use dbt is going to be a more controversial. There's a load of great resources on dbt best practices, but as you can see from my Slack questions, there's enough ambiguity to tie you up.
-
Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!
Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.
-
What are some good resources for learning to write clean, production-quality code?
I really like thisthis SQL STYLE GUIDE, and if you use dbt, the dbt style guide.
-
How do you format your SQL queries?
I like this one very much from dbt very much.
-
Where do you like to do the L of ELT? Python or DBT?
I recommend you write one. You can take inspiration from dbt's one or Gitlab
-
Confused about benefits of CTE
I've seen fishtown analytics coding conventions recommend a lot around here, but there are a few things about their recommendations of CTE use that confuse me.
What are some alternatives?
streamify - A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
nodejs-bigquery - Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
spark-bigquery-connector - BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
sql-style-guide - An opinionated guide for writing clean, maintainable SQL.
RedfinScraper - Scrapes Redfin data.
terraform - Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
mockingbird - Mockingbird is a mock streaming data generator
pgsink - Logically replicate data out of Postgres into sinks (files, Google BigQuery, etc)
eventsim - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
awesome-public-real-time-datasets - A list of publicly available datasets with real-time data maintained by the team at bytewax.io