Our great sponsors
|6 days ago||10 months ago|
|GNU General Public License v3.0 or later||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ?
5 projects | /r/dataengineering | 17 Nov 2022
We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - https://atlas.apache.org/#/ - https://opendatadiscovery.org/ - https://open-metadata.org/ - https://marquezproject.github.io/marquez/ - https://datahubproject.io/ - https://www.amundsen.io/ - https://ckan.org/ - https://magda.io/
How to start Data Science and Machine Learning Career?
2 projects | /r/ReviewNPrep | 24 Nov 2021
We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything!
4 projects | /r/datasets | 8 Mar 2021
We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.
What are the 5 hottest dbt Repositories one should star on GitHub 2022?
4 projects | news.ycombinator.com | 15 Jun 2022
What are the 5 hottest dbt Repositories one should star on Github 2022?
dbt is a software framework that sits in the middle of the ELT process. It represents the transformative layer after loading data from an original source. Dbt combines SQL with software engineering principles.
Here are my top5!
- Lightdash (https://github.com/lightdash/lightdash): Lightdash converts dbt models and makes it possible to define and easily visualize additional metrics via a visual interface.
- ⏎ re_data (https://github.com/re-data/re-data): Re-Data is an abstraction layer that helps users monitor dbt projects and their underlying data. For example, you get alerts when a test failed or a data anomaly occurs in a dbt project.
- evidence (https://github.com/evidence-dev/evidence): Evidence is another tool for lightweight BI reporting. With Evidence, you can build simple reports in "medium style" using SQL queries and Markdown.
- Kuwala (https://github.com/kuwala-io/kuwala): With Kuwala, a BI analyst can intuitively build advanced data workflows using a drag-drop interface on top of the modern data stack without coding. Behind the Scenes, the dbt models are generated so that a more experienced engineer can customize the pipelines at any time.
- fal ai (https://github.com/fal-ai/fal): Fal helps to run Python scripts directly from the dbt project. For example, you can load dbt models directly into the Python context which helps to apply Data Science libraries like SKlearn and Prophet in the dbt models.
What are the hottest dbt Repositories you should star on Github 2022? - Here are mine.
5 projects | dev.to | 8 Jun 2022
Kuwala ( https://github.com/kuwala-io/kuwala ) Kuwala is a data workspace that consolidates the Modern Data Stack and makes it usable for BI analysts and Engineers. Even though dbt is originally targeted at BI Analysts, dbt is mainly used by Engineers. This shifts a large amount of pipeline engineering effort to the IT department. With Kuwala, a BI analyst can intuitively build advanced data workflows using a drag-drop interface on top of the modern data stack without coding. Consequently, the BI Analyst can work more iteratively and maintain the complete workflow from source to metrics in a dashboard. Under the hood and Behind the Scenes, the dbt models are generated so that a more experienced engineer can customize the pipelines at any time. In addition, engineers can easily convert dbt models into reusable “drag and drop” components.
What are your hottest dbt repositories in 2022 so far? Here are mine!
5 projects | /r/dataengineering | 7 Jun 2022
- 🧱 Kuwala: With Kuwala, a BI analyst can intuitively build advanced data workflows using a drag-drop interface on top of the modern data stack without coding. Behind the Scenes, the dbt models are generated so that a more experienced engineer can customize the pipelines at any time.
BECOME A PRODUCT DESIGNER & STRATEGIST AT KUWALA! (Intern / Working Student)
2 projects | dev.to | 20 Oct 2021
In 2020 data teams spent 79% of their time on data gathering and cleaning. The enormous amount of data and missing data talents lead to a huge failure rate of 85% in data science projects. Kuwala wants to provide a tool that combines the best open source projects in an UI that is easy to handle by analysts and flexible enough for engineers to adjust the tool to edge cases and internal needs. We believe in Open Source and so we are! Everything we do is published on our Github Repo: https://github.com/kuwala-io/kuwala or https://kuwala.io. And now we search you to help us to built out our hosted version of Kuwala.
Graph databases for data engineering
2 projects | /r/dataengineering | 21 Jun 2021
We're currently using it as a PoC to combine POI with population data. You can find it on our GitHub.2 projects | /r/dataengineering | 21 Jun 2021
Yes, exactly. Looking at a graph schema is very intuitive. We're currently using Neo4j to model different POI sources (OSM, Google, etc.) with population and other sources. So trying to model real-world entities. E.g., you can express "this shop is in that shopping mall in this district where 3214 people between 14-25 are living." It's open-source on GitHub. Would be happy to get your feedback. :) https://github.com/kuwala-io/kuwala
[OC] High-Resolution Demographics Data (visualized for Malta but scalable worldwide)
2 projects | /r/dataisbeautiful | 26 May 2021
Also you can easily run it locally for any country you like. It's just a few commands: https://github.com/kuwala-io/kuwala/tree/master/kuwala-pipelines/population-density2 projects | /r/dataisbeautiful | 26 May 2021
ETL Pipeline for Gigabytes of Demographics Data
3 projects | /r/dataengineering | 26 May 2021
Here's the link to the README issue: https://github.com/kuwala-io/kuwala/issues/203 projects | /r/dataengineering | 26 May 2021
👾 https://github.com/kuwala-io/kuwala/tree/master/kuwala-pipelines/population-density 📖 https://medium.com/kuwala-io/querying-the-most-granular-demographics-dataset-62da16b441a8 💬 https://join.slack.com/t/kuwala-community/shared_invite/zt-l5b2yjfp-pXKFBjbnl7_P3nXtwca5ag
What are some alternatives?
ArchivesSpace - The ArchivesSpace archives management tool
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Archivematica - Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.
Access to Memory (AtoM) - Open-source, web application for archival description and public access.
Collective Access: Providence - Cataloguing and data/media management application
Activeloop Hub - Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake]
uawardata - The data behind uawardata.com
CKAN-meta - Metadata files for the CKAN
mara-pipelines - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
dwc - Darwin Core standard for sharing of information about biological diversity.
evidence - Business intelligence as code: build polished data products with SQL and markdown