Show HN: Data load tool(dlt)-Python library to automate the creation of datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • dlt

    data load tool (dlt) is an open source Python library that makes data loading easy ๐Ÿ› ๏ธ

  • Hi HN,

    We're Anna, Adrian, Marcin and Matt, developers of dlt. dlt is an open source library to automatically create datasets out of messy, unstructured data sources. You can use the library to move data from about anywhere into most of well known SQL and vector stores, data lakes, storage buckets, or local engines like DuckDB. It automates many cumbersome data engineering tasks and can by handled by anyone who knows Python.

    Hereโ€™s our Github: https://github.com/dlt-hub/dlt

    Hereโ€™s our Colab demo: https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-...

    โ€” โ€” โ€”

    In the past we wrote hundreds of Python scripts to fit messy data sources into something that you can work with in Python - a database, Pandas frame or just a Python list. We were solving the same problems and making the similar mistakes again and again.

    This is why we built an easy to use Python library called dlt that will automate most data engineering tasks. It hides the complexities of data loading and automatically generates a structured and clean datasets for immediate querying and sharing.

    โ€” โ€” โ€”

    At its core, dlt removes the need to create the dataset schemas, react to changing data, generate append or merge statements, and to move the data in transactional and idempotent manner. Those things are automated and can be declared right in the Python code, just by decorating functions.

    Add @dlt.resource decorator, give it a few hints, and convert any data into a simple pipeline that creates and updates datasets.

    dlt gets the details out of your way:

    1. You do not need to worry about the structure of a database or parquet files

    dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.

    2. You do not need to write any INSERT/UPDATE or data copy statements

    dlt will push the data to DuckDB, Weaviate, storage buckets and many popular SQL stores. It will align the data types, file formats, and identifier names automatically

    3. You do not need to worry when you need to add new data or update the changes.

    dlt lets you declare how to load the data, how to increment it and will keep the loading state together so they are always in sync.

    4. You keep how you develop and test your code

    Iterate and test quickly on your laptop or in a dev container. Run locally on DuckDB and just swap destination name to go to the cloud - your code, schema and data will stay the same.

    5. You can work with data on your laptop.

    Combine dlt with other tools and libraries to process data locally. duckdb, Pandas, Arrow tables and Rust based loading libraries like ConnectorX work nicely with dlt and process data blazingly fast, compared to the cloud.

    6. You do not need to worry if your pipeline will work when you deploy it.

    dlt is a minimalistic Python library, requires no backend and works whenever Python works. You can finetune it to work on constrained environments like AWS Lambda or run with Airflow, GitHub Actions or Dagster.

    dlt has an Apache 2.0 license. We plan to make money by offering organizations a paid control plane, where dlt users can track and policy what every pipeline does, manage schemas and contracts across organization, create data catalogues, and share them with the team members and customers.

  • verified-sources

    Contribute to dlt verified sources ๐Ÿ”ฅ

  • - get data from any storage bucket:https://github.com/dlt-hub/verified-sources/tree/master/sour...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Data load tool (dlt) โ€“ open-source Python library that makes data loading easy

    1 project | news.ycombinator.com | 17 Oct 2023
  • [Discussion] How to implement Data Contracts generically? Seeking advice from data contract users.

    1 project | /r/MachineLearning | 6 Sep 2023
  • Ask HN: Freelancer? Seeking freelancer? (December 2023)

    3 projects | news.ycombinator.com | 3 Dec 2023
  • Can we take a moment to appreciate how much of dataengineering is open source?

    8 projects | /r/dataengineering | 23 Nov 2022
  • What is a personality type of a Data Engineer?

    2 projects | /r/dataengineering | 26 Oct 2022