A Polars exploration into Kedro

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Mergify - Tired of breaking your main and manually rebasing outdated pull requests?
  • Sonar - Write Clean Python Code. Always.
  • Kedro

    Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

    # pyproject.toml [project] dependencies = [ "kedro @ git+https://github.com/kedro-org/kedro@3ea7231", "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets", ]

  • cudf

    cuDF - GPU DataFrame Library

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • kedro-plugins

    First-party plugins maintained by the Kedro team.

    # pyproject.toml [project] dependencies = [ "kedro @ git+https://github.com/kedro-org/kedro@3ea7231", "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets", ]

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Traditionally Kedro has favoured pandas as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to the catalog:

  • modin

    Modin: Scale your Pandas workflows by changing a single line of code

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

    Polars is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the Apache Arrow columnar data format (you can read more about Arrow on my earlier blog post “Demystifying Apache Arrow”), and it is optimised to be blazing fast.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts