Working with more than 10gb csv

This page summarizes the projects mentioned and recommended in the original post on /r/datascience

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • sqlitestudio

    A free, open source, multi-platform SQLite database manager.

    https://sqlitestudio.pl is awesome, super easy to set up and pull in CSVs

  • modin

    Modin: Scale your Pandas workflows by changing a single line of code

    Modin should fit. It implements Pandas APIs with e.g. Ray as backend. https://github.com/modin-project/modin

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • spyql

    Query data on the command line with SQL-like SELECTs powered by Python expressions

    You can import the data into a PostgreSQL/MySQL/SQLite/... database and then query the database. However, even with the right choice of indexes, it might take a while to run queries on a table with hundreds of millions of records. You can easily import your data to these databases with SpyQL: $ spyql "SELECT * FROM csv TO sql(table=my_table_name) | sqlite3 my.db" (you would need to create the table my_table_name before running the command).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts