Fastest way to open semi-large files and merge

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Apache Spark

101 38,249 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing
dagster

46 10,114 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

This can be done as a batch job. 350MB is not really that big and may even be smaller if you just need a subset of the columns. You would basically loop through and process each file individually and append if I understood you correctly. My initial implementation would be to use a combination of zip, StringIO, and csv modules to process the zip file in-memory since it should fit comfortably in RAM. The issue would be is having a fault tolerant process to do this continuously and reliably. So for that I would use a general purpose scheduler. If you're stuck on Windows, I highly recommend dagster as it now comes with an awesome general purpose scheduler that works in Windows. Otherwise, I would look into Airflow or Prefect with Prefect easier to use than Airflow. Ideally, you would use cloud resources, but can be done locally with a VM. But more importantly, where do you intend the final resting place to be? I would recommend a database.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project