Data cleaning/ analysis 100-200 million rows of data. Is this doable in R, or is there another program I should try instead?

This page summarizes the projects mentioned and recommended in the original post on /r/rstats

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • disk.frame

    Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data

  • It depends on your hardware, but it should not be a problem. You might look into disk frame (https://diskframe.com) or similar packages.

  • db-benchmark

    reproducible benchmark of database-like ops

  • Yes, data.table can handle this. But your limiting factor might be RAM. This benckmark shows that data.table can load in RAM a billion rows (9 columns) faster than other solutions. (Source). They run their benchmark on a machine with 50 GB of RAM.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts