disk.frame
janitor
Our great sponsors
disk.frame | janitor | |
---|---|---|
5 | 2 | |
592 | 1,337 | |
0.5% | - | |
0.0 | 6.2 | |
2 months ago | about 2 months ago | |
R | R | |
GNU General Public License v3.0 or later | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
disk.frame
-
Do you code from memory? Or do you reference things?
Say hello to disk.frame.
- How can I read in only two columns from a massive 10+ GB tab file?
-
Data cleaning/ analysis 100-200 million rows of data. Is this doable in R, or is there another program I should try instead?
It depends on your hardware, but it should not be a problem. You might look into disk frame (https://diskframe.com) or similar packages.
-
is it possible to have my enviroment objects and work with them on my local drive instead of RAM?
If that doesn't work, the disk.frame package might help. It is new-ish and not common, but does seem to work with data on disk rather than in memory
-
We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite
> The speeds were just stunning to say the least at 15GB/s.
That is amazing. That is around DDR4-1866 speeds, and not far from DDR4-2666 (~21 GB/s). At those speeds I would happily work with dataframes sitting on the disk rather than in memory [1, 2]. Did you benchmark RAID 0 with less than four disks?
janitor
-
Working with columns names that are numbers (in this case, years)
I would just clean the names and work with those. Then there is no need to use backticks. Read about the function clean_names in the janitor vignette: https://github.com/sfirke/janitor
-
R Libraries Every Data Scientist Should Know - Pyoflife
I just stumbled across Janitor which can help you clean colum names easily.
What are some alternatives?
db-benchmark - reproducible benchmark of database-like ops
tidyverse - Easily install and load packages from the tidyverse
drake - An R-focused pipeline toolkit for reproducibility and high-performance computing
IntRo - Introduction to R for health data
police-settlements - A FiveThirtyEight/The Marshall Project effort to collect comprehensive data on police misconduct settlements from 2010-19.
Practical-Applications-in-R-for-Psychologists - Lesson files for Practical Applications in R for Psychologists.