Data Science at the Command Line, 2nd Edition (2021)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • fzf-tool-launcher

    Browse & preview contents of files, and launch tools/pipelines to process a selected file.

    I wrote a couple of shell scripts that allow you to build command pipelines one step at a time, choosing a tool from a menu at each step, and with the ability to preview the results while tweaking the command line flags at each step. At any point you can go back to the previous step and continue: https://github.com/vapniks/fzf-tool-launcher

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • fzfrepl

    Edit commands/pipelines with fzf, and view live output while editing

  • makefile2graph

    Creates a graph of dependencies from GNU-Make; Output is a graphiz-dot file or a Gexf-XML file.

    Yes, that's extremely important. I've had great success, but replacing Airflow, Luigi, and friends with a cron-ed Makefile target refreshing Some database tables (usually Materialized views).

    I've then used this tool to visualize the execution graph. https://github.com/lindenb/makefile2graph

    The result looks like this https://tselai.com/data/graph.png

  • visidata

    A terminal spreadsheet multitool for discovering and arranging data

    I'd like to call out one of my favorite pieces of software from the past 10 years: VisiData [1] has completely changed the way I do ad-hoc data processing, and is now my go-to for pretty much all use cases that I previously used spreadsheets for, and about half of those I previously used databases for.

    It's a TUI application, not strictly CLI, but scriptable, and I figure anyone building pipelines using tools like jq, q, awk, grep, etc. to process tabular data will find it extremely useful.

    ----

    [1]: https://visidata.org

  • jq

    Command-line JSON processor

    Thanks, if anyone else is interested there is an explanation of this feature here: https://subtxt.in/library-data/2016/03/28/json_stream_jq And: https://github.com/jqlang/jq/wiki/FAQ#streaming-json-parser

    The last time I tried, I think the reason I gave up on JQ for large inputs was that the throughput would max out at 7mb/s whereas the same thing with spark SQL on the same hardware (MacBook) would max out at 250mb/s. So I started looking into using other solutions for big data while I use jq in parallel for small data in multiple files.

    I will test it out again cause this was 4-5 years ago when I last tested it, but I believe jaq is still preferred for large inputs. Still I prefer for big data to use Spark/Polars/clickhouse etc.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Plugin for pretty rendering of data?

    1 project | /r/neovim | 24 Dec 2022
  • Hanukkah of Data: Advent of Code for Data Nerds

    1 project | /r/learndatascience | 3 Dec 2022
  • Visidata - work with CSV / SQLlite / xls and other data files from the CLI

    1 project | /r/LinuxShell | 12 Oct 2022
  • Visidata 2.9.1 Released– Excel/CSV/JSON/SQLite/Parquet in the Terminal (TUI/CLI)

    1 project | news.ycombinator.com | 12 Aug 2022
  • I made RemoteZipFile to download individual files from INSIDE a .zip

    3 projects | /r/Python | 13 Jul 2022

Did you konow that C is
the 7th most popular programming language
based on number of metions?