What's the best tool to build pipelines from REST APIs?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • memphis

    Memphis.dev is a highly scalable and effortless data streaming platform

    Great recommendations by the rest of the members here. I would love to learn more about your use case if possible, as we are adding a native REST, websocket and gRPC support to our message broker (Memphis. Let’s chat if possible, would love to work on this together

  • xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

    Xidel for extraction and pagination

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • gnu-parallel

    A clone of GNU Parallel (git://git.savannah.gnu.org/parallel.git)

    GNU Parallel for parallelism, retry, and resumption

  • jq

    Discontinued Command-line JSON processor [Moved to: https://github.com/jqlang/jq] (by stedolan)

    JQ for JSON processing

  • aws_lambda_reddit_api

    Batch data processing project using data from the reddit api.

    I have a small project like this i done before. Which i am gonna shamelessly plug in lol. https://github.com/PanzerFlow/aws_lambda_reddit_api

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    AWS: deploy using maintained Terraform scripts.

  • CoinCap-firehose-s3-DynamicPartitioning

    AWS CDK project using typescript. Services: Lambda, Kinesis Firehose, Glue, Quicksight.

    I agree with the Cron triggered Lambda approach. For inspiration I have a small project where a lambda pulls data from a public api and writes it to a firehose which buffers the data and writes it to s3. There is also a cron job on Glue which catalogues the data. https://github.com/TrygviZL/CoinCap-firehose-s3-DynamicPartitioning

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • astro-sdk

    Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.

    I have an example here using COVID data. basically you just write a python function that reads the API and returns a dataframe (or any number of dataframes) and downstream tasks can then read the output as either a dataframe or a SQL table.

  • Rudderstack

    Privacy and Security focused Segment-alternative, in Golang and React

    RudderStack is an open-source tool to build data pipelines with high-availability and high-precision event ordering. It is suitable for your use case as

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts