Our great sponsors
-
Great recommendations by the rest of the members here. I would love to learn more about your use case if possible, as we are adding a native REST, websocket and gRPC support to our message broker (Memphis. Let’s chat if possible, would love to work on this together
-
xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Xidel for extraction and pagination
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
GNU Parallel for parallelism, retry, and resumption
-
JQ for JSON processing
-
I have a small project like this i done before. Which i am gonna shamelessly plug in lol. https://github.com/PanzerFlow/aws_lambda_reddit_api
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
AWS: deploy using maintained Terraform scripts.
-
CoinCap-firehose-s3-DynamicPartitioning
AWS CDK project using typescript. Services: Lambda, Kinesis Firehose, Glue, Quicksight.
I agree with the Cron triggered Lambda approach. For inspiration I have a small project where a lambda pulls data from a public api and writes it to a firehose which buffers the data and writes it to s3. There is also a cron job on Glue which catalogues the data. https://github.com/TrygviZL/CoinCap-firehose-s3-DynamicPartitioning
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
I have an example here using COVID data. basically you just write a python function that reads the API and returns a dataframe (or any number of dataframes) and downstream tasks can then read the output as either a dataframe or a SQL table.
-
RudderStack is an open-source tool to build data pipelines with high-availability and high-precision event ordering. It is suitable for your use case as