Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 15 Python data-extraction Projects
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
As part of work on my open-source project ContextGem, I've built a native, zero-dependency DOCX converter that transforms Word documents into LLM-ready data.
This custom-built converter directly processes Word XML, provides comprehensive content extraction + covers what other open-source tools often miss or lack support for:
- Rich paragraph and sentence metadata for enhanced context
- Misaligned tables
- Comments, footnotes, and textboxes
- Embedded images
The converted document can then be easily used in ContextGem's LLM extraction workflows.
Perfect for developers building contract intelligence applications where precision matters. The converter preserves document structure and relationships, empowering LLMs to better understand and analyze document content.
Try it / share with your dev team today and see the difference in your document processing pipeline!
GitHub: https://github.com/shcherbak-ai/contextgem
All DocxConverter features: https://contextgem.dev/converters/docx.html
-
Project mention: llm-scraper VS parsera - a user suggested alternative | libhunt.com/r/llm-scraper | 2024-10-16
-
Project mention: HN Summary: Let ChatGPT Summarize Hacker News for You | news.ycombinator.com | 2024-09-02
-
-
sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
-
-
-
-
-
pinterest-scrapy-scraper
Production-ready Pinterest scraper built with Python Scrapy. Extract pins, boards, and search data with ScrapeOps proxy integration. Features 3 specialized spiders, JavaScript rendering, and CSV export. Perfect for content marketing, market research, and social media analytics.
Project mention: Pinterest Scraping in 2025: Why I Built a Production-Ready Scrapy Spider (And You Should Too) | dev.to | 2025-07-09I've open-sourced the complete Pinterest scraper on GitHub: pinterest-scrapy-scraper
-
youtube-scrapy-scraper
Professional YouTube data scraper built with Scrapy framework. Extract video search results, channel analytics, subscriber counts, view metrics, and trending content. Features anti-bot protection, proxy rotation, CSV/JSON export, and comprehensive data validation. Perfect for content research, competitor analysis, and YouTube API alternatives.
Project mention: Building a YouTube Scraper with Scrapy: A Developer's Journey | dev.to | 2025-07-10The complete code is available on GitHub: youtube-scrapy-scraper
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python data-extraction discussion
Python data-extraction related posts
-
Transform DOCX into LLM-ready data
-
I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple
-
Scrapegraph-ai VS parsera - a user suggested alternative
2 projects | 16 Oct 2024 -
HN Summary: Let ChatGPT Summarize Hacker News for You
-
Lightweight library for scraping web-sites with LLMs
-
What's the fun in writing on the internet anymore?
-
Made an app that summarizes recent popular stories from Hacker News
-
A note from our sponsor - Stream
getstream.io | 19 Jul 2025
Index
What are some of the best open-source data-extraction projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | flashtext | 5,662 |
2 | Optimus | 1,515 |
3 | contextgem | 1,248 |
4 | parsera | 1,125 |
5 | hacker-news-digest | 723 |
6 | PlotDigitizer | 145 |
7 | sayn | 120 |
8 | superpipe | 110 |
9 | tinvois-parser | 44 |
10 | JSONPATH | 41 |
11 | Data Extractor | 28 |
12 | Webtap.ai | 12 |
13 | sdk | 5 |
14 | pinterest-scrapy-scraper | 0 |
15 | youtube-scrapy-scraper | 0 |