Python data-extraction

Open-source Python projects categorized as data-extraction

Top 15 Python data-extraction Projects

data-extraction
  1. flashtext

    Extract Keywords from sentence or Replace keywords in sentences.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  4. contextgem

    ContextGem: Effortless LLM extraction from documents

    Project mention: Transform DOCX into LLM-ready data | news.ycombinator.com | 2025-05-04

    As part of work on my open-source project ContextGem, I've built a native, zero-dependency DOCX converter that transforms Word documents into LLM-ready data.

    This custom-built converter directly processes Word XML, provides comprehensive content extraction + covers what other open-source tools often miss or lack support for:

    - Rich paragraph and sentence metadata for enhanced context

    - Misaligned tables

    - Comments, footnotes, and textboxes

    - Embedded images

    The converted document can then be easily used in ContextGem's LLM extraction workflows.

    Perfect for developers building contract intelligence applications where precision matters. The converter preserves document structure and relationships, empowering LLMs to better understand and analyze document content.

    Try it / share with your dev team today and see the difference in your document processing pipeline!

    GitHub: https://github.com/shcherbak-ai/contextgem

    All DocxConverter features: https://contextgem.dev/converters/docx.html

  5. parsera

    Lightweight library for scraping web-sites with LLMs

    Project mention: llm-scraper VS parsera - a user suggested alternative | libhunt.com/r/llm-scraper | 2024-10-16
  6. hacker-news-digest

    :newspaper: Let ChatGPT Summarize Hacker News for You

    Project mention: HN Summary: Let ChatGPT Summarize Hacker News for You | news.ycombinator.com | 2024-09-02
  7. PlotDigitizer

    A Python utility to digitize plots.

  8. sayn

    Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. superpipe

    Superpipe - optimized LLM pipelines for structured data

  11. tinvois-parser

    Extract receipt info

  12. JSONPATH

    A query expression for extracting data from JSON. (by linw1995)

  13. Data Extractor

    Combine XPath, CSS Selectors and JSONPath for Web data extracting.

  14. Webtap.ai

    AI web scraping python library for efficient and reliable web scraping.

  15. sdk

    Lightfeed SDK to search and filter web data (by lightfeed)

  16. pinterest-scrapy-scraper

    Production-ready Pinterest scraper built with Python Scrapy. Extract pins, boards, and search data with ScrapeOps proxy integration. Features 3 specialized spiders, JavaScript rendering, and CSV export. Perfect for content marketing, market research, and social media analytics.

    Project mention: Pinterest Scraping in 2025: Why I Built a Production-Ready Scrapy Spider (And You Should Too) | dev.to | 2025-07-09

    I've open-sourced the complete Pinterest scraper on GitHub: pinterest-scrapy-scraper

  17. youtube-scrapy-scraper

    Professional YouTube data scraper built with Scrapy framework. Extract video search results, channel analytics, subscriber counts, view metrics, and trending content. Features anti-bot protection, proxy rotation, CSV/JSON export, and comprehensive data validation. Perfect for content research, competitor analysis, and YouTube API alternatives.

    Project mention: Building a YouTube Scraper with Scrapy: A Developer's Journey | dev.to | 2025-07-10

    The complete code is available on GitHub: youtube-scrapy-scraper

  18. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-extraction discussion

Log in or Post with

Python data-extraction related posts

  • Transform DOCX into LLM-ready data

    2 projects | news.ycombinator.com | 4 May 2025
  • I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple

    1 project | dev.to | 2 May 2025
  • Scrapegraph-ai VS parsera - a user suggested alternative

    2 projects | 16 Oct 2024
  • HN Summary: Let ChatGPT Summarize Hacker News for You

    1 project | news.ycombinator.com | 2 Sep 2024
  • Lightweight library for scraping web-sites with LLMs

    1 project | news.ycombinator.com | 17 Aug 2024
  • What's the fun in writing on the internet anymore?

    1 project | news.ycombinator.com | 17 Feb 2024
  • Made an app that summarizes recent popular stories from Hacker News

    2 projects | news.ycombinator.com | 16 Nov 2023
  • A note from our sponsor - Stream
    getstream.io | 19 Jul 2025
    Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source data-extraction projects in Python? This list will help you:

# Project Stars
1 flashtext 5,662
2 Optimus 1,515
3 contextgem 1,248
4 parsera 1,125
5 hacker-news-digest 723
6 PlotDigitizer 145
7 sayn 120
8 superpipe 110
9 tinvois-parser 44
10 JSONPATH 41
11 Data Extractor 28
12 Webtap.ai 12
13 sdk 5
14 pinterest-scrapy-scraper 0
15 youtube-scrapy-scraper 0

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com