text-extraction

Open-source projects categorized as text-extraction

Top 12 text-extraction Open-Source Projects

  • sumy

    Module for automatic summarization of text documents and HTML pages.

  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

  • Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

    The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

    Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • unipdf

    Golang PDF library for creating and processing PDF files (pure go)

  • Project mention: PDF Annotations and Collaboration with Golang PDF Library | dev.to | 2023-09-01

    But how do we elevate our interaction with these files? How do we make annotations, edits, and collaboration more seamless? Here's where the Golang PDF Library swoops in like a superhero for your PDF woes.

  • tika-python

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

  • datashare

    A self-hosted search engine for documents.

  • pdftools

    Text Extraction, Rendering and Converting of PDF Documents

  • srt

    A simple library and set of tools for parsing, modifying, and composing SRT files. (by cdown)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • hotpdf

    hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

  • Project mention: Show HN: Hotpdf – Search and Extract text within PDFs | news.ycombinator.com | 2024-02-27
  • PDFIO.jl

    PDF Reader Library for Native Julia.

  • cat

    Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

  • docwire

    DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

  • Project mention: DocWire SDK: Award-winning modern data processing in C++17/20 | news.ycombinator.com | 2024-01-04
  • ApyHub

    ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.

  • Project mention: Exploring the Top SERP (Search Engine Result Pages) APIs | dev.to | 2024-04-03

    Most of the providers let you test the APIs right from a terminal in the platform. Brightdata also provides an interactive API Playground for developers to test and check SERP Ranking for multiple search engines. ApyHub also has an API Playground where you can test the APIs in the UI plus you can see the snippet of the provided input in cURL. DataForSEO and Oxylabs don’t provide any internal tool for API tests but provide cURL request code to test on API Clients

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

text-extraction related posts

  • Trafilatura: Python tool to gather text on the Web

    3 projects | news.ycombinator.com | 14 Aug 2023
  • I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)

    2 projects | /r/SideProject | 6 Mar 2023
  • Testing fast installation in tear-down environment

    1 project | /r/learnpython | 6 Jul 2022
  • Advice on standard design pattern for comparison test script

    1 project | /r/learnpython | 24 May 2022
  • Leaks! how to organize them?

    4 projects | /r/OSINT | 2 May 2022
  • Journalism/research/information linking/graph db

    4 projects | /r/selfhosted | 21 Apr 2022
  • Automate dependency installation

    1 project | /r/learnpython | 9 Apr 2022
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 10 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source text-extraction projects? This list will help you:

Project Stars
1 sumy 3,428
2 trafilatura 2,898
3 unipdf 2,375
4 tika-python 1,420
5 datashare 547
6 pdftools 499
7 srt 429
8 hotpdf 167
9 PDFIO.jl 122
10 cat 88
11 docwire 51
12 ApyHub 5

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com