extraction

Open-source projects categorized as extraction
Topics: Python NLP PDF Images C#

Top 23 extraction Open-Source Projects

  • Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

  • Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

    I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).

    For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

    Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).

    AMA

  • adversarial-robustness-toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • mtail

    extract internal monitoring data from application logs for collection in a timeseries database

  • Project mention: i need to visualize all logs from remote dir | /r/sysadmin | 2023-05-19

    You can do that with something like mtail. Basically write expressions that match your logs and produce metrics.

  • aubio

    a library for audio and music analysis

  • Project mention: Doing a project on an Audio to MIDI Converter, any help is appreciated | /r/learnprogramming | 2023-05-28

    Aubio is a good library for working with audio and midi: https://aubio.org/

  • GARbro

    Visual Novels resource browser

  • Project mention: Is there a way to download the Nekopara background pictures? | /r/NEKOPARAGAME | 2023-06-08
  • unblob

    Extract files from any kind of container formats

  • Project mention: Reverse-engineering an encrypted IoT protocol | news.ycombinator.com | 2024-02-14
  • tika-python

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • stanford-openie-python

    Stanford Open Information Extraction made simple!

  • thepipe

    Feed PDFs, URLs, Slides, YouTube, and more into GPT-4-Vision with one line of code ⚡

  • Project mention: Show HN: I just open sourced my document/website extractor for Vision-LLMs | news.ycombinator.com | 2024-04-02
  • unrpa

    A program to extract files from the RPA archive format.

  • File-Injector

    File Injector is a script that allows you to store any file in an image using steganography

  • OWLib

    Toolchain that lets you interact with the Overwatch files and extract models and stuff.

  • Project mention: over 300 hours on lucio and I've never heard this voice line | /r/luciomains | 2023-06-08

    There are tons of voicelines that you rarely here. Overtools lets you rip not just all voicelines unlocked, but all "hero interactions" as well. It's super awesome. https://github.com/overtools/OWLib you'd be surprised how many you don't catch!

  • SurvivCheatInjector

    An actual, updated, surviv.io cheat. Works great and we reply fast.

  • android-otp-extractor

    Extracts OTP tokens from rooted Android devices

  • jarchivelib

    A simple archiving and compression library for Java

  • tabula-sharp

    Extract tables from PDF files (port of tabula-java)

  • Project mention: What is the best library for processing table data contained within a PDF? | /r/dotnet | 2023-06-23

    This looks promising: https://github.com/BobLd/tabula-sharp

  • stegextract

    Detect hidden files and text in images

  • XADMaster

    Objective-C library for archive and file unarchiving and extraction

  • rakun2

    RaKUn 2.0 - A fast keyword detection algorithm

  • pydoxtools

    Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

  • Project mention: What is the most cost-efficient way to have an embedding generator endpoint that is using an open-source embedding model? [D] | /r/MachineLearning | 2023-06-01
  • doctor

    A microservice for document conversion at scale (by freelawproject)

  • chatnoir-resiliparse

    A robust web archive analytics toolkit

  • Project mention: Selenium over scrapy | /r/Python | 2023-05-04

    bs4 is a little slow, try https://github.com/chatnoir-eu/chatnoir-resiliparse, it's faster for working with the dom written in cython and based on lexbor (written in C and very fast)

  • twitter-emotions

    NLP tool to extract emotional phrase from tweets 🤩

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

extraction related posts

  • Show HN: I just open sourced my document/website extractor for Vision-LLMs

    2 projects | news.ycombinator.com | 2 Apr 2024
  • How are zlib, gzip and zip related?

    3 projects | news.ycombinator.com | 27 Nov 2023
  • Issue getting Parsr GUI up and running

    1 project | /r/docker | 13 Sep 2023
  • Is there a way to download the Nekopara background pictures?

    1 project | /r/NEKOPARAGAME | 8 Jun 2023
  • What is the most cost-efficient way to have an embedding generator endpoint that is using an open-source embedding model? [D]

    1 project | /r/MachineLearning | 1 Jun 2023
  • Anyone knows how to open .isa files in a visual novel game?

    1 project | /r/visualnovels | 4 Apr 2023
  • Inspiring OOP examples?

    1 project | /r/java | 12 Mar 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 2 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source extraction projects? This list will help you:

Project Stars
1 Parsr 5,656
2 adversarial-robustness-toolbox 4,460
3 mtail 3,747
4 aubio 3,177
5 GARbro 2,106
6 unblob 2,054
7 tika-python 1,418
8 stanford-openie-python 616
9 thepipe 556
10 unrpa 547
11 File-Injector 418
12 OWLib 341
13 SurvivCheatInjector 228
14 android-otp-extractor 211
15 jarchivelib 198
16 tabula-sharp 136
17 stegextract 107
18 XADMaster 98
19 rakun2 61
20 pydoxtools 55
21 doctor 50
22 chatnoir-resiliparse 42
23 twitter-emotions 40

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com