SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 extraction Open-Source Projects
-
adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
mtail
extract internal monitoring data from application logs for collection in a timeseries database
-
tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
File-Injector
File Injector is a script that allows you to store any file in an image using steganography
-
pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA
You can do that with something like mtail. Basically write expressions that match your logs and produce metrics.
Project mention: Doing a project on an Audio to MIDI Converter, any help is appreciated | /r/learnprogramming | 2023-05-28Aubio is a good library for working with audio and midi: https://aubio.org/
Project mention: Is there a way to download the Nekopara background pictures? | /r/NEKOPARAGAME | 2023-06-08
Project mention: Show HN: I just open sourced my document/website extractor for Vision-LLMs | news.ycombinator.com | 2024-04-02
Project mention: over 300 hours on lucio and I've never heard this voice line | /r/luciomains | 2023-06-08There are tons of voicelines that you rarely here. Overtools lets you rip not just all voicelines unlocked, but all "hero interactions" as well. It's super awesome. https://github.com/overtools/OWLib you'd be surprised how many you don't catch!
Project mention: What is the best library for processing table data contained within a PDF? | /r/dotnet | 2023-06-23This looks promising: https://github.com/BobLd/tabula-sharp
Project mention: What is the most cost-efficient way to have an embedding generator endpoint that is using an open-source embedding model? [D] | /r/MachineLearning | 2023-06-01
bs4 is a little slow, try https://github.com/chatnoir-eu/chatnoir-resiliparse, it's faster for working with the dom written in cython and based on lexbor (written in C and very fast)
extraction related posts
-
Show HN: I just open sourced my document/website extractor for Vision-LLMs
-
How are zlib, gzip and zip related?
-
Issue getting Parsr GUI up and running
-
Is there a way to download the Nekopara background pictures?
-
What is the most cost-efficient way to have an embedding generator endpoint that is using an open-source embedding model? [D]
-
Anyone knows how to open .isa files in a visual novel game?
-
Inspiring OOP examples?
-
A note from our sponsor - SaaSHub
www.saashub.com | 2 May 2024
Index
What are some of the best open-source extraction projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Parsr | 5,656 |
2 | adversarial-robustness-toolbox | 4,460 |
3 | mtail | 3,747 |
4 | aubio | 3,177 |
5 | GARbro | 2,106 |
6 | unblob | 2,054 |
7 | tika-python | 1,418 |
8 | stanford-openie-python | 616 |
9 | thepipe | 556 |
10 | unrpa | 547 |
11 | File-Injector | 418 |
12 | OWLib | 341 |
13 | SurvivCheatInjector | 228 |
14 | android-otp-extractor | 211 |
15 | jarchivelib | 198 |
16 | tabula-sharp | 136 |
17 | stegextract | 107 |
18 | XADMaster | 98 |
19 | rakun2 | 61 |
20 | pydoxtools | 55 |
21 | doctor | 50 |
22 | chatnoir-resiliparse | 42 |
23 | twitter-emotions | 40 |
Sponsored