Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Extract Open-Source Projects
-
video-subtitle-extractor
视频硬字幕提取,生成srt文件。无需申请第三方API,本地实现文本识别。基于深度学习的视频字幕提取框架,包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
-
SwiftSoup
SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
sling-cli
Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Does iOS application development platform support HTML rendering? | /r/iOSProgramming | 2023-05-30For parsing there is this amazing library, but again, that's only for parsing HTML, not rendering anything: https://github.com/scinfu/SwiftSoup
Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
Project mention: pdfsam VS cpdf-binaries - a user suggested alternative | libhunt.com/r/pdfsam | 2023-08-18
Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03SEEKING FREELANCER | REMOTE | GERMANY
dltHub is looking for a freelance help in the following repos:
- https://github.com/dlt-hub/dlt
Project mention: Ask HN: What's a good library/command line tool to extract tables from PDFs? | news.ycombinator.com | 2023-06-10have not tried it, but this has been in my bookmarks a while: https://github.com/camelot-dev/excalibur
Project mention: ScrapeGraphAI: Web scraping using LLM and direct graph logic | news.ycombinator.com | 2024-05-07Agreed!
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
[0] https://apify.com/apify/website-content-crawler
[1] https://github.com/extractus/article-extractor
[2] https://magicloops.dev/
[3] https://reworkd.ai/
Extract related posts
-
Improve your productivity with regexp-it-cli
-
Jailer – open-source database client
-
Show HN: Jailer is a unique open-source database client tool
-
Database browsing tool with sophisticated and animated Java Swing UI
-
Jailer is a unique open-source database client tool
-
How can I scrape every .sensorpanel attachment from this thread?
-
Jailer, a unique open-source database tool
-
A note from our sponsor - InfluxDB
www.influxdata.com | 20 May 2024
Index
What are some of the best open-source Extract projects? This list will help you:
Project | Stars | |
---|---|---|
1 | video-subtitle-extractor | 4,943 |
2 | SwiftSoup | 4,341 |
3 | archiver | 4,244 |
4 | camelot | 3,553 |
5 | pdfsam | 3,121 |
6 | Jailer | 2,715 |
7 | UtinyRipper | 2,703 |
8 | Python | 2,069 |
9 | dlt | 1,792 |
10 | excalibur | 1,483 |
11 | vscode-glean | 1,455 |
12 | article-extractor | 1,416 |
13 | download | 1,273 |
14 | lessmsi | 1,221 |
15 | extrakto | 814 |
16 | extract-xiso | 607 |
17 | datashare | 549 |
18 | decompress | 409 |
19 | extract-loader | 316 |
20 | color.js | 278 |
21 | libarchivejs | 277 |
22 | MFT_Browser | 280 |
23 | sling-cli | 273 |
Sponsored