article-extractor
tarsier
| article-extractor | tarsier | |
|---|---|---|
| 4 | 13 | |
| 1,895 | 1,761 | |
| 1.2% | 0.6% | |
| 6.2 | 8.6 | |
| about 1 month ago | over 1 year ago | |
| JavaScript | Jupyter Notebook | |
| MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
article-extractor
-
Show HN: I built an AI satirical news site because news was depressing me
Actually, I kept it simple - I use the original images from the news articles! When I fetch an article through RSS and extract its content using the @extractus/article-extractor library, it pulls the main image along with the content.
https://github.com/extractus/article-extractor
-
ScrapeGraphAI: Web scraping using LLM and direct graph logic
Agreed!
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
[0] https://apify.com/apify/website-content-crawler
[1] https://github.com/extractus/article-extractor
[2] https://magicloops.dev/
[3] https://reworkd.ai/
-
How do Instapaper and Pocket apps extract the content of the articles?
Edit: I found this library in NodeJs useful for article extraction. Anyone looking for something like you can take a look. https://github.com/extractus/article-extractor
- How to get the main topic of a Web article?
tarsier
-
Ask HN: Who is hiring? (November 2024)
Reworkd | Backend / Infrastructure | ONSITE San Francisco
At https://reworkd.ai/, we're building application layer LLM agents to extract web data at scale. We are foundational data infrastructure for startups today that are fine tuning models or building some web data constrained product. We're backed by YC, Paul Graham, AI grant, and many others.
We're looking for backend/infrastructure/full stack engineers to:
-
Show HN: Finic – open-source platform for building browser automations
https://github.com/reworkd/tarsier/pull/115/files represents someone who does not know what git is used for
Cloning into 'tarsier'... -
A single ChatGPT mistake cost us $10k
Yes, thank you, I had the exact same experience. The actual project is probably https://reworkd.ai/
-
Ask HN: Who is hiring? (June 2024)
Reworkd (https://reworkd.ai/) | San Francisco (In-person) | Full-time
We're hiring a founding backend engineer to help us build infrastructure to run web agents at scale.
We're a super small scrappy team of four that's been working on the application layer of web agents since it's inception. Our projects have over 30k stars on GitHub, we're backed by PG himself + a bunch of great investors, and we have 3+ years of runway. Join us if you want to grind, and want a lot of ownership. (More info about the role in our job posting)
You can either apply through bookface (https://www.ycombinator.com/companies/reworkd/jobs/4f6BHpT-f...) or directly email me the following (asim@reworkd.ai) the following:
- FLaNK-AIM: 20 May 2024 Weekly
-
Show HN: Tarsier – vision for text-only LLM web agents that beats GPT-4o
We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:
[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...
[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...
-
ScrapeGraphAI: Web scraping using LLM and direct graph logic
Agreed!
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
[0] https://apify.com/apify/website-content-crawler
[1] https://github.com/extractus/article-extractor
[2] https://magicloops.dev/
[3] https://reworkd.ai/
- Control the browser using GPT-4 vision by AgentGPT team
- Show HN: GPT-4 vision utilities to browse the web
What are some alternatives?
readability-extractor - Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.
skyvern - Automate browser based workflows with AI
penthouse - Generate critical css for your web pages
dude - dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
threadRoll-frontend - Roll your articles to a Twitter thread
RDKit - The official sources for the RDKit library