Trafilatura Alternatives

Similar projects and alternatives to trafilatura

Bitwarden

1,055 14,286 9.8 C# trafilatura VS Bitwarden

The core infrastructure backend (API, database, Docker, etc). (by bitwarden)
PhotoPrism

510 32,525 9.9 Go trafilatura VS PhotoPrism

AI-Powered Photos App for the Decentralized Web 🌈💎✨
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Invidious

421 14,937 9.5 Crystal trafilatura VS Invidious

Invidious is an alternative front-end to YouTube
Tautulli

419 5,354 8.6 Python trafilatura VS Tautulli

A Python based monitoring and tracking tool for Plex Media Server.
restic

357 23,706 9.7 Go trafilatura VS restic

Fast, secure, efficient backup program
filemanager

304 23,611 8.7 Go trafilatura VS filemanager

📂 Web File Browser
libreddit

283 4,996 5.2 Rust trafilatura VS libreddit

Private front-end for Reddit
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
docker

263 5,615 8.5 Shell trafilatura VS docker

⛴ Docker image of Nextcloud (by nextcloud)
ArchiveBox

248 19,737 9.7 Python trafilatura VS ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
docker-minecraft-server

211 8,340 9.4 Shell trafilatura VS docker-minecraft-server

Docker image that provides a Minecraft Server that will automatically download selected version at startup
whoogle-search

146 8,789 8.2 Python trafilatura VS whoogle-search

A self-hosted, ad-free, privacy-respecting metasearch engine
floccus

98 4,986 9.3 JavaScript trafilatura VS floccus

:cloud: Sync your bookmarks privately across browsers and devices
ERPNext

80 16,847 10.0 Python trafilatura VS ERPNext

Free and Open Source Enterprise Resource Planning (ERP)
readability

51 8,056 6.3 JavaScript trafilatura VS readability

A standalone version of the readability lib
Pinry

27 2,996 0.0 Python trafilatura VS Pinry

Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim through format. It's open-source and self-hosted.
bookmarks

16 958 9.4 JavaScript trafilatura VS bookmarks

🔖 Bookmark app for Nextcloud
parser

12 5,245 1.1 JavaScript trafilatura VS parser

📜 Extract meaningful content from the chaos of a web page
docker-languagetool

10 400 5.9 Shell trafilatura VS docker-languagetool

Dockerfile for LanguageTool server - configurable
newspaper

13 13,703 0.0 Python trafilatura VS newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
python-goose

0 3,942 0.0 HTML trafilatura VS python-goose

Html Content / Article Extractor, web scrapping lib in Python
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better trafilatura alternative or higher similarity.

Suggest an alternative to trafilatura

trafilatura reviews and mentions

Posts with mentions or reviews of trafilatura. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-08-14.

Trafilatura: Python tool to gather text on the Web
3 projects | news.ycombinator.com | 14 Aug 2023

The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
2 projects | news.ycombinator.com | 21 Apr 2023

The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
Powerful and free scraper with a headless browser under the hood and Readability for parsing
2 projects | /r/webscraping | 18 Mar 2023

I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
2 projects | /r/SideProject | 6 Mar 2023

Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
Testing fast installation in tear-down environment
1 project | /r/learnpython | 6 Jul 2022

I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
Advice on standard design pattern for comparison test script
1 project | /r/learnpython | 24 May 2022
Automate dependency installation
1 project | /r/learnpython | 9 Apr 2022
Issue with sklearn
2 projects | /r/learnpython | 8 Apr 2022
Questions about some code
1 project | /r/learnpython | 4 Apr 2022
How does Firefox's Reader View work?
15 projects | news.ycombinator.com | 30 Mar 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 24 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Stats

Basic trafilatura repo stats

Mentions

Stars

2,740

Activity

8.4

Last Commit

7 days ago

adbar/trafilatura is an open source project licensed under Apache License 2.0 which is an OSI approved license.

The primary programming language of trafilatura is Python.

Popular Comparisons