trafilatura
bookmarks
trafilatura | bookmarks | |
---|---|---|
13 | 16 | |
2,853 | 964 | |
- | 1.5% | |
8.7 | 9.4 | |
2 days ago | 4 days ago | |
Python | JavaScript | |
Apache License 2.0 | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
bookmarks
-
Useful browser extensions and their associated selfhosted services.
I don't know why he didn't just point you to the API link on their github page.
-
What's a software you searched to selfhost but is still missing to you ?
there's nextcloud bookmarks https://github.com/nextcloud/bookmarks
-
Searching for a tag based bookmark manager
I might try https://github.com/nextcloud/bookmarks from there, thanks.
-
A good self hosted bookmark service
The nextcloud bookmarks app works great for me: https://apps.nextcloud.com/apps/bookmarks
-
Alternatives to nextcloud? Is this abandonware?
example: https://github.com/nextcloud/bookmarks/issues/1305
- Self hosted app with web clipper feature
-
NextCloud overwrote ALL data on SMB/CIFS share
And this kind of thing (loss or undesired modification of data) unfortunately seems to be recurring? This issue is a bit older.
- Looking for a tool that generates reader mode of articles
- I centralize and distribute my bookmarks
-
Offline bookmark manager
NextCloud has a Bookmark app: https://apps.nextcloud.com/apps/bookmarks
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Pinry - Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim through format. It's open-source and self-hosted.
python-goose - Html Content / Article Extractor, web scrapping lib in Python
floccus - :cloud: Sync your bookmarks privately across browsers and devices
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
docker-ubooquity
html2text - Convert HTML to Markdown-formatted text.
LinkAce - LinkAce is a self-hosted archive to collect links of your favorite websites.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
docker-booksonic-air
textract - extract text from any document. no muss. no fuss.
Bitwarden - The core infrastructure backend (API, database, Docker, etc).