trafilatura
floccus
trafilatura | floccus | |
---|---|---|
13 | 98 | |
2,853 | 5,047 | |
- | 2.6% | |
8.7 | 9.4 | |
2 days ago | 5 days ago | |
Python | JavaScript | |
Apache License 2.0 | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
floccus
-
⟳ 2 apps added, 13 updated at apt.izzysoft.de
floccus bookmark sync (version 5000002): Sync your bookmarks privately across browsers and devices
- Tab Sync between Browsers
- Floccus – Sync Bookmarks Privately
-
Can Chrome Sync or Firefox Sync be trusted with sensitive data?
There are solutions external to the browsers that work pretty well and where you have control on your data :
Floccus for bookmarks (https://floccus.org/) : it works also on mobile devices : a great plus ! You need only a webdav server (or a Nextcloud account), I use Dave (https://github.com/micromata/dave)
Vaultwarden for the passwords (https://github.com/dani-garcia/vaultwarden)
A huge advantage of this solution is that you can have synchronization also between different browsers and on mobile devices.
-
Discount for bookmarks app: Bookmarks - Read Later ($8.99 -> $0.99)
I have used things like xsync, raindrop and others over the years and recently started using Floccus (https://floccus.org/) which is free and opensource just does not support Safari. Private bookmarks on my own sync system and can keep any chromium or firefox based browsers bookmarks and tabs synced.
-
Extension - Open Source Bookmark Sync
xBrowserSync and Floccus.
-
Safari retakes second place in global browser market share, but Edge is close behind
Try floccus if you want to sync between different browsers.
-
Looking for a selfhosted tool to store/sync/backup URLs using a Firefox extension
maybe this: floccus
-
Looking for recommendations (Bookmarks/Links)
I've got floccus running between browsers for the bookmarks I use more often, and benotes for the ones I want to keep for reference or for later.
-
Best cross-platform bookmark tracker/manager?
Floccus https://floccus.org/
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Firefox Sync Server - Run-Your-Own Firefox Sync Server
python-goose - Html Content / Article Extractor, web scrapping lib in Python
synology-download-manager - An open source browser extension for adding/managing download tasks to your Synology DiskStation.
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
nightTab - A neutral new tab page accented with a chosen colour. Customise the layout, style, background and bookmarks with nightTab.
html2text - Convert HTML to Markdown-formatted text.
linkding - Self-hosted bookmark manager that is designed be to be minimal, fast, and easy to set up using Docker.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
api-docker - xBrowserSync API for Docker
textract - extract text from any document. no muss. no fuss.
SyncMarks-Extension - Browser Webextension for Firefox, Edge or Chromium derivatives to sync your bookmarks with a private backend.