trafilatura
docker
trafilatura | docker | |
---|---|---|
13 | 263 | |
2,853 | 5,635 | |
- | 1.4% | |
8.7 | 8.5 | |
2 days ago | 8 days ago | |
Python | Shell | |
Apache License 2.0 | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
docker
- Has Anyone Created A Working Docker Container?
-
NextCloud Docker
Am I better off using this container: https://hub.docker.com/_/nextcloud/
-
Issues with urandom + Docker due to DSM kernel
It looks like I'm not the only person who has faced this. apache-based images require buster, for instance, and some docker images that rely on Ruby face issues too (for example, I decided to try setting up Postal but it looks like it's facing the same issues).
-
Complete noob, hit a wall trying to get nextcloud working
If you look at the info on https://hub.docker.com/_/nextcloud, you'll see that you aught to be specifying at least one volume, so that you have the data you want to not disappear on restart, and can access config files.
-
Still issues with max file upload size after following instructions
Yes, and this is what L passed through as the env variables in the docker compose file, as the docs state: https://hub.docker.com/_/nextcloud
-
Memories no preview or thumbnails
You can't map binaries from host to container like this. What you need is a custom Dockerfile that installs ffmpeg. https://github.com/nextcloud/docker/blob/master/.examples/dockerfiles/full/fpm/Dockerfile
-
VPS + CF Tunnel + docker
A good example to start with is at https://github.com/nextcloud/docker/tree/master/.examples/docker-compose/insecure/postgres/apache
- Docker Compose for NextCloudPi?
- Run docker inside docker for Nextcloud AiO?
-
My first mini-lab
Nextcloud is free and open source. The easiest way to install it is via docker containers. nextcloud docker
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
all-in-one - 📦 The official Nextcloud installation method. Provides easy deployment and maintenance with most features included in this one Nextcloud instance.
python-goose - Html Content / Article Extractor, web scrapping lib in Python
nextcloud-snap - ☁️📦 Nextcloud packaged as a snap [Moved to: https://github.com/nextcloud-snap/nextcloud-snap]
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
NextCloudPi - 📦 Build code for NextcloudPi: Raspberry Pi, Odroid, Rock64, curl installer...
html2text - Convert HTML to Markdown-formatted text.
Invidious - Invidious is an alternative front-end to YouTube
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
Navidrome Music Server - 🎧☁️ Modern Music Server and Streamer compatible with Subsonic/Airsonic
textract - extract text from any document. no muss. no fuss.
Nextcloud - ☁️ Nextcloud server, a safe home for all your data