trafilatura
Pinry
trafilatura | Pinry | |
---|---|---|
13 | 27 | |
2,853 | 3,007 | |
- | 0.6% | |
8.7 | 0.0 | |
2 days ago | 19 days ago | |
Python | Python | |
Apache License 2.0 | BSD 2-clause "Simplified" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
Pinry
-
Is there a Fediverse alternative to Pinterest?
Don't think there's something around, yet. A Self-Hosted Pinterest alternative would be Pinry. But they do not have federation functionality, IIRC. But it's open source, so maybe a starting point.
-
I am looking for a software that works kind of like Pinterest—so that I can just grab stuff, upload it, and categorize it on a backup server
Pinry is what you need. I think it also has a browser plugin to save images and whole webpages while browsing anything.
-
This Week in Self-Hosted (7 April 2023)
This week we're featuring several updates from the LinuxServer.io team as well as a few unique software launches (some created by members of this sub!). We've also spotlighted a self-hosted Pinterest-like application called Pinry - check it out!
-
Pinterest alternative?
Hi there - Pikapods https://www.pikapods.com/apps have a service called Pinry on there - it costs as small amount a month but should do what you want:- https://docs.getpinry.com/ . I use it to bookmark images!
- Does anything like Pintrest exist, but open source?
- What Are Your Most Used Self Hosted Applications?
- Pinry: A Selfhosted Alternative to Pinterest
- Pinry Docs
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Myyna
python-goose - Html Content / Article Extractor, web scrapping lib in Python
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service - community repo
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
bookmarks - 🔖 Bookmark app for Nextcloud
html2text - Convert HTML to Markdown-formatted text.
Scuttle - Web-based social bookmarking system. Allows multiple users to store, share and tag their favourite links online.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
Lobsters - Computing-focused community centered around link aggregation and discussion
textract - extract text from any document. no muss. no fuss.
dyu/bookmarks - a simple self-hosted bookmarking app that can import bookmarks from delicious and chrome