trafilatura
whoogle-search
trafilatura | whoogle-search | |
---|---|---|
13 | 146 | |
2,853 | 8,815 | |
- | - | |
8.7 | 8.1 | |
2 days ago | 9 days ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
whoogle-search
- So I deployed Whoogle on my NAS....
- Whoogle Search: Self-hosted, ad-free, privacy-respecting metasearch engine
-
All my Open Source App Alternatives
Google Search → Whoogle
Google Search → Whoogle
- Ask HN: Best search engine alternatives to Google?
-
Redirect google searches to whoogle
I deployed whoogle search at home. For laptops it is rather easy to define it as the default search for alfred and the respective browser, but for iOS devices I have not found a way (except some rather clunky and intrusive extensions) to change the default search engine beyond what Apple tells you. Is there a way to to this via pfSense or pihole? So whenever someone searches in the address bar in safari on iOS that pfSense redirects that search to the whoogle URL?
-
Do search parameters no longer work? (Google, Bing, DuckDuckGo...)
I host an instance of Whoogle Search that I use and never looked back. It's like someone brought Google back to the golden age. No (direct) tracking, no ads, complete control over the results.
-
Whoogle, Open Source Search Engine That Proxies Google Results
Source Code: https://github.com/benbusby/whoogle-search
- Best alternative to duckduckgo?
- Where to start?
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Searx - Privacy-respecting metasearch engine
python-goose - Html Content / Article Extractor, web scrapping lib in Python
searxng - SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither tracked nor profiled.
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
Nginx Proxy Manager - Docker container for managing Nginx proxy hosts with a simple, powerful interface
html2text - Convert HTML to Markdown-formatted text.
Piped - An alternative privacy-friendly YouTube frontend which is efficient by design.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
awesome-selfhosted - A list of Free Software network services and web applications which can be hosted on your own servers
textract - extract text from any document. no muss. no fuss.
pi-hosted - Raspberry Pi Self Hosted Server Based on Docker / Portainer.io