PHP Scraper
trafilatura
PHP Scraper | trafilatura | |
---|---|---|
1 | 13 | |
497 | 2,853 | |
- | - | |
5.3 | 8.7 | |
23 days ago | about 14 hours ago | |
PHP | Python | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
PHP Scraper
-
Suggestions for PHP web scraper libraries?
I am familiar with Goutte, and when I searched "goutte alternatives" I found links to Embed, PHP Scraper, and PHP Spider. But I haven't tried them, or know if there's any reason to prefer one over the others.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
What are some alternatives?
Goutte - Goutte, a simple PHP Web Scraper
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
PHP Spider - A configurable and extensible PHP web spider
python-goose - Html Content / Article Extractor, web scrapping lib in Python
Embed - Get info from any web service or page
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
twitteroauth - The most popular PHP library for use with the Twitter OAuth REST API.
html2text - Convert HTML to Markdown-formatted text.
hashids - A small PHP library to generate YouTube-like ids from numbers. Use it when you don't want to expose your database ids to the user.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
PHPMailer - The classic email sending library for PHP
textract - extract text from any document. no muss. no fuss.