|15 days ago||6 days ago|
|MIT License||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
A simple solution to rotate proxies or how to spin up your own rotation proxy server with Puppeteer and only a few lines of JS code
1 project | reddit.com/r/webscraping | 5 Mar 2021
I'm currently implementing concurrency conditions at project/proxy/domain/session level in https://github.com/get-set-fetch/scraper . On each level you can define the maximum number of requests and the delay between two consecutive requests.
Web scraping content into postgresql? Scheduling web scrapers into a pipeline with airflow?
1 project | reddit.com/r/webscraping | 15 Feb 2021
If you're familiar with nodejs give https://github.com/get-set-fetch/scraper a try. Scraped content can be stored in sqlite, mysql or postgresql. It also supports puppeteer, playwright, cheerio or jsdom for the actual content extraction. No scheduler though.
Web Scraping 101 with Python
5 projects | news.ycombinator.com | 10 Feb 2021
I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.
What is your “I don't care if this succeeds” project?
42 projects | news.ycombinator.com | 1 Feb 2021
https://github.com/get-set-fetch/scraper - I've been working (intermittently :) ) on a nodejs or browser extension scraper for the last 3 years, see the other projects under the get-set-fetch umbrella. Putting a lot more effort lately as I really want to do those Alexa top 1 million analysis like top js libraries, certificate authorities and so on. A few weeks back I've posted on Show:HN as you can do basic/intermediate? scraping with it.
Not capable of handling 1 mil+ pages as it still limited to puppeteer or playwright. Working on adding cheerio/jsdom support right now.
Show HN: Plugin Based, Batteries Included, Web Scraper
1 project | news.ycombinator.com | 19 Jan 2021
Scraping Product Info from an ecommerce website
1 project | reddit.com/r/webscraping | 8 Nov 2021
Web Scraping: Intercepting XHR Requests
2 projects | dev.to | 27 Oct 2021
Microsoft has released a Playwright-Python
2 projects | dev.to | 17 Oct 2021
GitHub link: https://github.com/microsoft/playwright-python Portal: https://playwright.dev/ Open source organization: Microsoft
need help with form to pdf
1 project | reddit.com/r/django | 7 Oct 2021
You could look into Playwright (python) - If you have a view that outputs the form data as you want, playwright can run a headless task to "print from the browser" for that particular view.
Tried scraping this table without success
1 project | reddit.com/r/webscraping | 4 Oct 2021
Blocking Resources with Playwright
1 project | dev.to | 29 Sep 2021
Playwright "is a Python library to automate Chromium, Firefox, and WebKit browsers with a single API." It allows us to browse the Internet with a headless browser programmatically.
Using 🎭Playwright in Ruby/Rails
7 projects | dev.to | 10 Aug 2021
Using 🎭Playwright in python:alpine
2 projects | dev.to | 10 Aug 2021
playwright-python is a really awesome library for browser automation.
Saturday Daily Thread: Resource Request and Sharing!
2 projects | reddit.com/r/Python | 26 Jun 2021
You might try Playwright for Python. It's a browser automation tool that supports interactive websites. I haven't tested it yet l, so I cannot vouch for its speed, but it is being built by some of the people that built Puppeteer, which is also a super solid tool for this sort of thing .
Is there anything for Python that compares to Puppeteer Stealth Plugin for Node?
3 projects | reddit.com/r/webscraping | 10 Apr 2021
Pyppeteer is not insane, lot of bug but open source and work on most of cases and not very active, you can take a look to https://github.com/microsoft/playwright-python with the advantage to not be locked with chrome / chronium
What are some alternatives?
Playwright - Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
playwright-dotnet - .NET version of the Playwright testing and automation library.
playwright-java - Java version of the Playwright testing and automation library
playwright-python-remote - Enables us to use playwright-python on pure-Python environment
vopono - Run applications through VPN tunnels with temporary network namespaces
playwright-ruby-client - Playwright client for Ruby
capybara-playwright-driver - Playwright driver for Capybara
Capybara - Acceptance test framework for web applications
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.