puppeteer-extra
node
Our great sponsors
puppeteer-extra | node | |
---|---|---|
28 | 904 | |
5,970 | 102,694 | |
- | 1.3% | |
0.0 | 9.9 | |
18 days ago | 7 days ago | |
JavaScript | JavaScript | |
MIT License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
puppeteer-extra
-
how can i bypasd 403 forbidden?
There is a good chance that the website is using Cloudflare to block web scrapers, which will require you to use a fortified headless browser to solve the JS challenges. Your options include the Puppeteer stealth plugin and Selenium undetected-chromedriver.
-
New headless Chrome has been released and has a near-perfect browser fingerprint
There are even Puppeteer plugins that will do it for you. [^1]
The best detection I've come across so far (i.e. before this release) has just required I run headless Chrome in headed mode. Granted, I don't do a ton of scraping -- mostly just pulling data out of websites so that I can play with it in aggregate using more civilized tools.
[1]: https://github.com/berstend/puppeteer-extra/tree/master/pack...
- Using selenium with proxy still hit bot detection
-
Is there an easy way to tell if a website will allow scrapers or not?
Fortified Headless Browser: Depending on the anti-bot protection the website is using you make need to use a fortified headless browser that can solve its JS challenges without giving its identity away. Your options include the Puppeteer stealth plugin and Selenium undetected-chromedriver.
-
Is Selenium still a good choice?
That being said, if you're a beginner Selenium is a much more mature package so it has significantly more resources on StackOverflow and whatnot and Puppeteer has bigger community for avoiding web scraper detection (plugins like puppeteer-extra-plugin-stealth)
-
Show HN: Browser extension that spoofs your location data to match your VPN
This extension takes an interesting approach to spoofing data, which is nice!
In my case, I’m interested in doing the same thing inside of Puppeteer for web scraping, unfortunately it seems like the only possible approach is similar to content scripts (for example https://github.com/berstend/puppeteer-extra/tree/master/pack...) which leads to it being easily detected. Are there any similar approaches that can be used for Puppeteer?
-
Avoiding Bot Detection
"I'm a noob and using python with selenium to do some basic scraping on StockX" and scraping protected website like stockx with perimeterx is not possible. It's all about reverse engineering, browser introspection, fingerprint (from hardware to software canvas), then you still need tons of ips to rotate and cooldown, finally protection evolve with time and you have to redo most of the things to pass again. A company like Scrapfly exists because it's more expensive to do and maintain such solution internally, look at their public repositories on GitHub low level stuff, network spoofing stacks, packet manipulation, custom angle libs. It takes a long time to learn vs something like `asp=true` from their docs https://scrapfly.io/docs/scrape-api/anti-scraping-protection If you have time and are more interested in this side, you could start to read https://github.com/prescience-data/dark-knowledge and look at https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth project to see how it works. Do not attempt stealth project helping you to bypass at scale, it's public, anti bot companies are aware and spot it easily - most of the time they don't block directly and use bad fp generated to recognize bots and map proxies ips to collect it and deducted the subnet or residential > My main question is, would it be better to try and make my script act "more human" It's a legend that anti bot use or detect "human" behavior, this signal is not very important, you can randomly move the mouse or things, like is fine, having 0 input events, is suspect but not that much in fact - tactile systems do not trigger any events until you touch so it can't be a strong signal due to false-positive - and doing "behavioral detection" is a big lie in the industry, you can experiment by doing dumb things, it's still passing and at scale ... and when they say "machine learning" it's just basic stats like a throttle do but based on browser fingerprints rather than IP. If you hit some path, like login, registration and payment - they can use some very heavy system with GPU canvas and stuff like but not used for scraping yet > are other methods like switching drivers and using proxies the way to go? Using proxies yes, but with wrong fingerprints (chrome headless, a browser running on server hardware, browser in docker and so on) In fact, there is no magic, mixing driver change nothing, they still manipulate a spotted browser - some are just more flexible than other to spoof correctly some part - like js worker interception to inject scripts and hook correctly but that's all.
-
How I met your...Scraper?
The page scraped for this post behaves "interesting", sometimes the reCaptcha is ignored, some others appear right after submitting the login, so randomly fails; I opened an issue in puppeteer-extra, an npm lib extension for puppeteer which works hand-to-hand with 2captcha, I'm watching the issue closely, in case of getting a fix for the random issue I'll edit the post.
-
Why mimicking a device is becoming almost impossible
The stealth plugin for Puppeteer Extra gives a pretty good idea of what you need to cover today. Maybe it's not rocket science, but it's not trivial either.
https://github.com/berstend/puppeteer-extra/tree/master/pack...
node
- Ask HN: Anyone looking for contributors for their open source projects
-
I Deployed My Own Cute Lil’ Private Internet (a.k.a. VPC)
Each app’s front end is built with Qwik and uses Tailwind for styling. The server-side is powered by Qwik City (Qwik’s official meta-framework) and runs on Node.js hosted on a shared Linode VPS. The apps also use PM2 for process management and Caddy as a reverse proxy and SSL provisioner. The data is stored in a PostgreSQL database that also runs on a shared Linode VPS. The apps interact with the database using Drizzle, an Object-Relational Mapper (ORM) for JavaScript. The entire infrastructure for both apps is managed with Terraform using the Terraform Linode provider, which was new to me, but made provisioning and destroying infrastructure really fast and easy (once I learned how it all worked).
-
You're Installing Node.js Wrong. That's OK, Here Is How To Fix It 🙌
I have always either installed Node from the installer provided by the Nodejs website or, via Brew in macOS. I have also used nvm in the past but did not know that there was a best practice to guide us.
-
How to use ApyHub to Build a Serverless Function in NodeJs?
Node.js v16+: To check your current Node.js version or install Node.js, visit the official Node.js website. Follow the installation instructions for your operating system to get the required version.
-
The easiest way to add a license file to your project
Before you begin, you need to have Node.js and npm or yarn installed on your system. If you don't have them installed, you can download and install them from the official website: Node.js.
-
Hosting an Angular application in a Docker container on Amazon EC2 deployed by Amazon ECS
Node.js and npm: Node.js is a JavaScript code runtime software based on Google's V8 engine. npm is a package manager for Node.js (Node.js Package Manager). They will be used to build and run the Angular application and install the libraries.
-
I have created a small anti-depression script
Install Node.js (or Bun, or Deno, or whatever JS runtime you prefer) if it's not there
-
How to access Neon Postgres from AWS Lambda functions via serverless driver
Node.js installed on your PC. It comes with npm, which you will use to add Neon’s serverless driver to your project.
-
Release Radar • February 2024 Edition
The Electron framework lets you write cross-platform desktop applications using JavaScript, HTML and CSS. It is based on Node.js and Chromium and is used by the Visual Studio Code and many other apps.
-
Angular 17 Upgrade Guide with SSR
If you need to upgrade your Node.js version, be sure to go to Nodejs.org and download the recommended version or use the NVM instead:
What are some alternatives?
puppeteer - Node.js API for Chrome
Svelte - Cybernetically enhanced web apps
widevine-l3-decryptor - A Chrome extension that demonstrates bypassing Widevine L3 DRM
dark-knowledge - 😈📚 A curated library of research papers and presentations for counter-detection and web privacy enthusiasts.
sharp-libvips - Packaging scripts to prebuild libvips and its dependencies - you're probably looking for https://github.com/lovell/sharp
source-map-resolve - [DEPRECATED] Resolve the source map and/or sources for a generated file.
nodejs.dev - A redesign of Nodejs.org built using Gatsby.js with React.js, TypeScript, and Remark.
fakebrowser - 🤖 Fake fingerprints to bypass anti-bot systems. Simulate mouse and keyboard operations to make behavior like a real person.
hashlips_art_engine - HashLips Art Engine is a tool used to create multiple different instances of artworks based on provided layers.
Hugo - The world’s fastest framework for building websites.
leakgirls-camsite-downloader - LeakGirls is a computer application that is capable of easily downloading videos any cam site. [GET https://api.github.com/repos/IcaroAugusto/leakgirls-camsite-downloader: 403 - Repository access blocked]
Nim - Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).