Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Secutils-web-scraper Alternatives
Similar projects and alternatives to secutils-web-scraper based on common topics and language
-
Playwright
Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
secutils
Secutils.dev is an open-source, versatile, yet simple security toolbox for engineers and researchers (by secutils-dev)
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
secutils-web-scraper reviews and mentions
-
How to track anything on the internet or use Playwright for fun and profit
To begin, all functionality related to browser automation and web scraping lives in a dedicated service — Web Scraper. The primary rationale is that dealing with browsers and arbitrary user scripts is tricky from a security standpoint, and it's always a good idea to isolate such functionality as much as possible. You can read more about the security aspects of web scraping in the "Running web scraping service securely" post.
-
Running web scraping service securely
When it comes to web page resource scraping, Secutils.dev relies on a separate component - secutils-dev/secutils-web-scraper. I've built it on top of Playwright since I need to handle both resources that are statically defined in the HTML and those that are loaded dynamically. Leveraging Playwright, backed by a real browser, instead of parsing the static HTML opens up a ton of opportunities to turn a simple web resource scraper into a much more intelligent tool capable of handling all sorts of use cases: recording and replaying HARs, imitating user activity, and more.
-
Detecting changes in JavaScript and CSS isn't an easy task, Part 1
While both Puppeteer and Playwright have their own advantages and disadvantages, I have chosen Playwright for Secutils.dev. Playwright not only allows us to access all browser APIs within the web page context to easily detect and extract inline resources, but also enables us to intercept all external dynamically loaded web page resources. Here's an example of the code (full code can be found here):
-
A note from our sponsor - InfluxDB
www.influxdata.com | 5 May 2024
Stats
secutils-dev/secutils-web-scraper is an open source project licensed under GNU Affero General Public License v3.0 which is an OSI approved license.
The primary programming language of secutils-web-scraper is TypeScript.
Sponsored