got-scraping
header-generator
Our great sponsors
got-scraping | header-generator | |
---|---|---|
3 | 1 | |
397 | 58 | |
9.9% | - | |
6.5 | 0.0 | |
25 days ago | over 1 year ago | |
TypeScript | TypeScript | |
- | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
got-scraping
-
How do I scrape external web pages and then insert them as records into KB table?
You could do the scraping yourself by hosting your own ServiceNow MID Server, making a bespoke scraping script on top of an existing library (example: got-scraping), then calling the scraper script via IntegrationHub & a Script Step)
-
How to Crawl the Web with Scrapy
While I agree that Scrapy is a great tool for beginner tutorials and easy entry into scraping, it's becoming difficult to use it in real world scenarios because almost all the large players now employ some anti-bot or anti-scraping protection.
A great example above all is Cloudflare. You simply can't convince Cloudflare you're a human with Scrapy alone. Scrapy has only experimental support of HTTP2 and does not support proxies over HTTP2 (https://github.com/scrapy/scrapy/issues/5213). Yet, all browsers use HTTP2 now, which means all normal users use HTTP2... You get the point.
What we use now is Got Scraping (https://github.com/apify/got-scraping). It's a special purpose extension of Got (HTTP client with 18 mil weekly downloads) that masks its HTTP communication as if it was coming from a real browser. Of course, this will not get you as far as Puppeteer or Playwright (headless browsers), but it improved our scraping tremendously. If you need a full crawling library, see the Apify SDK (https://sdk.apify.com) which uses Got Scraping under the hood.
- Show HN: Web scraping focused HTTP client for Node.js
header-generator
-
Show HN: Web scraping focused HTTP client for Node.js
Hey everyone,
we built a special-purpose web scraping client for Node.js. When scraping with pure HTTP clients, you want to blend in with the regular traffic as much as you can. This means your request signature needs to look like a browser's.
With got-scraping, we developed a special purpose header generator(https://github.com/apify/header-generator) that uses a bayesian network and real browser headers to make your headers undistinguishable.
We also override Node.js ciphers with the browser ones and simplify the use of proxies. HTTP protocol versions are auto-detected for both the target website and the proxy, so you can have a perfect HTTP2 connection even through a HTTP(S) proxy.
It's always a work in progress, so we would be grateful for any comments or tips how to make the requests even more stealthy!
Thanks!
What are some alternatives?
google-search-results-php - Google Search Results PHP API via Serp Api
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
colly - Elegant Scraper and Crawler Framework for Golang
puppeteer - Node.js API for Chrome
rtila-releases
parsel - Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors