x-crawl
billboard-json
x-crawl | billboard-json | |
---|---|---|
8 | 4 | |
1,293 | 30 | |
- | - | |
9.3 | 9.4 | |
7 days ago | 3 days ago | |
TypeScript | TypeScript | |
MIT License | GNU General Public License v3.0 only |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
x-crawl
- Flexible Node.js AI-assisted crawler library
-
Traditional crawler or AI-assisted crawler? How to choose?
The crawler uses x-crawl. The crawled websites are all real. To avoid disputes, https://www.example.com is used instead.
- AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?
-
AI combined with Node.js x-crawl crawler
import { createXCrawlOpenAI } from 'x-crawl' const xCrawlOpenAIApp = createXCrawlOpenAI({ clientOptions: { apiKey: 'Your API Key' } }) xCrawlOpenAIApp.help('What is x-crawl').then((res) => { console.log(res) /* res: x-crawl is a flexible Node.js AI-assisted web crawling library. It offers powerful AI-assisted features that make web crawling more efficient, intelligent, and convenient. You can find more information and the source code on x-crawl's GitHub page: https://github.com/coder-hxl/x-crawl. */ }) xCrawlOpenAIApp .help('Three major things to note about crawlers') .then((res) => { console.log(res) /* res: There are several important aspects to consider when working with crawlers: 1. **Robots.txt:** It's important to respect the rules set in a website's robots.txt file. This file specifies which parts of a website can be crawled by search engines and other bots. Not following these rules can lead to your crawler being blocked or even legal issues. 2. **Crawl Delay:** It's a good practice to implement a crawl delay between your requests to a website. This helps to reduce the load on the server and also shows respect for the server resources. 3. **User-Agent:** Always set a descriptive User-Agent header for your crawler. This helps websites identify your crawler and allows them to contact you if there are any issues. Using a generic or misleading User-Agent can also lead to your crawler being blocked. By keeping these points in mind, you can ensure that your crawler operates efficiently and ethically. */ })
-
Recommend a flexible Node.js multi-functional crawler library —— x-crawl
If you also like x-crawl, you can give the x-crawl repository a star on GitHub to support it. Thank you for your support!
-
A flexible nodejs crawler library —— x-crawl
If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.
billboard-json
What are some alternatives?
wranglebot - Decentralized MAM Platform
scraper - All In One API to easily scrape data from any website, without worrying about captchas and bot detection mecanisms.
prray - "Promisified" Array, it compatible with the original Array but comes with async versions of native Array methods
crawlee - Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
payload - The best way to build a modern backend + admin UI. No black magic, all TypeScript, and fully open-source, Payload is both an app framework and a headless CMS.
maestro-express-async-errors - Maestro is a layer of code that acts as a wrapper, without any dependencies, for async middlewares.