Crawling

Open-source projects categorized as Crawling

Top 23 Crawling Open-Source Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

  • Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
  • colly

    Elegant Scraper and Crawler Framework for Golang

  • Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

    SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • newspaper

    newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

  • crawlee

    Crawleeβ€”A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

  • Project mention: How to scrape Amazon products | dev.to | 2024-04-01

    In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

  • awesome-web-scraping

    List of libraries, tools and APIs for web scraping and data processing.

  • Ferret

    Declarative web scraping

  • rod

    A Devtools driver for web automation and scraping

  • Project mention: Need help authenticating to Okta programatically. | /r/okta | 2023-07-03

    I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • hakrawler

    Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

  • PuppeteerSharp

    Headless Chrome .NET API

  • Project mention: What do .NET devs use for web scraping these days? | /r/dotnet | 2023-06-13

    PuppeteerSharp

  • Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

  • Grab

    Web Scraping Framework

  • awesome-puppeteer

    A curated list of awesome puppeteer resources.

  • cariddi

    Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

  • core

    The complete web scraping toolkit for PHP. (by roach-php)

  • mlscraper

    πŸ€– Scrape data from HTML websites automatically by just providing examples

  • botasaurus

    The All in One Framework to build Awesome Scrapers.

  • Project mention: This Week In Python | dev.to | 2024-04-05

    botasaurus – The All in One Framework to build Awesome Scrapers

  • Crawly

    Crawly, a high-level web crawling & scraping framework for Elixir.

  • Project mention: Crawly – Elixir web scraping framework | news.ycombinator.com | 2023-08-26
  • scrapyrt

    HTTP API for Scrapy spiders

  • Dataflow kit

    Extract structured data from web sites. Web sites scraping.

  • isp-data-pollution

    ISP Data Pollution to Protect Private Browsing History with Obfuscation

  • spidermon

    Scrapy Extension for monitoring spiders execution.

  • WarcDB

    WarcDB: Web crawl data as SQLite databases.

  • Project mention: An Introduction to the WARC File | news.ycombinator.com | 2024-01-29
  • LinkedInDumper

    Python 3 script to dump/scrape/extract company employees from LinkedIn API

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Crawling related posts

Index

What are some of the best open-source Crawling projects? This list will help you:

Project Stars
1 Scrapy 50,824
2 colly 22,120
3 newspaper 13,703
4 crawlee 12,044
5 awesome-web-scraping 6,299
6 Ferret 5,616
7 rod 4,750
8 hakrawler 4,225
9 PuppeteerSharp 3,149
10 Apache Nutch 2,807
11 Grab 2,353
12 awesome-puppeteer 2,318
13 cariddi 1,351
14 core 1,319
15 mlscraper 1,219
16 botasaurus 870
17 Crawly 839
18 scrapyrt 816
19 Dataflow kit 636
20 isp-data-pollution 566
21 spidermon 510
22 WarcDB 383
23 LinkedInDumper 362

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com