Web Crawling

Open-source projects categorized as Web Crawling

Top 23 Web Crawling Open-Source Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
  • pyspider

    A Powerful Spider(Web Crawler) System in Python.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • requests-html

    Pythonic HTML Parsing for Humans™

  • crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Project mention: Automating Data Collection with Apify: From Script to Deployment | dev.to | 2024-03-17

    Previously, the Apify SDK offered a blend of crawling functionalities and Actor building features. However, a recent update separated these functionalities into two distinct libraries: Crawlee and Apify SDK v3. Crawlee now houses the web scraping and crawling tools, while Apify SDK v3 focuses solely on features specific to building Actors for the Apify platform. This distinction allows for a clear separation of concerns and enhances the development experience for various use cases.

  • webmagic

    A scalable web crawler framework for Java.

  • jsoup

    jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

    Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20
  • portia

    Visual scraping for Scrapy

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • MechanicalSoup

    A Python library for automating interaction with websites.

    Project mention: How to scrape a website with Python (Beginner tutorial) | dev.to | 2024-02-22

    MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:

  • Crawler4j

    Open Source Web Crawler for Java

  • Mechanize

    Mechanize is a ruby library that makes automated web interaction easy.

  • RoboBrowser

  • Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

  • Grab

    Web Scraping Framework

  • gain

    Web crawling framework based on asyncio.

  • Scrapely

    A pure-python HTML screen-scraping library

  • feedparser

    Parse feeds in Python

    Project mention: RSS can be used to distribute all sorts of information | news.ycombinator.com | 2023-11-20

    There is JSON Feed¹ already. One of the spec writers is behind micro.blog, which is the first place I saw it(and also one of the few places I've seen it). I don't think it is a bad idea, and it doesn't take all that long to implement it.

    I have long hoped it would pick up with the JSON-ify everything crowd, just so I'd never see a non-Atom feed again. We perhaps wouldn't need sooo much of the magic that is wrapped up in packages like feedparser² to deal with all the brokeness of RSS in the wild then.

    ¹ https://www.jsonfeed.org/

    ² https://github.com/kurtmckee/feedparser

  • PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • anemone

    Anemone web-spider framework

  • Upton

    A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)

  • cola

    A high-level distributed crawling framework.

  • FastImage

    FastImage finds the size or type of an image given its uri by fetching as little as needed

  • Wombat

    Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

  • MetaInspector

    Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-17.

Web Crawling related posts

Index

What are some of the best open-source Web Crawling projects? This list will help you:

Project Stars
1 Scrapy 50,603
2 pyspider 16,273
3 requests-html 13,553
4 crawlee 11,796
5 webmagic 11,203
6 jsoup 10,565
7 portia 9,142
8 MechanicalSoup 4,533
9 Crawler4j 4,469
10 Mechanize 4,349
11 RoboBrowser 3,689
12 Apache Nutch 2,792
13 Grab 2,350
14 gain 2,031
15 Scrapely 1,845
16 feedparser 1,813
17 PSpider 1,811
18 anemone 1,617
19 Upton 1,614
20 cola 1,485
21 FastImage 1,351
22 Wombat 1,301
23 MetaInspector 1,025
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com