Python web-scraping

Open-source Python projects categorized as web-scraping

Top 23 Python web-scraping Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
  • changedetection.io

    The best and simplest free open source website change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

    Project mention: Google have removed RSS support from their developer blogs | news.ycombinator.com | 2023-12-11

    I use ChangeDetection,

    - https://changedetection.io/#features

    - https://github.com/dgtlmoon/changedetection.io

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: What are the best tools for web scraping and analysis of natural language to populate a dataset? | /r/datasets | 2023-04-12

    See if something like autoscraper or mlscraper suits your needs.

  • Douyin_TikTok_Download_API

    🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

    Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

    At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

  • snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world)

    Project mention: Osint update of the Snoop Project tool search for user by nickname | news.ycombinator.com | 2024-01-02
  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

    Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

    The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

    Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

  • Grab

    Web Scraping Framework

  • Onboard AI

    ChatGPT with full context of any GitHub repo. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at app.getonboardai.com.

  • curl_cffi

    Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

    Project mention: Found a way to bypass Cloudflare 403 forbidden in cURL, fetch | /r/webscraping | 2023-07-02

    requests-like lib that uses curl-impersonate behind the scene: https://github.com/yifeikong/curl_cffi

  • scrapy-fake-useragent

    Random User-Agent middleware based on fake-useragent

  • web-scraping

    Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

    Project mention: web-scraping: NEW Data - star count:554.0 | /r/algoprojects | 2023-09-25
  • google-search-results-python

    Google Search Results via SERP API pip Python Package

    Project mention: Make Direct Async Requests to SerpApi with Python | dev.to | 2023-05-24

    In this blog post we'll cover on how to make direct requests to serpapi.com/search.json without using SerpApi's google-search-results Python client.

  • botasaurus

    The All in One Framework to build Awesome Scrapers.

    Project mention: Reached 100 ⭐ for my Bot Development Framework. | /r/webscraping | 2023-07-04

    Do check it out at https://github.com/omkarcloud/bose

  • dude

    dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

    Project mention: Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next | /r/webscraping | 2023-05-08

    Please check https://github.com/roniemartinez/dude

  • wayback-machine-scraper

    A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

    Project mention: wayback-machine-scraper: NEW Data - star count:380.0 | /r/algoprojects | 2023-12-10
  • twitter-scraper-selenium

    Python's package to scrap Twitter's front-end easily

  • letterboxd_recommendations

    Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

    Project mention: Self Hosted Content Recommender? | /r/selfhosted | 2023-06-20

    Use an existing self-hostable tool for getting recommendations from there, such as letterboxd_recommendations

  • facebook_page_scraper

    Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV

  • saveddit

    Bulk Downloader for Reddit

  • rymscraper

    Python library to extract data from rateyourmusic.com.

    Project mention: Local app for filtering music by genre (and more)! | /r/lastfm | 2023-05-27

    True. I use a scraper that someone built (https://github.com/dbeley/rymscraper) + rate limiting, it seems to work just fine without getting my IP banned, although it is necessarily slow. I would just run it overnight or so.

  • stock_screener

    Picking stocks through various screening methods. Focus on Northern Europe.

  • GoodreadsScraper

    Scrape data from Goodreads using Scrapy and Selenium :books:

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

  • web-poet

    Web scraping Page Objects core library

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-16.

Python web-scraping related posts

Index

What are some of the best open-source web-scraping projects in Python? This list will help you:

Project Stars
1 Scrapy 50,198
2 changedetection.io 14,294
3 autoscraper 5,812
4 Douyin_TikTok_Download_API 5,457
5 snoop 2,596
6 trafilatura 2,539
7 Grab 2,343
8 curl_cffi 1,099
9 scrapy-fake-useragent 673
10 web-scraping 617
11 google-search-results-python 499
12 botasaurus 425
13 dude 401
14 wayback-machine-scraper 393
15 twitter-scraper-selenium 255
16 letterboxd_recommendations 196
17 facebook_page_scraper 176
18 saveddit 158
19 rymscraper 138
20 stock_screener 124
21 GoodreadsScraper 114
22 htmldate 104
23 web-poet 88
The modern API for authentication & user identity.
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
workos.com