Python Scraping

Open-source Python projects categorized as Scraping | Edit details

Top 23 Python Scraping Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Legalität von Web scraping | reddit.com/r/de_EDV | 2022-01-22
  • requests-html

    Pythonic HTML Parsing for Humans™

    Project mention: How to make all https traffic in program go through a specific proxy? | reddit.com/r/learnpython | 2021-12-24
  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Turn Any Website Into An API with AutoScraper and FastAPI | dev.to | 2021-04-24

    In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.

  • undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Project mention: attributeerror: module 'undetected_chromedriver' has no attribute 'install' | reddit.com/r/learnpython | 2021-12-28
  • snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world) (by snooppr)

    Project mention: FOSS News International #2: November 8-145, 2021 | reddit.com/r/fossnews | 2021-11-15

    Snoop 1.3.1

  • facebook-scraper

    Scrape Facebook public pages without an API key

    Project mention: Reddit, Twitter and Instagram downloader. Grand update | reddit.com/r/DataHoarder | 2021-12-26

    I have been using kevinzg/facebook-scraper for some time, but (1) it is just a library and would require some programming work; and (2) you WILL walk into Facebook's rate limit, and nothing is downloaded for older posts (download order is from newest to oldest).

  • parsel

    Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

    Project mention: How to Crawl the Web with Scrapy | news.ycombinator.com | 2021-09-13
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • gazpacho

    🥫 The simple, fast, and modern web scraping library

    Project mention: Ask HN: What are some tools / libraries you built yourself? | news.ycombinator.com | 2021-05-16

    I've been working on gazpacho [1] for last two years.

    It's a general purpose web scraping library for Python that replaces BeautifulSoup + requests for most projects.

    Just surpassed ~2K downloads every week!

    [1] https://github.com/maxhumber/gazpacho

  • Edu-Mail-Generator

    Generate Free Edu Mail(s) within minutes

    Project mention: Ilpt Another Way Of Getting An Edu Mail | reddit.com/r/IllegalLifeProTips | 2021-02-17

    Had the same "waiting" issue. Found a similar script here https://github.com/AmmeySaini/Edu-Mail-Generator which worked out well. The fake email generating part isn't automated though.

  • lookyloo

    Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.

    Project mention: Lookyloo/lookyloo - Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other | reddit.com/r/bag_o_news | 2021-05-03
  • loconotion

    📄 Python tool to turn Notion.so pages into lightweight, customizable static websites

    Project mention: Build unlimited free-forever static sites without a single line of code | dev.to | 2021-04-04

    As of now, the best free way is using loconotion which an open-source python tool written by Leonardo Cavaletti.

  • spidermon

    Scrapy Extension for monitoring spiders execution.

    Project mention: spidermon: Scrapy Extension for monitoring spiders execution | news.ycombinator.com | 2021-02-16
  • trafilatura

    Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

    Project mention: Most universal way to grab relevant webpage content | reddit.com/r/webscraping | 2021-11-30

    Check out this python package https://github.com/adbar/trafilatura it seems to do what you are looking for. You may need to use an automated browser like selenium to first load the content of the page if you want it to work on all sites, even those that use js

  • DataEngineeringProject

    Example end to end data engineering project.

    Project mention: Is it me or are beginner-friendly ETL pipeline guides that explain from the ground-up how to incorporate the use of various technologies notoriously difficult to find. | reddit.com/r/dataengineering | 2021-07-23
  • mlscraper

    🤖 Scrape data from HTML websites automatically with Machine Learning

    Project mention: mlscraper: Scrape data from HTML pages automatically with Machine Learning | news.ycombinator.com | 2021-07-05
  • wikipedia_ql

    Query language for efficient data extraction from Wikipedia

    Project mention: WikipediaQL: Query language for efficient data extraction from Wikipedia (early | news.ycombinator.com | 2021-07-05
  • Search Engine Parser

    Lightweight package to query popular search engines and scrape for result titles, links and descriptions

  • Scweet

    A simple and unlimited twitter scraper : scape tweets, likes, retweets, following, followers, user info, images...

    Project mention: Scraping the entirety of a private Twitter account. | reddit.com/r/DataHoarder | 2021-11-05

    Learn some Python and check this out: https://github.com/Altimis/Scweet

  • scan-for-webcams

    scan for webcams on the internet

    Project mention: JettChenT/scan-for-webcams - scan for webcams on the internet | reddit.com/r/GithubSecurityTools | 2021-08-09
  • languagepod101-scraper

    Python scraper for Language Pods such as Japanesepod101.com :japanese_ogre: :japan: :sushi: Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨

  • blinkist-scraper

    📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output

    Project mention: Has anyone used the Blinkist scraper | reddit.com/r/webscraping | 2021-12-12
  • arxiv-miner

    arxiv_miner is a toolkit for mining research papers on CS ArXiv.

    Project mention: ArXiv_miner: A toolkit for mining research papers on CS ArXiv | reddit.com/r/CKsTechNews | 2021-05-29
  • 4cat

    The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.

    Project mention: What's a way to scrape all posts of a sub-reddit and make a dataset of it? | reddit.com/r/datasets | 2021-10-03
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-22.

Python Scraping related posts

Index

What are some of the best open-source Scraping projects in Python? This list will help you:

Project Stars
1 Scrapy 42,622
2 requests-html 12,324
3 autoscraper 4,196
4 undetected-chromedriver 1,447
5 snoop 1,261
6 facebook-scraper 993
7 parsel 737
8 gazpacho 614
9 Edu-Mail-Generator 549
10 lookyloo 477
11 loconotion 423
12 spidermon 385
13 trafilatura 368
14 DataEngineeringProject 362
15 mlscraper 360
16 wikipedia_ql 328
17 Search Engine Parser 304
18 Scweet 289
19 scan-for-webcams 146
20 languagepod101-scraper 113
21 blinkist-scraper 105
22 arxiv-miner 87
23 4cat 74
Find remote jobs at our new job board 99remotejobs.com. There are 30 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.
github.com/nanovms