Python Scraping

Open-source Python projects categorized as Scraping | Edit details

Top 23 Python Scraping Projects

  • GitHub repo Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Wanting to build a web scraper with no prior coding knowledge. Where do I start as fast as possible? | reddit.com/r/webscraping | 2021-10-23

    Check out https://scrapy.org/ it’s a Python framework for web scraping and then look at https://youtube.com/c/JohnWatsonRooney channel to learn the syntax. Finally, go to https://www.zyte.com/scrapy-cloud/ to deploy your crawler to the cloud!

  • GitHub repo requests-html

    Pythonic HTML Parsing for Humans™

    Project mention: Requests html: Directly downloading pyppeteer chrome, not by script run | reddit.com/r/learnpython | 2021-08-18

    This issue is asking for the same thing. Seems like they've implemented a simple fix in this Pull Request. But it looks like it never made it to the Master branch. Maybe you can extend the class and make necessary changes if you know what you're doing, otherwise you're out of luck.

  • Activeloop.ai

    Optimize your datasets for ML. Goodbye, boilerplate code - the fastest dataset optimization and management tool for computer vision.

  • GitHub repo autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Turn Any Website Into An API with AutoScraper and FastAPI | dev.to | 2021-04-24

    In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.

  • GitHub repo undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Project mention: Is there a way to use the 'real' chrome instead of chromedriver? | reddit.com/r/selenium | 2021-10-08
  • GitHub repo facebook-scraper

    Scrape Facebook public pages without an API key

    Project mention: Legality of scraper code on Github | reddit.com/r/webscraping | 2021-10-19

    I see a lot of projects like this https://github.com/kevinzg/facebook-scraper, are they just YOLO-ing it, or is it perfectly in line with Github rules to have these types of things open and public?

  • GitHub repo parsel

    Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

    Project mention: How to Crawl the Web with Scrapy | news.ycombinator.com | 2021-09-13
  • GitHub repo gazpacho

    🥫 The simple, fast, and modern web scraping library

    Project mention: Ask HN: What are some tools / libraries you built yourself? | news.ycombinator.com | 2021-05-16

    I've been working on gazpacho [1] for last two years.

    It's a general purpose web scraping library for Python that replaces BeautifulSoup + requests for most projects.

    Just surpassed ~2K downloads every week!

    [1] https://github.com/maxhumber/gazpacho

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo Edu-Mail-Generator

    Generate Free Edu Mail(s) within minutes

    Project mention: Ilpt Another Way Of Getting An Edu Mail | reddit.com/r/IllegalLifeProTips | 2021-02-17

    Had the same "waiting" issue. Found a similar script here https://github.com/AmmeySaini/Edu-Mail-Generator which worked out well. The fake email generating part isn't automated though.

  • GitHub repo lookyloo

    Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.

    Project mention: Lookyloo/lookyloo - Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other | reddit.com/r/bag_o_news | 2021-05-03
  • GitHub repo loconotion

    📄 Python tool to turn Notion.so pages into lightweight, customizable static websites

    Project mention: Build unlimited free-forever static sites without a single line of code | dev.to | 2021-04-04

    As of now, the best free way is using loconotion which an open-source python tool written by Leonardo Cavaletti.

  • GitHub repo spidermon

    Scrapy Extension for monitoring spiders execution.

    Project mention: spidermon: Scrapy Extension for monitoring spiders execution | news.ycombinator.com | 2021-02-16
  • GitHub repo mlscraper

    🤖 Scrape data from HTML websites automatically with Machine Learning

    Project mention: mlscraper: Scrape data from HTML pages automatically with Machine Learning | news.ycombinator.com | 2021-07-05
  • GitHub repo trafilatura

    Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

    Project mention: What's something self hosted everyone needs to run ? | reddit.com/r/selfhosted | 2021-09-02

    The only RSS-reader(not an aggregator like freshrss) with a parser included i know of is RSSowlNix (which still works but it's a bit old/not very leightweight). I wrote myself a little dirty workaround in python to solve this problem (since im a noob in PHP). It reads the sqlite database of freshrss, adds a new column to every article to mark articles, that are already parsed and then i use trafilatura to give fulltext parsing a try for every newly added article. To be on the safe site i use it with a proxy. Not all pages work (some javascript gdpr popups are a nightmare), and you cannot jump over paywalls but it works most of the times.

  • GitHub repo DataEngineeringProject

    Example end to end data engineering project.

    Project mention: Is it me or are beginner-friendly ETL pipeline guides that explain from the ground-up how to incorporate the use of various technologies notoriously difficult to find. | reddit.com/r/dataengineering | 2021-07-23
  • GitHub repo scan-for-webcams

    scan for webcams on the internet

    Project mention: JettChenT/scan-for-webcams - scan for webcams on the internet | reddit.com/r/GithubSecurityTools | 2021-08-09
  • GitHub repo languagepod101-scraper

    Python scraper for Language Pods such as Japanesepod101.com :japanese_ogre: :japan: :sushi: Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨

    Project mention: Has anyone been through all five levels/pathways of JapanesePod101.com? | reddit.com/r/LearnJapanese | 2021-01-07

    A python scraper to download everything: https://github.com/nedlir/languagepod101-scraper

  • GitHub repo blinkist-scraper

    📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output

    Project mention: Please help me run this project | reddit.com/r/learnpython | 2021-02-24
  • GitHub repo arxiv-miner

    arxiv_miner is a toolkit for mining research papers on CS ArXiv.

    Project mention: ArXiv_miner: A toolkit for mining research papers on CS ArXiv | reddit.com/r/CKsTechNews | 2021-05-29
  • GitHub repo 4cat

    The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.

    Project mention: What's a way to scrape all posts of a sub-reddit and make a dataset of it? | reddit.com/r/datasets | 2021-10-03
  • GitHub repo activesoup

    A headless pure-python browser for the web

    Project mention: Sunday Daily Thread: What's everyone working on this week? | reddit.com/r/Python | 2021-05-30

    I dusted off an old web-scraping project (activesoup) that I wrote a few years ago and don't really use much myself anymore, but I think it gets a little bit of usage by others, since every few months I see a new star on github, or a small issue or feature request is filed. This week, it was a small feature request (with a PR too - great!).

  • GitHub repo etf4u

    📊 Python tool to scrape real-time information about ETFs from the web and mixing them together by proportionally distributing their assets allocation

    Project mention: Investing Advice: Are Stocks better than ETFs in Ireland? | reddit.com/r/irishpersonalfinance | 2021-06-10
  • GitHub repo nintendo-games-ratings

    Dataset and visualizations of Nintendo Games and ratings, scraped from metacritic.com

    Project mention: [OC] Critic to User Score Delta Analysis for Nintendo Games | reddit.com/r/dataisbeautiful | 2021-01-09

    Source: https://github.com/yaylinda/nintendo-games-ratings

  • GitHub repo fbspider

    Scraping Facebook information

    Project mention: #fbspider: Scraping información Facebook | reddit.com/r/u_esgeeks | 2021-02-12
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-10-23.

Index

What are some of the best open-source Scraping projects in Python? This list will help you:

Project Stars
1 Scrapy 41,921
2 requests-html 12,166
3 autoscraper 3,987
4 undetected-chromedriver 1,029
5 facebook-scraper 847
6 parsel 709
7 gazpacho 599
8 Edu-Mail-Generator 486
9 lookyloo 452
10 loconotion 378
11 spidermon 352
12 mlscraper 331
13 trafilatura 287
14 DataEngineeringProject 266
15 scan-for-webcams 140
16 languagepod101-scraper 111
17 blinkist-scraper 93
18 arxiv-miner 85
19 4cat 60
20 activesoup 33
21 etf4u 21
22 nintendo-games-ratings 15
23 fbspider 15
Find remote jobs at our new job board 99remotejobs.com. There are 37 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Scout APM: A developer's best friend. Try free for 14-days
Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
scoutapm.com