Python Scraping

Open-source Python projects categorized as Scraping

Top 23 Python Scraping Projects

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Is there a program available for bulk image reverse searching? | reddit.com/r/AskProgramming | 2023-02-02

    In the past I used stuff like beautifulsoup for webscraping but I’ve heard good things about https://scrapy.org/

  • requests-html

    Pythonic HTML Parsing for Humans™

    Project mention: 8 Most Popular Python HTML Web Scraping Packages with Benchmarks | dev.to | 2023-02-01

    requests-html

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: A Smart, Automatic, Fast and Lightweight Web Scraper for Python | reddit.com/r/webdev | 2022-12-02
  • undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Project mention: Bot detection on google? | reddit.com/r/selenium | 2023-01-15

    Try using undetected chromedriver it should make your scraper harder to detect. If that doesn't work then you will need to use proxies.

  • Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

    Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!

  • Grab

    Web Scraping Framework

  • snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world)

    Project mention: Tool das alle mit E-Mail verknüpfte Accounts auflistet? | reddit.com/r/de_EDV | 2022-06-22
  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • facebook-scraper

    Scrape Facebook public pages without an API key

    Project mention: Facebook Scraper to access public data | dev.to | 2023-01-19

    Facebook Scraper project github link: GitHub – kevinzg/facebook-scraper: Scrape Facebook public pages without an API key

  • parsel

    Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

    Project mention: 13 ways to scrape any public data from any website | dev.to | 2022-10-07

    variable.css(".X5PpBb::text").get() # returns a text value variable.css(".gs_a").xpath("normalize-space()").get() # https://github.com/scrapy/parsel/issues/192#issuecomment-1042301716 variable.css(".gSGphe img::attr(srcset)").get() # returns a attribute value variable.css(".I9Jtec::text").getall() # returns a list of strings values variable.xpath('th/text()').get() # returns text value using xpath

  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

    Project mention: Pre-trained Webscraping Models | reddit.com/r/webscraping | 2022-12-04
  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

    Project mention: Testing fast installation in tear-down environment | reddit.com/r/learnpython | 2022-07-06

    I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura

  • Edu-Mail-Generator

    Generate Free Edu Mail(s) within minutes

  • gazpacho

    🥫 The simple, fast, and modern web scraping library

  • DataEngineeringProject

    Example end to end data engineering project.

    Project mention: What are your favourite GitHub repos that shows how data engineering should be done? | reddit.com/r/dataengineering | 2022-11-18
  • Scweet

    A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...

    Project mention: How do I scrape Twitter media along with text of all the accounts that I'm following without using API? | reddit.com/r/DataHoarder | 2023-02-03
  • loconotion

    📄 Python tool to turn Notion.so pages into lightweight, customizable static websites

    Project mention: Looking for site generator | reddit.com/r/Notion | 2022-07-14

    Most of services tried to take money from me for custom domains, sitemap, theme etc. Sorry, I don't have enough money for that. 2. nextjs-notion-starter-kit This was the second for me. ① Cannot handle path well It is OK to map your page to /one but it throws 404 error when path is /one/two or /one/two/three. Since I have to map URL like /category1/subcategory2/sub-subcategory3/sub-sub-subcategory4/page5, this is serious problem. Other than that, it seems quite nice. 3. loconotion ① Does not sort pages in directories I tested this without mapping pages to path, but it seems it doesn't sort pages in directories when processing Notion pages. I found python script that sort those pages, but it doesn't support Windows, the only OS I use for general use. ② Slow generation If my site have few pages, it might be good choice, but my site has ~100 pages and a lot of pages will be added in near future. If it converts every pages whenever I launches it, it would take a lot of time. 4. Fruition I think it is just converting Notion page URL into mapped URL. Kind of nice, but still slow as Notion's shared page is. But I think it is doing some 'magic'. I didn't know that CF Worker can do such job. I'm currently using this.

  • lookyloo

    Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.

  • comic-dl

    Comic-dl is a command line tool to download manga and comics from various comic and manga sites. Supported sites : readcomiconline.to, mangafox.me, comic naver and many more.

    Project mention: Working on a frontend for Comic-dl. Would there be any interest in this? | reddit.com/r/selfhosted | 2022-07-03

    Recently came back to comic-dl when I started setting up my homelab and selfhosting more. Wanted a way to automate/semi-automate my comic consumption.

  • spidermon

    Scrapy Extension for monitoring spiders execution.

    Project mention: Automated testing the scraping output | reddit.com/r/scrapy | 2022-02-20

    This is what Spidermon does.

  • Search Engine Parser

    Lightweight package to query popular search engines and scrape for result titles, links and descriptions

  • wikipedia_ql

    Query language for efficient data extraction from Wikipedia

  • dude

    dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

    Project mention: Why do you use python for web scraping? | reddit.com/r/webscraping | 2022-10-11

    I also built a framework so I can easily switch between these libraries with less code change (still on hiatus for a few months before going back to it): https://github.com/roniemartinez/dude

  • scan-for-webcams

    scan for webcams on the internet

    Project mention: Scan For Webcams on the internet easily, with automatic checking of accessibility, text descriptions of the footage, timezone of the camera and more. | reddit.com/r/SideProject | 2022-05-17
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-02-03.

Python Scraping related posts

Index

What are some of the best open-source Scraping projects in Python? This list will help you:

Project Stars
1 Scrapy 46,031
2 requests-html 12,923
3 autoscraper 4,905
4 undetected-chromedriver 3,848
5 Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE 2,820
6 Grab 2,259
7 snoop 1,735
8 facebook-scraper 1,554
9 parsel 874
10 mlscraper 818
11 trafilatura 741
12 Edu-Mail-Generator 725
13 gazpacho 694
14 DataEngineeringProject 684
15 Scweet 636
16 loconotion 619
17 lookyloo 522
18 comic-dl 476
19 spidermon 447
20 Search Engine Parser 371
21 wikipedia_ql 347
22 dude 330
23 scan-for-webcams 184
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com