Python web-scraping

Open-source Python projects categorized as web-scraping

Top 23 Python web-scraping Projects

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Scrapping - How to deal with page changes Ai | reddit.com/r/webscraping | 2022-03-25

    It depends on the website, but autoscraper was used to calculate similar nodes given the text to search. Not sure how it works now but it's open source.

  • selenium-python-helium

    Selenium-python but lighter: Helium is the best Python library for web automation.

    Project mention: Automating some process with PyAutoGui? | reddit.com/r/automation | 2022-07-21

    You can, though it might not be the best tool for this. Automation of web entry is better done with selenium, or my favorite variation helium.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Grab

    Web Scraping Framework

  • snoop

    Snoop — инструмент разведки на основе открытых данных (OSINT world)

    Project mention: Tool das alle mit E-Mail verknüpfte Accounts auflistet? | reddit.com/r/de_EDV | 2022-06-22
  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

    Project mention: Testing fast installation in tear-down environment | reddit.com/r/learnpython | 2022-07-06

    I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura

  • scrapy-fake-useragent

    Random User-Agent middleware based on fake-useragent

    Project mention: Looking for suggestions for a web scraper | reddit.com/r/learnpython | 2022-09-01

    User-Agents: Your user-agent list is pretty small, and you aren't adding the other headers that real browsers typically have. For a bigger list of user-agents you could use the scrapy-fake-user-agent middleware.

  • web-scraping

    Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

    Project mention: web-scraping: NEW Data - star count:324.0 | reddit.com/r/algoprojects | 2022-08-13
  • Zigi

    Workflow assistant built for devs & their teams. Automate the mundane part of your day, with live actionable messages for your GitHub & Jira tasks.

  • wayback-machine-scraper

    A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

    Project mention: Anyone have a simple useful guide so I can get this scraper working . | reddit.com/r/github | 2022-11-06
  • saveddit

    Bulk Downloader for Reddit

  • sillynium

    Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements

    Project mention: Can anyone share some cool projects done with Python? | reddit.com/r/Python | 2022-02-13
  • letterboxd_recommendations

    Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

    Project mention: Any tools for adding a list of recommendations? | reddit.com/r/radarr | 2022-07-17
  • stock_screener

    Picking stocks through various screening methods. Focus on Northern Europe.

  • leetcode-compensation

    Compensation analysis on the posts scraped from leetcode.com/discuss/compensation. At present, the reports have been generated only for Indian cities.

  • GoodreadsScraper

    Scrape data from Goodreads using Scrapy and Selenium :books:

    Project mention: [OC] Top 10 Fantasy Books of the 21st Century According to GoodReads | reddit.com/r/dataisbeautiful | 2022-05-20

    Used this code to get the data: https://github.com/havanagrawal/GoodreadsScraper

  • twitter-scraper-selenium

    Python's package to scrap Twitter's front-end easily

  • facebook_page_scraper

    Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

    Project mention: How does Firefox's Reader View work? | news.ycombinator.com | 2022-03-30
  • web-poet

    Web scraping Page Objects core library

    Project mention: Is there a method to web scrape similar type of information from hundreds of websites with a single code or application? | reddit.com/r/webscraping | 2022-07-21

    Check out the web-poet pattern: https://github.com/scrapinghub/web-poet

  • Mobile-Phone-Dataset-GSMArena

    Python script for creating Mobile Phones Dataset on GSMArena website.

    Project mention: Need web scraping script for phones specifications and phone dataset | reddit.com/r/learnpython | 2022-04-11

    but searching GitHub for python specific scrapers for GSMARENA look at gsmarena-scraper and Mobile-Phone-Dataset-GSMArena for reference how to scrape.

  • parsel-cli

    cli for evaluating css and xpath selectors

    Project mention: Web Scraping With Python (An Ultimate Guide) | reddit.com/r/Python | 2022-09-15

    I like it so much that I even wrote a REPL for it parsel-cli :) (it's a bit of a Frankenstein though as I'm working on a 2.0 release)

  • Neural-Scam-Artist

    Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

  • tagalog-dictionary-scraper

    Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com

  • microwler

    A micro-framework for asynchronous deep crawls and web scraping with Python

  • Scout APM

    Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-11-06.

Python web-scraping related posts

Index

What are some of the best open-source web-scraping projects in Python? This list will help you:

Project Stars
1 autoscraper 4,739
2 selenium-python-helium 3,189
3 Grab 2,231
4 snoop 1,615
5 trafilatura 666
6 scrapy-fake-useragent 615
7 web-scraping 376
8 wayback-machine-scraper 316
9 saveddit 127
10 sillynium 114
11 letterboxd_recommendations 112
12 stock_screener 104
13 leetcode-compensation 92
14 GoodreadsScraper 68
15 twitter-scraper-selenium 67
16 facebook_page_scraper 66
17 htmldate 60
18 web-poet 57
19 Mobile-Phone-Dataset-GSMArena 46
20 parsel-cli 19
21 Neural-Scam-Artist 17
22 tagalog-dictionary-scraper 17
23 microwler 11
Build time-series-based applications quickly and at scale.
InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.
www.influxdata.com