Python Webscraping

Open-source Python projects categorized as Webscraping | Edit details

Top 23 Python Webscraping Projects

  • GitHub repo autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Turn Any Website Into An API with AutoScraper and FastAPI | dev.to | 2021-04-24

    In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.

  • GitHub repo gazpacho

    🥫 The simple, fast, and modern web scraping library

    Project mention: Ask HN: What are some tools / libraries you built yourself? | news.ycombinator.com | 2021-05-16

    I've been working on gazpacho [1] for last two years.

    It's a general purpose web scraping library for Python that replaces BeautifulSoup + requests for most projects.

    Just surpassed ~2K downloads every week!

    [1] https://github.com/maxhumber/gazpacho

  • Activeloop.ai

    Optimize your datasets for ML. Goodbye, boilerplate code - the fastest dataset optimization and management tool for computer vision.

  • GitHub repo instascrape

    Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

    Project mention: Fetching all comments from an Instagram post? | reddit.com/r/webscraping | 2021-06-16

    There are Instagram scrapers that you can use for this if you don't want to use the Instagram api. It can fetch a lot of posts, but you should rate limit it.

  • GitHub repo TikTokBot

    A TikTokBot that downloads trending tiktok videos and compiles them using FFmpeg

    Project mention: [OFFER] CHEAP and High-Quality Programming | reddit.com/r/slavelabour | 2021-03-27

    TikTokBot automatically make TikTok compilations

  • GitHub repo CoWin-Vaccine-Notifier

    Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.

    Project mention: Automate Cowin Vaccine slots Availablity using Python | dev.to | 2021-05-17

    And with that, it's a wrap! You can host the script at the server to get notified every time a slot is available near you. You can find all the code at my GitHub Repository. Drop a star if you find it useful.

  • GitHub repo zimit

    Make a ZIM file from any Web site and surf offline!

    Project mention: Reading from the web offline and distraction-free | news.ycombinator.com | 2021-10-10

    which worked quite well for most sites, but still very far from a general-purpose solution.

    There is also more powerful/general-purpose scraper that generates a ZIM file here: https://github.com/openzim/zimit

    It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)

  • GitHub repo web_check

    Script for checking changes in webpages

    Project mention: Help me automate a boring task. [Print TO HTML] | reddit.com/r/learnpython | 2021-10-10

    Sure, in this project https://github.com/Jaime-alv/web_check. Look at checker.py inside web_check folder, line 37 onwards.

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo newsemble

    API for fetching data from news websites.

    Project mention: Newsemble: An API to fetch current news data | reddit.com/r/Python | 2021-07-18

    I read through the documentation and tinkered around with it -- great work! One recommendation I would make, particularly if you're hoping that this will be useful long-term for NLP, is not to delete the previously scraped data. For instance, http://www.newsemble.ml/news only contains 129 results, which is nowhere near comprehensive enough to ensure any kind of statistically significant NLP.

  • GitHub repo htmldate

    Fast and robust date extraction from web pages, from the command-line or within Python

  • GitHub repo redditsfinder

    Archive a reddit user's post history. Formatted overview of a profile, JSON containing every post, and picture downloads. Uses the pushshift API.

    Project mention: How to get list of subreddits where user has commented? | reddit.com/r/redditdev | 2021-08-06
  • GitHub repo extractnet

    A Dragnet that also extract author, headline, date, keywords from context

    Project mention: ExtractNet - a dragnet that also extract author, headline, date, keywords from web page in ML fashion | reddit.com/r/Python | 2021-02-10
  • GitHub repo scrapingant-client-python

    ScrapingAnt API client for Python.

    Project mention: We've updated our scraping API docs. Do you like or hate it? Thanks in advance for the comments! | reddit.com/r/webscraping | 2021-02-21

    Hello. In order to our previous conversation: We've created a python client library for our API: https://github.com/ScrapingAnt/scrapingant-client-python So you can check it out. Scrapy plugin is still under development. I'll keep you posted. Thanks for the interest :-)

  • GitHub repo udemyscraper

    A Udemy Course Scraper built with bs4 and selenium, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file, without authentication!

    Project mention: What cool projects have you make with BeautifulSoup to make your life easier? | reddit.com/r/Python | 2021-09-15
  • GitHub repo pycraigslist

    A fast and expressive Craigslist API wrapper

    Project mention: Web Scraping Used Car data via Carfax.com | reddit.com/r/webscraping | 2021-08-30

    I understand your concerns, as the quality of posts can vary greatly on Craigslist. But the plus side is there is a vast amount or used cars on Craigslist. If you know Python, try pycraigslist.

  • GitHub repo HackerNEWS-Simplified

    A more simplified, straightforward, and plain version of Hacker News.

    Project mention: HackerNews Simplified with Python3 and BeautifulSoup | reddit.com/r/learnprogramming | 2021-10-16

    HackerNews Simplified Is A more simplified and straightforward version of HackerNews. The project uses BeautifulSoup, a python web scraping framework to scrape the data from the hackernews website and take ony the most relevant news and displays it on your terminal.

  • GitHub repo Amazon-Product-Information-Scraper

    This is a python web-scraping project to get all the product names, price, review stars and review count of a particular category of the product

    Project mention: Program to get all the information about a product on Amazon | reddit.com/r/indiasocial | 2021-05-26

    All you have to do is go to this link and click on Code on the top right corner and click on Download Zip. Once you have extracted, simply click on AmazonProductScraper.exe and type the name of the product to be searched. Once the browser closes automatically, all the information is present in the same folder as an excel sheet.

  • GitHub repo Google-Image-Scraper

    This program uses web scraping to download images from google image search instantly and can be helpful in making image datasets.

    Project mention: help with google image scraper script | reddit.com/r/DataHoarder | 2021-06-12

    I'm running a google image scraper python script found here: https://github.com/NIKHILDUGAR/Google-Image-Scraper everything looks like it's running smoothly, but then it runs into the following error before it ever downloads any images:

  • GitHub repo PublicToiletDataFetcher

    Various data fetching, cleaning and processing scripts to collect and process data on public toilets in London

    Project mention: Why doesn’t the U.K. have more public toilets? | reddit.com/r/AskUK | 2021-05-02

    If you're in London, there this app - https://www.toilets4london.com/

  • GitHub repo getFunds

    ISIN code to Price

    Project mention: This Script Searchs the ISIN code and finds the fund price. | reddit.com/r/Python | 2021-05-10

    Check it here : GitHub Link

  • GitHub repo webscraping_lyrics

    Functions for scraping AZLyrics.com and downloading songs lyrics in txt format

    Project mention: Most frequent words in Springsteen's Lyrics | reddit.com/r/BruceSpringsteen | 2021-01-23

    Actually I did the web-scraping with Python and analyzed the lyrics with R because I’m a bit more confident with it. If you’re interested I have uploaded the code I wrote for the web-scraping here. Feel absolutely free to use it!

  • GitHub repo downdrag

    Webscraping done thoroughly

    Project mention: downdrag, python web scraping utility with extensible features | reddit.com/r/webscraping | 2021-05-05
  • GitHub repo acgn-bot

    Telegram bot: Check anime/comic/game/novel websites update

    Project mention: I need an app to get notifications from webnovels from websites such as syosetu. | reddit.com/r/noveltranslations | 2021-08-10

    You could spin up this telegram bot for it https://github.com/nonjosh/acgn-bot. I don’t know if there’s a public one already setup

  • GitHub repo random_spanish

    Using Selenium, generates a set of words to study by translating to and from English and Spanish.

    Project mention: I made a tool to revise Spanish using randomly selected words | reddit.com/r/Python | 2021-02-03
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-10-16.

Index

What are some of the best open-source Webscraping projects in Python? This list will help you:

Project Stars
1 autoscraper 3,987
2 gazpacho 599
3 instascrape 345
4 TikTokBot 163
5 CoWin-Vaccine-Notifier 101
6 zimit 52
7 web_check 47
8 newsemble 40
9 htmldate 31
10 redditsfinder 29
11 extractnet 28
12 scrapingant-client-python 13
13 udemyscraper 11
14 pycraigslist 10
15 HackerNEWS-Simplified 5
16 Amazon-Product-Information-Scraper 3
17 Google-Image-Scraper 2
18 PublicToiletDataFetcher 2
19 getFunds 2
20 webscraping_lyrics 1
21 downdrag 1
22 acgn-bot 1
23 random_spanish 0
Find remote jobs at our new job board 99remotejobs.com. There are 36 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Scout APM: A developer's best friend. Try free for 14-days
Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
scoutapm.com