Python Webscraping

Open-source Python projects categorized as Webscraping

Top 23 Python Webscraping Projects

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: What are the best tools for web scraping and analysis of natural language to populate a dataset? | reddit.com/r/datasets | 2023-04-12

    See if something like autoscraper or mlscraper suits your needs.

  • Webscraping Open Project

    The web scraping open project repository aims to share knowledge and experiences about web scraping with Python

    Project mention: What are your thoughts on scrapy | reddit.com/r/webscraping | 2022-08-30
  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • requests-cache

    Transparent persistent cache for python requests

    Project mention: Web Scraping with Python: from Fundamentals to Practice | reddit.com/r/Python | 2022-06-23

    For anyone who goes with requests as your HTTP client, I would highly recommend adding requests-cache for a nice performance boost.

  • scrapeghost

    👻 Experimental library for scraping websites using OpenAI's GPT API.

    Project mention: Those of you who have developed product features using GPT4 API (or failed to do so), how did it go? | reddit.com/r/ExperiencedDevs | 2023-04-15

    Not my project but an ex-colleague has been having some success in this direction: https://jamesturk.github.io/scrapeghost/

  • CrossLinked

    LinkedIn enumeration tool to extract valid employee names from an organization through search engine scraping

    Project mention: What do with email and names? | reddit.com/r/OSINT | 2023-03-22

    I used https://github.com/m8sec/CrossLinked, it takes a domain as input and gives this as output: Datetime, Search, First, Last, Title, URL, rawText "03-22-2023 08:36:54","google","Elon","Musc","manager operational unit","https://be.linkedin.com/in/elon-musc-b8464215","Elon Musc - manager operational unit - LinkedIn linkedin.comhttps://be.linkedin.com > elon-m",

  • gazpacho

    🥫 The simple, fast, and modern web scraping library

  • dude

    dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

    Project mention: Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next | reddit.com/r/webscraping | 2023-05-08

    Please check https://github.com/roniemartinez/dude

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • TikTokBot

    A TikTokBot that downloads trending tiktok videos and compiles them using FFmpeg

  • Dorkify

    Perform Google Dork search with Dorkify

  • phomber

    [PH0MBER]: An open source infomation grathering & reconnaissance framework!

    Project mention: ChatGPT is falsely recommending us for a service we don't provide | news.ycombinator.com | 2023-02-23

    Phomber is not the best example. Ed contacted the developer of that tool over a year ago about the issue and to remove mentions of OpenCage and as far as I see the author removed it https://github.com/s41r4j/phomber/issues/4

  • ebayScraper

    Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel

  • zimit

    Make a ZIM file from any Web site and surf offline!

    Project mention: Zim vs WARC ? | reddit.com/r/Kiwix | 2022-11-15

    There are clearly similarities between the two, given that Kiwix put resources into making WARC content available in ZIM archives (i.e. Zimit-style ZIMs, created with the Zimit scraper and warc2zim backend). But as u/IMayBeABitShy said, the ZIM specification focuses on providing a highly compressed container that is readable on-the-fly (i.e. by decompressing only the needed content to show an article), whereas WARC, or rather the compressed version WACZ, is merely a zipped version of the WARC data (request headers and responses). It is also readable on-the-fly, but compression will not be as optimal as the zstandard compression used by modern ZIM archives.

  • CoWin-Vaccine-Notifier

    Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

  • web_check

    Script for checking changes in webpages

  • redditsfinder

    Archive a reddit user's post history. Formatted overview of a profile, JSON containing every post, and picture downloads. Uses the pushshift API.

    Project mention: How to download all the Threads i created? | reddit.com/r/redditdev | 2022-09-08

    Using this

  • newsemble

    API for fetching data from news websites.

  • SearchifyX

    Fast flashcard searcher

    Project mention: FULL GUIDE FOR EDGENUITY | reddit.com/r/edgenuity | 2023-02-02

    There's a program called searchifyx that scrapes brainly, quizziz, and Quizlet to find answers https://github.com/daijro/SearchifyX

  • linkedin-comments-scraper

    Script to scrape comments (including name, profile link, pfp, designation, email(if present), and comment) from a LinkedIn post from the URL of the post.

    Project mention: Linkedin Comments Scraper - Script to scrape comments (including name, profile picture, designation, email(if present), and comment) from a LinkedIn post from the URL of the post. | reddit.com/r/webscraping | 2023-01-05
  • web_to_obsidian

    A Python 3 script that scrapes an html/xml page to extract text, then creates markdown files for Obsidian & the dataview plugin

    Project mention: I got the job! Now using Obsidian to help create databases for my company | reddit.com/r/ObsidianMD | 2023-03-25

    Made a Github page last time I posted this: https://github.com/Flybell/web_to_obsidian

  • scrape-google-scholar-py

    Extract data from all Google Scholar pages from a single Python module.

    Project mention: Scrape Google Scholar in R | dev.to | 2023-05-06

    scrape-google-scholar-py is a open-source project of mine that aims to extract all the possible data from Google Scholar. In the future I'll port it to R.

  • pycraigslist

    A quick Craigslist API wrapper

  • botcity-framework-web-python

    BotCity Framework Web - Python

  • CodiumAI

    TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-05-08.

Python Webscraping related posts

Index

What are some of the best open-source Webscraping projects in Python? This list will help you:

Project Stars
1 autoscraper 5,206
2 Webscraping Open Project 1,279
3 requests-cache 1,111
4 scrapeghost 1,087
5 CrossLinked 779
6 gazpacho 703
7 dude 362
8 TikTokBot 281
9 Dorkify 182
10 phomber 153
11 ebayScraper 146
12 zimit 144
13 CoWin-Vaccine-Notifier 102
14 htmldate 71
15 web_check 52
16 redditsfinder 48
17 newsemble 44
18 SearchifyX 42
19 linkedin-comments-scraper 39
20 web_to_obsidian 31
21 scrape-google-scholar-py 29
22 pycraigslist 28
23 botcity-framework-web-python 26
ONLYOFFICE Docs — document collaboration in your environment
Powerful document editing and collaboration in your app or environment. Ultimate security, API and 30+ ready connectors, SaaS or on-premises
www.onlyoffice.com