Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push. Learn more →
Top 23 Python Webscraping Projects
-
Project mention: What are the best tools for web scraping and analysis of natural language to populate a dataset? | reddit.com/r/datasets | 2023-04-12
See if something like autoscraper or mlscraper suits your needs.
-
Webscraping Open Project
The web scraping open project repository aims to share knowledge and experiences about web scraping with Python
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Project mention: Web Scraping with Python: from Fundamentals to Practice | reddit.com/r/Python | 2022-06-23
For anyone who goes with requests as your HTTP client, I would highly recommend adding requests-cache for a nice performance boost.
-
Project mention: Those of you who have developed product features using GPT4 API (or failed to do so), how did it go? | reddit.com/r/ExperiencedDevs | 2023-04-15
Not my project but an ex-colleague has been having some success in this direction: https://jamesturk.github.io/scrapeghost/
-
CrossLinked
LinkedIn enumeration tool to extract valid employee names from an organization through search engine scraping
I used https://github.com/m8sec/CrossLinked, it takes a domain as input and gives this as output: Datetime, Search, First, Last, Title, URL, rawText "03-22-2023 08:36:54","google","Elon","Musc","manager operational unit","https://be.linkedin.com/in/elon-musc-b8464215","Elon Musc - manager operational unit - LinkedIn linkedin.comhttps://be.linkedin.com > elon-m",
-
-
dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
Project mention: Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next | reddit.com/r/webscraping | 2023-05-08Please check https://github.com/roniemartinez/dude
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
-
-
Project mention: ChatGPT is falsely recommending us for a service we don't provide | news.ycombinator.com | 2023-02-23
Phomber is not the best example. Ed contacted the developer of that tool over a year ago about the issue and to remove mentions of OpenCage and as far as I see the author removed it https://github.com/s41r4j/phomber/issues/4
-
ebayScraper
Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel
-
There are clearly similarities between the two, given that Kiwix put resources into making WARC content available in ZIM archives (i.e. Zimit-style ZIMs, created with the Zimit scraper and warc2zim backend). But as u/IMayBeABitShy said, the ZIM specification focuses on providing a highly compressed container that is readable on-the-fly (i.e. by decompressing only the needed content to show an article), whereas WARC, or rather the compressed version WACZ, is merely a zipped version of the WARC data (request headers and responses). It is also readable on-the-fly, but compression will not be as optimal as the zstandard compression used by modern ZIM archives.
-
CoWin-Vaccine-Notifier
Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.
-
-
-
redditsfinder
Archive a reddit user's post history. Formatted overview of a profile, JSON containing every post, and picture downloads. Uses the pushshift API.
Using this
-
-
There's a program called searchifyx that scrapes brainly, quizziz, and Quizlet to find answers https://github.com/daijro/SearchifyX
-
linkedin-comments-scraper
Script to scrape comments (including name, profile link, pfp, designation, email(if present), and comment) from a LinkedIn post from the URL of the post.
Project mention: Linkedin Comments Scraper - Script to scrape comments (including name, profile picture, designation, email(if present), and comment) from a LinkedIn post from the URL of the post. | reddit.com/r/webscraping | 2023-01-05 -
web_to_obsidian
A Python 3 script that scrapes an html/xml page to extract text, then creates markdown files for Obsidian & the dataview plugin
Project mention: I got the job! Now using Obsidian to help create databases for my company | reddit.com/r/ObsidianMD | 2023-03-25Made a Github page last time I posted this: https://github.com/Flybell/web_to_obsidian
-
scrape-google-scholar-py is a open-source project of mine that aims to extract all the possible data from Google Scholar. In the future I'll port it to R.
-
-
-
CodiumAI
TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
Python Webscraping related posts
- I have made a simple webscraper in python.pls checkout this github project.
- Gambling addiction?
- Educational Resources
- I have an idea that would be somewhat cool to do with a sub of the most idiotic people ever.
- ChatGPT converted my trivia book into a daily Podcast
- Scrape Google Scholar in R
- ChatGPT Driven AI News Podcasts now rival the realism of actual news segments
-
A note from our sponsor - CodiumAI
codium.ai | 30 May 2023
Index
What are some of the best open-source Webscraping projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | autoscraper | 5,206 |
2 | Webscraping Open Project | 1,279 |
3 | requests-cache | 1,111 |
4 | scrapeghost | 1,087 |
5 | CrossLinked | 779 |
6 | gazpacho | 703 |
7 | dude | 362 |
8 | TikTokBot | 281 |
9 | Dorkify | 182 |
10 | phomber | 153 |
11 | ebayScraper | 146 |
12 | zimit | 144 |
13 | CoWin-Vaccine-Notifier | 102 |
14 | htmldate | 71 |
15 | web_check | 52 |
16 | redditsfinder | 48 |
17 | newsemble | 44 |
18 | SearchifyX | 42 |
19 | linkedin-comments-scraper | 39 |
20 | web_to_obsidian | 31 |
21 | scrape-google-scholar-py | 29 |
22 | pycraigslist | 28 |
23 | botcity-framework-web-python | 26 |