SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Webscraping Projects
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: CyberScraper 2077 – A OpenAI / Gemini Based Web Scraper | news.ycombinator.com | 2024-09-08
Info about CyberScraper-2077:
Rip data from the net, leaving no trace. Welcome to the future of web scraping.
About
CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI to slice through the web's defenses, extracting the data you need with unparalleled precision and style.
Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.
Features
AI-Powered Extraction: Utilizes cutting-edge AI models to understand and parse web content intelligently.
Sleek Streamlit Interface: User-friendly GUI that even a chrome-armed street samurai could navigate.
Multi-Format Support: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.
Stealth Mode: Implemented stealth mode parameters that helps it from getting detected as bot.
Ollama Support: Use a huge libarary of open source LLMs.
Async Operations: Lightning-fast scraping that would make a Trauma Team jealous.
Smart Parsing: Structures scraped content as if it was extracted straight from the engram of a master netrunner.
Ethical Scraping: Respects robots.txt and site policies. We may be in 2077, but we still have standards.
Caching: We implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
Upload to Google Sheets: Now you can easily upload your extract csv data to google sheets with one click.
Proxy Mode (Coming Soon): Built-in proxy support to keep you ghosting through the net.
Navigate through the Pages: Navigate through the webpage and scrap the data from different pages.
If you are unable to scrape a website and you are getting blocked, try out the current browser features:
Github: https://github.com/itsOwen/CyberScraper-2077
-
-
-
CrossLinked
LinkedIn enumeration tool to extract valid employee names from an organization through search engine scraping
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
EasyApplyJobsBot
A python bot to automatically apply all Linkedin,Glassdoor, etc Easy Apply jobs based on your preferences. Auto login, auto fill additional questions, apply automatically!
-
-
At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.
-
AutoScraper
Official implement of paper "AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation" [EMNLP 24'] (by EZ-hwh)
-
-
ChatGPT-OpenAI-Smart-Speaker
This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.
-
ebayScraper
Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel
-
llm_osint
LLM OSINT is a proof-of-concept method of using LLMs to gather information from the internet and then perform a task with this information.
-
-
-
-
Project mention: Show HN: Domain Public Data Collection Service Got Updated | news.ycombinator.com | 2024-06-14
Hello everyone! Some days ago I've published a post here about DPULSE which is domain OSINT tool. Аor those who haven't seen it I'll give Github link (https://github.com/OSINT-TECHNOLOGIES/dpulse). In general, this tool allows you to gather a lot of open-source information about domain such as WHOIS information, subdomains, mentions of the domain's owner organization in some social networks (+ organization profiles in social networks), IP addresses, public documents and many more using custom Google Dorking queries, InternetDB search results (possible vulnerabilities, open ports and so on), used web-technologies, sitemap, robots.txt files, SSL certificate info. And in v1.0.2 update I've added possibility to form results as XLSX file additionally to PDF file. Also I've added roadmap on GitHub page (projects tab), so everyone interested is able to see how development process is going on.
Will be glad to see everyone in my repository and, moreover, to hear your opinion and tips!
-
seo-keyword-research-tool
Python SEO keywords suggestion tool. Google Autocomplete, People Also Ask and Related Searches.
-
CoWin-Vaccine-Notifier
Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.
-
scrape-google-scholar-py
Extract data from all Google Scholar pages from a single Python module. NOTE: I'm no longer maintaining this repo. Chrome driver/selectors might need an update.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Webscraping discussion
Python Webscraping related posts
-
A Comprehensive Guide to TikTok API
-
Ask HN: Who is hiring? (November 2024)
-
CyberScraper 2077 – A OpenAI / Gemini Based Web Scraper
-
CyberScraper-2077 – A LLM Based Web Scraper
-
How To Scrape TikTok in 2024
-
Direction Of The Stock Market
-
And I thought amazing fics suddenly being deleted was a myth
-
A note from our sponsor - SaaSHub
www.saashub.com | 13 Jun 2025
Index
What are some of the best open-source Webscraping projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | autoscraper | 6,788 |
2 | CyberScraper-2077 | 1,712 |
3 | scrapeghost | 1,436 |
4 | requests-cache | 1,426 |
5 | CrossLinked | 1,386 |
6 | mov-cli | 937 |
7 | gazpacho | 769 |
8 | EasyApplyJobsBot | 625 |
9 | zimit | 535 |
10 | scrapfly-scrapers | 533 |
11 | AutoScraper | 468 |
12 | TikTokBot | 377 |
13 | ChatGPT-OpenAI-Smart-Speaker | 290 |
14 | ebayScraper | 223 |
15 | llm_osint | 207 |
16 | Dorkify | 202 |
17 | blackmaria | 150 |
18 | htmldate | 129 |
19 | dpulse | 124 |
20 | seo-keyword-research-tool | 119 |
21 | CoWin-Vaccine-Notifier | 107 |
22 | scrape-google-scholar-py | 106 |
23 | python-libzim | 88 |