Python Webscraping

Open-source Python projects categorized as Webscraping

Top 23 Python Webscraping Projects

Webscraping
  1. autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. CyberScraper-2077

    A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

    Project mention: CyberScraper 2077 – A OpenAI / Gemini Based Web Scraper | news.ycombinator.com | 2024-09-08

    Info about CyberScraper-2077:

    Rip data from the net, leaving no trace. Welcome to the future of web scraping.

    About

    CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI to slice through the web's defenses, extracting the data you need with unparalleled precision and style.

    Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.

    Features

    AI-Powered Extraction: Utilizes cutting-edge AI models to understand and parse web content intelligently.

    Sleek Streamlit Interface: User-friendly GUI that even a chrome-armed street samurai could navigate.

    Multi-Format Support: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.

    Stealth Mode: Implemented stealth mode parameters that helps it from getting detected as bot.

    Ollama Support: Use a huge libarary of open source LLMs.

    Async Operations: Lightning-fast scraping that would make a Trauma Team jealous.

    Smart Parsing: Structures scraped content as if it was extracted straight from the engram of a master netrunner.

    Ethical Scraping: Respects robots.txt and site policies. We may be in 2077, but we still have standards.

    Caching: We implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.

    Upload to Google Sheets: Now you can easily upload your extract csv data to google sheets with one click.

    Proxy Mode (Coming Soon): Built-in proxy support to keep you ghosting through the net.

    Navigate through the Pages: Navigate through the webpage and scrap the data from different pages.

    If you are unable to scrape a website and you are getting blocked, try out the current browser features:

    Github: https://github.com/itsOwen/CyberScraper-2077

  4. scrapeghost

    👻 Experimental library for scraping websites using OpenAI's GPT API.

  5. requests-cache

    Transparent persistent cache for python requests

  6. CrossLinked

    LinkedIn enumeration tool to extract valid employee names from an organization through search engine scraping

  7. mov-cli

    Watch everything from your terminal.

  8. gazpacho

    🥫 The simple, fast, and modern web scraping library

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. EasyApplyJobsBot

    A python bot to automatically apply all Linkedin,Glassdoor, etc Easy Apply jobs based on your preferences. Auto login, auto fill additional questions, apply automatically!

  11. zimit

    Make a ZIM file from any Web site and surf offline!

  12. scrapfly-scrapers

    Scalable Python web scraping scripts for +40 popular domains

    Project mention: A Comprehensive Guide to TikTok API | dev.to | 2025-03-20

    At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.

  13. AutoScraper

    Official implement of paper "AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation" [EMNLP 24'] (by EZ-hwh)

  14. TikTokBot

    A TikTokBot that downloads trending tiktok videos and compiles them using FFmpeg

  15. ChatGPT-OpenAI-Smart-Speaker

    This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.

  16. ebayScraper

    Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel

  17. llm_osint

    LLM OSINT is a proof-of-concept method of using LLMs to gather information from the internet and then perform a task with this information.

  18. Dorkify

    Perform Google Dork search with Dorkify

  19. blackmaria

    Python package for webscraping in Natural language

  20. htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

  21. dpulse

    DPULSE - Tool for complex approach to domain OSINT

    Project mention: Show HN: Domain Public Data Collection Service Got Updated | news.ycombinator.com | 2024-06-14

    Hello everyone! Some days ago I've published a post here about DPULSE which is domain OSINT tool. Аor those who haven't seen it I'll give Github link (https://github.com/OSINT-TECHNOLOGIES/dpulse). In general, this tool allows you to gather a lot of open-source information about domain such as WHOIS information, subdomains, mentions of the domain's owner organization in some social networks (+ organization profiles in social networks), IP addresses, public documents and many more using custom Google Dorking queries, InternetDB search results (possible vulnerabilities, open ports and so on), used web-technologies, sitemap, robots.txt files, SSL certificate info. And in v1.0.2 update I've added possibility to form results as XLSX file additionally to PDF file. Also I've added roadmap on GitHub page (projects tab), so everyone interested is able to see how development process is going on.

    Will be glad to see everyone in my repository and, moreover, to hear your opinion and tips!

  22. seo-keyword-research-tool

    Python SEO keywords suggestion tool. Google Autocomplete, People Also Ask and Related Searches.

  23. CoWin-Vaccine-Notifier

    Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.

  24. scrape-google-scholar-py

    Extract data from all Google Scholar pages from a single Python module. NOTE: I'm no longer maintaining this repo. Chrome driver/selectors might need an update.

  25. python-libzim

    Libzim binding for Python: read/write ZIM files in Python

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Webscraping discussion

Log in or Post with

Python Webscraping related posts

  • A Comprehensive Guide to TikTok API

    1 project | dev.to | 20 Mar 2025
  • Ask HN: Who is hiring? (November 2024)

    20 projects | news.ycombinator.com | 1 Nov 2024
  • CyberScraper 2077 – A OpenAI / Gemini Based Web Scraper

    1 project | news.ycombinator.com | 8 Sep 2024
  • CyberScraper-2077 – A LLM Based Web Scraper

    2 projects | news.ycombinator.com | 6 Sep 2024
  • How To Scrape TikTok in 2024

    1 project | dev.to | 8 Mar 2024
  • Direction Of The Stock Market

    1 project | /r/StockMarket | 6 Dec 2023
  • And I thought amazing fics suddenly being deleted was a myth

    1 project | /r/AO3 | 18 Nov 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 13 Jun 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Webscraping projects in Python? This list will help you:

# Project Stars
1 autoscraper 6,788
2 CyberScraper-2077 1,712
3 scrapeghost 1,436
4 requests-cache 1,426
5 CrossLinked 1,386
6 mov-cli 937
7 gazpacho 769
8 EasyApplyJobsBot 625
9 zimit 535
10 scrapfly-scrapers 533
11 AutoScraper 468
12 TikTokBot 377
13 ChatGPT-OpenAI-Smart-Speaker 290
14 ebayScraper 223
15 llm_osint 207
16 Dorkify 202
17 blackmaria 150
18 htmldate 129
19 dpulse 124
20 seo-keyword-research-tool 119
21 CoWin-Vaccine-Notifier 107
22 scrape-google-scholar-py 106
23 python-libzim 88

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?