Python Scraper

Open-source Python projects categorized as Scraper

Top 23 Python Scraper Projects

  • newspaper

    newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  • Douyin_TikTok_Download_API

    🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  • myGPTReader

    A community-driven way to read and chat with AI bots - powered by chatGPT.

  • snscrape

    A social networking service scraper in Python

  • Project mention: Can someone walk me through this? | /r/learnpython | 2023-11-21

    Here's what I'm trying to use: https://github.com/JustAnotherArchivist/snscrapeWhat do I need to open/run any of this? My goal with this is to extract my follower list off Twitter, and I'd very much like to know how to run it on my machine instead of having someone run it for me on theirs. I can't even figure out what I need to open the Readme file.

  • Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

    Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • bulk-downloader-for-reddit

    Downloads and archives content from reddit

  • Project mention: BDFR skipping Reddit hosted videos | /r/DataHoarder | 2023-10-21
  • linkedin_scraper

    A library that scrapes Linkedin for user data

  • JobFunnel

    Scrape job websites into a single spreadsheet with no duplicates.

  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

  • animdl

    A highly efficient, fast, powerful and light-weight anime downloader and streamer for your favorite anime.

  • Project mention: Does anyone have anime websites recommendations without ads? | /r/anime | 2023-07-10

    If you're tech-savvy, you can try animdl, which is a command line tool. No browser, no ads. To directly stream with the default provider AllAnime: animdl stream 'Your Anime Name'

  • cinemagoer

    Cinemagoer is a Python package useful to retrieve and manage the data of the IMDb (to which we are not affiliated in any way) movie database about movies, people, characters and companies

  • RedditDownloader

    Scrapes Reddit to download media of your choice.

  • Project mention: Reddit downloader that works after API upgrade | /r/DataHoarder | 2023-10-10

    Is there a functioning tool to download the saved posts / upvotes that you do on reddit? This tool: https://github.com/shadowmoose/RedditDownloader was perfect, but it got rekt by the API changes and has been discontinued.

  • finviz

    Unofficial API for finviz.com

  • Scweet

    A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...

  • GramAddict bot

    Completely free and open-source human-like Instagram bot. Powered by UIAutomator2 and compatible with basically any Android device 5.0+ that can run Instagram - real or emulated. (by GramAddict)

  • scrapyrt

    HTTP API for Scrapy spiders

  • thepipe

    Feed PDFs, URLs, Slides, YouTube, and more into Vision-Language models with one line of code⚡

  • Project mention: Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o | dev.to | 2024-05-22

    You've successfully scraped a webpage and extracted visually complicated unstructured data from it. This process can be extended to other types of content, such as PDFs, videos, and more, using The Pipe API. For more details on GPT-4o, check out the OpenAI announcement. If you're a developer, feel free to contribute to The Pipe on GitHub!

  • bookcorpus

    Crawl BookCorpus

  • Project mention: The Internet Archive is under a DDoS attack | news.ycombinator.com | 2024-05-28

    It is apparently widely suspected that a certain "Books2" dataset mentioned by OpenAI is basically just LibGen:

    https://blusharkmedia.medium.com/the-ongoing-battle-against-...

    https://techhq.com/2023/09/can-libgen-shadow-library-survive...

    https://www.twitter.com/theshawwn/status/1320282152689336320

    https://qz.com/openai-books-piracy-microsoft-meta-google-cha...

    https://qz.com/shadow-libraries-are-at-the-heart-of-the-moun...

    https://goodereader.com/blog/e-book-news/authors-file-lawsui...

    When asked about whether this was true, they refused to answer based on confidentiality concerns, then said they had deleted all copies of the dataset, stopped using it, and no longer employed the individuals that compiled it:

    https://www.businessinsider.com/openai-destroyed-ai-training...

    We do know for a fact that the (non-OpenAI-controlled) "books3" dataset is just "all of bibliotik":

    https://www.twitter.com/theshawwn/status/1320282149329784833

    https://github.com/soskek/bookcorpus/issues/27

    And we also apparently know for a fact that this was included in the datasets used to train LLAMA:

    https://en.wikipedia.org/wiki/The_Pile_(dataset)

    https://aicopyright.substack.com/p/the-books-used-to-train-l...

    https://aicopyright.substack.com/p/has-your-book-been-used-t...

  • TikTokLive

    Python library to receive live stream events (comments, gifts, etc.) in realtime from TikTok LIVE.

  • Project mention: Can someone help me modify this project | /r/learnpython | 2023-09-22
  • twscrape

    2024! Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.

  • URS

    Universal Reddit Scraper - A comprehensive Reddit scraping/archival command-line tool.

  • Project mention: Nitter Shutting Down | news.ycombinator.com | 2024-01-27

    If they don't want you to use their API just respect their wishes and scrape Reddit. https://github.com/JosephLai241/URS it's the only moral thing we can do.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Scraper discussion

Log in or Post with

Python Scraper related posts

  • Can someone walk me through this?

    1 project | /r/learnpython | 21 Nov 2023
  • What’s the coolest things you’ve done with python?

    3 projects | /r/Python | 15 Nov 2023
  • BDFR skipping Reddit hosted videos

    1 project | /r/DataHoarder | 21 Oct 2023
  • Updated Drexel Scheduler to Winter Quarter

    1 project | /r/Drexel | 19 Oct 2023
  • Show HN: New AI Dataset Based on LibGen and Sci-Hub

    2 projects | news.ycombinator.com | 8 Sep 2023
  • Exporting a telegram chat without Telegram Desktop?

    1 project | /r/DataHoarder | 17 Aug 2023
  • cryptoCMD: NEW Data - star count:456.0

    1 project | /r/algoprojects | 8 Aug 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 18 Jun 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Scraper projects in Python? This list will help you:

Project Stars
1 newspaper 13,835
2 chinese-xinhua 10,727
3 Douyin_TikTok_Download_API 7,526
4 autoscraper 6,027
5 myGPTReader 4,412
6 snscrape 4,286
7 Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE 3,080
8 bulk-downloader-for-reddit 2,226
9 linkedin_scraper 1,809
10 JobFunnel 1,740
11 mlscraper 1,242
12 animdl 1,241
13 cinemagoer 1,203
14 RedditDownloader 1,106
15 finviz 1,028
16 Scweet 986
17 GramAddict bot 968
18 scrapyrt 820
19 thepipe 794
20 bookcorpus 788
21 TikTokLive 777
22 twscrape 763
23 URS 744

Sponsored
Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com