Open-source projects categorized as Webscraping | Edit details

Top 23 Webscraping Open-Source Projects

  • Huginn

    Create agents that monitor and act on your behalf. Your agents are standing by!

    Project mention: Organice: An implementation of Org mode without the dependency of Emacs | news.ycombinator.com | 2022-05-26
  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Scrapping - How to deal with page changes Ai | reddit.com/r/webscraping | 2022-03-25

    It depends on the website, but autoscraper was used to calculate similar nodes given the text to search. Not sure how it works now but it's open source.

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • browser-fingerprinting

    Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

    Project mention: Hey! Another FREE guide for you! | reddit.com/r/proxies | 2022-04-18

    So he here is a link to the original, true Github repo, so at least the author might get reimbursed via his affiliation links and get rightfully paid rather than just plagiarized by other proxy blogs - https://github.com/niespodd/browser-fingerprinting

  • soup

    Web Scraper in Go, similar to BeautifulSoup

    Project mention: Web Scraping in Golang | dev.to | 2022-01-17

    Web scraping is a handy tool to have in a data scientist's skill set. It can be useful in a variety of situations to gather data, such as when a website does not provide an API. We will be using this golang package github.com/anaskhan96/soup. It performs the same as beautifulsoup of python. This is the webpage we are going to be scraping.

  • gazpacho

    🥫 The simple, fast, and modern web scraping library

  • xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

    Project mention: How to make http request with curl on certain page after being authenticated? | reddit.com/r/commandline | 2021-10-14

    I built Xidel for such authenticated requests:

  • NYTimes-App

    🗽 A Simple Demonstration of the New York Times App 📱 using Jsoup web crawler with MVVM Architecture 🔥

    Project mention: NY Times app on the origin? | reddit.com/r/mobiscribe | 2021-11-21

    Darn, appears that it doesn't work. However I looked to see if there was a third part app that works and it appears that the following one works well: https://github.com/TheCodeMonks/NYTimes-App

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • morph

    Take the hassle out of web scraping (by openaustralia)

    Project mention: Web Scraping Open Knowledge | news.ycombinator.com | 2022-05-27

    This is the one I know about: https://morph.io/ and https://github.com/openaustralia/morph#readme (AGPLv3) -- they used to be at the intersection of "heroku for scrapers" and DoltHub (e.g. https://www.dolthub.com/repositories/dolthub/us-businesses/d...) since the scrapers would run but then make their data available as CSV or sqlite or whatever. But, when I just tried to load one of the morph.io scrapers, the page just said "creating new template" so I'm guessing they've gone the way of the ScraperWiki.com that preceded them: turns out, hosted compute for free isn't free

  • instascrape

    Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

    Project mention: Question about Instagram scraping problem for Thesis (Too big size of data to scrape) | reddit.com/r/learnpython | 2021-11-09

    Link to the open source package I used: https://github.com/chris-greening/instascrape

  • Rcrawler

    An R web crawler and scraper

  • r-web-scraping-cheat-sheet

    Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.

  • TikTokBot

    A TikTokBot that downloads trending tiktok videos and compiles them using FFmpeg

  • ebayScraper

    Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel

    Project mention: I wrote a python program for scraping Ebay to find a cheap used espresso machines under $200. | reddit.com/r/Python | 2021-12-11

    If you ever want to expand on this project more, you might enjoy looking at my implementation of an eBay Scraper I made last year: https://github.com/driscoll42/ebayMarketAnalyzer You can see the code I used to specify a specific search to scrape eBay for those instead of needing to put the specific search URL, also filters based on price. The main issue you'll run into sooner or later are CAPTCHAs eBay added earlier this year.

  • CoWin-Vaccine-Notifier

    Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.

  • Cascadia.jl

    A CSS Selector library in Julia

    Project mention: I Need to Convert HTML Files to CSV | reddit.com/r/Julia | 2022-05-12

    https://github.com/Algocircle/Cascadia.jl is a julia library for css-style queries on Gumbo.jl parsed HTML.

  • zimit

    Make a ZIM file from any Web site and surf offline!

    Project mention: Reading from the web offline and distraction-free | news.ycombinator.com | 2021-10-10

    which worked quite well for most sites, but still very far from a general-purpose solution.

    There is also more powerful/general-purpose scraper that generates a ZIM file here: https://github.com/openzim/zimit

    It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)

  • newspaperjs

    News extraction and scraping. Article Parsing

    Project mention: Creating news section for site | reddit.com/r/HTML | 2022-01-03

    This doesn't have much to do with HTML, but you should consider newspaperJS if iframe doesn't work.

  • web_check

    Script for checking changes in webpages

    Project mention: Help me automate a boring task. [Print TO HTML] | reddit.com/r/learnpython | 2021-10-10

    Sure, in this project https://github.com/Jaime-alv/web_check. Look at checker.py inside web_check folder, line 37 onwards.

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

    Project mention: How does Firefox's Reader View work? | news.ycombinator.com | 2022-03-30
  • extractnet

    A Dragnet that also extract author, headline, date, keywords from context

  • newsemble

    API for fetching data from news websites.

    Project mention: Newsemble: An API to fetch current news data | reddit.com/r/Python | 2021-07-18

    I read through the documentation and tinkered around with it -- great work! One recommendation I would make, particularly if you're hoping that this will be useful long-term for NLP, is not to delete the previously scraped data. For instance, http://www.newsemble.ml/news only contains 129 results, which is nowhere near comprehensive enough to ensure any kind of statistically significant NLP.

  • redditsfinder

    Archive a reddit user's post history. Formatted overview of a profile, JSON containing every post, and picture downloads. Uses the pushshift API.

    Project mention: Is it possible to fetch entire history of comments by a user? | reddit.com/r/redditdev | 2022-03-05
  • iSubRip

    A Python package for scraping and downloading subtitles from iTunes movie pages.

    Project mention: iSubRip: A Python package for scraping and downloading subtitles from iTunes movie pages | reddit.com/r/trackers | 2022-03-27
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-05-27.

Webscraping related posts


What are some of the best open-source Webscraping projects? This list will help you:

Project Stars
1 Huginn 35,634
2 autoscraper 4,367
3 browser-fingerprinting 2,907
4 soup 1,798
5 gazpacho 641
6 xidel 471
7 NYTimes-App 456
8 morph 448
9 instascrape 446
10 Rcrawler 309
11 r-web-scraping-cheat-sheet 304
12 TikTokBot 219
13 ebayScraper 108
14 CoWin-Vaccine-Notifier 100
15 Cascadia.jl 95
16 zimit 90
17 newspaperjs 57
18 web_check 52
19 htmldate 47
20 extractnet 46
21 newsemble 42
22 redditsfinder 32
23 iSubRip 24
Find remote jobs at our new job board 99remotejobs.com. There are 7 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives