Top 23 Webscraping Open-Source Projects
-
Project mention: Organice: An implementation of Org mode without the dependency of Emacs | news.ycombinator.com | 2022-05-26
-
Project mention: Scrapping - How to deal with page changes Ai | reddit.com/r/webscraping | 2022-03-25
It depends on the website, but autoscraper was used to calculate similar nodes given the text to search. Not sure how it works now but it's open source.
-
SonarLint
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
-
browser-fingerprinting
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web?
So he here is a link to the original, true Github repo, so at least the author might get reimbursed via his affiliation links and get rightfully paid rather than just plagiarized by other proxy blogs - https://github.com/niespodd/browser-fingerprinting
-
Web scraping is a handy tool to have in a data scientist's skill set. It can be useful in a variety of situations to gather data, such as when a website does not provide an API. We will be using this golang package github.com/anaskhan96/soup. It performs the same as beautifulsoup of python. This is the webpage we are going to be scraping.
-
-
xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Project mention: How to make http request with curl on certain page after being authenticated? | reddit.com/r/commandline | 2021-10-14I built Xidel for such authenticated requests:
-
NYTimes-App
🗽 A Simple Demonstration of the New York Times App 📱 using Jsoup web crawler with MVVM Architecture 🔥
Darn, appears that it doesn't work. However I looked to see if there was a third part app that works and it appears that the following one works well: https://github.com/TheCodeMonks/NYTimes-App
-
Scout APM
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
-
This is the one I know about: https://morph.io/ and https://github.com/openaustralia/morph#readme (AGPLv3) -- they used to be at the intersection of "heroku for scrapers" and DoltHub (e.g. https://www.dolthub.com/repositories/dolthub/us-businesses/d...) since the scrapers would run but then make their data available as CSV or sqlite or whatever. But, when I just tried to load one of the morph.io scrapers, the page just said "creating new template" so I'm guessing they've gone the way of the ScraperWiki.com that preceded them: turns out, hosted compute for free isn't free
-
instascrape
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
Project mention: Question about Instagram scraping problem for Thesis (Too big size of data to scrape) | reddit.com/r/learnpython | 2021-11-09Link to the open source package I used: https://github.com/chris-greening/instascrape
-
-
r-web-scraping-cheat-sheet
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
-
-
ebayScraper
Scrape all eBay sold listings to determine average/median pricing, plot listings over time with trend lines, and extract to excel
Project mention: I wrote a python program for scraping Ebay to find a cheap used espresso machines under $200. | reddit.com/r/Python | 2021-12-11If you ever want to expand on this project more, you might enjoy looking at my implementation of an eBay Scraper I made last year: https://github.com/driscoll42/ebayMarketAnalyzer You can see the code I used to specify a specific search to scrape eBay for those instead of needing to put the specific search URL, also filters based on price. The main issue you'll run into sooner or later are CAPTCHAs eBay added earlier this year.
-
CoWin-Vaccine-Notifier
Automated Python Script to retrieve vaccine slots availability and get notified when a slot is available.
-
https://github.com/Algocircle/Cascadia.jl is a julia library for css-style queries on Gumbo.jl parsed HTML.
-
Project mention: Reading from the web offline and distraction-free | news.ycombinator.com | 2021-10-10
which worked quite well for most sites, but still very far from a general-purpose solution.
There is also more powerful/general-purpose scraper that generates a ZIM file here: https://github.com/openzim/zimit
It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)
-
This doesn't have much to do with HTML, but you should consider newspaperJS if iframe doesn't work.
-
Project mention: Help me automate a boring task. [Print TO HTML] | reddit.com/r/learnpython | 2021-10-10
Sure, in this project https://github.com/Jaime-alv/web_check. Look at checker.py inside web_check folder, line 37 onwards.
-
-
-
I read through the documentation and tinkered around with it -- great work! One recommendation I would make, particularly if you're hoping that this will be useful long-term for NLP, is not to delete the previously scraped data. For instance, http://www.newsemble.ml/news only contains 129 results, which is nowhere near comprehensive enough to ensure any kind of statistically significant NLP.
-
redditsfinder
Archive a reddit user's post history. Formatted overview of a profile, JSON containing every post, and picture downloads. Uses the pushshift API.
Project mention: Is it possible to fetch entire history of comments by a user? | reddit.com/r/redditdev | 2022-03-05 -
Project mention: iSubRip: A Python package for scraping and downloading subtitles from iTunes movie pages | reddit.com/r/trackers | 2022-03-27
Webscraping related posts
- I Need to Convert HTML Files to CSV
- Amazon Product Information Scraper
- Public toilets
- is there anything else i can do to complete my edgenuity faster?
- Simple web scraping container
- iSubRip: A Python package for scraping and downloading subtitles from iTunes movie pages
- Scrapping - How to deal with page changes Ai
Index
What are some of the best open-source Webscraping projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Huginn | 35,634 |
2 | autoscraper | 4,367 |
3 | browser-fingerprinting | 2,907 |
4 | soup | 1,798 |
5 | gazpacho | 641 |
6 | xidel | 471 |
7 | NYTimes-App | 456 |
8 | morph | 448 |
9 | instascrape | 446 |
10 | Rcrawler | 309 |
11 | r-web-scraping-cheat-sheet | 304 |
12 | TikTokBot | 219 |
13 | ebayScraper | 108 |
14 | CoWin-Vaccine-Notifier | 100 |
15 | Cascadia.jl | 95 |
16 | zimit | 90 |
17 | newspaperjs | 57 |
18 | web_check | 52 |
19 | htmldate | 47 |
20 | extractnet | 46 |
21 | newsemble | 42 |
22 | redditsfinder | 32 |
23 | iSubRip | 24 |
Are you hiring? Post a new remote job listing for free.