Top 23 Crawler Open-Source Projects
Scrapy, a fast high-level web crawling & scraping framework for Python.Project mention: Why is Python popular despite being accused of being slow? | reddit.com/r/programming | 2021-04-16
I use it regularly for things like web scraping (Scrapy is a joy) and data manipulation. For instance just wrote some fairly complicated scripts for doing address matching to pair up a couple of UK datasets without a common identity field. Human-entered addresses are decidedly fuzzy so you end up with a lot of arbitrary rules and Python is just fast to develop against. I don't really care if the script takes a couple of hours to run on the full datasets (35 million addresses) as opposed to half that time in something else more of a pain to tweak around with.
A Powerful Spider(Web Crawler) System in Python.
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
👾 Fast, simple and clean video downloaderProject mention: Question about annie | reddit.com/r/youtubedl | 2021-03-03
Elegant Scraper and Crawler Framework for GolangProject mention: Any decent web crawler? | reddit.com/r/rust | 2021-03-02
I love the lang but seems like it lacks some libs I need. Good example will be this in Go github.com/gocolly/colly
News, full-text, and article metadata extraction in Python 3. Advanced docs:Project mention: Scrapers for replacing RSS 2.0 articles with their full source articles | reddit.com/r/rss | 2021-04-19
These Python scrapers take RAW RSS 2.0 XML data as input and replace their description s with full source articles parsed/fetched via Newspaper (first scraper) or Article-Parser (second scraper).
A scalable web crawler framework for Java.
Pholcus is a distributed high-concurrency crawler software written in pure golang
Declarative web scraping
A Smart, Automatic, Fast and Lightweight Web Scraper for PythonProject mention: Do I really need to customize a scaper for every input page URL? | reddit.com/r/webscraping | 2021-01-04
Maybe you could try out this tool never used it myself https://github.com/alirezamika/autoscraper Or you would need to train a neural network which does the parsing for you but the accuracy won't be 100% with this method https://github.com/scrapy/scrapely To answer your question there is no way to write one program which scrapes different pages correctly because their structure is not the same and some of them probably will have protection which will block your scraper so that would need extra attention or additional code
Every web site provides APIs.
All in one tool for Information Gathering, Vulnerability Scanning and Crawling. A must have tool for all penetration testersProject mention: Top 10 Information Gathering Tools in Termux | dev.to | 2021-03-24
Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 知音漫客 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 comico webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.Project mention: What's my best option for archiving every webtoon from webtoons.com? | reddit.com/r/DataHoarder | 2021-04-03
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Elasticsearch File System Crawler (FS Crawler)Project mention: How to index PDF files with AppSeach | reddit.com/r/elasticsearch | 2021-01-09
Did you sir give FSCrawler a shot? it's also a file system crawler that automatically ingests files into an ElasticSearch index and it is built over Tika so it performs typically the same (multiple file formats, multiple languages, ...). It can also perform OCR while indexing (uses Tesseract) by just toggling it on its config (I didn't try the OCR that much for a lack of need and for the ridiculous amount of times it added to the indexation process on my small environment).
A high performance web crawler in Elixir.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patternsProject mention: #grab-site: Rastreador web diseñado para realizar copias de seguridad de sitios web | reddit.com/r/u_esgeeks | 2021-02-24
Crawl BookCorpusProject mention: On the Danger of Stochastic Parrots [pdf] | news.ycombinator.com | 2021-03-01
The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.
Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.
Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?
This might be why the paper is intentionally vague about the books used?
Crawly, a high-level web crawling & scraping framework for Elixir.
What do people have in their dotfiles?Project mention: Key mappings everyone uses? | reddit.com/r/vim | 2020-12-30
That's the repository. But it doesn't crawl for mappings. But it's simple to modify. So maybe you can add the pattern to look for. https://github.com/Kharacternyk/dotcommon#neovim
What are some of the best open-source Crawler projects? This list will help you: