#Crawler

Open-source projects categorized as Crawler

Top 23 Crawler Open-Source Projects

  • GitHub repo Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Why is Python popular despite being accused of being slow? | reddit.com/r/programming | 2021-04-16

    I use it regularly for things like web scraping (Scrapy is a joy) and data manipulation. For instance just wrote some fairly complicated scripts for doing address matching to pair up a couple of UK datasets without a common identity field. Human-entered addresses are decidedly fuzzy so you end up with a lot of arbitrary rules and Python is just fast to develop against. I don't really care if the script takes a couple of hours to run on the full datasets (35 million addresses) as opposed to half that time in something else more of a pain to tweak around with.

  • GitHub repo pyspider

    A Powerful Spider(Web Crawler) System in Python.

  • GitHub repo annie

    👾 Fast, simple and clean video downloader

    Project mention: Question about annie | reddit.com/r/youtubedl | 2021-03-03

    https://github.com/iawia002/annie#download-a-video

  • GitHub repo colly

    Elegant Scraper and Crawler Framework for Golang

    Project mention: Any decent web crawler? | reddit.com/r/rust | 2021-03-02

    I love the lang but seems like it lacks some libs I need. Good example will be this in Go github.com/gocolly/colly

  • GitHub repo newspaper

    News, full-text, and article metadata extraction in Python 3. Advanced docs:

    Project mention: Scrapers for replacing RSS 2.0 articles with their full source articles | reddit.com/r/rss | 2021-04-19

    These Python scrapers take RAW RSS 2.0 XML data as input and replace their description s with full source articles parsed/fetched via Newspaper (first scraper) or Article-Parser (second scraper).

  • GitHub repo webmagic

    A scalable web crawler framework for Java.

  • GitHub repo Pholcus

    Pholcus is a distributed high-concurrency crawler software written in pure golang

  • GitHub repo Ferret

    Declarative web scraping

  • GitHub repo autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Do I really need to customize a scaper for every input page URL? | reddit.com/r/webscraping | 2021-01-04

    Maybe you could try out this tool never used it myself https://github.com/alirezamika/autoscraper Or you would need to train a neural network which does the parsing for you but the accuracy won't be 100% with this method https://github.com/scrapy/scrapely To answer your question there is no way to write one program which scrapes different pages correctly because their structure is not the same and some of them probably will have protection which will block your scraper so that would need extra attention or additional code

  • GitHub repo toapi

    Every web site provides APIs.

  • GitHub repo Rendora

    dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

  • GitHub repo RED_HAWK

    All in one tool for Information Gathering, Vulnerability Scanning and Crawling. A must have tool for all penetration testers

    Project mention: Top 10 Information Gathering Tools in Termux | dev.to | 2021-03-24
  • GitHub repo PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • GitHub repo work_crawler

    Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 知音漫客 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 comico webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.

    Project mention: What's my best option for archiving every webtoon from webtoons.com? | reddit.com/r/DataHoarder | 2021-04-03
  • GitHub repo Geziyor

    Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

  • GitHub repo SwiftLinkPreview

    It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.

  • GitHub repo Wombat

    Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

  • GitHub repo fscrawler

    Elasticsearch File System Crawler (FS Crawler)

    Project mention: How to index PDF files with AppSeach | reddit.com/r/elasticsearch | 2021-01-09

    Did you sir give FSCrawler a shot? it's also a file system crawler that automatically ingests files into an ElasticSearch index and it is built over Tika so it performs typically the same (multiple file formats, multiple languages, ...). It can also perform OCR while indexing (uses Tesseract) by just toggling it on its config (I didn't try the OCR that much for a lack of need and for the ridiculous amount of times it added to the indexation process on my small environment).

  • GitHub repo Crawler

    A high performance web crawler in Elixir.

  • GitHub repo grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: #grab-site: Rastreador web diseñado para realizar copias de seguridad de sitios web | reddit.com/r/u_esgeeks | 2021-02-24
  • GitHub repo bookcorpus

    Crawl BookCorpus

    Project mention: On the Danger of Stochastic Parrots [pdf] | news.ycombinator.com | 2021-03-01

    The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.

    Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.

    Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?

    This might be why the paper is intentionally vague about the books used?

  • GitHub repo Crawly

    Crawly, a high-level web crawling & scraping framework for Elixir.

  • GitHub repo dotcommon

    What do people have in their dotfiles?

    Project mention: Key mappings everyone uses? | reddit.com/r/vim | 2020-12-30

    That's the repository. But it doesn't crawl for mappings. But it's simple to modify. So maybe you can add the pattern to look for. https://github.com/Kharacternyk/dotcommon#neovim

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-04-19.

Index

What are some of the best open-source Crawler projects? This list will help you:

Project Stars
1 Scrapy 40,335
2 pyspider 14,954
3 annie 14,534
4 colly 13,668
5 newspaper 10,925
6 webmagic 9,734
7 Pholcus 6,767
8 Ferret 4,455
9 autoscraper 3,453
10 toapi 3,125
11 Rendora 1,797
12 RED_HAWK 1,660
13 PSpider 1,553
14 work_crawler 1,271
15 Geziyor 1,271
16 SwiftLinkPreview 1,225
17 Wombat 1,221
18 fscrawler 924
19 Crawler 784
20 grab-site 690
21 bookcorpus 453
22 Crawly 451
23 dotcommon 429