Python Crawler

Open-source Python projects categorized as Crawler | Edit details

Top 23 Python Crawler Projects

  • GitHub repo Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    Project mention: Konohagakure Search | dev.to | 2022-01-09

    Scrapy

  • GitHub repo pyspider

    A Powerful Spider(Web Crawler) System in Python.

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • GitHub repo newspaper

    News, full-text, and article metadata extraction in Python 3. Advanced docs:

    Project mention: Is there a web text extraction library for reader mode written in Java/Kotlin? | reddit.com/r/webdev | 2021-12-20

    I have searched the web but the library I have found was for Python only. I need a library written in Java or Kotlin so that I could use it on Android. Is there any library for that? If you know that there is no such Java library, please let me know that so that I could stop searching.

  • GitHub repo Photon

    Incredibly fast crawler designed for OSINT. (by s0md3v)

    Project mention: How is ArchiveBox? | reddit.com/r/selfhosted | 2021-12-27

    If you need more advanced recursive spider/crawling ability beyond --depth=1, check out Browsertrix, Photon, or Scrapy and pipe the outputted URLs into ArchiveBox.

  • GitHub repo autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    Project mention: Turn Any Website Into An API with AutoScraper and FastAPI | dev.to | 2021-04-24

    In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.

  • GitHub repo toapi

    Every web site provides APIs.

  • GitHub repo instagram-scraper

    scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot (by realsirjoe)

    Project mention: Help in scraping Instagram | reddit.com/r/webscraping | 2021-10-09

    Yes: https://github.com/realsirjoe/instagram-scraper

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • GitHub repo PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

  • GitHub repo grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    Project mention: is there something like DownThemAll! that can be run without installing anything? | reddit.com/r/Archiveteam | 2022-01-16
  • GitHub repo XSRFProbe

    The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

    Project mention: Write-Up: TryHackMe Web Fundamentals - ZTH: Obscure Web Vulns | dev.to | 2022-01-05

    This section does not have an actual challenge, but it requires you to get familiar with the xsrfprobe library. You could either install the library to your device and use xsrfprobe --help to find the right argument for generating a POC, or use the official documentation from the web. Hint: Find out which command would let you craft an actual, malicious request.

  • GitHub repo bookcorpus

    Crawl BookCorpus

    Project mention: On the Danger of Stochastic Parrots [pdf] | news.ycombinator.com | 2021-03-01

    The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.

    Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.

    Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?

    This might be why the paper is intentionally vague about the books used?

  • GitHub repo dotcommon

    What do people have in their dotfiles?

  • GitHub repo freshonions-torscraper

    Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

    Project mention: How do I explore sites that aren't in lists? | reddit.com/r/TOR | 2021-08-11
  • GitHub repo trafilatura

    Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

    Project mention: Most universal way to grab relevant webpage content | reddit.com/r/webscraping | 2021-11-30

    Check out this python package https://github.com/adbar/trafilatura it seems to do what you are looking for. You may need to use an automated browser like selenium to first load the content of the page if you want it to work on all sites, even those that use js

  • GitHub repo mlscraper

    🤖 Scrape data from HTML websites automatically with Machine Learning

    Project mention: mlscraper: Scrape data from HTML pages automatically with Machine Learning | news.ycombinator.com | 2021-07-05
  • GitHub repo google-play-scraper

    Google play scraper for Python inspired by <facundoolano/google-play-scraper> (by JoMingyu)

    Project mention: Some CN players are conducting data analysis on Genshin’s reviews on Google Play | reddit.com/r/Genshin_Impact | 2021-09-30
  • GitHub repo spidy Web Crawler

    The simple, easy to use command line web crawler.

  • GitHub repo Moodle-Downloader-2

    A Moodle downloader that downloads course content fast from Moodle (eg. lecture pdfs)

    Project mention: Losing participation in an ONLINE ONLY CLASS. Why do professors act like their class is the only one? I expected this behavior in undergrad but not in grad school... | reddit.com/r/GradSchool | 2021-03-01

    That sounds terrible. From reading what you’ve written, the following may not be something you can do on your own, but maybe you have a technically inclined friend that can help you with it. There is a Linux based script that can be configured to automatically log on to Moodle and check it at specific intervals (and download the content). I suspect that running this script twice a day will help you ace those stats. https://github.com/C0D3D3V/Moodle-Downloader-2/issues

  • GitHub repo bitextor

    Bitextor generates translation memories from multilingual websites

    Project mention: Bitextor: a tool for generating translation memories from multilingual websites | reddit.com/r/LanguageTechnology | 2021-04-20
  • GitHub repo TorCrawl.py

    Crawl and extract (regular or onion) webpages through TOR network

  • GitHub repo Python Testing Crawler

    A crawler for automated functional testing of a web application

  • GitHub repo nebula-crawler

    🌌 A libp2p DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

    Project mention: Nebula - A libp2p DHT crawler. | reddit.com/r/libp2p | 2021-07-17
  • GitHub repo deepweb-scappering

    Discover hidden deepweb pages

    Project mention: DeepWeb Scapper: Descubre las páginas ocultas de #deepweb 👀 | reddit.com/r/u_esgeeks | 2021-04-07
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-16.

Python Crawler related posts

Index

What are some of the best open-source Crawler projects in Python? This list will help you:

Project Stars
1 Scrapy 42,525
2 pyspider 15,274
3 newspaper 11,599
4 Photon 8,388
5 autoscraper 4,184
6 toapi 3,216
7 instagram-scraper 2,239
8 PSpider 1,618
9 grab-site 803
10 XSRFProbe 722
11 bookcorpus 516
12 dotcommon 509
13 freshonions-torscraper 392
14 trafilatura 359
15 mlscraper 353
16 google-play-scraper 284
17 spidy Web Crawler 276
18 Moodle-Downloader-2 210
19 bitextor 199
20 TorCrawl.py 82
21 Python Testing Crawler 68
22 nebula-crawler 65
23 deepweb-scappering 63
Find remote jobs at our new job board 99remotejobs.com. There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
scoutapm.com