SaaSHub helps you find the best software and product alternatives Learn more →
Top 13 Python Crawling Projects
Scrapy, a fast high-level web crawling & scraping framework for Python.Project mention: Show HN: SiteGPT – Create ChatGPT-like chatbots trained on your website content | news.ycombinator.com | 2023-04-01
Not to go full "Dropbox in a weekend", but if you're technical enough to self-host, this is something you can build for yourself
Everyone is going straight to embeddings, but it'd be easy enough to use old school NLP summarization from NLTK (https://www.nltk.org/)
Hook that up a web scraping library like https://scrapy.org/ and get a summary of each page.
Then embed a site map in your system prompt and use langchain (https://github.com/hwchase17/langchain) to allow GPT to query for a specific page's summary.
The point of this isn't to say that's how OP did it, but there might be people seeing stuff like this and wondering how on earth to get into it: This is something you could build in a weekend with pretty much no understanding of AI
News, full-text, and article metadata extraction in Python 3. Advanced docs:Project mention: Gathering News Headlines | reddit.com/r/Automate | 2022-08-08
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
Web Scraping Framework
🤖 Scrape data from HTML websites automatically by just providing examplesProject mention: Experimental library for scraping websites using OpenAI's GPT API | news.ycombinator.com | 2023-03-25
Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.
HTTP API for Scrapy spidersProject mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
ISP Data Pollution to Protect Private Browsing History with ObfuscationProject mention: A browser extension that sends periodical random search requests, from a large distributed list, to confuse the data broker industry. | reddit.com/r/CrazyIdeas | 2022-04-12
I’ve seen multiple implementations of this idea. Here’s the first that came up in a quick search: https://github.com/essandess/isp-data-pollution/
Scrapy Extension for monitoring spiders execution.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
WarcDB: Web crawl data as SQLite databases.Project mention: GitHub - Florents-Tselai/WarcDB: WarcDB: Web crawl data as SQLite databases. | reddit.com/r/sqlite | 2022-06-23
spidy Web Crawler
The simple, easy to use command line web crawler.
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.Project mention: ☎️ Fragment is getting ready to sell virtual numbers | reddit.com/r/TONPlus | 2022-12-03
According to the "Telegram Info", changes have been made to the source code of the Fragment platform, demonstrating the appearance of virtual numbers for sale soon!
estela, an elastic web scraping cluster 🕸Project mention: How to run webs scraping script every 15 minutes | reddit.com/r/webscraping | 2023-02-13
You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.
Python 3 script to dump company employees from XING APIProject mention: OSINT: Enumerating Employees on LinkedIn and Xing | reddit.com/r/redteamsec | 2023-02-16
estela Command Line Client 🕸Project mention: Deploying Scrapy Projects on the Cloud | reddit.com/r/webscraping | 2022-12-14
We are currently running a closed beta of Bitmaker Cloud (free and unlimited). Bitmaker Cloud gives you easy management of scraping workloads via a web dashboard and API. Only Scrapy spiders are supported at the moment (additional languages/frameworks are on the roadmap). Bitmaker Cloud is powered by estela, an elastic web scraping cluster running on Kubernetes. estela is a modern alternative to proprietary platforms such as Scrapy Cloud, as well as OSS projects such as scrapyd. The source code of estela and estela-cli is available on Github.
Python Crawling related posts
Could someone recommend me a library for c# like one of these two (they are for python) : mlscraper and autoscraper
2 projects | reddit.com/r/learnprogramming | 19 Mar 2023
Is there a way to scrape ALL images from a website (not a single page, the whole website)?
1 project | reddit.com/r/DataHoarder | 18 Mar 2023
Celery lock a variable between two processes
1 project | reddit.com/r/learnpython | 9 Mar 2023
Scrapy extension blocking login to AVer PTZ Camera (CAM520 Pro)
1 project | reddit.com/r/scrapy | 7 Mar 2023
What steps do you apply in the "L" when doing ELT?
1 project | reddit.com/r/dataengineering | 24 Feb 2023
1 project | reddit.com/r/webscraping | 14 Feb 2023
How to run webs scraping script every 15 minutes
2 projects | reddit.com/r/webscraping | 13 Feb 2023
A note from our sponsor - #<SponsorshipServiceOld:0x00007f160c97f0f8>
www.saashub.com | 1 Apr 2023
What are some of the best open-source Crawling projects in Python? This list will help you:
|9||spidy Web Crawler||306|