SaaSHub helps you find the best software and product alternatives Learn more →
Top 13 Python Crawling Projects
-
Project mention: Show HN: SiteGPT – Create ChatGPT-like chatbots trained on your website content | news.ycombinator.com | 2023-04-01
Not to go full "Dropbox in a weekend", but if you're technical enough to self-host, this is something you can build for yourself
Everyone is going straight to embeddings, but it'd be easy enough to use old school NLP summarization from NLTK (https://www.nltk.org/)
Hook that up a web scraping library like https://scrapy.org/ and get a summary of each page.
Then embed a site map in your system prompt and use langchain (https://github.com/hwchase17/langchain) to allow GPT to query for a specific page's summary.
-
The point of this isn't to say that's how OP did it, but there might be people seeing stuff like this and wondering how on earth to get into it: This is something you could build in a weekend with pretty much no understanding of AI
-
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
-
Project mention: Experimental library for scraping websites using OpenAI's GPT API | news.ycombinator.com | 2023-03-25
Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.
-
Project mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
-
Project mention: A browser extension that sends periodical random search requests, from a large distributed list, to confuse the data broker industry. | reddit.com/r/CrazyIdeas | 2022-04-12
I’ve seen multiple implementations of this idea. Here’s the first that came up in a quick search: https://github.com/essandess/isp-data-pollution/
-
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Project mention: GitHub - Florents-Tselai/WarcDB: WarcDB: Web crawl data as SQLite databases. | reddit.com/r/sqlite | 2022-06-23
-
-
telegram-crawler
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
Project mention: ☎️ Fragment is getting ready to sell virtual numbers | reddit.com/r/TONPlus | 2022-12-03According to the "Telegram Info", changes have been made to the source code of the Fragment platform, demonstrating the appearance of virtual numbers for sale soon!
-
Project mention: How to run webs scraping script every 15 minutes | reddit.com/r/webscraping | 2023-02-13
You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.
-
Project mention: OSINT: Enumerating Employees on LinkedIn and Xing | reddit.com/r/redteamsec | 2023-02-16
-
We are currently running a closed beta of Bitmaker Cloud (free and unlimited). Bitmaker Cloud gives you easy management of scraping workloads via a web dashboard and API. Only Scrapy spiders are supported at the moment (additional languages/frameworks are on the roadmap). Bitmaker Cloud is powered by estela, an elastic web scraping cluster running on Kubernetes. estela is a modern alternative to proprietary platforms such as Scrapy Cloud, as well as OSS projects such as scrapyd. The source code of estela and estela-cli is available on Github.
Python Crawling related posts
- Could someone recommend me a library for c# like one of these two (they are for python) : mlscraper and autoscraper
- Is there a way to scrape ALL images from a website (not a single page, the whole website)?
- Celery lock a variable between two processes
- Scrapy extension blocking login to AVer PTZ Camera (CAM520 Pro)
- What steps do you apply in the "L" when doing ELT?
- Smart Scraper
- How to run webs scraping script every 15 minutes
-
A note from our sponsor - #<SponsorshipServiceOld:0x00007f160c97f0f8>
www.saashub.com | 1 Apr 2023
Index
What are some of the best open-source Crawling projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 46,621 |
2 | newspaper | 12,585 |
3 | Grab | 2,268 |
4 | mlscraper | 901 |
5 | scrapyrt | 767 |
6 | isp-data-pollution | 530 |
7 | spidermon | 453 |
8 | WarcDB | 369 |
9 | spidy Web Crawler | 306 |
10 | telegram-crawler | 129 |
11 | estela | 110 |
12 | XingDumper | 16 |
13 | estela-cli | 3 |