SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Scraping Projects
-
Project mention: Is there a program available for bulk image reverse searching? | reddit.com/r/AskProgramming | 2023-02-02
In the past I used stuff like beautifulsoup for webscraping but I’ve heard good things about https://scrapy.org/
-
Project mention: 8 Most Popular Python HTML Web Scraping Packages with Benchmarks | dev.to | 2023-02-01
requests-html
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Project mention: A Smart, Automatic, Fast and Lightweight Web Scraper for Python | reddit.com/r/webdev | 2022-12-02
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Try using undetected chromedriver it should make your scraper harder to detect. If that doesn't work then you will need to use proxies.
-
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
-
-
Project mention: Tool das alle mit E-Mail verknüpfte Accounts auflistet? | reddit.com/r/de_EDV | 2022-06-22
-
InfluxDB
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.
-
Facebook Scraper project github link: GitHub – kevinzg/facebook-scraper: Scrape Facebook public pages without an API key
-
variable.css(".X5PpBb::text").get() # returns a text value variable.css(".gs_a").xpath("normalize-space()").get() # https://github.com/scrapy/parsel/issues/192#issuecomment-1042301716 variable.css(".gSGphe img::attr(srcset)").get() # returns a attribute value variable.css(".I9Jtec::text").getall() # returns a list of strings values variable.xpath('th/text()').get() # returns text value using xpath
-
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Project mention: Testing fast installation in tear-down environment | reddit.com/r/learnpython | 2022-07-06I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
-
-
-
Project mention: What are your favourite GitHub repos that shows how data engineering should be done? | reddit.com/r/dataengineering | 2022-11-18
-
Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
Project mention: How do I scrape Twitter media along with text of all the accounts that I'm following without using API? | reddit.com/r/DataHoarder | 2023-02-03 -
Most of services tried to take money from me for custom domains, sitemap, theme etc. Sorry, I don't have enough money for that. 2. nextjs-notion-starter-kit This was the second for me. ① Cannot handle path well It is OK to map your page to /one but it throws 404 error when path is /one/two or /one/two/three. Since I have to map URL like /category1/subcategory2/sub-subcategory3/sub-sub-subcategory4/page5, this is serious problem. Other than that, it seems quite nice. 3. loconotion ① Does not sort pages in directories I tested this without mapping pages to path, but it seems it doesn't sort pages in directories when processing Notion pages. I found python script that sort those pages, but it doesn't support Windows, the only OS I use for general use. ② Slow generation If my site have few pages, it might be good choice, but my site has ~100 pages and a lot of pages will be added in near future. If it converts every pages whenever I launches it, it would take a lot of time. 4. Fruition I think it is just converting Notion page URL into mapped URL. Kind of nice, but still slow as Notion's shared page is. But I think it is doing some 'magic'. I didn't know that CF Worker can do such job. I'm currently using this.
-
lookyloo
Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.
-
comic-dl
Comic-dl is a command line tool to download manga and comics from various comic and manga sites. Supported sites : readcomiconline.to, mangafox.me, comic naver and many more.
Project mention: Working on a frontend for Comic-dl. Would there be any interest in this? | reddit.com/r/selfhosted | 2022-07-03Recently came back to comic-dl when I started setting up my homelab and selfhosting more. Wanted a way to automate/semi-automate my comic consumption.
-
This is what Spidermon does.
-
Search Engine Parser
Lightweight package to query popular search engines and scrape for result titles, links and descriptions
-
-
dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
I also built a framework so I can easily switch between these libraries with less code change (still on hiatus for a few months before going back to it): https://github.com/roniemartinez/dude
-
Project mention: Scan For Webcams on the internet easily, with automatic checking of accessibility, text descriptions of the footage, timezone of the camera and more. | reddit.com/r/SideProject | 2022-05-17
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Scraping related posts
- How do I scrape Twitter media along with text of all the accounts that I'm following without using API?
- Is there a program available for bulk image reverse searching?
- Roth IRA short term investing
- How I used Scrapy for my ML Project
- What is a good skill to learn at 14?
- I'd like to start learning day trading but I feel like there's so much that I don't know in fact I don't even know where to start I'll be so glad if you can help me
- scrapy.Request(url, callback) vs response.follow(url, callback)
-
A note from our sponsor - #<SponsorshipServiceOld:0x00007fea59c7e930>
www.saashub.com | 4 Feb 2023
Index
What are some of the best open-source Scraping projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 46,031 |
2 | requests-html | 12,923 |
3 | autoscraper | 4,905 |
4 | undetected-chromedriver | 3,848 |
5 | Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE | 2,820 |
6 | Grab | 2,259 |
7 | snoop | 1,735 |
8 | facebook-scraper | 1,554 |
9 | parsel | 874 |
10 | mlscraper | 818 |
11 | trafilatura | 741 |
12 | Edu-Mail-Generator | 725 |
13 | gazpacho | 694 |
14 | DataEngineeringProject | 684 |
15 | Scweet | 636 |
16 | loconotion | 619 |
17 | lookyloo | 522 |
18 | comic-dl | 476 |
19 | spidermon | 447 |
20 | Search Engine Parser | 371 |
21 | wikipedia_ql | 347 |
22 | dude | 330 |
23 | scan-for-webcams | 184 |