Scrapy
colly
Our great sponsors
Scrapy | colly | |
---|---|---|
96 | 14 | |
43,565 | 16,631 | |
0.9% | 2.4% | |
8.8 | 5.3 | |
3 days ago | 29 days ago | |
Python | Go | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Scrapy
-
Level up your Python today with open-source contributions
You could also visit the project's GitHub repository and add "/contribute" to the end of the URL. For example, visiting https://github.com/scrapy/scrapy/contribute will show you relatively approachable tasks for first-time contributors to the Scrapy project.
-
Python Projects to Improve?
I used Scrapy for those, but you can also learn how to scrape and parse HTML with requests and something like Beautiful Soup.
-
Django custom management command running Scrapy: How to include Scrapy's options?
I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy to execute its commands, i.e. the tool was not intentionally written to be called from an external program.
-
Is it possible on Python?
Yeah my hunch is that a combination of nltk, python-Levenshtein, numpy for language processing, pandas for gathering results and scrapy for web scraping should make it possible. Sadly such a project probably requires at least a month or two worth of training in Python to prototype. Good luck OP.
-
Web-scraping without much Python knowledge
The scrapy framework would be a good fit for what you want to do (especially if you plan on hosting the crawler), however the learning curve is quite steep, especially if you don't have much experience using python.
-
Seeking a modern streamlined modern framework for my one page web app.
I had a cool idea recently, coded up a scraper with scrapy. Decided it was a cool dataset, scraped it all into postgres with SQLalchemy. Built out a nice streamlit project so others users could interact with it as well. A core component of the project related to using spotify's API. Long story short, streamlit isn't totally mature and something like capturing the auth token after a redirect is a PITA. Plus, you are sort of locked into the visual design of streamlit to begin with.
-
Looking for a Selenium alternative
Look into Scrapy
-
Bulk Image Downloading (and Renaming)
Don't know about any non scripting way. But i would use https://scrapy.org/ for this.
-
Intermediate & Advanced Python Projects
Web Crawler
-
Script to pull data from websites structured data?
Check out Scrapy - https://scrapy.org/
colly
-
First project - The API for engineering multiple choice questions
There is the scraper struct that spawns goroutine. Then, each goroutine scraps the web URL using go colly, and the structured MCQ questions are served from API endpoints. Users can choose between persistent Postgres model or in-memory data structure to store the questions and retrieve them from API. I have used dependency injection to inject the models via the interface to my application struct.
-
How I Scraped Michelin Guide Using Go
What follows is my thought process on how I collect all restaurant details from the Michelin Guide using Go with Colly. The final dataset is available free to be downloaded here.
-
Show HN: The Brutalist Report – A rolling snapshot of the day’s headlines
The whole thing is written in Go on my end. Ingesting new headlines is handled in a goroutine that spawns within the process every 30 mins using a combo of the wonderful gofeed (https://github.com/mmcdole/gofeed) and colly (https://github.com/gocolly/colly) libraries.
When loading the front page, you're loading a 1-minute-cached HTML page of it that was constructed out of headlines already in my PostgreSQL database that were put there by the ingestion goroutine.
I like the idea of word clouds actually, I think you're on to something there. I think you just need to pre-generate them rather than doing it adhoc (if that's what you're doing here) for speed. Additionally, perhaps consider using sentiment in a way that orients stories based on positive and negative sentiment. Right now I am not seeing how I as a visitor/user can act on the sentiment analysis as it is presented now.
It would be neat to see a collection of uplifting stories grouped together through the sentiment analysis.
Anyway, food for thought. I hope you keep hacking away on it as it's just good fun to build things.
-
Web Crawling Libraries which allow me to view the network requests made from a site?
I've been looking at various scraping/crawling libraries like colly, but I'm not exactly sure if any of them can handle a specific use case I have.
- Colly Framework ile Go dilinde Web Scraping nasıl yapılır?
-
Problem trying to insert data received from Colly web scraper lib into DB via pgx library
In the process of trying to get better with Go, I have been trying to create a program that will scrape a subreddit and then save some info into a Postgres table. In this case, I decided to use r/frugalmalefashion and save the title and flair of recent posts into the table. I am using Colly to scrape data, and pgx as the database driver.
-
The State of Web Scraping in 2021
If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:
https://gotripod.com/insights/super-simple-site-crawling-and...
-
How to Crawl the Web with Scrapy
What's your problem with Colly? [0]
-
Mastering Web Scraping in Python: Crawling from Scratch
If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
-
Mastering Web Scraping in Python: Crawling from Scratch
Before you write your own library for crawling, try some of the options out there. Many great Open Source libraries can achieve it: Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). And many companies and services that provide you with scraping and crawling solutions.
What are some alternatives?
requests-html - Pythonic HTML Parsing for Humans™
pyspider - A Powerful Spider(Web Crawler) System in Python.
GoQuery - A little like that j-thing, only in Go.
MechanicalSoup - A Python library for automating interaction with websites.
xpath - XPath package for Golang, supports HTML, XML, JSON document query.
RoboBrowser
Grab - Web Scraping Framework
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
playwright-python - Python version of the Playwright testing and automation library.
portia - Visual scraping for Scrapy
Ferret - Declarative web scraping
feedparser - Parse feeds in Python