Ferret
colly
Our great sponsors
Ferret | colly | |
---|---|---|
0 | 14 | |
4,981 | 16,585 | |
1.1% | 2.1% | |
7.3 | 5.3 | |
5 days ago | 26 days ago | |
Go | Go | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Ferret
We haven't tracked posts mentioning Ferret yet.
Tracking mentions began in Dec 2020.
colly
-
First project - The API for engineering multiple choice questions
There is the scraper struct that spawns goroutine. Then, each goroutine scraps the web URL using go colly, and the structured MCQ questions are served from API endpoints. Users can choose between persistent Postgres model or in-memory data structure to store the questions and retrieve them from API. I have used dependency injection to inject the models via the interface to my application struct.
-
How I Scraped Michelin Guide Using Go
What follows is my thought process on how I collect all restaurant details from the Michelin Guide using Go with Colly. The final dataset is available free to be downloaded here.
-
Show HN: The Brutalist Report – A rolling snapshot of the day’s headlines
The whole thing is written in Go on my end. Ingesting new headlines is handled in a goroutine that spawns within the process every 30 mins using a combo of the wonderful gofeed (https://github.com/mmcdole/gofeed) and colly (https://github.com/gocolly/colly) libraries.
When loading the front page, you're loading a 1-minute-cached HTML page of it that was constructed out of headlines already in my PostgreSQL database that were put there by the ingestion goroutine.
I like the idea of word clouds actually, I think you're on to something there. I think you just need to pre-generate them rather than doing it adhoc (if that's what you're doing here) for speed. Additionally, perhaps consider using sentiment in a way that orients stories based on positive and negative sentiment. Right now I am not seeing how I as a visitor/user can act on the sentiment analysis as it is presented now.
It would be neat to see a collection of uplifting stories grouped together through the sentiment analysis.
Anyway, food for thought. I hope you keep hacking away on it as it's just good fun to build things.
-
Web Crawling Libraries which allow me to view the network requests made from a site?
I've been looking at various scraping/crawling libraries like colly, but I'm not exactly sure if any of them can handle a specific use case I have.
- Colly Framework ile Go dilinde Web Scraping nasıl yapılır?
-
Problem trying to insert data received from Colly web scraper lib into DB via pgx library
In the process of trying to get better with Go, I have been trying to create a program that will scrape a subreddit and then save some info into a Postgres table. In this case, I decided to use r/frugalmalefashion and save the title and flair of recent posts into the table. I am using Colly to scrape data, and pgx as the database driver.
-
The State of Web Scraping in 2021
If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:
https://gotripod.com/insights/super-simple-site-crawling-and...
-
How to Crawl the Web with Scrapy
What's your problem with Colly? [0]
-
Mastering Web Scraping in Python: Crawling from Scratch
If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
-
Mastering Web Scraping in Python: Crawling from Scratch
Before you write your own library for crawling, try some of the options out there. Many great Open Source libraries can achieve it: Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). And many companies and services that provide you with scraping and crawling solutions.
What are some alternatives?
GoQuery - A little like that j-thing, only in Go.
xpath - XPath package for Golang, supports HTML, XML, JSON document query.
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
rod - A Devtools driver for web automation and scraping
Geziyor - Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.
Pholcus - Pholcus is a distributed high-concurrency crawler software written in pure golang
Playwright - Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
rtila-releases
google-search-results-golang - Google Search Results GoLang API
Dataflow kit - Extract structured data from web sites. Web sites scraping.
gichidan - Gichidan - CLI wrapper for Ichidan deep-web search engine.