Our great sponsors
-
Spidey
A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.
This may be overkill but I have library out there for building web crawlers. Spidey is the library. I'm not suggesting you use it but you could look at it for ideas. It uses a multithreaded, producer/consumer approach that avoids recursion and stack overflow issues. Use a queue, pull from the queue for each url, push new urls on when you find them. Do need to optimize my code a bit more but if it helps at all. But your issue is most likely the fact that you're finding a link to the page you are currently on. HashSet or List of found URLs would solve the issue.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Related posts
- Can R do recursive web crawling?
- I need data from a website. It is viable to create an API that scrapes the website and returns the data on an endpoint?
- Lataa kaikki ~19000 kuntavaaliehdokkaan vastaukset YLE:n kuntavaalikoneesta JSON-muodossa
- increasing scraping speed
- AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?