url-collector
webmagic
url-collector | webmagic | |
---|---|---|
2 | 1 | |
0 | 11,265 | |
- | - | |
5.1 | 8.4 | |
over 2 years ago | 28 days ago | |
Java | Java | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
url-collector
-
240 million URLs for PDF and DOC files
Well, I used Java. The app is still somewhat under construction, but it is available here: https://github.com/bottomless-archive-project/url-collector
webmagic
What are some alternatives?
fscrawler - Elasticsearch File System Crawler (FS Crawler)
library-of-alexandria - Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
SpotifyDiscoveryBot - A Java-based bot that automatically crawls for new releases by your followed artists on Spotify. Never miss a release again!
google-search-results-java - Google Search Results JAVA API via SerpApi
Flowable (V6) - A compact and highly efficient workflow and Business Process Management (BPM) platform for developers, system admins and business users.
spring-cloud-config - External configuration (server and client) for Spring Cloud
ActiveJ - ActiveJ is an alternative Java platform built from the ground up. ActiveJ redefines core, web and high-load programming in Java, providing simplicity, maximum performance and scalability
Arthas - Alibaba Java Diagnostic Tool Arthas/Alibaba Java诊断利器Arthas
TestFX - Simple and clean testing for JavaFX.
ServiceTalk - A networking framework that evolves with your application