Apache Nutch

Apache Nutch is an extensible and scalable web crawler (by apache)

Apache Nutch Alternatives

Similar projects and alternatives to Apache Nutch based on common topics and language

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better Apache Nutch alternative or higher similarity.

Apache Nutch reviews and mentions

Posts with mentions or reviews of Apache Nutch. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-12-14.
  • Distributed Web Crawler
    1 project | /r/webdev | 31 Dec 2022
  • How impossible is this task that's been assigned to my coworkers and I?
    2 projects | /r/cscareerquestions | 14 Dec 2022
    Hi, I have read few comments under the post, there are great suggestions also your questions regarding task are on the point. But i believe handling this with a script might be not easy. If i were you, I would use Apache Nutch or similar open source software/library.I have used Nutch for my thesis for similar task that i had to scrap a lot of blog pages and the other pages they were referencing. You can configure all the points in your questions. Like How deep you want to scrap, what kind of content you want to extract? Or there are places, you can extend or modify the behavior, so you can implement your custom logic to parse the html. https://nutch.apache.org

Stats

Basic Apache Nutch repo stats
3
2,818
8.0
10 days ago

apache/nutch is an open source project licensed under Apache License 2.0 which is an OSI approved license.

The primary programming language of Apache Nutch is Java.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com