java-warc
Read Web ARChive (WARC) files in Java. (by laxika)
mixnode-warcreader-java
Read Web ARChive (WARC) files in Java. (by Mixnode)
java-warc | mixnode-warcreader-java | |
---|---|---|
1 | 1 | |
3 | 9 | |
- | - | |
10.0 | 10.0 | |
over 4 years ago | about 7 years ago | |
Java | Java | |
Apache License 2.0 | GNU General Public License v3.0 or later |
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
java-warc
Posts with mentions or reviews of java-warc.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2023-01-11.
-
How I archived 100 million PDF documents... (Part 1)
I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)
mixnode-warcreader-java
Posts with mentions or reviews of mixnode-warcreader-java.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2023-01-11.
-
How I archived 100 million PDF documents... (Part 1)
I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)
What are some alternatives?
When comparing java-warc and mixnode-warcreader-java you can also consider the following projects:
jsoup - jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
java-warc - Read Web ARChive (WARC) files in Java.
library-of-alexandria - Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
Apache PDFBox - Mirror of Apache PDFBox