Top 23 Archiving Open-Source Projects

paperless-ngx

212 16,754 9.9 Python

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
https://github.com/paperless-ngx/paperless-ngx

nb

48 6,294 9.3 Shell

CLI and local web plain text note‑taking, bookmarking, and archiving with linking, tagging, filtering, search, Git versioning & syncing, Pandoc conversion, + more, in a single portable script.

Project mention: Nb – note taking and archiving on the command line | news.ycombinator.com | 2024-02-03

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
wal-e

7 3,423 3.0 Python

Continuous Archiving for Postgres

Project mention: Run PostgreSQL. The Kubernetes Way | news.ycombinator.com | 2023-09-22

See the GitHub: https://github.com/wal-e/wal-e
Unmaintained would’ve made more sense to say, but the maintainer choose the words “obsolete” so I took those. :)
Seems to be obsolete due to a lack of interest and contributions.

wal-g

13 3,038 9.0 Go

Archival and Restoration for databases in the Cloud

Project mention: WAL-G 3.0.0 – fast disaster recovery for Postgres | news.ycombinator.com | 2024-03-17

libarchive

33 2,870 8.8 C

Multi-format archive and compression library

Project mention: The XZ attack and timeline | dev.to | 2024-04-17

29. October 2021 At this point Jia Tan pops up, and the first thing we see from him is an innocuous patch to the xz repository, and while a lot of people believe he started out trying his luck with another library also known as libarchive, this is not the case, I would bet it’s more of a backup looking at the dates, being that there are a few days in between as shown in this commit.

LinkAce

48 2,426 7.7 PHP

LinkAce is a self-hosted archive to collect links of your favorite websites.

Project mention: Linkhut: A Social Bookmarking Site | news.ycombinator.com | 2024-01-09

pgBackRest

13 2,194 9.2 C

Reliable PostgreSQL Backup & Restore

Project mention: pgBackRest: PostgreSQL S3 backups | dev.to | 2023-08-10

This tutorial explains how to backup PostgreSQL database using pgBackRest and S3.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
itext-java

2 1,841 9.5 Java

iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.

Project mention: FastPDF Service API (Java) VS itext7 - a user suggested alternative | libhunt.com/r/fastpdf-java | 2023-12-07

dwarfs

21 1,860 9.9 C++

A fast high compression read-only file system for Linux, Windows and macOS

Project mention: DwarFS – The Deduplicating Warp-Speed Advanced Read-Only File System | news.ycombinator.com | 2024-04-11

https://github.com/mhx/dwarfs/blob/main/doc/mkdwarfs.md#nils...

itext-dotnet

5 1,548 9.5 C#

iText for .NET is the .NET version of the iText library, formerly known as iTextSharp, which it replaces. iText represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enha

Project mention: FastPDF Service API (C# .NET) VS itext7-dotnet - a user suggested alternative | libhunt.com/r/fastpdf-csharp | 2023-12-07

grab-site

30 1,260 3.8 Python

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29

The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

sleek

8 1,199 9.2 TypeScript

todo.txt manager for Linux, Windows and MacOS, free and open-source (FOSS)

Project mention: Feature request: sync to text file (todo.txt syntax) | /r/tasks | 2023-12-08

It would make it possible to use tasks.org on Android and another app like Sleek on Windows (or any of the other todo.txt clients).

Bareos

0 933 9.9 C++

Bareos is a cross-network Open Source backup solution (licensed under AGPLv3) which preserves, archives, and recovers data from all major operating systems.
URS

11 727 7.5 Python

Universal Reddit Scraper - A comprehensive Reddit scraping/archival command-line tool.

Project mention: Nitter Shutting Down | news.ycombinator.com | 2024-01-27

If they don't want you to use their API just respect their wishes and scrape Reddit. https://github.com/JosephLai241/URS it's the only moral thing we can do.

squashfs-tools

2 708 8.8 C

tools to create and extract Squashfs filesystems
gwern.net

16 434 9.9 Haskell

Site infrastructure for gwern.net (CSS/JS/HS/images/icons). Custom Hakyll website with unique automatic link archiving, recursive tooltip popup UX, dark mode, and typography (sidenotes+dropcaps+admonitions+inflation-adjuster).

Project mention: Show HN: My related-posts finder script (with LLM and GPT4 enhancement) | news.ycombinator.com | 2023-12-08

I do something similar on my website ( https://www.gwern.net ; crummy code at https://github.com/gwern/gwern.net/ ) for the 'similar' feature: call OA API with embedding, nearest-neighbor via cosine, list of links for suggested further reading.
Because it's a static site, managing the similar links poses the difficulties OP mentions: where do you store & update it? In the raw original Markdown? We solve it by transclusion: the list of 'similar' links is stored in a separate HTML snippet, which is just transcluded into the web page on demand. The snippets can be arbitrarily updated without affecting the Markdown essay source. We do this for other things too, it's a handy design pattern for static sites, to make things more compositional (allowing one HTML snippet to be reused in arbitrarily many places or allowing 'extremely large' pages) at the cost of some client-side work doing the transclusion.
I refine it in a couple ways: I don't need to call GPT-4 for summarization because the links all have abstracts/excerpts; I usually write abstracts for my own essays/posts (which everyone should do, and if the summaries are good enough to embed, why not just use them yourself for your posts? would also help your cache & cost issues, and be more useful than the 'explanation'). Then I also throw in the table of contents (which is implicitly an abstract), available metadata like tags & authors, and I further throw into the embeddings a list of the parsed links as well as reverse citations/backlinks. My assumption is that these improve the embedding by explicitly listing the URLs/titles of references, and what other pages find a given thing worth linking.
Parsing the links means I can improve the list of suggestions by deleting anything already linked in the article. OP has so few posts this may not be a problem for him, if you are heavily hyperlinking and also have good embeddings (like I do), this will happen a lot, and it is annoying to a reader to be suggested links he has already seen and either looked at or ignored. This also means that it's easy to provide a curated 'see also' list: simply dump the similar list at the beginning, and keep the ones you like. They will be filtered out of the generated list automatically, so you can present known-good ones upfront and then the similars provide a regularly updated list of more. (Which helps handle the tension he notes between making a static list up front while new links regularly enter the system.)
One neat thing you can do with a list of hits, that I haven't seen anyone else do, is sort them by distance. The default presentation everyone does is to simply present them in order of distance to the target. This is sorta sensible because you at least see the 'closest' first, but the more links you have, the smaller the difference is, and the more that sorting looks completely arbitrary. What you can do instead is sort them by their distance to each other: if you do that, even in a simple greedy way, you get what is a list which automatically clusters by the internal topics. (Imagine there are two 'clusters' of topics equidistant to the current article; the default distance sort would give you something random-looking like A/B/B/A/B/A/A/A/B/B/A, which is painful to read, but if you sort by distance to each other to minimize the total distance, you'd get something more like B/B/B/B/B/B/A/A/A/A/A/A.) I call this 'sort by magic' or 'sort by semantic similarity': https://gwern.net/design#future-tag-features
Additional notes: I would not present 'Similarity score: 79% match' because I assume this is just the cosine distance, which is equal for both suggestions (and therefore not helpful) and also is completely embedding dependent and basically arbitrary. (A good heuristic is: would it mean anything to the reader if the number were smaller, larger, or has one less digit? A 'similarity score' of 89%, or 7.9, or 70%, would all mean the same thing to the reader - nothing.)
> Complex or not, calculating cosine similarity is a lot less work than creating a fully-fledged search algorithm, and the results will be of similar quality. In fact, I'd be willing to bet that the embedding-based search would win a head-to-head comparison most of the time.
You are probably wrong. The full search algorithm, using exact word count indexes of everything, is highly competitive with embedding search. If you are interested, the baseline you're looking for in research papers on retrieval is 'BM25'.
> For each post, the script then finds the top two most-similar posts based on the cosine similarity of the embedding vectors.
Why only top two? It's at the bottom of the page, you're hardly hurting for space.

wikipedia-mirror

2 324 1.8 Shell

🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kiwix + ZIM dump, and MediaWiki/XOWA + XML dump
PDF-Archiver

1 286 8.0 Swift

A tool for tagging files and archiving tasks.
UnifiedArchive

0 273 5.6 PHP

UnifiedArchive - an archive manager with unified interface for different formats (bundled with cli utility). Supports all formats with basic operations (reading, extracting and creation) and popular formats specific features (compression level, password-protection, comment)
Golty

1 247 9.5 Go

A selfhostable service for automatically downloading YouTube channels, playlists and videos. It's like Sonarr, but for YouTube.
itext-pdfhtml-java

2 212 8.3 HTML

pdfHTML is an iText add-on for Java that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.
jarchivelib

1 198 0.0 Java

A simple archiving and compression library for Java
archiveis

1,477 170 0.0 Python

A simple Python wrapper for the archive.is capturing service

Project mention: Ask HN: Comments requesting paywall bypass links | news.ycombinator.com | 2024-04-18

I frequently see comments from people explicitly or implicitly asking for links to bypass the paywall on submitted articles. I'm confused by this, since it takes about the same amount of effort to generate your own paywall bypassing link as it does to post a comment asking for someone else to do it. Going further and posting this link for others to use does add a step, but doesn't seem like a lot to ask.
What's happening here?
Do these posters think some special magic is required? Are they not aware that creating such a link just involves going to the top level domain of one of the services (eg, http://archive.is) and pasting the URL into a form?
Are they opposed to the idea of creating such a link themselves, either due to moral qualms or legal fears, but willing a follow a link that some else has created?
Are they using a handheld device that makes it so hard to copy a URL and open a new page that they don't know how to start, whereas they know how to write a comment?
Or are they just so entitled that they think someone else should provide for them at all times, and don't want to demean themselves helping others?
Can anyone who has posted such requests tell me what they were thinking? Can others who post bypass links tell me other explanations? General discussion on what the HN etiquette on paywall bypass links should be is welcomed as well.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Archiving related posts

Ask HN: Comments requesting paywall bypass links
1 project | news.ycombinator.com | 18 Apr 2024
Feathers Are One of Evolution's Cleverest Inventions
1 project | news.ycombinator.com | 18 Apr 2024
The XZ attack and timeline
1 project | dev.to | 17 Apr 2024
What will humans do if technology solves everything?
1 project | news.ycombinator.com | 14 Apr 2024
Building an AI Coach to Tame My Monkey Mind
3 projects | news.ycombinator.com | 11 Apr 2024
Zip entry size unset now honors user requested compression level
1 project | news.ycombinator.com | 31 Mar 2024
Suspicious libarchive pull request
1 project | news.ycombinator.com | 29 Mar 2024
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Archiving projects? This list will help you:

	Project	Stars
1	paperless-ngx	16,754
2	nb	6,294
3	wal-e	3,423
4	wal-g	3,038
5	libarchive	2,870
6	LinkAce	2,426
7	pgBackRest	2,194
8	itext-java	1,841
9	dwarfs	1,860
10	itext-dotnet	1,548
11	grab-site	1,260
12	sleek	1,199
13	Bareos	933
14	URS	727
15	squashfs-tools	708
16	gwern.net	434
17	wikipedia-mirror	324
18	PDF-Archiver	286
19	UnifiedArchive	273
20	Golty	247
21	itext-pdfhtml-java	212
22	jarchivelib	198
23	archiveis	170