Fun with File Formats

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • Sonar - Write Clean JavaScript Code. Always.
  • Scout APM - Truly a developer’s best friend
  • Zigi - Close all those tabs. Zigi will handle your updates.
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • DistorteD

    Ruby multimedia toolkit with deep Jekyll integration 🧪

    In addition to this resource and UK's equivalent (PRONOM/DROID, also mentioned in the linked post), I've found ArchiveTeam's wiki to be very useful for obscure file format details: http://fileformats.archiveteam.org/

    The `shared-mime-info` database from freedesktop-dot-org is probably more worthy of contribution than these government-backed databases, at least in terms of number-of-end-users. New type definitions in their database will improve the entire Linux/BSD ecosystem (both desktop and server!) because it's consumed not only by fd.o's own `update-mime-database` utility but by many language-specific type-identification libraries too https://gitlab.freedesktop.org/xdg/shared-mime-info/-/blob/m...

    …including (shameless plug) the new Ractor-based Ruby type library I've been working on in the wake of the `mimemagic` drama earlier this year: https://github.com/okeeblow/DistorteD/tree/NEW%E2%80%85SENSA...

  • SheetJS js-xlsx

    📗 SheetJS Community Edition -- Spreadsheet Data Toolkit

    One of the biggest challenges when navigating file formats and public research is the legal landscape. GIF and the LZW patent is arguably one of the most well-known examples of legal issues spilling into the file format landscape. Even today, the various clauses around the Microsoft "Open Specifications Promise" present a challenge.

    Some formats like the original Lotus 1-2-3 format (PRONOM x-fmt/117) have surprisingly complete documentation that was explicitly released into the public domain.

    Other formats like the Excel 2007+ XLSB format (PRONOM fmt/595) have some documentation. Despite the length there are many implementation gaps. They are covered under the "Open Specifications Promise", effectively a covenant not to sue.

    Many formats like the various Apple Numbers formats (e.g. PRONOM fmt/825) and Quattro Pro 5 for DOS format (PRONOM x-fmt/122) are not well-documented at all. For some formats, the community knowledge has been derived by inspecting sample files and reverse-engineering the frameworks and binaries.

    Shameless plug: We're deeply interested in spreadsheets, and our main open source library (https://github.com/SheetJS/sheetjs) is primarily focused on spreadsheet data processing across a number of legacy and modern spreadsheet formats from Lotus WKS and Visicalc DIF to XLSX and XLSB. If file formats, data processing and bit-twiddling are interesting to you, we're looking to hire!

  • Sonar

    Write Clean JavaScript Code. Always.. Sonar helps you commit clean code every time. With over 300 unique rules to find JavaScript bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • file

    Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintainer(s) are not tracking this git mirror.

    Also the magic number database for guessing the format of a file:

    https://www.darwinsys.com/file/

  • tika-docker

    Convenience Docker images for Apache Tika Server

  • tablib

    Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.

    There are two problems leading to the decision of only accepting public domain info: licensing and provenance.

    "Licensing" is hard. The "Open Specifications Promise" [1], which covers a bunch of Microsoft-designed file formats, is merely a covenant not to sue.

    "Provenance" is tricky. For example, much of the knowledge of the Apple iWork formats were derived by reverse-engineering the source programs and extracting protobuf definitions. Many open source projects have freely copied from each other, making detailed analysis tricky [2].

    [1] https://en.wikipedia.org/wiki/Microsoft_Open_Specification_P...

    [2] https://github.com/jazzband/tablib/issues/114

  • feather

    Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow (by wesm)

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts