Fun with File Formats

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • DistorteD

    Ruby multimedia toolkit with deep Jekyll integration 🧪

  • In addition to this resource and UK's equivalent (PRONOM/DROID, also mentioned in the linked post), I've found ArchiveTeam's wiki to be very useful for obscure file format details: http://fileformats.archiveteam.org/

    The `shared-mime-info` database from freedesktop-dot-org is probably more worthy of contribution than these government-backed databases, at least in terms of number-of-end-users. New type definitions in their database will improve the entire Linux/BSD ecosystem (both desktop and server!) because it's consumed not only by fd.o's own `update-mime-database` utility but by many language-specific type-identification libraries too https://gitlab.freedesktop.org/xdg/shared-mime-info/-/blob/m...

    …including (shameless plug) the new Ractor-based Ruby type library I've been working on in the wake of the `mimemagic` drama earlier this year: https://github.com/okeeblow/DistorteD/tree/NEW%E2%80%85SENSA...

  • SheetJS js-xlsx

    📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

  • One of the biggest challenges when navigating file formats and public research is the legal landscape. GIF and the LZW patent is arguably one of the most well-known examples of legal issues spilling into the file format landscape. Even today, the various clauses around the Microsoft "Open Specifications Promise" present a challenge.

    Some formats like the original Lotus 1-2-3 format (PRONOM x-fmt/117) have surprisingly complete documentation that was explicitly released into the public domain.

    Other formats like the Excel 2007+ XLSB format (PRONOM fmt/595) have some documentation. Despite the length there are many implementation gaps. They are covered under the "Open Specifications Promise", effectively a covenant not to sue.

    Many formats like the various Apple Numbers formats (e.g. PRONOM fmt/825) and Quattro Pro 5 for DOS format (PRONOM x-fmt/122) are not well-documented at all. For some formats, the community knowledge has been derived by inspecting sample files and reverse-engineering the frameworks and binaries.

    Shameless plug: We're deeply interested in spreadsheets, and our main open source library (https://github.com/SheetJS/sheetjs) is primarily focused on spreadsheet data processing across a number of legacy and modern spreadsheet formats from Lotus WKS and Visicalc DIF to XLSX and XLSB. If file formats, data processing and bit-twiddling are interesting to you, we're looking to hire!

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • file

    Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintainer(s) are not tracking this git mirror.

  • Also the magic number database for guessing the format of a file:

    https://www.darwinsys.com/file/

  • tika-docker

    Convenience Docker images for Apache Tika Server

  • tablib

    Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.

  • There are two problems leading to the decision of only accepting public domain info: licensing and provenance.

    "Licensing" is hard. The "Open Specifications Promise" [1], which covers a bunch of Microsoft-designed file formats, is merely a covenant not to sue.

    "Provenance" is tricky. For example, much of the knowledge of the Apple iWork formats were derived by reverse-engineering the source programs and extracting protobuf definitions. Many open source projects have freely copied from each other, making detailed analysis tricky [2].

    [1] https://en.wikipedia.org/wiki/Microsoft_Open_Specification_P...

    [2] https://github.com/jazzband/tablib/issues/114

  • feather

    Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow (by wesm)

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts