Parsing an Undocumented File Format

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • test_files

    :books: SheetJS Test Files (XLS/XLSX/XLSB and other spreadsheet formats)

  • 3) try to reproduce the artifacts with a NodeJS script.

    A simple `xlsx2csv` NodeJS script generates CSV text from a XLSX file, and a simple `diff` would reveal any deviations from the expected result.

    No one in the preceding 30 years thought to do the same! Many open source projects including OpenOffice/LibreOffice had huge collections of sample XLS and XLSX artifacts without accompanying plaintext artifacts.

    We wrote a series of scripts to automate Excel and generate the desired artifacts. https://github.com/SheetJS/test_files/blob/master/tests/txt.... is a AppleScript automation script for Excel 2011 for Mac.

    Those tests have revealed a number of unexpected bugs in third-party tools and regressions in Excel itself. For example, Excel 5.0 introduced the datetime format `yyyy-mm-dd [hh]:mm:ss`. The value 0.001 is expected to be rendered as "1900-01-00 00:01:26", and Excel 5.0 - 2003 worked as expected. Excel 2007 changed number formatting and newer versions show the nonsensical result "1900-01-00 645:01:26" (it is nonsensical since the value 0.001 represents less than one hour)

  • noaccess

  • I am trying to parse the data tables from the undocumented MS Access file format using NodeJS/Javascript. I last tried about 3 years ago and it was really tough going, with a lot of trial and error spread out over several months. Anyway, I managed to be able to parse some basic MS Access files, but need to figure out a way to get the whole database more reliably. My effort was here:

    https://github.com/yazz/noaccess

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • byter

    Discontinued Python binary object reader/writer

  • Kaitai Struct

    Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby

  • - ImHex [2], which has a pattern language [3] which allows parsing, and it seems more powerful than what Kaitai offers. I stumbled upon some limitations with it but it was still useful.

    [1]: https://kaitai.io/

  • ImHex

    🔍 A Hex Editor for Reverse Engineers, Programmers and people who value their retinas when working at 3 AM.

  • mdbtools

    MDB Tools - Read Access databases on *nix

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts