Kaitai Struct: A new way to develop parsers for binary structures

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Kaitai Struct

    Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby

  • I contributed a number of file formats a few years ago (and attempted numerous others) but ran into a number of problems with certain file formats:

    1. It's not possible to read from the file until a multiple byte termination sequence is detected. [1]

    2. You can't read sections of a file where the termination condition is the presence of a sequence of bytes denoting the next unrelated section of the file (and you don't want to consume/read these bytes) [2]

    3. The WebIDE at the time couldn't handle very large file format specifications such as Photoshop (PSD) [3]

    4. Files containing compressed or encrypted sections require a compression/encryption algorithm to be hardcoded into Kaitai struct libraries for each programming language it can output to.

    The WebIDE I particularly liked as it makes it easy to get started and share results. I also liked how Kaitai Struct allows easy definition of constraints (simple ones at least) into the file format specification so that you can say "this section of the file shall have a size not exceeding header.length * 2 bytes".

    Some alternative binary file format specification attempts for those interested in seeing alternatives, each with their own set of problems/pros/cons:

    1. 010 Editor [4]

    2. Synalysis [5]

    3. hachoir [6]

    4. DFDL [7]

    [1] https://github.com/kaitai-io/kaitai_struct/issues/158

    [2] https://github.com/kaitai-io/kaitai_struct/issues/156

    [3] https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...

    [4] https://www.sweetscape.com/010editor/repository/templates/

    [5] https://github.com/synalysis/Grammars

    [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser

    [7] https://github.com/DFDLSchemas/

  • nom

    Rust parser combinator framework

  • As a code generator, I guess this may be nice. It seems like a DSL like the Nom [0] API is more natural and expressive, though. I imagine you can hit limits to expressiveness in Yaml pretty quickly.

    [0] https://github.com/Geal/nom

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • wuffs

    Wrangling Untrusted File Formats Safely

  • I agree that it's similar but ultimately different problems. Copy/pasting from https://github.com/google/wuffs/blob/main/doc/related-work.m... gives:

    > Kaitai Struct is in a similar space, generating safe parsers for multiple target programming languages from one declarative specification. Again, Wuffs differs in that it is a complete (and performant) end to end implementation, not just for the structured parts of a file format. Repeating a point in the previous paragraph, the difficulty in decoding the GIF format isn't in the regularly-expressible part of the format, it's in the LZW compression. Kaitai's GIF parser returns the compressed LZW data as an opaque blob.

  • Grammars

    Grammars for Synalyze It! and Hexinator

  • I contributed a number of file formats a few years ago (and attempted numerous others) but ran into a number of problems with certain file formats:

    1. It's not possible to read from the file until a multiple byte termination sequence is detected. [1]

    2. You can't read sections of a file where the termination condition is the presence of a sequence of bytes denoting the next unrelated section of the file (and you don't want to consume/read these bytes) [2]

    3. The WebIDE at the time couldn't handle very large file format specifications such as Photoshop (PSD) [3]

    4. Files containing compressed or encrypted sections require a compression/encryption algorithm to be hardcoded into Kaitai struct libraries for each programming language it can output to.

    The WebIDE I particularly liked as it makes it easy to get started and share results. I also liked how Kaitai Struct allows easy definition of constraints (simple ones at least) into the file format specification so that you can say "this section of the file shall have a size not exceeding header.length * 2 bytes".

    Some alternative binary file format specification attempts for those interested in seeing alternatives, each with their own set of problems/pros/cons:

    1. 010 Editor [4]

    2. Synalysis [5]

    3. hachoir [6]

    4. DFDL [7]

    [1] https://github.com/kaitai-io/kaitai_struct/issues/158

    [2] https://github.com/kaitai-io/kaitai_struct/issues/156

    [3] https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...

    [4] https://www.sweetscape.com/010editor/repository/templates/

    [5] https://github.com/synalysis/Grammars

    [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser

    [7] https://github.com/DFDLSchemas/

  • hachoir

    Hachoir is a Python library to view and edit a binary stream field by field

  • I contributed a number of file formats a few years ago (and attempted numerous others) but ran into a number of problems with certain file formats:

    1. It's not possible to read from the file until a multiple byte termination sequence is detected. [1]

    2. You can't read sections of a file where the termination condition is the presence of a sequence of bytes denoting the next unrelated section of the file (and you don't want to consume/read these bytes) [2]

    3. The WebIDE at the time couldn't handle very large file format specifications such as Photoshop (PSD) [3]

    4. Files containing compressed or encrypted sections require a compression/encryption algorithm to be hardcoded into Kaitai struct libraries for each programming language it can output to.

    The WebIDE I particularly liked as it makes it easy to get started and share results. I also liked how Kaitai Struct allows easy definition of constraints (simple ones at least) into the file format specification so that you can say "this section of the file shall have a size not exceeding header.length * 2 bytes".

    Some alternative binary file format specification attempts for those interested in seeing alternatives, each with their own set of problems/pros/cons:

    1. 010 Editor [4]

    2. Synalysis [5]

    3. hachoir [6]

    4. DFDL [7]

    [1] https://github.com/kaitai-io/kaitai_struct/issues/158

    [2] https://github.com/kaitai-io/kaitai_struct/issues/156

    [3] https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...

    [4] https://www.sweetscape.com/010editor/repository/templates/

    [5] https://github.com/synalysis/Grammars

    [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser

    [7] https://github.com/DFDLSchemas/

  • restruct

    Rich binary (de)serialization library for Golang

  • I’m a big fan of Kaitai Struct, to the point where I’ve even contributed a small bit of improvements to its Go support, and I use it in a handful of small projects. It’s indispensable for spelunking blobs of binary data.

    I’ve also taken some inspiration with a Go library I wrote, restruct:

    https://github.com/go-restruct/restruct

    … which is a bit like Go’s JSON encoding/decoding library, but with kaitai-like annotations for binary encoding. (Check the PNG example to see some of what can be done with it.)

  • binrw

    A Rust crate for helping parse and rebuild binary data using ✨macro magic✨.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • smm2-documentation

    Documentation for the game Super Mario Maker 2.

  • I can't claim credit for the API and reverse engineering of the data model (https://github.com/0Liam/smm2-documentation), standing on the shoulder of giants here. But the original level viewer was a windows app (there are since a handful of web ports of it); it seemed like an opportunity to learn a few technologies (Rust, WASM). There were a lot of interesting problems, i'll do a proper write-up one day.

  • FortniteReplayDecompressor

    Read Fortnite replay files

  • https://fortnitereplaydecompressor.readthedocs.io/en/latest/...

    Unreal engine encode the network packet like it's a BitStream, when it want to write a boolean for example, it will write a single bit. The following integer won't be aligned.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts