Magika: AI powered fast and efficient file type identification

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • file

    Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintainer(s) are not tracking this git mirror.

  • As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  • magika

    Detect file content types with deep learning

  • As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • noseyparker

    Nosey Parker is a command-line program that finds secrets and sensitive information in textual data and Git history.

  • Yes!

    Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.

    This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.

    I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.

    I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.

    [1] https://github.com/praetorian-inc/noseyparker

  • wasmagic

    A WebAssembly compiled version of libmagic with a simple API for Node. WASMagic provides accurate filetype detection with zero prod dependencies

  • As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  • KeenWrite

  • Space-Maker

  • mimemagic

    Mime type detection in ruby via file extension or file content

  • If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • kaitai_struct_formats

    Kaitai Struct: library of binary file formats (.ksy)

  • hachoir

    Hachoir is a Python library to view and edit a binary stream field by field

  • https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

    File signature:

  • clamav

    ClamAV - Documentation is here: https://docs.clamav.net

  • osv.dev

    Open source vulnerability DB and triage service.

  • Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

    https://github.com/google/osv.dev/blob/master/README.md#usin... :

    > We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

    Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

    Add'l useful formats:

    > Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

  • magic

    Racket implementation of the Unix file command's magic language (by jjsimpso)

  • I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.

    As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts