Magika: AI powered fast and efficient file type identification

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

file

14 1,171 9.2 C

Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintainer(s) are not tracking this git mirror.

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.
Though I have to say when looking at the Node module, I don't understand why they released it.
Their docs say it's slow:
https://github.com/google/magika/blob/120205323e260dad4e5877...
It loads the model an runtime:
https://github.com/google/magika/blob/120205323e260dad4e5877...
They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.
Also as others have mentioned. The model appears to only detect 116 file types:
https://github.com/google/magika/blob/120205323e260dad4e5877...
Where libmagic detects... a lot. Over 1600 last time I checked:
https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...
I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

magika

4 7,344 9.8 Python

Detect file content types with deep learning

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.
Though I have to say when looking at the Node module, I don't understand why they released it.
Their docs say it's slow:
https://github.com/google/magika/blob/120205323e260dad4e5877...
It loads the model an runtime:
https://github.com/google/magika/blob/120205323e260dad4e5877...
They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.
Also as others have mentioned. The model appears to only detect 116 file types:
https://github.com/google/magika/blob/120205323e260dad4e5877...
Where libmagic detects... a lot. Over 1600 last time I checked:
https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...
I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
noseyparker

13 1,511 9.4 Rust

Nosey Parker is a command-line program that finds secrets and sensitive information in textual data and Git history.

Yes!
Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.
This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.
I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.
I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.
[1] https://github.com/praetorian-inc/noseyparker

wasmagic

1 35 6.3 TypeScript

A WebAssembly compiled version of libmagic with a simple API for Node. WASMagic provides accurate filetype detection with zero prod dependencies

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.
Though I have to say when looking at the Node module, I don't understand why they released it.
Their docs say it's slow:
https://github.com/google/magika/blob/120205323e260dad4e5877...
It loads the model an runtime:
https://github.com/google/magika/blob/120205323e260dad4e5877...
They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.
Also as others have mentioned. The model appears to only detect 116 file types:
https://github.com/google/magika/blob/120205323e260dad4e5877...
Where libmagic detects... a lot. Over 1600 last time I checked:
https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...
I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

KeenWrite

6 - -
Space-Maker

1 0 - C#
mimemagic

18 416 0.0 Ruby

Mime type detection in ruby via file extension or file content

If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
kaitai_struct_formats

3 682 6.3 Kaitai Struct

Kaitai Struct: library of binary file formats (.ksy)
hachoir

3 586 6.4 Python

Hachoir is a Python library to view and edit a binary stream field by field

https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...
File signature:

clamav

39 3,751 9.1 C

ClamAV - Documentation is here: https://docs.clamav.net
osv.dev

19 1,405 9.7 Python

Open source vulnerability DB and triage service.

Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?
https://github.com/google/osv.dev/blob/master/README.md#usin... :
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.
Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.
Add'l useful formats:
> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

magic

1 7 - Racket

Racket implementation of the Unix file command's magic language (by jjsimpso)

I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.
As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Horus: An OSINT / digital forensics tool built in Python (formerly 'Sentinel')
1 project | news.ycombinator.com | 22 Apr 2024
Show HN: Horus – An OSINT / digital forensics tool built in Python
1 project | news.ycombinator.com | 17 Apr 2024
Tracking Snoop Dogg's $4M Crypto Wallet with My New Open Source Tool!
1 project | dev.to | 13 Apr 2024
SLSA up to v1.9.0 (latest) breaking GHA pipelines
1 project | news.ycombinator.com | 20 Mar 2024
Randcrack – predict Python's random module random generated values
1 project | news.ycombinator.com | 16 Mar 2024

Magika: AI powered fast and efficient file type identification

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Security security-tools Clamav kaitai-struct Credentials
Post date: 15 Feb 2024

file

magika

InfluxDB

noseyparker

wasmagic

KeenWrite

Space-Maker

mimemagic

WorkOS

kaitai_struct_formats

hachoir

clamav

osv.dev

magic

Related posts

Magika: AI powered fast and efficient file type identification

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Security security-tools Clamav kaitai-struct Credentials Post date: 15 Feb 2024

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Security security-tools Clamav kaitai-struct Credentials
Post date: 15 Feb 2024