Magika: AI powered fast and efficient file type identification

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Judoscale - Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com
featured
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
  1. file

    Read-only mirror of file CVS repository, updated every half hour. NOTE: do not make pull requests here, nor comment any commits, submit them usual way to bug tracker or to the mailing list. Maintainer(s) are not tracking this git mirror.

    As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. magika

    Detect file content types with deep learning

    As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  4. noseyparker

    Nosey Parker is a command-line tool that finds secrets and sensitive information in textual data and Git history.

    Yes!

    Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.

    This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.

    I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.

    I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.

    [1] https://github.com/praetorian-inc/noseyparker

  5. wasmagic

    A WebAssembly compiled version of libmagic with a simple API for Node. WASMagic provides accurate filetype detection with zero prod dependencies

    As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

    Though I have to say when looking at the Node module, I don't understand why they released it.

    Their docs say it's slow:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    It loads the model an runtime:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

    Also as others have mentioned. The model appears to only detect 116 file types:

    https://github.com/google/magika/blob/120205323e260dad4e5877...

    Where libmagic detects... a lot. Over 1600 last time I checked:

    https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

    I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

  6. KeenWrite

  7. Space-Maker

  8. mimemagic

    Mime type detection in ruby via file extension or file content

    If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. kaitai_struct_formats

    Kaitai Struct: library of binary file formats (.ksy)

  11. hachoir

    Hachoir is a Python library to view and edit a binary stream field by field

    https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

    File signature:

  12. clamav

    ClamAV - Documentation is here: https://docs.clamav.net

  13. osv.dev

    Open source vulnerability DB and triage service.

    Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

    https://github.com/google/osv.dev/blob/master/README.md#usin... :

    > We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

    Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

    Add'l useful formats:

    > Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

  14. magic

    Racket implementation of the Unix file command's magic language (by jjsimpso)

    I wrote an implementation of libmagic in Racket a few years ago(https://github.com/jjsimpso/magic). File type identification is a pretty interesting topic.

    As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 🔒 Bandit: Python Static Application Security Testing Guide

    1 project | dev.to | 23 Apr 2025
  • Doge Worker's Code Supports NLRB Whistleblower

    5 projects | news.ycombinator.com | 23 Apr 2025
  • Building AI Agents to Prioritize CVEs — A Google ADK Guide

    3 projects | dev.to | 23 Apr 2025
  • Panic at the CVE-o-theque [video]

    1 project | news.ycombinator.com | 22 Apr 2025
  • 🛡️ How to Use Bandit as a SAST Tool for Your Python App

    2 projects | dev.to | 20 Apr 2025

Did you know that Python is
the 2nd most popular programming language
based on number of references?