Revng translates (i386, x86-64, MIPS, ARM, AArch64, s390x) binaries to LLVM IR

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • rizin

    UNIX-like reverse engineering framework and command-line toolset.

  • Rizin[1] is also able to uplift native code to the new RzIL, which is based on the BAP Core Theory[2] and is essentially an extension of SMT theories of bitvectors, bitvector-indexed arrays of bitvectors and effects[3].

    [1] https://rizin.re/

    [2] https://binaryanalysisplatform.github.io/bap/api/master/bap-...

    [3] https://github.com/rizinorg/rizin/blob/dev/doc/rzil.md

  • revng

    revng: the core repository of the rev.ng project

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • remill

    Library for lifting machine code to LLVM bitcode

  • Usually such things are called lifters. Wonder how this tool compares to other existing LLVM IR lifters, such as remill[0] and rellume[1].

    0: https://github.com/lifting-bits/remill

  • rellume

    Lift machine code to performant LLVM IR

  • revng-qa

    Source for rev.ng test cases

  • > the binary code to LLVM IR uplifting loses a lot of context

    Losing context is good in order to ensure you properly decoupled the frontend from the rest of the pipeline.

    We don't even keep track of what a "call" instruction is, we re-detect it on the LLVM IR.

    One reason you may want to preserve context is to let the user know where a specific piece of lifted code originated from. In order to preserve this information, we exploit LLVM's debugging metadata and it works pretty well. There's some loss there, but LLVM transformations strive to preserve it.

    After all, imagine you have `add rax, 4; add rax, 4`, you'll want to optimize it to a +8 and you'll either have to decide if you want to associate your +8 operation with the first or the second instruction.

    > the binary code to LLVM IR uplifting loses a lot of [...] semantics information

    Not sure what you mean here, we use QEMU as a lifter and that's very accurate in terms of semantics.

    I'm not sure what MIR and Swift IR have to do with the discussion, those are higher level IRs for specific languages. LLVM is rather low level and it's language agnostic.

    However, for going beyond lifting, i.e., decompilation, it's true that LLVM shows some significant limitations. That's why we're rolling our own MLIR dialect, but we can still benefit of all the MLIR/LLVM infrastructure, optimizations and analyses. We're not starting from scratch.

    > emulating pieces of the code sparsely to figure out indirect jumps and so on

    It's hard to emulate without starting from the beginning. Maybe you're thinking about symbolic execution?

    In any case, rev.ng does not emulate and does not do any symbolic execution: we have a data-flow analysis that detects destinations of indirect jumps and it's pretty scalable and effective. Example of things we handle: https://github.com/revng/revng-qa/blob/master/share/revng/te...

  • QEMU

    Official QEMU mirror. Please see https://www.qemu.org/contribute/ for how to submit changes to QEMU. Pull Requests are ignored. Please only use release tarballs from the QEMU website.

  • > architectural registers are always updated

    In tiny code, the guest registers (global TCG variables) are stored in the host's registers until you either call an helper which can access the CPU state or you return (`git grep la_global_sync`). This is the reason why QEMU is not so terribly slow.

    But after a check, this also happens when you access the guest memory address space! https://github.com/qemu/qemu/blob/master/include/tcg/tcg-opc... (TCG_OPF_SIDE_EFFECTS is what matters)

    But still, in the end, it's the same problem. What QEMU does, can be done in LLVM too. You could probably be more efficient in LLVM by using the exception handling mechanism (invoke and friends) to only serialize back to memory when there's an actual exception, at the cost of higher register pressure. More or less what we do here: https://rev.ng/downloads/bar-2019-paper.pdf

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts