Hunting down a C memory leak in a Go program

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • kafka-go

    Kafka library in Go

  • Segment learned quite some time ago that librdkafka-go has problems like these (and doesn’t support Contexts either), so they wrote a pure Go replacement instead. https://github.com/segmentio/kafka-go

  • Confluent Kafka Golang Client

    Confluent's Apache Kafka Golang client

  • So, in the interests of full transparency - we at Zendesk are actually running a fork of confluent-kafka-go, which I forked to add, amongst other things, context support: https://github.com/confluentinc/confluent-kafka-go/pull/626

    This bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • confluent-kafka-go

    Confluent's Apache Kafka Golang client (by zendesk)

  • So, in the interests of full transparency - we at Zendesk are actually running a fork of confluent-kafka-go, which I forked to add, amongst other things, context support: https://github.com/confluentinc/confluent-kafka-go/pull/626

    This bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...

  • jemalloc

  • Nice write up! Using BPF to trace malloc/free is good example of the tool’s power. Unfortunately, IME, this approach doesn’t scale to very high load services. Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

    It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.

    However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:

    https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...

  • bytehound

    A memory profiler for Linux.

  • > Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

    Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:

    https://github.com/koute/bytehound

    It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html

  • leakdice

    Monte Carlo leak diagnostic for Linux binaries

  • (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)

    leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.

    Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.

    That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.

    Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.

    This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.

  • leakdice-rust

    Rust re-implementation of leakdice

  • (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)

    leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.

    Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.

    That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.

    Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.

    This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • librdkafka

    The Apache Kafka C/C++ library

  • I wonder if statistics provided by librdkafka (available also with confluent-kafka-go) could have been used to solve the issue with less effort.

    https://github.com/edenhill/librdkafka/blob/master/STATISTIC...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts