Golang is evil on shitty networks

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Access the most powerful time series database as a service
  • SonarQube - Static code analysis for 29 languages.
  • SaaSHub - Software Alternatives and Reviews
  • go

    The Go programming language

    If you trace this all the way back it's been in the Go networking stack since the beginning with the simple commit message of "preliminary network - just Dial for now " [0]. You can see the exact line in the 2008 our repository here [1].

    As an aside it was interesting to chase the history of this line of code as it was made with a public SetNoDelay function, then with a direct system call, then back to an abstract call. Along the way it was also broken out into a platform specific library, then back into a general library and go other with a pass from gofmt, all over a "short" 14 years.

    0 - https://github.com/golang/go/commit/e8a02230f215efb075cccd41...

  • git-lfs

    Git extension for versioning large files

    > And that pattern is the one that is used by GOs http libraries

    I don't think that is correct. In https://news.ycombinator.com/item?id=34213383, I notice that Go's HTTP/2 library would write the HEADERS frame, the DATA frame, and the terminal HEADERS frame in 3 different syscalls. In a sample application using the Go's HTTP/2 library, a gRPC response without Nagle's algorithm would transmit 497 bytes over 6 packets, while a gRPC response with Nagle's algorithm would transmit 275 bytes over 2 packets.

    With a starting point where both Nagle's algorithm and delayed ack are enabled, I guess this is the order of preference:

    1. delayed ack disabled, applications do the right thing by buffering accordingly - ideal performance, but it is difficult to disable delayed ack, and it may require a lot of works to fix the applications.

    2a. Nagle's algorithm disabled, applications do the right thing by buffering accordingly - almost ideal performance (may perform worse than #1 over bad connection), but it may require a lot of works to fix the applications.

    2b. delayed ack disabled, real world applications - almost ideal performance (may have higher syscall overhead than #1), but it is difficult to disable delayed ack.

    3. Nagle's algorithm disabled, real world application - not ideal as some applications can suffer from high packet overhead, e.g. git-lfs, and this is where we are at with Go.

    4. baseline - far from ideal as many applications can suffer from high latency due to bad interaction between Nagle's algorithm and delayed ack.

    I would say Go has made the right trade-off, albeit with a slight hint of "we know better than you". Going forward, it is probably cheaper for linux kernel to come up with a better API to disable delayed ack (i.e. to achieve #2b), than getting the affected applications to do the right thing by buffering accordingly (i.e. to achieve #1 or #2a). We will see how soon https://github.com/git-lfs/git-lfs/issues/5242 can be resolved.

    In the mean time, #2b can actually be achieved with a "SRE approach" by patching the kernel to remove delayed ack and patching the Go library to remove the `setNoDelay` call. Something for OP to try?

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • Caddy

    Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS

    I ran into a similar phantom-traffic problem from Go ignoring the Linux default for TCP keepalives and sending them every 15 seconds, very wasteful for mobile devices. While I quite like the rest of Go, I don't see why they have to be so opinionated and ignore the OS in their network defaults.

    My PR fixing that in Caddy: https://github.com/caddyserver/caddy/pull/4865

  • rke2

    Golang has burned me more than once with bizarre design decisions that break things in a user hostile way.

    The last one we ran into was a change in Go 1.15 where servers that presented a TLS certificate with the hostname encoded into the CN field instead of the more appropriate SAN field always fail validation.

    The behavior could be disabled however that functionality was removed in 1.18 with no way to opt back into the old behavior. I understand why SAN is the right way to do it but in this case I didn’t control the server.

    Developers at Google probably never have to deal with 3rd parties with shitty infrastructure but a lot of us do.

    Here’s a bug in rke that’s related https://github.com/rancher/rke2/issues/775

  • certificate-transparency-go

    Auditing for TLS certificates (Go code)

    The x509 package has unfortunately burned me several times, this one included. It is too anal about non-fatal errors, that Google themselves forked it (and asn1) to improve usability.

    https://github.com/google/certificate-transparency-go

  • plan9port

    Plan 9 from User Space

    That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message "various tweaks". If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!

    I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn't understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle's algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. It was a terrible default for trying to do anything that cared about latency.

    When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.

    As others have pointed out at length elsewhere in these comments, it's a completely reasonable default.

    I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

    And to answer the question in the article:

    > Much (all?) of Kubernetes is written Go, and how has this default affected that?

    I'm quite confident that this default has greatly improved the default server latency in all the various kinds of servers Kubernetes has. It was the right choice for Go, and it still is.

    [1] https://github.com/9fans/plan9port/commit/d51419bf4397cf13d0...

    [2] https://news.ycombinator.com/item?id=34180239

    [3] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-65...

  • libnodelay

    A small wrapper library that adds the TCP_NODELAY option for all sockets.

    > not to mention nearly 50% of every packet was literally packet headers

    I was just looking at a similar issue with grpc-go, where it would somehow send a HEADERS frame, a DATA frame, and a terminal HEADERS frame in 3 different packets. The grpc server is a golang binary (lightstep collector), which definitely disables Nagle's algorithm as shown by strace output, and the flag can't be flipped back via the LD_PRELOAD trick (e.g. with a flipped version of https://github.com/sschroe/libnodelay) as the binary is statically linked.

    I can't reproduce this with a dummy grpc-go server, where all 3 frames would be sent in the same packet. So I can't blame Nagle's algorithm, but I am still not sure why the lightstep collector behaves differently.

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • .NET Runtime

    .NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

    > I've seen people writing C# applications and wondering why stuff is taking 200ms

    I observe that in the most recent generation of its HTTP client (SocketsHttpHandler), .NET also sets NoDelay by default.

    https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

  • s2n

    An implementation of the TLS/SSL protocols

    > The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

    This is correct. And in the end it means more or less that setting the socket option is more of a way of sending an explicit ACK from userspace than a real setting.

    It's not great for common use-cases, because making userspace care about ACKs will obviously degrade efficiency (more syscalls).

    However it can make sense for some use-cases. E.g. I saw the s2n TLS library using QUICKACK to avoid the TLS handshake being stuck [1]. Maybe also worthwhile to be set in some specific RPC scenarios where the server might not immediately send a response on receiving the request, and where the client could send additional frames (e.g. gRPC client side streaming, or in pipelined HTTP requests if the server would really process those in parallel and not just let them sit in socket buffers).

    [1] https://github.com/aws/s2n-tls/blob/46c47a71e637cabc312ce843...

  • kubernetes

    Production-Grade Container Scheduling and Management

    There's been a highly annoying kubectl port-forward heisenbug open for several years which smells an awful lot like one of these dark Go network layer corners. You get a good connection establish and some data flows, but at some random point it decides to drop. It's not annoying enough for any wizards to fix.

    https://github.com/kubernetes/kubernetes/issues/74551

  • grpc-go

    The Go language implementation of gRPC. HTTP/2 based RPC

    Found the root cause from https://github.com/grpc/grpc-go/commit/383b1143 (original issue: https://github.com/grpc/grpc-go/issues/75):

        // Note that ServeHTTP uses Go's HTTP/2 server implementation which is

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts