The technology behind GitHub’s new code search

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • sourcegraph

    Code AI platform with Code Search & Cody

  • Re: Sourcegraph, we're working on improving that, and sorry you couldn't get the results you wanted. We primarily build for the code within customers, where this particular problem is less common than across all open-source repositories. But we want it to work really well in every case.

    Our new ranking (https://about.sourcegraph.com/blog/new-search-ranking) should help a lot here, and it's live on https://sourcegraph.com. Can you share some of the queries you tried so we can see how much ranking helps and how to handle them better?

    Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.

    You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: https://pagefind.app/.

    For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): https://swtch.com/~rsc/regexp/regexp4.html.

    Algolia also has a series of blog posts describing how their search engine works: https://www.algolia.com/blog/engineering/inside-the-algolia-....

    ---

    It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:

    "Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."

    I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:

    "Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."

    "Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."

    https://stackshare.io/posts/how-algolia-built-their-realtime...

    100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on https://hn.algolia.com I've seen single digit millisecond search times in the responses which seems to corroborate this.

    I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.

    Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".

    Search is so cool. I could talk about it all day.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Django

    The Web framework for perfectionists with deadlines.

  • And yet, if I search [0] the Django repo for a class that definitely exists [1] in Django, there are 0 code results. Zero. GitHub search is mystifyingly bad.

    [0] https://github.com/django/django/search?q=DeleteView&type=co...

    [1] https://github.com/django/django/blob/main/django/views/gene...

  • imdb-rename

    A command line tool to rename media files based on titles from IMDb.

  • What a shit take. The article itself is perhaps a nice light overview of 101-ish level concepts, although knowing how and when to apply them in a real engineering context is not something I would consider 101 level. And certainly, building something that is actually at the scale of GitHub Search is nowhere near 101 level.

    This is what a 101-level inverted index implementation looks like: https://github.com/BurntSushi/imdb-rename

    In other words, absolutely nothing like what GitHub built. Nowhere close.

  • stack-graphs

    Rust implementation of stack graphs

  • > It doesn't have the faintest idea where the name is defined, or if there's even a difference between a function name, a parameter name, or a word in a comment.

    I don't think what you are saying is actually true for stack-graphs[0][1].

    [0]: https://github.com/github/stack-graphs

    [1]: https://github.blog/2021-12-09-introducing-stack-graphs/

  • pagefind

    Static low-bandwidth search at scale

  • Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.

    You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: https://pagefind.app/.

    For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): https://swtch.com/~rsc/regexp/regexp4.html.

    Algolia also has a series of blog posts describing how their search engine works: https://www.algolia.com/blog/engineering/inside-the-algolia-....

    ---

    It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:

    "Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."

    I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:

    "Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."

    "Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."

    https://stackshare.io/posts/how-algolia-built-their-realtime...

    100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on https://hn.algolia.com I've seen single digit millisecond search times in the responses which seems to corroborate this.

    I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.

    Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".

    Search is so cool. I could talk about it all day.

  • bar

  • Yes, just change the URL from https://github.com/foo/bar to https://sourcegraph.com/github.com/foo/bar to be dropped in to a code search for that GH repo.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • lsif-clang

    Discontinued Language Server Indexing Format (LSIF) generator for C, C++ and Objective C

  • In the top right corner of the tooltip it will say either "Search-based" or "Precise" - in this case, you're right, we don't have the abseil-cpp repo indexed so it falls back to search-based as you describe.

    We do have a C++ code indexer in beta, https://github.com/sourcegraph/lsif-clang - it is based on clang but C++ indexing is notably harder to do automatically/without-setup due to the varying build systems that need to be understood in order to invoke the compiler.

  • gitlab

  • GitLab team member. Thanks for the question.

    Our Code Search team is currently working on moving to Zoekt[0] which is expected to be a significant improvement as it is purpose-built for code search.

    We also shipped an improvement[1] to our existing search functionality at the end of last year. If you haven't used it recently, I'd encourage you to check out code search again to see if the quality has been improved for you.

    [0] - https://gitlab.com/groups/gitlab-org/-/epics/9404

    [1] - https://gitlab.com/gitlab-org/gitlab/-/issues/346914

  • scip

    SCIP Code Intelligence Protocol

  • This is pretty much exactly what we've built at Sourcegraph. Microsoft had introduced (but pretty much abandoned before it even started) LSIF, a static index format for LSP servers requests/responses.

    We took that torch and carried it forward, building the spiritual successor called SCIP[0]. It's language agnostic, we have indexers for quite a few languages already, and we genuinely intend for it to be vendor neutral / a proper OSS project[1].

    [0] https://about.sourcegraph.com/blog/announcing-scip

    [1] https://github.com/sourcegraph/scip

  • llvm-project

    The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

  • This is exciting! I see a lot of familiar pieces here that propagated from Google's Code Search and I know few people from Code Search went to GitHub, probably specifically to work on this. I always wondered why GitHub didn't invest into a decent code searching features, but I'm happy it finally gets to the State of the Art one step at a time. Some of the folks going to GitHub to work on this I know are just incredible and I have no doubt GitHub's code search will be amazing.

    I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).

    Here is the implementation:

    https://github.com/llvm/llvm-project/tree/main/clang-tools-e...

    I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:

    https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...

  • scip-clang

  • (I work on C++ indexing at Sourcegraph.)

    As my colleague mentioned in a sibling comment, we have an existing indexer lsif-clang which supports C++. I just added a Chromium example to the lsif-clang README right now: (direct link) https://sourcegraph.com/github.com/chromium/chromium@cab0660...

    We are also actively working on a new SCIP indexer which should support features like cross-repo references in the future. https://github.com/sourcegraph/scip-clang

    Right now, Abseil doesn't have precise code navigation because no one has uploaded an index for it. In an ideal world, we would automatically have precise indexes for all the C++ code on Sourcegraph, but that's a hard problem because of the large variety in build systems, build configurations, and system dependencies that are often specified outside the build system.

  • scip-zig

    SCIP indexer for Zig!

  • Highcharts JS

    Highcharts JS, the JavaScript charting framework

  • I am searching this repository

    https://github.com/highcharts/highcharts

    for Series.drawPoint and expecting a direct hit for

    https://github.com/highcharts/highcharts/blob/29d2a83a5a997b...

    practically I tried "Series" and "drawPoint" also.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts