How SerpApi sped up data extraction from HTML from 3s to 800ms (or How to profile and optimize Ruby code and C extension)

This page summarizes the projects mentioned and recommended in the original post on

Our great sponsors
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • SonarLint - Clean code begins in your IDE with SonarLint
  • Scout APM - Truly a developer’s best friend
  • Zigi - Workflow assistant built for devs & their teams
  • flamescope

    FlameScope is a visualization tool for exploring different time ranges as Flame Graphs.

    I searched over the web how to profile C extensions for Ruby and C code in general, and found out Brendan Gregg’s tutorial on Linux perf. That was my first usage of Linux perf profiler. I’ve also tried gperftools and pprof, because seen its usage. And flamescope, because it was made by Brendan Gregg. There are many similar tools and it was hard to figure out what to use during two weeks or so.

  • flamegraph

    Easy flamegraphs for Rust projects and everything else, without Perl or pipes <3 (by flamegraph-rs)

    flamescope shows the same as flamegraph. Both of these tools use the same tools to generate chart probably.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

  • perf_data_converter

    Tool to convert Linux perf files to the profile.proto format used by pprof

    I’ve installed perf_data_converter to be able to use report with pprof.

  • nokogiri-rust

    Ruby FFI wrapper around scraper crate to be used instead of Nokogiri. Status: proof of concept.

    As of an experiment, I’ve made an FFI wrapper around the Rust scraper crate. at_css.text calls of proof of concept are 60 times faster than Nokogiri ones.

  • bcc

    BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more

    Haven’t tried to use bcc for tracing and profiling.

  • lexbor

    Lexbor is development of an open source HTML Renderer library.

    I’m glad to have the opportunity to contribute to an open-source project that is used by thousands of people. Hopefully, we will speed up Nokogiri (or XML parser it uses) to match the performance of html5ever or lexbor at some point in the future. 800 ms to extract data from HTML is still too much.

  • rbspy

    Sampling CPU profiler for Ruby

    c function is not very helpful to find the performance problem, so we dug deeper.

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • Nokogiri

    Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

    It worked because CFLAGS are passed here and there in ext/nokogiri/extconf.rb.

  • linux

    Linux kernel source tree

    Haven’t read the entire documentation about perf.

  • oga

    Julien Khaleghy also tried Oga gem instead of Nokogiri. It was about six times faster than Nokogiri.

  • ruby-ll

    But some tests were failing with LL::ParserError from ruby-ll that is used in Oga.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts