How SerpApi sped up data extraction from HTML from 3s to 800ms (or How to profile and optimize Ruby code and C extension)

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • SaaSHub - Software Alternatives and Reviews
  • flamescope

    FlameScope is a visualization tool for exploring different time ranges as Flame Graphs.

    I searched over the web how to profile C extensions for Ruby and C code in general, and found out Brendan Gregg’s tutorial on Linux perf. That was my first usage of Linux perf profiler. I’ve also tried gperftools and pprof, because seen its usage. And flamescope, because it was made by Brendan Gregg. There are many similar tools and it was hard to figure out what to use during two weeks or so.

  • flamegraph

    Easy flamegraphs for Rust projects and everything else, without Perl or pipes <3 (by flamegraph-rs)

    flamescope shows the same as flamegraph. Both of these tools use the same tools to generate chart probably.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • perf_data_converter

    Tool to convert Linux perf files to the profile.proto format used by pprof

    I’ve installed perf_data_converter to be able to use perf.data report with pprof.

  • nokogiri-rust

    Ruby FFI wrapper around scraper crate to be used instead of Nokogiri. Status: proof of concept.

    As of an experiment, I’ve made an FFI wrapper around the Rust scraper crate. at_css.text calls of proof of concept are 60 times faster than Nokogiri ones.

  • bcc

    BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more

    Haven’t tried to use bcc for tracing and profiling.

  • lexbor

    Lexbor is development of an open source HTML Renderer library. https://lexbor.com

    I’m glad to have the opportunity to contribute to an open-source project that is used by thousands of people. Hopefully, we will speed up Nokogiri (or XML parser it uses) to match the performance of html5ever or lexbor at some point in the future. 800 ms to extract data from HTML is still too much.

  • rbspy

    Sampling CPU profiler for Ruby

    c function is not very helpful to find the performance problem, so we dug deeper.

  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.

  • Nokogiri

    Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

    It worked because CFLAGS are passed here and there in ext/nokogiri/extconf.rb.

  • linux

    Linux kernel source tree

    Haven’t read the entire documentation about perf.

  • oga

    Julien Khaleghy also tried Oga gem instead of Nokogiri. It was about six times faster than Nokogiri.

  • ruby-ll

    But some tests were failing with LL::ParserError from ruby-ll that is used in Oga.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts