Parsing URLs in Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • can_ada

    Python bindings for Ada, a fast and spec-compliant URL parser.

  • I apologize for the misjudgment. I just followed the link to can_ada and saw really minimal tests, e.g. https://github.com/TkTech/can_ada/blob/main/tests/test_parsi...

    I didn't understand that can_ada is not where the parser is developed.

  • furl

    🌐 URL parsing and manipulation made easy.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • yarl

    Yet another URL library

  • universal_pathlib

    pathlib api extended to use fsspec backends

  • You might be interested in https://github.com/fsspec/universal_pathlib

  • ada

    WHATWG-compliant and fast URL parser written in modern C++

  • ...

    can_ada is just the python bindings, largely generated via pybind11.

    The actual project is at https://github.com/ada-url/ada

  • w3lib

    Python library of web-related functions

  • A great initiative!

    We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.

    It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.

    Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.

    Ada / can_ada look very promising!

  • url

    Python bindings to the Rust url crate (by crate-py)

  • Nice.

    I'll also throw in that I've recently wrote bindings to Mozilla's servo URL library.

    Those live at https://github.com/crate-py/url

    They're not complete yet (meaning only the parsing bits are exposed, not URL modification) but I too was frustrated with the state of URL parsing.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • rust-url

    URL parser for Rust

  • IMO that URL crate is not especially high quality. I barely work with URLs and I quickly found an embarrassingly trivial bug:

    https://github.com/servo/rust-url/issues/864#issuecomment-16...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts