Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
I apologize for the misjudgment. I just followed the link to can_ada and saw really minimal tests, e.g. https://github.com/TkTech/can_ada/blob/main/tests/test_parsi...
I didn't understand that can_ada is not where the parser is developed.
You might be interested in https://github.com/fsspec/universal_pathlib
...
can_ada is just the python bindings, largely generated via pybind11.
The actual project is at https://github.com/ada-url/ada
A great initiative!
We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.
It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.
Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.
Ada / can_ada look very promising!
Nice.
I'll also throw in that I've recently wrote bindings to Mozilla's servo URL library.
Those live at https://github.com/crate-py/url
They're not complete yet (meaning only the parsing bits are exposed, not URL modification) but I too was frustrated with the state of URL parsing.
IMO that URL crate is not especially high quality. I barely work with URLs and I quickly found an embarrassingly trivial bug:
https://github.com/servo/rust-url/issues/864#issuecomment-16...