HTML Sanitizer API

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • content

    The content behind MDN Web Docs

  • bluemonday

    bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

  • My thoughts as a maintainer of a HTML sanitizer https://github.com/microcosm-cc/bluemonday

    1. Sanitizing is not difficult, defining the policy/config is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.

    2. If you allow a blocklist then people will use that by default as it's easier to say "I don't want " than it is to say "I only accept 3. Even if you sanitize something you should keep the raw input... you should store the raw input alongside the sanitized (in fact the sanitized is merely a cached version of the raw input having been sanitized). The reason for this is you will have issues you need to debug (and can't without the input) and you will have round-trip edits you should support (but it's not round-trippable when everything you return is different from the input, do not punish a user who pasted HTML thinking it was safe by then not allowing them to edit it out because you threw everything away). Additionally if you want to ever report on the input, i.e. topK values, and you've modified the input and not kept raw, then you can never do this.

    4. Provide a sane default. Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.

    I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.

    The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • java-html-sanitizer

    Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.

  • My thoughts as a maintainer of a HTML sanitizer https://github.com/microcosm-cc/bluemonday

    1. Sanitizing is not difficult, defining the policy/config is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.

    2. If you allow a blocklist then people will use that by default as it's easier to say "I don't want " than it is to say "I only accept 3. Even if you sanitize something you should keep the raw input... you should store the raw input alongside the sanitized (in fact the sanitized is merely a cached version of the raw input having been sanitized). The reason for this is you will have issues you need to debug (and can't without the input) and you will have round-trip edits you should support (but it's not round-trippable when everything you return is different from the input, do not punish a user who pasted HTML thinking it was safe by then not allowing them to edit it out because you threw everything away). Additionally if you want to ever report on the input, i.e. topK values, and you've modified the input and not kept raw, then you can never do this.

    4. Provide a sane default. Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.

    I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.

    The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.

  • big-list-of-naughty-strings

    The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.

  • I wonder how it handles cases like this:

    ript>alert('XSS')ript>

    ...and other strings from https://github.com/minimaxir/big-list-of-naughty-strings

  • eslint-plugin-no-unsanitized

    Custom ESLint rule to disallows unsafe innerHTML, outerHTML, insertAdjacentHTML and alike

  • Great point!

    It wanted to edit the comment to change (1) to (server/client) but I passed my edit timeout.

    I would include your (5) within (1). `textContent` and other DOM methods like `setAttribute` are effectively secure output-escaping on the client.

    Your (5a) is an excellent extra measure. In this area, I'd also add security-focused linting for (1) and (5)–e.g. for (5), to ensure secure DOM methods are used, I use Mozilla's `eslint-plugin-no-unsanitized`[0] plugin for all my personal & work projects.

    [0] https://github.com/mozilla/eslint-plugin-no-unsanitized/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts