Yetto A portrait of Billy Marx Vanzetti, Yetto's nonbinary, anti-capitalist mascot

Chainable filters, flexible HTML

Garen Torikian's Profile Picture

Garen Torikian

Co-Founder, CTO

Engineering

In the old days, there was just plaintext. And this was good, but not great. What about all the neat formatting that everyone wants to use? Sure, one could use HTML tags, but that's so cumbersome, and not very nice to look at, much less write.

So, writer-engineers began standardizing on ways to create plaintext formats that supported styling—like CommonMark (née Markdown). And this was good, but not great, either. Writers began wanting to represent various document signifiers in plaintext, like callouts, executable code blocks, and math equations.

In my past life building technical writing tools, my engineering philosophy towards these requests was always "the computer can do it." If the writer (who might not know HTML) wants their text to come alive, it's my job to make it happen the way they expect it to. Unfortunately, the web is a dark and dangerous place. Although we can give users tooling to make generating HTML easier, more often than note, we shouldn't.

Allowing your input fields to accept any HTML tags from your users exposes you to vulnerabilities. We could deny HTML tags outright, but what if we want to support some niceties, like details? We could make a list of "safe" HTML tags to support, but how do we predict them all, and then ensure that it still produces reliable and readable outputs? Maybe we could build a conduit, passing text from one source to another, modifying it as it goes...a pipeline for HTML, if you will.

There just so happens to be a Ruby library written to do just that, called html-pipeline. (Bet you didn't see that one coming!)

What is HTML-Pipeline?

For nearly twenty years, across several companies, I've been working on variations of the same problem: converting regular markup text written by humans into presentable HTML rendered by browsers. I say this not to brag, but quite the opposite: it's an admission of being tormented by the back-and-forth struggle of building a writing system that's easy to use and still capable of generating interactive content.

Most authoring systems make the assumption that you only want to convert from one plaintext format into another, or they'll provide a limited set of permissible tags in a templating language—but that's it. If you want to write something in a way the tool authors don't allow, you're out of luck.

HTML-Pipeline rectifies this in part by separating each phase of the conversion into a distinct step:

  • Cleaning the original source text. An example of this might be to remove excessive newlines from the start and end of a document.
  • Converting the original text into HTML. This is the process of converting your CommonMark/AsciiDoc/reStructuredText.
  • Running node filters to manipulate the resultant HTML

It's that final step where all the joy is manifested and the dangers of accepting user input are banished. A node filter is composed of two parts:

  • A CSS selector (to match HTML elements)
  • A method to call on each matched element

For example, suppose we wanted to make sure that every HTTP link in our final document was set to HTTPS. A node filter to do that might look like this:

class HTMLPipeline
  class NodeFilter
    # HTML Filter for replacing http references to :http_url with https versions.
    # Subdomain references are not rewritten.
    class HttpsFilter < NodeFilter
      def selector
        Selma::Selector.new(match_element: %(a[href^="http:"]))
      end

      def handle_element(element)
        element["href"] = element["href"].sub(/^http:/, "https:")
      end
    end
  end
end

In other words:

  • We look for any anchor tag with an href attribute starting with http:
  • We replace that http: protocol with https:

Sure, that might not look like much—a simple find-replace would do the same trick. But what if you wanted to do something more complicated, like convert all @mention phrases to actual links—while ignoring those found in pre and code blocks? Doing that in a regular expression is hard.

# frozen_string_literal: true

MENTION_PATTERNS = Hash.new do |hash, key|
  hash[key] = %r{
    (?:^|\W)                    # beginning of string or non-word char
    @((?>#{key}))  # @username
    (?!/)                      # without a trailing slash
    (?=
    \.+[ \t\W]|               # dots followed by space or non-word character
    \.+$|                     # dots at end of line
    [^0-9a-zA-Z_.]|           # non-word character except dot
    $                         # end of line
    )
  }ix
end

# Don't look for mentions in text nodes that are children of these elements
IGNORE_PARENTS = ["pre", "code", "a", "style", "script"]

class MentionFilter < HTMLPipeline::NodeFilter
  def selector
    Selma::Selector.new(match_text_within: "*", ignore_text_within: IGNORE_PARENTS)
  end

  def handle_text_chunk(text)
    content = text.to_s
    return unless content.include?("@")

    html = mention_link_filter(content)

    text.replace(html, as: :html)
  end
end

Or, what if you wanted every single h2 header to have an a new class added?

class TableOfContentsFilter < NodeFilter
  def selector
    Selma::Selector.new(
      match_element: "h2",
    )
  end

  # since it's just a Ruby object, we can call any method outside the filter
  # to pass data into the filter
  def classes
    context[:classes] || " anchor"
  end

  def handle_element(element)
    header_href = element["href"]

    # add an id class based on the href
    header_id = header_href

    element["id"] = header_id
    element["class"] += classes
  end
end

These nodes stack together, so that any change made at the beginning of the pipeline is carried through to the end. This is particularly important when dealing with sanitization. Rather than creating a list of tags which you don't want, HTML-Pipeline asks you to list the tags you do want:

ALLOWLIST = {
  elements: ["p", "pre", "code"]
}

pipeline = HTMLPipeline.new \
  convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
  sanitization_config: ALLOWLIST

In this example, we know that only p, pre, and code tags will pass through; everything else will be stripped away:

result = pipeline.call <<-CODE
This is **great**:

    some_code(:first)

CODE
result[:output].to_s
# ===v
# <p>This is great:</p>
# <pre><code>some_code(:first)
# </code></pre>

Notice in the example above that great is not made bold, the way one might expect from a word surrounded by **double asterisks**.

How does it work?

Past incarnations of HTML-Pipeline used Nokogiri, the venerable HTML parser. However, it had one major design flaw: a document would need to be reparsed on each filter. For example, the previous example to replace HTTP links with HTTPS ones looked like this:

doc.css(%(a[href^="#{http_url}"])).each do |element|
  element['href'] = element['href'].sub(/^http:/, 'https:')
end

The problem here is that any subsequent filter which might want to operate on a has to search the document again. If you have several filters operating on the same nodes, this can be a repetitive and costly performance problem.

Nowdays, under the hood, HTML-Pipeline is just a thin wrapper around lol-html, a project by the good folks at Cloudflare. It was designed to run filters sequentially, but still match all of the nodes at once. In other words, if several filters need to operate on a tags, all of those matching elements are found first, and then sent to the filter. Each modified a tag is handed off to the converter after it. (If you want to learn more about this project, you can read about it on Cloudflare's blog!)

Best of all, lol-html is written in Rust, a low-level memory-safe language. That means that the possibility of data corruption, segmentation faults, or other common CVEs plauging C projects (as Nokogiri is) vanishes. HTML-Pipeline is able to be a Ruby wrapper for a Rust project through the magic of magnus—but that's a blog post for another time!

What are y'all doing with it?

You might be thinking: big deal. Plaintext in, HTML out, delete scary-looking content. What's so special about that?

First: ouch. But second, we here at Yetto have the firm opinion that our users should be able to do whatever they want to (within reason). Practically speaking, that means using OpenAPI schemas for our JSON configurations and sensible constraints on our database. Yetto's big secret (that we can't wait to tell you, obviously) is that every plug's settings page is dynamically created. We can hand wave the details for now (sorry!), but what that essentially means is that we will accept text from a server we don't control and render it to the user.

How do we do this safely? With HTML-Pipeline of course! Here's what our pipeline for rendering this section of the app looks like:

SettingsPipeline = ::HTMLPipeline.new(
  convert_filter: ::HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  sanitization_config: Yetto::HTMLPipeline::SanitizationFilter::PRESENTATION_ONLY,
)

So, we'll take plaintext, convert it to CommonMark (neé Markdown), and sanitize the result to do...what? It's clear if we look at the definition for PRESENTATION_ONLY:

PRESENTATION_ONLY = {
  elements: ["em", "strong", "code"],
}

Ta-da! The response from the server can continue whatever text the server wants to send it. But, we'll strip away everything, and just leave a few decorative elements. That's the power of HTML-Pipeline.

By the way, we define pipelines for several parts of the app; for example, here's how we render outgoing messages:

MessagePipeline = ::HTMLPipeline.new(
  convert_filter: ::HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  node_filters: [::HTMLPipeline::NodeFilter::ImageMaxWidthFilter.new, ::HTMLPipeline::NodeFilter::MentionFilter.new],
)

Reading is fundamental

HTML-Pipeline has had a long history, having been born at GitHub and raised by a single maintainer (me). Its backend swap over to Rust two years ago has vastly improved its performance, and we've been running it in production at Yetto without a hitch for years.

It's important that the engineers crafting tools for writers understand not only their needs, but the needs of their readers, too. When it comes to plaintext to HTML conversation, i's not appropriate to say "we can't!" just because dealing with sanitization is hard. Yes, it is hard: so is writing an idea with one hand tied behind your back. Sometimes, it's worse to be safe than sorry, if it means doing nothing at all. HTML-Pipeline grants writers the full ability to express themselves, while allowing the systems that translate their bytes and the readers who receive their ideas to do so safely.

In fact, there's even more we could say about the writing process—particularly around our editor, Stheno. But that's yet another post for yet another day!