# Regex and You: Matching an HTML Tag Regular expressions, ever versatile, will help up locate HTML tags in a string today. ## Summary Pattern matching HTML strings serves at least one crucial function in web dev: sanitizing user input. Allowing user-submitted strings opens one's application to significant vulnerability. Supposing, for example, some ne'er-do-well on the internet submitted a comment that includes ``. Regular expressions allow us to match HTML tags in a string, because HTML tags conform to a certain pattern: - begin and end with brackets (<>) - contain a string name consisting of one or more lowercase letters, like p, a, div, strong, script - contain zero or more attributes, such as `class="btn"`, `src="/steal_your_data.js"`, or `href="https://github.com/gavin-asay"` - be accompanied by a closing tag in brackets with a slash and its tag name, e.g., `
`, `` or - be a self-closing tag, which has one or more whitespace characters, then a slash before the closing bracket (>). So, to pick out an HTML tag, we write a regex that can account for these various possibilities. Consider this regex: `/^<([a-z]+)([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$/` If that looks like gibberish, that's because a regex often does at first glance. It takes some time to break down a lengthy regex and make sense of its pattern. Let's break this regex down piece by piece. Look in the table of contents for an explanation for each part of this lengthy regex. ## Table of Contents - [/](#slash) - [^<(\[a-z\]+)](#capture1) - [^](#carat) - [<](#openbracket) - [\[a-z\]](#class) - [+](#plus) - [( ... )](#capturing) - [(\[^>\]+)\*](#capture2) - [\[^>\]](#class2) - [+](#plus2) - [( ... )\*](#asterisk) - [(?: ... )](#noncapture) - [>(.\*)](#period) - [<\/\1>](#escape) - [|](#pipe) - [\s+\/>](#short) - [$](#dollar) ## / {#slash} Every regex is enclosed in forward slashes. Programming languages recognize this syntax to denote a regular expression. ## ^<([a-z]+) {#capture1} ### ^ {#carat} When you see a **carat** ^ at the beginning of the regex, it means the beginning of the string we're comparing. Thus, only an HTML tag found immediately at the start of our string will fit the pattern. (Note that we also have a character that matches the end of the string, which we'll discuss later.) ### < {#openbracket} This single character < stands alone, not enclosed in any parentheses or brackets. This means that the pattern will match one and only one single open bracket, as we would expect from an HTML tag. ### [a-z] {#class} Square brackets [] mark a **character class**. Any character within the brackets will match the pattern. In this case, we match any lowercase letter from a to z. Note that for letters, regex is case sensitive. If we wanted to match capital letters as well, our character class would be [A-Za-z]. If we only wanted to match a handful of characters, we could use [abc123] to match only lowercase a, b, c, or the digits, 1, 2, and 3. ### + {#plus} The plus sign + is a **quantifier**. It describes how many times the previous character class can be repeated. Plus means one more times. That means we must have at least one character that matches [a-z], but two or any quantity beyond that will also match. Other quantifiers include the asterisk \*, meaning zero more times (essentially making the character class optional), while a question mark ? means zero or one times. ### ( ... ) {#capturing} Finally, you'll notice that this segment is enclosed in parentheses ( ). Parentheses mark a **capturing group**. This means that the regex will remember the segment of the pattern matching everything inside those parentheses. We can refer back to this capturing group later. JavaScript will also keep track of the contents of this capturing group. Still with me? Have you figured out what this first part matches? An opening HTML bracker <, followed by one or more lowercase letters. That's the start of an HTML tag—segments like ]+)* {#capture2} You'll notice that we're isolating a second capturing group. ### [^>] {#class2} Last time we saw a **carat** ^, it denoted the start of the string. Within a character class, however, ^ has a different meaning: to *exclude* a character from the class. We're excluding > here, but that's the only definition of this class. If a character class only describes exclusions, then any character EXCEPT the exluded characters will match. Any character that isn't >, including letters, digits, symbols, and whitespace match this character class. ### + {#plus2} As before, + matches one or more non-> characters. ### ( ... )* {#asterisk} Like we mentioned above, the **asterisk** * matches zero or more times. Thus, our second capturing group ([^<]+)* is optional and will include any collection of one or more non-> characters. What is this very flexible pattern looking for? Anything that comes after the tag name and before the closing bracket >. That includes the tags attributes. That includes anything like classes or ids, href, src, or flags like selected or disabled. Let's look at an example: `