Regex Tutorial: Matching an Email

This is a tutorial that explains how a specific regular expression, or regex, functions. This tutorial breaks down each part of the expression and describes what it does. A regex, which is short for regular expression, is a sequence of characters that defines a specific search pattern. When included in code or search algorithms, regular expressions can be used to find certain patterns of characters within a string, or to find and replace a character or sequence of characters within a string. They are also frequently used to validate input.

Summary

This particular tutorial will go in depth on matching an email regex. This tutorial breaks down each part of the expression and describes how each particulate operates.

email regex:

/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/

Anchors
Quantifiers
OR Operator
Character Classes
Flags
Grouping and Capturing
Bracket Expressions
Greedy and Lazy Match
Boundaries
Back-references
Look-ahead and Look-behind

Regex Components

Anchors

A regex anchor ensures that a matched expression is anchored to a certain position in the string.

Types of Anchors

The caret symbol (^) is used as the start-of-string anchor and ensures that a match is positioned at the start of a string

/^([a-zA-Z0-9._%-]+@
The dollar sign ($) is used as an end-of-string anchor and ensures that a match is positioned at the end of a string

@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
The word boundary anchor (\b) will match an expression only if it occurs at the beginning of a word. Note that “words” in regex can consist of characters including [a-z], [A-Z], [0-9], or underscores _.

/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
The non-word boundary anchor (/B) is the word boundary inverse that matches anything except a word boundary.

Quantifiers

Quantifiers specify the number of consecutive occurences of the character or expression directly preceding it. Quantifiers can specify zero or more (*), one or more (+), zero or one (?), a specific quantity such as three {3}, more than three {3,}, or between one and three {1,3}. A lazy flag (?) added behind any qualifier will make it match as few characters as possible.

Quantifiers in username: /^([a-zA-Z0-9._%-]+

Quantifiers in email host name: [a-zA-Z0-9.-]+

Quantifiers in domain: [a-zA-Z]{2,6})*$/

OR Operator

| acts like a boolean OR. It matches the expression before or after the |. It can operate within a group, or on a whole expression.

Character Classes

A character class allows us to match several possible characters. A character class is enclosed in square brackets.

to match any single character from the group x,y, or z, use: [x,y,z] to match any single character except those from the group x,y, or z, use the same expression with a caret symbol: [^x,y,z] to specify a range of characters in one go use the dash symbol: [a-z] the expression above will match nay lowercase letter from a-z however you can also string ranges together: [a-zA-Z0-9] subsets can be used as well: [j-m8-66]

Flags

Flags alter the entire behavior of an expression. Flags follow the closing forward slash of the expression.

Please see below for the common example of flags:

i ignore case: make the entire expression case-sensitive.
g global search: store the index of the last match.
m multiline: cause the beginning and end anchors to match the start and end of a line instead of the whole string.
s dotall: causes dot(.) to match any character.
y sticky: will only match from its last index position and ignores the global search flag.

Grouping and Capturing

Groups allow a group of tokens to be combined. Capturing group, such as (ABC), by parantheses group multiple tokens together for extracting a substring. In this tutorial, the parentheses were not used, however, there are three groups in this example as shown below:

/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/

username: ([a-zA-Z0-9._%-]+)
email host name: ([a-zA-Z0-9.-]+\.)
domain: ([a-zA-Z]{2,6})

The reason why I did not include parantheses in this example is because these 3 groups can be distinguished and separated by the @ and ..

Bracket Expressions

The group within the square brackets can be considered a sub-group. Tokens within the brackets will be allowed more than once so long that they meet the criteria within the square brackets.

[a-zA-Z0-9._%-] [a-zA-Z0-9.-] [a-zA-Z]

Greedy and Lazy Match

A greedy match will consume as much as possible. This syntax <.+> if you have this

Ello World

will return everything from the first < to the last >.

Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.

Boundaries

Word Boundary: /b The word boundary matches positions where one side is a word character, usually a letter, digit or underscore. /bcat/b would maych cat in black cat but would not match in catatonic, tomcat or certificate. /bcat would match cat in catfish cat/b would match cat in tomcat both would match cat on its own.

Back-references

Back-references match the same text as previsouly matched by a caturing group.

Look-ahead and Look-behind

Look-aheads can be used to ensure that a match is/isn’t followed by some pattern, without actually including this pattern in a match. A negative look ahead is used to declare that the match is not followed by a specific pattern.

syntax for positive look ahead: (?=...)
    \w+(?=.com)

syntax for negative look ahead: (?!…)
    \w+(?!.com)

look-behinds can be used to test if a match is preceded by some specified pattern.

syntax for positive look behind: (?<=…)
    (?<=Mr. )\w+

syntax for negative look behind: (?<!…)
    (?<!Mr. )\w+

Author

Cade Wilson - Full Stack Developer - https://github.com/M8MBA

M8MBA/Regex-tutorial.md