This is a tutorial that explains how a specific regular expression, or regex, functions. This tutorial breaks down each part of the expression and describes what it does. A regex, which is short for regular expression, is a sequence of characters that defines a specific search pattern. When included in code or search algorithms, regular expressions can be used to find certain patterns of characters within a string, or to find and replace a character or sequence of characters within a string. They are also frequently used to validate input.
This particular tutorial will go in depth on matching an email regex. This tutorial breaks down each part of the expression and describes how each particulate operates.
email regex:
/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
A regex anchor ensures that a matched expression is anchored to a certain position in the string.
Types of Anchors
-
The caret symbol (^) is used as the start-of-string anchor and ensures that a match is positioned at the start of a string
/^([a-zA-Z0-9._%-]+@
-
The dollar sign ($) is used as an end-of-string anchor and ensures that a match is positioned at the end of a string
@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
-
The word boundary anchor (\b) will match an expression only if it occurs at the beginning of a word. Note that “words” in regex can consist of characters including [a-z], [A-Z], [0-9], or underscores _.
/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
-
The non-word boundary anchor (/B) is the word boundary inverse that matches anything except a word boundary.
Quantifiers specify the number of consecutive occurences of the character or expression directly preceding it. Quantifiers can specify zero or more (*), one or more (+), zero or one (?), a specific quantity such as three {3}, more than three {3,}, or between one and three {1,3}. A lazy flag (?) added behind any qualifier will make it match as few characters as possible.
Quantifiers in username: /^([a-zA-Z0-9._%-]+
Quantifiers in email host name: [a-zA-Z0-9.-]+
Quantifiers in domain: [a-zA-Z]{2,6})*$/
| acts like a boolean OR. It matches the expression before or after the |. It can operate within a group, or on a whole expression.
A character class allows us to match several possible characters. A character class is enclosed in square brackets.
to match any single character from the group x,y, or z, use: [x,y,z] to match any single character except those from the group x,y, or z, use the same expression with a caret symbol: [^x,y,z] to specify a range of characters in one go use the dash symbol: [a-z] the expression above will match nay lowercase letter from a-z however you can also string ranges together: [a-zA-Z0-9] subsets can be used as well: [j-m8-66]
Flags alter the entire behavior of an expression. Flags follow the closing forward slash of the expression.
Please see below for the common example of flags:
i ignore case: make the entire expression case-sensitive.
g global search: store the index of the last match.
m multiline: cause the beginning and end anchors to match the start and end of a line instead of the whole string.
s dotall: causes dot(.) to match any character.
y sticky: will only match from its last index position and ignores the global search flag.
Groups allow a group of tokens to be combined. Capturing group, such as (ABC), by parantheses group multiple tokens together for extracting a substring. In this tutorial, the parentheses were not used, however, there are three groups in this example as shown below:
/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/
username: ([a-zA-Z0-9._%-]+)
email host name: ([a-zA-Z0-9.-]+\.)
domain: ([a-zA-Z]{2,6})
The reason why I did not include parantheses in this example is because these 3 groups can be distinguished and separated by the @ and ..
The group within the square brackets can be considered a sub-group. Tokens within the brackets will be allowed more than once so long that they meet the criteria within the square brackets.
[a-zA-Z0-9._%-] [a-zA-Z0-9.-] [a-zA-Z]
A greedy match will consume as much as possible. This syntax <.+> if you have this
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
Word Boundary: /b The word boundary matches positions where one side is a word character, usually a letter, digit or underscore. /bcat/b would maych cat in black cat but would not match in catatonic, tomcat or certificate. /bcat would match cat in catfish cat/b would match cat in tomcat both would match cat on its own.
Back-references match the same text as previsouly matched by a caturing group.
Look-aheads can be used to ensure that a match is/isn’t followed by some pattern, without actually including this pattern in a match. A negative look ahead is used to declare that the match is not followed by a specific pattern.
syntax for positive look ahead: (?=...)
\w+(?=.com)
syntax for negative look ahead: (?!…)
\w+(?!.com)
look-behinds can be used to test if a match is preceded by some specified pattern.
syntax for positive look behind: (?<=…)
(?<=Mr. )\w+
syntax for negative look behind: (?<!…)
(?<!Mr. )\w+
Cade Wilson - Full Stack Developer - https://github.com/M8MBA