substring(*string* from *pattern*) | returns NULL if no match, the match text for the first capture group (if any), or the entire matched substring |
| | regexp_replace(*source*, *pattern*, *replacement* [, *flags* ])| returns *source*, transformed per *pattern* and *replacement* (and *flags*, which can include `g`) |
| | regexp_matches(*string*, *pattern* [, *flags* ]) | returns no rows (if no match); a row containing an array of capture matches (if the pattern uses capture groups), or a row containing the entire matched substring. (The `g` flag, if supplied, results in one returned row for each match in *string*.) |
| | regexp_split_to_table(*string*, *pattern* [, *flags* ]) | splits *string* using *pattern* as a delimiter, returning a row for each fragment (ignoring zero-length fragments) |
| | regexp_split_to_array(*string*, *pattern* [, *flags* ]) | same as `regexp_split_to_table` except it returns an array of strings rather than rows |
## Alternation
Ruby
Postgres
Explanation
|
|
combines two expressions into a single one that matches either of the expressions; each expression is an alternative
## Atoms
Atoms match a sequence of one or more characters (or *zero* or more characters, in the case of subpatterns).
| Ruby | Postgres | Explanation |
|-|-|-|
| (*re*) | (*re*) | a sub-pattern (capturing) |
| (?<*name*>*re*) | | a sub-pattern (named capture) |
| (?"*name*"*re*) | | a sub-pattern (named capture) |
| (?:*re*) | (?:*re*) | a sub-pattern (non-capturing) |
| (?>*re*) | | a sub-pattern (atomic, non-capturing) |
| . | . | any single character |
| [*chars*] | [*chars*] | a character class |
| [^*chars*] | [^*chars*] | a negated character class |
| \\*k* | \\*k* | where *k* is non-alphanumeric: matches *k* |
| \\*c* | \\*c* | where *c* is alphanumeric: an escape |
| { | { | if followed by a digit, introduces a bound quantifier; otherwise matches { |
| *x* | *x* | other characters match themselves |
(Additionally, backreferences and escapes function as atoms.)
## Quantifiers
Quantifiers can follow atoms, and they change the number of occurrences of the atom that can be matched. (Without a subsequent quantifier, an atom will always match *exactly one* consecutive occurrence.)
| Ruby | Postgres | Explanation |
|-|-|-|
| * | * | |
| + | + | |
| ? | ? | |
| {*m*} | {*m*} | |
| {*m*,} | {*m*,} | |
| {,*m*} | | |
| {*m*,*n*} | {*m*,*n*} | |
| *? | *? | |
| +? | +? | |
| ?? | ?? | |
| {*m*}? | {*m*}? | |
| {*m*,}? | {*m*,}? | |
| {,*m*}? | | |
| {*m*,*n*}? | {*m*,*n*}? | |
| *+ | | |
| ++ | | |
| ?+ | | |
| {*m*}+ | | |
| {*m*,}+ | | |
| {,*m*}+ | | |
| {*m*,*n*}+ | | |
## Constraints / Anchors
| Ruby | Postgres | Explanation |
|-|-|-|
| ^ | ^ | beginning of line |
| \\A | \\A | beginning of string |
| $ | $ | end of line |
| \\Z | \\Z | end of string (just before a terminating newline, if any) |
| \\z | | end of string |
| (?=*re*) | (?=*re*) | empty string when following characters match *re* |
| (?!*re*) | (?!*re*) | empty string when following characters do not match *re* |
| (?<=*re*) | | empty string when preceding characters match *re* |
| (? | | reference to a named capture |
| \\g<*name*> | | reference to a named subpattern (re-evaluates the subpattern, rather than matching the same text)
### Substitution Backreferences
In Ruby, information from the match may be used in any of three ways: as global variables (`$something`), as substitution references in the second argument of `sub` or `gsub` (`\\something`) or by calling into the MatchData object returned from `match` (`md.something` or `md[:something]`).
In Postgres, these can be used in the second argument to `regexp_replace`.
| Ruby | Postgres | Explanation |
|-|-|-|
| `$-`, `Regexp.last_match` | | the MatchData object from the most recent match |
| `$&`, `\&`, `md[0]` | `\&` | the complete matched text |
| $\`, \\\`, `md.pre_match` | | the text of the string preceding the match |
| `$'`, `\'`, `md.post_match` | | the text of the string after the match |
| `$1`, `\1`, `md[1]` | `\1` | the first capture group (and so on for other numbered capture groups) |
| `$+`, `\+` | | the last capture group |
| \k<*name*>, md[:*name*] | | the named capture group with name *name* |
## Escapes
### Pattern Escapes
| Ruby | Postgres | Explanation |
|-|-|-|
| \\d | \\d | [[:digit:]] |
| \\D | \\D | [^[:digit:]] |
| \\h | | [[:xdigit:]] |
| \\H | | [^[:xdigit:]] |
| \\s | \\s | [[:space:]] |
| \\S | \\S | [^[:space:]] |
| \\w | \\w | [[:alnum:]_] (note underscore is included) |
| \\W | \\W | [^[:alnum:]_] (note underscore is included) |
| \p{Alnum} | | Alphabetic and numeric character |
| \p{Alpha} | | Alphabetic character |
| \p{Blank} | | Space or tab |
| \p{Cntrl} | | Control character |
| \p{Digit} | | Digit |
| \p{Graph} | | Non-blank character (excludes spaces, control characters, and similar) |
| \p{Lower} | | Lowercase alphabetical character |
| \p{Print} | | Like \p{Graph}, but includes the space character |
| \p{Punct} | | Punctuation character |
| \p{Space} | | Whitespace character ([:blank:], newline, carriage return, etc.) |
| \p{Upper} | | Uppercase alphabetical |
| \p{XDigit} | | Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) |
| \p{Word} | | A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation |
| \p{ASCII} | | A character in the ASCII character set |
| \p{Any} | | Any Unicode character (including unassigned characters) |
| \p{Assigned} | | An assigned character |
| \p{L} | | Any character with Unicode *General Category* 'Letter' |
| \p{Ll} | | Any character with Unicode *General Category* 'Letter: Lowercase' |
| \p{Lm} | | Any character with Unicode *General Category* 'Letter: Mark' |
| \p{Lo} | | Any character with Unicode *General Category* 'Letter: Other' |
| \p{Lt} | | Any character with Unicode *General Category* 'Letter: Titlecase' |
| \p{Lu} | | Any character with Unicode *General Category* 'Letter: Uppercase |
| \p{Lo} | | Any character with Unicode *General Category* 'Letter: Other' |
| \p{M} | | Any character with Unicode *General Category* 'Mark' |
| \p{Mn} | | Any character with Unicode *General Category* 'Mark: Nonspacing' |
| \p{Mc} | | Any character with Unicode *General Category* 'Mark: Spacing Combining' |
| \p{Me} | | Any character with Unicode *General Category* 'Mark: Enclosing' |
| \p{N} | | Any character with Unicode *General Category* 'Number' |
| \p{Nd} | | Any character with Unicode *General Category* 'Number: Decimal Digit' |
| \p{Nl} | | Any character with Unicode *General Category* 'Number: Letter' |
| \p{No} | | Any character with Unicode *General Category* 'Number: Other' |
| \p{P} | | Any character with Unicode *General Category* 'Punctuation' |
| \p{Pc} | | Any character with Unicode *General Category* 'Punctuation: Connector' |
| \p{Pd} | | Any character with Unicode *General Category* 'Punctuation: Dash' |
| \p{Ps} | | Any character with Unicode *General Category* 'Punctuation: Open' |
| \p{Pe} | | Any character with Unicode *General Category* 'Punctuation: Close' |
| \p{Pi} | | Any character with Unicode *General Category* 'Punctuation: Initial Quote' |
| \p{Pf} | | Any character with Unicode *General Category* 'Punctuation: Final Quote' |
| \p{Po} | | Any character with Unicode *General Category* 'Punctuation: Other' |
| \p{S} | | Any character with Unicode *General Category* 'Symbol' |
| \p{Sm} | | Any character with Unicode *General Category* 'Symbol: Math' |
| \p{Sc} | | Any character with Unicode *General Category* 'Symbol: Currency' |
| \p{Sc} | | Any character with Unicode *General Category* 'Symbol: Currency' |
| \p{Sk} | | Any character with Unicode *General Category* 'Symbol: Modifier' |
| \p{So} | | Any character with Unicode *General Category* 'Symbol: Other' |
| \p{Z} | | Any character with Unicode *General Category* 'Separator' |
| \p{Zs} | | Any character with Unicode *General Category* 'Separator: Space' |
| \p{Zl} | | Any character with Unicode *General Category* 'Separator: Line' |
| \p{Zp} | | Any character with Unicode *General Category* 'Separator: Paragraph' |
| \p{C} | | Any character with Unicode *General Category* 'Other' |
| \p{Cc} | | Any character with Unicode *General Category* 'Other: Control' |
| \p{Cf} | | Any character with Unicode *General Category* 'Other: Format' |
| \p{Cn} | | Any character with Unicode *General Category* 'Other: Not Assigned' |
| \p{Co} | | Any character with Unicode *General Category* 'Other: Private Use' |
| \p{Cs} | | Any character with Unicode *General Category* 'Other: Surrogate' |
| \p{*script*} | | Any character from the Unicode *script*, where *script* is one of *Arabic*, *Armenian*, *Balinese*, *Bengali*, *Bopomofo*, *Braille*, *Buginese*, *Buhid*, *Canadian_Aboriginal*, *Carian*, *Cham*, *Cherokee*, *Common*, *Coptic*, *Cuneiform*, *Cypriot*, *Cyrillic*, *Deseret*, *Devanagari*, *Ethiopic*, *Georgian*, *Glagolitic*, *Gothic*, *Greek*, *Gujarati*, *Gurmukhi*, *Han*, *Hangul*, *Hanunoo*, *Hebrew*, *Hiragana*, *Inherited*, *Kannada*, *Katakana*, *Kayah_Li*, *Kharoshthi*, *Khmer*, *Lao*, *Latin*, *Lepcha*, *Limbu*, *Linear_B*, *Lycian*, *Lydian*, *Malayalam*, *Mongolian*, *Myanmar*, *New_Tai_Lue*, *Nko*, *Ogham*, *Ol_Chiki*, *Old_Italic*, *Old_Persian*, *Oriya*, *Osmanya*, *Phags_Pa*, *Phoenician*, *Rejang*, *Runic*, *Saurashtra*, *Shavian*, *Sinhala*, *Sundanese*, *Syloti_Nagri*, *Syriac*, *Tagalog*, *Tagbanwa*, *Tai_Le*, *Tamil*, *Telugu*, *Thaana*, *Thai*, *Tibetan*, *Tifinagh*, *Ugaritic*, *Vai*, and *Yi* |
(Any of the above escapes of the form \p{*something*} can be negated by using the `^` character, as \p{^*something*}.)
### Literal Character Escapes
| Ruby | Postgres | Explanation |
|-|-|-|
| \\\\ | \\\\ | backslash |
| \\a | \\a | alert (bell) character, as in C |
| \\b | | backspace (only in character class) |
| | \\b | backspace, as in C |
| | \\B | synonym for backslash (\\) to help reduce the need for backslash doubling |
| | \\c*X* | (where *X* is any character) the character whose low-order 5 bits are the same as those of *X*, and whose other bits are all zero |
| \\e | | the escape character |
| | \\e | the character whose collating-sequence name is ESC, or failing that, the character with octal value 033 |
| \\f | \\f | form feed, as in C |
| \\n | \\n | newline, as in C |
| \\r | \\r | carriage return, as in C |
| \\t | \\t | horizontal tab, as in C |
| | \\u*hhhh* | (where *hhhh* is exactly four hexadecimal digits) the character whose hexadecimal value is 0x*hhhh* |
| | \\U*hhhhhhhh* | (where *hhhhhhhh* is exactly eight hexadecimal digits) the character whose hexadecimal value is 0x*hhhhhhhh* |
| \\v | \\v | vertical tab, as in C |
| | \\x*hhh* | (where *hhh* is any sequence of hexadecimal digits) the character whose hexadecimal value is 0x*hhh* (a single character no matter how many hexadecimal digits are used) |
| | \\0 | the character whose value is 0 (the null byte) |
| | \\*oo* | (where *oo* is exactly two octal digits, and is not a back reference) the character whose octal value is 0*oo* |
| | \\*ooo* | (where *ooo* is exactly three octal digits, and is not a back reference) the character whose octal value is 0*ooo* |
## Options
In Ruby, the single-letter version of an option can be specified after the closing delimiter of the regexp. Options `i`, `m`, and `x` can also be embedded within the expression using the (?*on*-*off*:*re*) syntax, which turns on options *on* and turns off options *off* while interpreting subpattern *re*. The `Regexp::CONSTANT` version can be passed as the second parameter to `Regexp.new` (optionally combined with other constants with `|`).
In Postgres, the options can be included at the very start of an expression (possibly after an initial `***=`) using the syntax (?*opts*); the options included in the string *opts* are in effect for the entire regular expression. The options can also be included in a string passed as a parameter to various pattern-related functions.
| Ruby | Postgres | Explanation |
|-|-|-|
| i or `Regexp::IGNORECASE` | | case-insensitive |
| m or `Regexp::MULTILINE` | | multiline (treat newline as a character matched by `.`) |
| x or `Regexp::EXTENDED` | | ignore whitespace and comments in pattern |
| o | | perform `#{}` interpolation only once ||
| u | | regexp is encoded as UTF-8 |
| e | | regexp is encoded as EUC-JP |
| s | | regexp is encoded as Windows-31J |
| n | | regexp is encoded as ASCII-8BIT |
| | \*\*\*: | (at beginning of pattern) the rest of the pattern is an ARE |
| | \*\*\*= | (at beginning of pattern) the rest of the pattern is a literal string |
| | b | rest of RE is a BRE |
| | c | case-sensitive matching (overrides operator type) |
| | e | rest of RE is an ERE |
| | i | case-insensitive matching (overrides operator type) |
| | m | historical synonym for n |
| | n | newline-sensitive matching |
| | p | partial newline-sensitive matching |
| | q | rest of RE is a literal ("quoted") string, all ordinary characters |
| | s | non-newline-sensitive matching (default) |
| | t | tight syntax (default) |
| | w | inverse partial newline-sensitive ("weird") matching |
| | x | expanded syntax |
| | g | (with `regexp_replace` and `regexp_matches` only) operate on all matches, not just the first |