# Ruby and Postgres Regular Expression Syntaxes Ruby's regular expressions are unusually powerful. Postgres' regular expressions are not as powerful, but they come close; close enough that it's possible to do many pattern-based queries and string transformations entirely in a query. And sometimes, it's very useful to have a single regular expression that works in both environments, so that one can reliably query for data in a particular format and then transform the results in Ruby using the same pattern. But there are subtle differences between the syntaxes that are hard to remember, and the two organize the documentation is different ways, so it's hard to even map between them easily. This table is aimed at making it easier. ## Operators and Functions | Ruby | Postgres | Explanation | |-|-|-| | `=~` | `~` | matches | | | `~*` | matches (case-independent) | | `!~` | `!~` | does not match | | | `!~*` | does not match (case-independent) | | `String#match(re)`, `Regexp#match(string)` | | returns a MatchData instance or nil if no match | | | substring(*string* from *pattern*) | returns NULL if no match, the match text for the first capture group (if any), or the entire matched substring | | | regexp_replace(*source*, *pattern*, *replacement* [, *flags* ])| returns *source*, transformed per *pattern* and *replacement* (and *flags*, which can include `g`) | | | regexp_matches(*string*, *pattern* [, *flags* ]) | returns no rows (if no match); a row containing an array of capture matches (if the pattern uses capture groups), or a row containing the entire matched substring. (The `g` flag, if supplied, results in one returned row for each match in *string*.) | | | regexp_split_to_table(*string*, *pattern* [, *flags* ]) | splits *string* using *pattern* as a delimiter, returning a row for each fragment (ignoring zero-length fragments) | | | regexp_split_to_array(*string*, *pattern* [, *flags* ]) | same as `regexp_split_to_table` except it returns an array of strings rather than rows | ## Alternation
Ruby Postgres Explanation
| | combines two expressions into a single one that matches either of the expressions; each expression is an alternative
## Atoms Atoms match a sequence of one or more characters (or *zero* or more characters, in the case of subpatterns). | Ruby | Postgres | Explanation | |-|-|-| | (*re*) | (*re*) | a sub-pattern (capturing) | | (?<*name*>*re*) | | a sub-pattern (named capture) | | (?"*name*"*re*) | | a sub-pattern (named capture) | | (?:*re*) | (?:*re*) | a sub-pattern (non-capturing) | | (?>*re*) | | a sub-pattern (atomic, non-capturing) | | . | . | any single character | | [*chars*] | [*chars*] | a character class | | [^*chars*] | [^*chars*] | a negated character class | | \\*k* | \\*k* | where *k* is non-alphanumeric: matches *k* | | \\*c* | \\*c* | where *c* is alphanumeric: an escape | | { | { | if followed by a digit, introduces a bound quantifier; otherwise matches { | | *x* | *x* | other characters match themselves | (Additionally, backreferences and escapes function as atoms.) ## Quantifiers Quantifiers can follow atoms, and they change the number of occurrences of the atom that can be matched. (Without a subsequent quantifier, an atom will always match *exactly one* consecutive occurrence.) | Ruby | Postgres | Explanation | |-|-|-| | * | * | | | + | + | | | ? | ? | | | {*m*} | {*m*} | | | {*m*,} | {*m*,} | | | {,*m*} | | | | {*m*,*n*} | {*m*,*n*} | | | *? | *? | | | +? | +? | | | ?? | ?? | | | {*m*}? | {*m*}? | | | {*m*,}? | {*m*,}? | | | {,*m*}? | | | | {*m*,*n*}? | {*m*,*n*}? | | | *+ | | | | ++ | | | | ?+ | | | | {*m*}+ | | | | {*m*,}+ | | | | {,*m*}+ | | | | {*m*,*n*}+ | | | ## Constraints / Anchors | Ruby | Postgres | Explanation | |-|-|-| | ^ | ^ | beginning of line | | \\A | \\A | beginning of string | | $ | $ | end of line | | \\Z | \\Z | end of string (just before a terminating newline, if any) | | \\z | | end of string | | (?=*re*) | (?=*re*) | empty string when following characters match *re* | | (?!*re*) | (?!*re*) | empty string when following characters do not match *re* | | (?<=*re*) | | empty string when preceding characters match *re* | | (? | | reference to a named capture | | \\g<*name*> | | reference to a named subpattern (re-evaluates the subpattern, rather than matching the same text) ### Substitution Backreferences In Ruby, information from the match may be used in any of three ways: as global variables (`$something`), as substitution references in the second argument of `sub` or `gsub` (`\\something`) or by calling into the MatchData object returned from `match` (`md.something` or `md[:something]`). In Postgres, these can be used in the second argument to `regexp_replace`. | Ruby | Postgres | Explanation | |-|-|-| | `$-`, `Regexp.last_match` | | the MatchData object from the most recent match | | `$&`, `\&`, `md[0]` | `\&` | the complete matched text | | $\`, \\\`, `md.pre_match` | | the text of the string preceding the match | | `$'`, `\'`, `md.post_match` | | the text of the string after the match | | `$1`, `\1`, `md[1]` | `\1` | the first capture group (and so on for other numbered capture groups) | | `$+`, `\+` | | the last capture group | | \k<*name*>, md[:*name*] | | the named capture group with name *name* | ## Escapes ### Pattern Escapes | Ruby | Postgres | Explanation | |-|-|-| | \\d | \\d | [[:digit:]] | | \\D | \\D | [^[:digit:]] | | \\h | | [[:xdigit:]] | | \\H | | [^[:xdigit:]] | | \\s | \\s | [[:space:]] | | \\S | \\S | [^[:space:]] | | \\w | \\w | [[:alnum:]_] (note underscore is included) | | \\W | \\W | [^[:alnum:]_] (note underscore is included) | | \p{Alnum} | | Alphabetic and numeric character | | \p{Alpha} | | Alphabetic character | | \p{Blank} | | Space or tab | | \p{Cntrl} | | Control character | | \p{Digit} | | Digit | | \p{Graph} | | Non-blank character (excludes spaces, control characters, and similar) | | \p{Lower} | | Lowercase alphabetical character | | \p{Print} | | Like \p{Graph}, but includes the space character | | \p{Punct} | | Punctuation character | | \p{Space} | | Whitespace character ([:blank:], newline, carriage return, etc.) | | \p{Upper} | | Uppercase alphabetical | | \p{XDigit} | | Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) | | \p{Word} | | A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation | | \p{ASCII} | | A character in the ASCII character set | | \p{Any} | | Any Unicode character (including unassigned characters) | | \p{Assigned} | | An assigned character | | \p{L} | | Any character with Unicode *General Category* 'Letter' | | \p{Ll} | | Any character with Unicode *General Category* 'Letter: Lowercase' | | \p{Lm} | | Any character with Unicode *General Category* 'Letter: Mark' | | \p{Lo} | | Any character with Unicode *General Category* 'Letter: Other' | | \p{Lt} | | Any character with Unicode *General Category* 'Letter: Titlecase' | | \p{Lu} | | Any character with Unicode *General Category* 'Letter: Uppercase | | \p{Lo} | | Any character with Unicode *General Category* 'Letter: Other' | | \p{M} | | Any character with Unicode *General Category* 'Mark' | | \p{Mn} | | Any character with Unicode *General Category* 'Mark: Nonspacing' | | \p{Mc} | | Any character with Unicode *General Category* 'Mark: Spacing Combining' | | \p{Me} | | Any character with Unicode *General Category* 'Mark: Enclosing' | | \p{N} | | Any character with Unicode *General Category* 'Number' | | \p{Nd} | | Any character with Unicode *General Category* 'Number: Decimal Digit' | | \p{Nl} | | Any character with Unicode *General Category* 'Number: Letter' | | \p{No} | | Any character with Unicode *General Category* 'Number: Other' | | \p{P} | | Any character with Unicode *General Category* 'Punctuation' | | \p{Pc} | | Any character with Unicode *General Category* 'Punctuation: Connector' | | \p{Pd} | | Any character with Unicode *General Category* 'Punctuation: Dash' | | \p{Ps} | | Any character with Unicode *General Category* 'Punctuation: Open' | | \p{Pe} | | Any character with Unicode *General Category* 'Punctuation: Close' | | \p{Pi} | | Any character with Unicode *General Category* 'Punctuation: Initial Quote' | | \p{Pf} | | Any character with Unicode *General Category* 'Punctuation: Final Quote' | | \p{Po} | | Any character with Unicode *General Category* 'Punctuation: Other' | | \p{S} | | Any character with Unicode *General Category* 'Symbol' | | \p{Sm} | | Any character with Unicode *General Category* 'Symbol: Math' | | \p{Sc} | | Any character with Unicode *General Category* 'Symbol: Currency' | | \p{Sc} | | Any character with Unicode *General Category* 'Symbol: Currency' | | \p{Sk} | | Any character with Unicode *General Category* 'Symbol: Modifier' | | \p{So} | | Any character with Unicode *General Category* 'Symbol: Other' | | \p{Z} | | Any character with Unicode *General Category* 'Separator' | | \p{Zs} | | Any character with Unicode *General Category* 'Separator: Space' | | \p{Zl} | | Any character with Unicode *General Category* 'Separator: Line' | | \p{Zp} | | Any character with Unicode *General Category* 'Separator: Paragraph' | | \p{C} | | Any character with Unicode *General Category* 'Other' | | \p{Cc} | | Any character with Unicode *General Category* 'Other: Control' | | \p{Cf} | | Any character with Unicode *General Category* 'Other: Format' | | \p{Cn} | | Any character with Unicode *General Category* 'Other: Not Assigned' | | \p{Co} | | Any character with Unicode *General Category* 'Other: Private Use' | | \p{Cs} | | Any character with Unicode *General Category* 'Other: Surrogate' | | \p{*script*} | | Any character from the Unicode *script*, where *script* is one of *Arabic*, *Armenian*, *Balinese*, *Bengali*, *Bopomofo*, *Braille*, *Buginese*, *Buhid*, *Canadian_Aboriginal*, *Carian*, *Cham*, *Cherokee*, *Common*, *Coptic*, *Cuneiform*, *Cypriot*, *Cyrillic*, *Deseret*, *Devanagari*, *Ethiopic*, *Georgian*, *Glagolitic*, *Gothic*, *Greek*, *Gujarati*, *Gurmukhi*, *Han*, *Hangul*, *Hanunoo*, *Hebrew*, *Hiragana*, *Inherited*, *Kannada*, *Katakana*, *Kayah_Li*, *Kharoshthi*, *Khmer*, *Lao*, *Latin*, *Lepcha*, *Limbu*, *Linear_B*, *Lycian*, *Lydian*, *Malayalam*, *Mongolian*, *Myanmar*, *New_Tai_Lue*, *Nko*, *Ogham*, *Ol_Chiki*, *Old_Italic*, *Old_Persian*, *Oriya*, *Osmanya*, *Phags_Pa*, *Phoenician*, *Rejang*, *Runic*, *Saurashtra*, *Shavian*, *Sinhala*, *Sundanese*, *Syloti_Nagri*, *Syriac*, *Tagalog*, *Tagbanwa*, *Tai_Le*, *Tamil*, *Telugu*, *Thaana*, *Thai*, *Tibetan*, *Tifinagh*, *Ugaritic*, *Vai*, and *Yi* | (Any of the above escapes of the form \p{*something*} can be negated by using the `^` character, as \p{^*something*}.) ### Literal Character Escapes | Ruby | Postgres | Explanation | |-|-|-| | \\\\ | \\\\ | backslash | | \\a | \\a | alert (bell) character, as in C | | \\b | | backspace (only in character class) | | | \\b | backspace, as in C | | | \\B | synonym for backslash (\\) to help reduce the need for backslash doubling | | | \\c*X* | (where *X* is any character) the character whose low-order 5 bits are the same as those of *X*, and whose other bits are all zero | | \\e | | the escape character | | | \\e | the character whose collating-sequence name is ESC, or failing that, the character with octal value 033 | | \\f | \\f | form feed, as in C | | \\n | \\n | newline, as in C | | \\r | \\r | carriage return, as in C | | \\t | \\t | horizontal tab, as in C | | | \\u*hhhh* | (where *hhhh* is exactly four hexadecimal digits) the character whose hexadecimal value is 0x*hhhh* | | | \\U*hhhhhhhh* | (where *hhhhhhhh* is exactly eight hexadecimal digits) the character whose hexadecimal value is 0x*hhhhhhhh* | | \\v | \\v | vertical tab, as in C | | | \\x*hhh* | (where *hhh* is any sequence of hexadecimal digits) the character whose hexadecimal value is 0x*hhh* (a single character no matter how many hexadecimal digits are used) | | | \\0 | the character whose value is 0 (the null byte) | | | \\*oo* | (where *oo* is exactly two octal digits, and is not a back reference) the character whose octal value is 0*oo* | | | \\*ooo* | (where *ooo* is exactly three octal digits, and is not a back reference) the character whose octal value is 0*ooo* | ## Options In Ruby, the single-letter version of an option can be specified after the closing delimiter of the regexp. Options `i`, `m`, and `x` can also be embedded within the expression using the (?*on*-*off*:*re*) syntax, which turns on options *on* and turns off options *off* while interpreting subpattern *re*. The `Regexp::CONSTANT` version can be passed as the second parameter to `Regexp.new` (optionally combined with other constants with `|`). In Postgres, the options can be included at the very start of an expression (possibly after an initial `***=`) using the syntax (?*opts*); the options included in the string *opts* are in effect for the entire regular expression. The options can also be included in a string passed as a parameter to various pattern-related functions. | Ruby | Postgres | Explanation | |-|-|-| | i or `Regexp::IGNORECASE` | | case-insensitive | | m or `Regexp::MULTILINE` | | multiline (treat newline as a character matched by `.`) | | x or `Regexp::EXTENDED` | | ignore whitespace and comments in pattern | | o | | perform `#{}` interpolation only once || | u | | regexp is encoded as UTF-8 | | e | | regexp is encoded as EUC-JP | | s | | regexp is encoded as Windows-31J | | n | | regexp is encoded as ASCII-8BIT | | | \*\*\*: | (at beginning of pattern) the rest of the pattern is an ARE | | | \*\*\*= | (at beginning of pattern) the rest of the pattern is a literal string | | | b | rest of RE is a BRE | | | c | case-sensitive matching (overrides operator type) | | | e | rest of RE is an ERE | | | i | case-insensitive matching (overrides operator type) | | | m | historical synonym for n | | | n | newline-sensitive matching | | | p | partial newline-sensitive matching | | | q | rest of RE is a literal ("quoted") string, all ordinary characters | | | s | non-newline-sensitive matching (default) | | | t | tight syntax (default) | | | w | inverse partial newline-sensitive ("weird") matching | | | x | expanded syntax | | | g | (with `regexp_replace` and `regexp_matches` only) operate on all matches, not just the first |