Skip to content

Instantly share code, notes, and snippets.

@imranity
Created April 2, 2023 18:49
Show Gist options
  • Select an option

  • Save imranity/0694e078ed531d74da98a5caac703d49 to your computer and use it in GitHub Desktop.

Select an option

Save imranity/0694e078ed531d74da98a5caac703d49 to your computer and use it in GitHub Desktop.

Revisions

  1. imranity created this gist Apr 2, 2023.
    67 changes: 67 additions & 0 deletions python-regex.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,67 @@
    # python regex

    ## chap1 : intro to regex

    * regex: Regular expressions are text patterns that define the form a text string should have.
    - useful for email checking patern
    - matching word "color" and "colour"
    - extra specific info like postal code
    LOL: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

    * How regex started (birth of grep)
    Ken Thompson's work didn't end in just writing a paper. He included support for these regular expressions in his version of QED. To search with a regular expression in QED, the following had to be written:
    g/<regular expression>/p
    In the preceding line of code, g means global search and p means print. If, instead of writing regular expression, we write the short form re, we get g/re/p, and therefore, the beginnings of the venerable UNIX command-line tool grep

    ?: match single char (file?.xml matches file1.xml and file9.xml but not file99.xml)
    * : match any numder of char

    in file?.xml:
    literals -> file and xml
    metacharacters -> ? (or '\*' )
    ### OUR FIRST REGEX
    /a\w*/ ==> matches any word starting with 'a'

    ### Escaping Metacharacters
    Metachars can coexist but what if need to use metachar as luterals?
    3 ways to do it:
    * escape the metachar by preceding with a backlash
    * in python , use "re.escape"
    * Quoting with \Q and \E: (not supported in Python)

    There are 12 metachar that should be escaped when needed to use as char:
    \ backslash
    ^ Caret
    $ Dollar Sign
    . Dot
    | Pipe Symbol
    ? Question
    * Asterik
    + Plus sign
    (
    )
    [
    {


    ## Character class
    Character classes allow us to define a char that will match if any of defined char on set is present

    for example to match "license" and "licene" --> /licen[sc]e/
    we can use range of chars [b-e] or num [2-9]
    Ranges can be combined : [0-9a-zA-z]
    * Negation of ranges [^0-9] match anything not a number but there has to be a char e.g. /hello[^0-9]/ wont match hello as there no char in its place

    ### Predefined char class

    | Element | Description
    | . | matches any char except newline
    | \d | matches any decimal , equivalent to [0-9]
    | \D | matches any non-digit , eq to [^0-9]
    | \s | matches any whitespace class: eq to [ \t\n\r\f\v ]
    | \S | matches non-whitespace , eq to [ ^ \t\n\r\f\v ]
    | \w | matches any alphanumeric eq to [0-9a-zA-Z_]

    [^\/\] -> matches any char thats not a backslash or slash