Skip to content

Instantly share code, notes, and snippets.

@jiaoyk
Forked from ciscorn/unicode_normalization.md
Created July 25, 2024 06:24
Show Gist options
  • Save jiaoyk/f42be0de83e84bad3d397ebb6d266395 to your computer and use it in GitHub Desktop.
Save jiaoyk/f42be0de83e84bad3d397ebb6d266395 to your computer and use it in GitHub Desktop.

Revisions

  1. @ciscorn ciscorn revised this gist Sep 27, 2021. No changes.
  2. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # How to normalize Unicode strings in Python / Go / Rust
    # Normalizing Unicode strings in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

  3. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Unicode normalization in Python / Go / Rust
    # How to normalize Unicode strings in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

  4. @ciscorn ciscorn revised this gist Sep 27, 2021. No changes.
  5. @ciscorn ciscorn revised this gist Sep 27, 2021. No changes.
  6. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,7 +37,7 @@ conv_and_print("NFKD")

    ## Go

    You can use [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.
    You can use the [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.

    ```go
    package main
    @@ -61,7 +61,7 @@ func main() {

    ## Rust

    You can use [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.
    You can use the [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.

    ```rust
    fn main() {
  7. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,7 +37,7 @@ conv_and_print("NFKD")

    ## Go

    You can use golang.org/x/text/unicode/norm package.
    You can use [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.

    ```go
    package main
  8. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,6 +37,8 @@ conv_and_print("NFKD")

    ## Go

    You can use golang.org/x/text/unicode/norm package.

    ```go
    package main

    @@ -59,6 +61,8 @@ func main() {

    ## Rust

    You can use [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;
  9. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@

    NFC, NFD, NFKC, NFKD

    source:
    input:
    ```
    it’säå(1−2)ドブロク㍿
    ```
  10. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 7 additions and 3 deletions.
    10 changes: 7 additions & 3 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,12 +2,16 @@

    NFC, NFD, NFKC, NFKD

    source:
    ```
    it’säå(1−2)ドブロク㍿
    ```
    source: it’säå(1−2)ドブロク㍿

    result:
    NFC: it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD: it’säå(1−2)ドブロク㍿ (50 bytes)

    ```
    NFC : it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD : it’säå(1−2)ドブロク㍿ (50 bytes)
    NFKC: it’säå(1−2)ドブロク株式会社 (41 bytes)
    NFKD: it’säå(1−2)ドブロク株式会社 (49 bytes)
    ```
  11. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 10 additions and 0 deletions.
    10 changes: 10 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,6 +2,16 @@

    NFC, NFD, NFKC, NFKD

    ```
    source: it’säå(1−2)ドブロク㍿
    result:
    NFC: it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD: it’säå(1−2)ドブロク㍿ (50 bytes)
    NFKC: it’säå(1−2)ドブロク株式会社 (41 bytes)
    NFKD: it’säå(1−2)ドブロク株式会社 (49 bytes)
    ```

    ## Python

    ```python
  12. @ciscorn ciscorn revised this gist Sep 27, 2021. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,6 +2,8 @@

    NFC, NFD, NFKC, NFKD

    ## Python

    ```python
    import unicodedata

    @@ -19,6 +21,8 @@ conv_and_print("NFKD")

    ```

    ## Go

    ```go
    package main

    @@ -39,6 +43,8 @@ func main() {

    ```

    ## Rust

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;
  13. @ciscorn ciscorn created this gist Sep 27, 2021.
    53 changes: 53 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,53 @@
    # Unicode normalization in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

    ```python
    import unicodedata


    def conv_and_print(form):
    src = "it’säå(1−2)ドブロク㍿"
    norm = unicodedata.normalize(form, src)
    print(f"{form}: {norm} ({len(norm.encode('utf-8'))} bytes)")


    conv_and_print("NFC")
    conv_and_print("NFD")
    conv_and_print("NFKC")
    conv_and_print("NFKD")

    ```

    ```go
    package main

    import (
    "fmt"

    "golang.org/x/text/unicode/norm"
    )

    func main() {
    src := "it’säå(1−2)ドブロク㍿"
    forms := map[string]norm.Form{"NFC": norm.NFC, "NFD": norm.NFD, "NFKC": norm.NFKC, "NFKD": norm.NFKD}
    for name, form := range forms {
    norm := form.String(src)
    fmt.Printf("%s: %v (%v bytes)\n", name, norm, len(norm))
    }
    }

    ```

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;

    let s = "it’säå(1−2)ドブロク㍿";
    let print = |form, norm: &str| println!("{}: {} ({} bytes)", form, norm, norm.len());
    print("NFC", &s.nfc().collect::<String>());
    print("NFD", &s.nfd().collect::<String>());
    print("NFKC", &s.nfkc().collect::<String>());
    print("NFKD", &s.nfkd().collect::<String>());
    }
    ```