Skip to content

Instantly share code, notes, and snippets.

@ciscorn
Last active July 25, 2024 06:24
Show Gist options
  • Save ciscorn/e0be0852b9ebe812b4e1787b77de397e to your computer and use it in GitHub Desktop.
Save ciscorn/e0be0852b9ebe812b4e1787b77de397e to your computer and use it in GitHub Desktop.

Revisions

  1. ciscorn revised this gist Sep 27, 2021. No changes.
  2. ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # How to normalize Unicode strings in Python / Go / Rust
    # Normalizing Unicode strings in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

  3. ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Unicode normalization in Python / Go / Rust
    # How to normalize Unicode strings in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

  4. ciscorn revised this gist Sep 27, 2021. No changes.
  5. ciscorn revised this gist Sep 27, 2021. No changes.
  6. ciscorn revised this gist Sep 27, 2021. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,7 +37,7 @@ conv_and_print("NFKD")

    ## Go

    You can use [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.
    You can use the [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.

    ```go
    package main
    @@ -61,7 +61,7 @@ func main() {

    ## Rust

    You can use [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.
    You can use the [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.

    ```rust
    fn main() {
  7. ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,7 +37,7 @@ conv_and_print("NFKD")

    ## Go

    You can use golang.org/x/text/unicode/norm package.
    You can use [golang.org/x/text/unicode/norm](https://pkg.go.dev/golang.org/x/text/unicode/norm) package.

    ```go
    package main
  8. ciscorn revised this gist Sep 27, 2021. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -37,6 +37,8 @@ conv_and_print("NFKD")

    ## Go

    You can use golang.org/x/text/unicode/norm package.

    ```go
    package main

    @@ -59,6 +61,8 @@ func main() {

    ## Rust

    You can use [unicode-normalization](https://crates.io/crates/unicode-normalization) crate.

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;
  9. ciscorn revised this gist Sep 27, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@

    NFC, NFD, NFKC, NFKD

    source:
    input:
    ```
    it’säå(1−2)ドブロク㍿
    ```
  10. ciscorn revised this gist Sep 27, 2021. 1 changed file with 7 additions and 3 deletions.
    10 changes: 7 additions & 3 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,12 +2,16 @@

    NFC, NFD, NFKC, NFKD

    source:
    ```
    it’säå(1−2)ドブロク㍿
    ```
    source: it’säå(1−2)ドブロク㍿

    result:
    NFC: it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD: it’säå(1−2)ドブロク㍿ (50 bytes)

    ```
    NFC : it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD : it’säå(1−2)ドブロク㍿ (50 bytes)
    NFKC: it’säå(1−2)ドブロク株式会社 (41 bytes)
    NFKD: it’säå(1−2)ドブロク株式会社 (49 bytes)
    ```
  11. ciscorn revised this gist Sep 27, 2021. 1 changed file with 10 additions and 0 deletions.
    10 changes: 10 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,6 +2,16 @@

    NFC, NFD, NFKC, NFKD

    ```
    source: it’säå(1−2)ドブロク㍿
    result:
    NFC: it’säå(1−2)ドブロク㍿ (45 bytes)
    NFD: it’säå(1−2)ドブロク㍿ (50 bytes)
    NFKC: it’säå(1−2)ドブロク株式会社 (41 bytes)
    NFKD: it’säå(1−2)ドブロク株式会社 (49 bytes)
    ```

    ## Python

    ```python
  12. ciscorn revised this gist Sep 27, 2021. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -2,6 +2,8 @@

    NFC, NFD, NFKC, NFKD

    ## Python

    ```python
    import unicodedata

    @@ -19,6 +21,8 @@ conv_and_print("NFKD")

    ```

    ## Go

    ```go
    package main

    @@ -39,6 +43,8 @@ func main() {

    ```

    ## Rust

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;
  13. ciscorn created this gist Sep 27, 2021.
    53 changes: 53 additions & 0 deletions unicode_normalization.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,53 @@
    # Unicode normalization in Python / Go / Rust

    NFC, NFD, NFKC, NFKD

    ```python
    import unicodedata


    def conv_and_print(form):
    src = "it’säå(1−2)ドブロク㍿"
    norm = unicodedata.normalize(form, src)
    print(f"{form}: {norm} ({len(norm.encode('utf-8'))} bytes)")


    conv_and_print("NFC")
    conv_and_print("NFD")
    conv_and_print("NFKC")
    conv_and_print("NFKD")

    ```

    ```go
    package main

    import (
    "fmt"

    "golang.org/x/text/unicode/norm"
    )

    func main() {
    src := "it’säå(1−2)ドブロク㍿"
    forms := map[string]norm.Form{"NFC": norm.NFC, "NFD": norm.NFD, "NFKC": norm.NFKC, "NFKD": norm.NFKD}
    for name, form := range forms {
    norm := form.String(src)
    fmt.Printf("%s: %v (%v bytes)\n", name, norm, len(norm))
    }
    }

    ```

    ```rust
    fn main() {
    use unicode_normalization::UnicodeNormalization;

    let s = "it’säå(1−2)ドブロク㍿";
    let print = |form, norm: &str| println!("{}: {} ({} bytes)", form, norm, norm.len());
    print("NFC", &s.nfc().collect::<String>());
    print("NFD", &s.nfd().collect::<String>());
    print("NFKC", &s.nfkc().collect::<String>());
    print("NFKD", &s.nfkd().collect::<String>());
    }
    ```