Skip to content

Instantly share code, notes, and snippets.

@bgromov
Last active November 3, 2022 10:10
Show Gist options
  • Save bgromov/ff624ad1fda63b47b75596e8c13dc3a5 to your computer and use it in GitHub Desktop.
Save bgromov/ff624ad1fda63b47b75596e8c13dc3a5 to your computer and use it in GitHub Desktop.

Revisions

  1. bgromov renamed this gist Aug 20, 2021. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. bgromov revised this gist Aug 20, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -33,7 +33,7 @@ df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes ar
    Luckily, we can solve this problem with [custom accessors](https://pandas.pydata.org/pandas-docs/stable/development/extending.html). We can extend the `pandas.Series` and extract necessary information from values of complex types.


    First, we have to register, an accessor:
    First, we have to register an accessor:

    ```python

  3. bgromov revised this gist Aug 20, 2021. 1 changed file with 60 additions and 43 deletions.
    103 changes: 60 additions & 43 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,60 +1,77 @@
    ## Pandas: `DataFrame.query()` for complex data types

    This method allows one to query data from pandas DataFrame where cells are of complex data types, e.g. `object`, `list`, `dict`.
    Consider the following pandas `DataFrame`:

    1. Register custom `pandas.Series` accessor:
    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
    ```

    ```python
    ```
    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]
    ```

    import pandas as pd
    How do we query all rows where the first element is greater than 5?

    @pd.api.extensions.register_series_accessor('list')
    class ListAccessor:
    def __init__(self, pandas_obj):
    self._obj = pandas_obj
    One of the ways would be to use `DataFrame.query()`. Unfortunatelly, there is no way to access individual elements of cell values, i.e., we can't do this:

    def __getitem__(self, index):
    return self._obj.apply(lambda x: x[index])
    ```
    ```python
    df.query('a[0] > 5') # TypeError: '>' not supported between instances of 'list' and 'int'
    ```

    2. Use it with `query` or `eval`:
    Clearly, that happens because the item accessor selects a row number 0 which is then compared to a scalar number.

    Create a `DataFrame`:
    Another approach would be to extract necessary data using a lambda function within `apply()` method; however, it is not supported yet:

    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
    ```
    ```python
    df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes are not implemented
    ```

    ```
    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]
    ```
    Luckily, we can solve this problem with [custom accessors](https://pandas.pydata.org/pandas-docs/stable/development/extending.html). We can extend the `pandas.Series` and extract necessary information from values of complex types.

    Eval:

    ```python
    df.eval('a.list[1]')
    ```
    First, we have to register, an accessor:

    ```
    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64
    ```
    ```python

    Query:
    import pandas as pd

    ```python
    df.query('a.list[2] > 4')
    ```
    @pd.api.extensions.register_series_accessor('list')
    class ListAccessor:
    def __init__(self, pandas_obj):
    self._obj = pandas_obj

    ```
    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```
    def __getitem__(self, index):
    return self._obj.apply(lambda x: x[index])
    ```

    Then, we can immediately use it with `query()` or `eval()`:

    Eval:

    ```python
    df.eval('a.list[1]')
    ```

    ```
    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64
    ```

    Query:

    ```python
    df.query('a.list[2] > 4')
    ```

    ```
    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```
  4. bgromov revised this gist Aug 20, 2021. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    ## Pandas: `DataFrame.query()` for complex data types

    This method allows one to query data from pandas DataFrame where cells are of complex data types, e.g. `object`, `list`, `dict`.

    1. Register custom `pandas.Series` accessor:
  5. bgromov revised this gist Aug 20, 2021. 1 changed file with 48 additions and 48 deletions.
    96 changes: 48 additions & 48 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -2,57 +2,57 @@ This method allows one to query data from pandas DataFrame where cells are of co

    1. Register custom `pandas.Series` accessor:

    ```python
    ```python

    import pandas as pd
    import pandas as pd

    @pd.api.extensions.register_series_accessor('list')
    class ListAccessor:
    def __init__(self, pandas_obj):
    self._obj = pandas_obj
    @pd.api.extensions.register_series_accessor('list')
    class ListAccessor:
    def __init__(self, pandas_obj):
    self._obj = pandas_obj

    def __getitem__(self, index):
    return self._obj.apply(lambda x: x[index])
    ```
    def __getitem__(self, index):
    return self._obj.apply(lambda x: x[index])
    ```

    2. Use it with `query` or `eval`:

    Create a `DataFrame`:

    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
    ```

    ```
    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]
    ```

    Eval:

    ```python
    df.eval('a.list[1]')
    ```

    ```
    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64
    ```

    Query:

    ```python
    df.query('a.list[2] > 4')
    ```

    ```
    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```
    Create a `DataFrame`:

    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
    ```

    ```
    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]
    ```

    Eval:

    ```python
    df.eval('a.list[1]')
    ```

    ```
    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64
    ```

    Query:

    ```python
    df.query('a.list[2] > 4')
    ```

    ```
    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```
  6. bgromov revised this gist Aug 20, 2021. 1 changed file with 14 additions and 3 deletions.
    17 changes: 14 additions & 3 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -17,31 +17,42 @@ class ListAccessor:

    2. Use it with `query` or `eval`:

    Create a `DataFrame`:

    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])
    ```

    ```
    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]
    ```

    Eval:

    ```python
    df.eval('a.list[1]')
    ```

    ```
    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64
    ```

    Query:

    ```python
    df.query('a.list[2] > 4')
    ```

    ```
    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```



  7. bgromov created this gist Aug 20, 2021.
    47 changes: 47 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,47 @@
    This method allows one to query data from pandas DataFrame where cells are of complex data types, e.g. `object`, `list`, `dict`.

    1. Register custom `pandas.Series` accessor:

    ```python

    import pandas as pd

    @pd.api.extensions.register_series_accessor('list')
    class ListAccessor:
    def __init__(self, pandas_obj):
    self._obj = pandas_obj

    def __getitem__(self, index):
    return self._obj.apply(lambda x: x[index])
    ```

    2. Use it with `query` or `eval`:

    ```python
    df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a'])

    a
    0 [1, 2, 3]
    1 [4, 5, 6]
    2 [7, 8, 9]
    3 [0, 1, 2]


    df.eval('a.list[1]')

    0 2
    1 5
    2 8
    3 1
    Name: a, dtype: int64


    df.query('a.list[2] > 4')

    a
    1 [4, 5, 6]
    2 [7, 8, 9]
    ```