Last active
November 3, 2022 10:10
-
-
Save bgromov/ff624ad1fda63b47b75596e8c13dc3a5 to your computer and use it in GitHub Desktop.
Revisions
-
bgromov renamed this gist
Aug 20, 2021 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
bgromov revised this gist
Aug 20, 2021 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -33,7 +33,7 @@ df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes ar Luckily, we can solve this problem with [custom accessors](https://pandas.pydata.org/pandas-docs/stable/development/extending.html). We can extend the `pandas.Series` and extract necessary information from values of complex types. First, we have to register an accessor: ```python -
bgromov revised this gist
Aug 20, 2021 . 1 changed file with 60 additions and 43 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,60 +1,77 @@ ## Pandas: `DataFrame.query()` for complex data types Consider the following pandas `DataFrame`: ```python df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a']) ``` ``` a 0 [1, 2, 3] 1 [4, 5, 6] 2 [7, 8, 9] 3 [0, 1, 2] ``` How do we query all rows where the first element is greater than 5? One of the ways would be to use `DataFrame.query()`. Unfortunatelly, there is no way to access individual elements of cell values, i.e., we can't do this: ```python df.query('a[0] > 5') # TypeError: '>' not supported between instances of 'list' and 'int' ``` Clearly, that happens because the item accessor selects a row number 0 which is then compared to a scalar number. Another approach would be to extract necessary data using a lambda function within `apply()` method; however, it is not supported yet: ```python df.query('a.apply(lambda x: x[0]) > 5') # NotImplementedError: 'Lambda' nodes are not implemented ``` Luckily, we can solve this problem with [custom accessors](https://pandas.pydata.org/pandas-docs/stable/development/extending.html). We can extend the `pandas.Series` and extract necessary information from values of complex types. First, we have to register, an accessor: ```python import pandas as pd @pd.api.extensions.register_series_accessor('list') class ListAccessor: def __init__(self, pandas_obj): self._obj = pandas_obj def __getitem__(self, index): return self._obj.apply(lambda x: x[index]) ``` Then, we can immediately use it with `query()` or `eval()`: Eval: ```python df.eval('a.list[1]') ``` ``` 0 2 1 5 2 8 3 1 Name: a, dtype: int64 ``` Query: ```python df.query('a.list[2] > 4') ``` ``` a 1 [4, 5, 6] 2 [7, 8, 9] ``` -
bgromov revised this gist
Aug 20, 2021 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,3 +1,5 @@ ## Pandas: `DataFrame.query()` for complex data types This method allows one to query data from pandas DataFrame where cells are of complex data types, e.g. `object`, `list`, `dict`. 1. Register custom `pandas.Series` accessor: -
bgromov revised this gist
Aug 20, 2021 . 1 changed file with 48 additions and 48 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,57 +2,57 @@ This method allows one to query data from pandas DataFrame where cells are of co 1. Register custom `pandas.Series` accessor: ```python import pandas as pd @pd.api.extensions.register_series_accessor('list') class ListAccessor: def __init__(self, pandas_obj): self._obj = pandas_obj def __getitem__(self, index): return self._obj.apply(lambda x: x[index]) ``` 2. Use it with `query` or `eval`: Create a `DataFrame`: ```python df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a']) ``` ``` a 0 [1, 2, 3] 1 [4, 5, 6] 2 [7, 8, 9] 3 [0, 1, 2] ``` Eval: ```python df.eval('a.list[1]') ``` ``` 0 2 1 5 2 8 3 1 Name: a, dtype: int64 ``` Query: ```python df.query('a.list[2] > 4') ``` ``` a 1 [4, 5, 6] 2 [7, 8, 9] ``` -
bgromov revised this gist
Aug 20, 2021 . 1 changed file with 14 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -17,31 +17,42 @@ class ListAccessor: 2. Use it with `query` or `eval`: Create a `DataFrame`: ```python df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a']) ``` ``` a 0 [1, 2, 3] 1 [4, 5, 6] 2 [7, 8, 9] 3 [0, 1, 2] ``` Eval: ```python df.eval('a.list[1]') ``` ``` 0 2 1 5 2 8 3 1 Name: a, dtype: int64 ``` Query: ```python df.query('a.list[2] > 4') ``` ``` a 1 [4, 5, 6] 2 [7, 8, 9] ``` -
bgromov created this gist
Aug 20, 2021 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,47 @@ This method allows one to query data from pandas DataFrame where cells are of complex data types, e.g. `object`, `list`, `dict`. 1. Register custom `pandas.Series` accessor: ```python import pandas as pd @pd.api.extensions.register_series_accessor('list') class ListAccessor: def __init__(self, pandas_obj): self._obj = pandas_obj def __getitem__(self, index): return self._obj.apply(lambda x: x[index]) ``` 2. Use it with `query` or `eval`: ```python df = pd.DataFrame([[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[0, 1, 2]]], columns=['a']) a 0 [1, 2, 3] 1 [4, 5, 6] 2 [7, 8, 9] 3 [0, 1, 2] df.eval('a.list[1]') 0 2 1 5 2 8 3 1 Name: a, dtype: int64 df.query('a.list[2] > 4') a 1 [4, 5, 6] 2 [7, 8, 9] ```