Skip to content

Instantly share code, notes, and snippets.

@TariqAHassan
Last active November 6, 2023 22:57
Show Gist options
  • Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.
Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.

Revisions

  1. TariqAHassan revised this gist May 13, 2020. 1 changed file with 9 additions and 10 deletions.
    19 changes: 9 additions & 10 deletions PandasConcatWorkaround.adoc
    Original file line number Diff line number Diff line change
    @@ -10,7 +10,8 @@ If these assumptions are not met, this approach could still work...but it
    will likely need to be modified.

    Start by importing `os`, `Pandas` and `chain` from `itertools`.
    ```

    ```python
    import os
    import pandas as pd
    from itertools import chain
    @@ -23,47 +24,45 @@ PATH_TO_FILES = '/your/path/here/'

    Read in the Data as Pandas DataFrames (csv files, in this example):

    ```
    ```python
    frames = list()
    for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
    frames.append(pd.read_csv(csv))
    ```

    Define a function to flatten large 2D lists quickly:
    ```

    ```python
    def fast_flatten(input_list):
    return list(chain.from_iterable(input_list))
    ```

    Next, construct a dictionary using the column names from one of the dataframes (located at index 0):

    ```
    ```python
    COLUMN_NAMES = frames[0].columns
    ```

    Now, construct a dictionary from the column names:

    ```
    ```python
    df_dict = dict.fromkeys(COLUMN_NAMES, [])
    ```

    Iterate though the columns:

    ```
    ```python
    for col in COLUMN_NAMES:
    # Use a generator to save memory
    extracted = (frame[col] for frame in frames)

    # Flatten and save to df_dict
    df_dict[col] = fast_flatten(extracted)
    ```

    Lastly use the `from_dict` method to produce the combined DataFrame:
    ```
    ```python
    df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]
    ```

    While this method is not very pretty, it typically is much faster than `pd.concat()` and
    yields the exact same result.


  2. TariqAHassan revised this gist Dec 19, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion PandasConcatWorkaround.adoc
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,5 @@
    Pandas DataFrames are fantastic. However, concatenating them using standard approaches,
    such as `pandas.concat()`, become very slow when the dataframes become very large.
    such as `pandas.concat()`, can be very slow with large dataframes.
    This is a work around for that problem.

    Note: this approach assumes that:
  3. TariqAHassan revised this gist Oct 3, 2016. No changes.
  4. TariqAHassan revised this gist Oct 3, 2016. No changes.
  5. TariqAHassan revised this gist Oct 3, 2016. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions PandasConcatWorkaround.adoc
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@ Note: this approach assumes that:
    (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
    (b) all dataframes share the same column names.

    If these assumptions are not met, this solution could still work...but it
    If these assumptions are not met, this approach could still work...but it
    will likely need to be modified.

    Start by importing `os`, `Pandas` and `chain` from `itertools`.
    @@ -60,7 +60,7 @@ for col in COLUMN_NAMES:

    Lastly use the `from_dict` method to produce the combined DataFrame:
    ```
    df = pd.DataFrame.from_dict(df_dict)
    df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]
    ```

    While this method is not very pretty, it typically is much faster than `pd.concat()` and
  6. TariqAHassan revised this gist Sep 29, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion PandasConcatWorkaround.adoc
    Original file line number Diff line number Diff line change
    @@ -4,7 +4,7 @@ This is a work around for that problem.

    Note: this approach assumes that:
    (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
    (b) all dataframes share the same column names
    (b) all dataframes share the same column names.

    If these assumptions are not met, this solution could still work...but it
    will likely need to be modified.
  7. TariqAHassan created this gist Sep 29, 2016.
    69 changes: 69 additions & 0 deletions PandasConcatWorkaround.adoc
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,69 @@
    Pandas DataFrames are fantastic. However, concatenating them using standard approaches,
    such as `pandas.concat()`, become very slow when the dataframes become very large.
    This is a work around for that problem.

    Note: this approach assumes that:
    (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
    (b) all dataframes share the same column names

    If these assumptions are not met, this solution could still work...but it
    will likely need to be modified.

    Start by importing `os`, `Pandas` and `chain` from `itertools`.
    ```
    import os
    import pandas as pd
    from itertools import chain
    ```

    Set the path to the data files:
    ```
    PATH_TO_FILES = '/your/path/here/'
    ```

    Read in the Data as Pandas DataFrames (csv files, in this example):

    ```
    frames = list()
    for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
    frames.append(pd.read_csv(csv))
    ```

    Define a function to flatten large 2D lists quickly:
    ```
    def fast_flatten(input_list):
    return list(chain.from_iterable(input_list))
    ```

    Next, construct a dictionary using the column names from one of the dataframes (located at index 0):

    ```
    COLUMN_NAMES = frames[0].columns
    ```

    Now, construct a dictionary from the column names:

    ```
    df_dict = dict.fromkeys(COLUMN_NAMES, [])
    ```

    Iterate though the columns:

    ```
    for col in COLUMN_NAMES:
    # Use a generator to save memory
    extracted = (frame[col] for frame in frames)

    # Flatten and save to df_dict
    df_dict[col] = fast_flatten(extracted)
    ```

    Lastly use the `from_dict` method to produce the combined DataFrame:
    ```
    df = pd.DataFrame.from_dict(df_dict)
    ```

    While this method is not very pretty, it typically is much faster than `pd.concat()` and
    yields the exact same result.