Last active
November 6, 2023 22:57
-
-
Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.
Revisions
-
TariqAHassan revised this gist
May 13, 2020 . 1 changed file with 9 additions and 10 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -10,7 +10,8 @@ If these assumptions are not met, this approach could still work...but it will likely need to be modified. Start by importing `os`, `Pandas` and `chain` from `itertools`. ```python import os import pandas as pd from itertools import chain @@ -23,47 +24,45 @@ PATH_TO_FILES = '/your/path/here/' Read in the Data as Pandas DataFrames (csv files, in this example): ```python frames = list() for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]: frames.append(pd.read_csv(csv)) ``` Define a function to flatten large 2D lists quickly: ```python def fast_flatten(input_list): return list(chain.from_iterable(input_list)) ``` Next, construct a dictionary using the column names from one of the dataframes (located at index 0): ```python COLUMN_NAMES = frames[0].columns ``` Now, construct a dictionary from the column names: ```python df_dict = dict.fromkeys(COLUMN_NAMES, []) ``` Iterate though the columns: ```python for col in COLUMN_NAMES: extracted = (frame[col] for frame in frames) # Flatten and save to df_dict df_dict[col] = fast_flatten(extracted) ``` Lastly use the `from_dict` method to produce the combined DataFrame: ```python df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES] ``` While this method is not very pretty, it typically is much faster than `pd.concat()` and yields the exact same result. -
TariqAHassan revised this gist
Dec 19, 2016 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ Pandas DataFrames are fantastic. However, concatenating them using standard approaches, such as `pandas.concat()`, can be very slow with large dataframes. This is a work around for that problem. Note: this approach assumes that: -
TariqAHassan revised this gist
Oct 3, 2016 . No changes.There are no files selected for viewing
-
TariqAHassan revised this gist
Oct 3, 2016 . No changes.There are no files selected for viewing
-
TariqAHassan revised this gist
Oct 3, 2016 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ Note: this approach assumes that: (a) the goal is a row-wise concatenation (i.e., `axis=0`) and (b) all dataframes share the same column names. If these assumptions are not met, this approach could still work...but it will likely need to be modified. Start by importing `os`, `Pandas` and `chain` from `itertools`. @@ -60,7 +60,7 @@ for col in COLUMN_NAMES: Lastly use the `from_dict` method to produce the combined DataFrame: ``` df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES] ``` While this method is not very pretty, it typically is much faster than `pd.concat()` and -
TariqAHassan revised this gist
Sep 29, 2016 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,7 +4,7 @@ This is a work around for that problem. Note: this approach assumes that: (a) the goal is a row-wise concatenation (i.e., `axis=0`) and (b) all dataframes share the same column names. If these assumptions are not met, this solution could still work...but it will likely need to be modified. -
TariqAHassan created this gist
Sep 29, 2016 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,69 @@ Pandas DataFrames are fantastic. However, concatenating them using standard approaches, such as `pandas.concat()`, become very slow when the dataframes become very large. This is a work around for that problem. Note: this approach assumes that: (a) the goal is a row-wise concatenation (i.e., `axis=0`) and (b) all dataframes share the same column names If these assumptions are not met, this solution could still work...but it will likely need to be modified. Start by importing `os`, `Pandas` and `chain` from `itertools`. ``` import os import pandas as pd from itertools import chain ``` Set the path to the data files: ``` PATH_TO_FILES = '/your/path/here/' ``` Read in the Data as Pandas DataFrames (csv files, in this example): ``` frames = list() for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]: frames.append(pd.read_csv(csv)) ``` Define a function to flatten large 2D lists quickly: ``` def fast_flatten(input_list): return list(chain.from_iterable(input_list)) ``` Next, construct a dictionary using the column names from one of the dataframes (located at index 0): ``` COLUMN_NAMES = frames[0].columns ``` Now, construct a dictionary from the column names: ``` df_dict = dict.fromkeys(COLUMN_NAMES, []) ``` Iterate though the columns: ``` for col in COLUMN_NAMES: # Use a generator to save memory extracted = (frame[col] for frame in frames) # Flatten and save to df_dict df_dict[col] = fast_flatten(extracted) ``` Lastly use the `from_dict` method to produce the combined DataFrame: ``` df = pd.DataFrame.from_dict(df_dict) ``` While this method is not very pretty, it typically is much faster than `pd.concat()` and yields the exact same result.