TariqAHassan · November 6, 2023 22:57 · May 13, 2020 · Dec 19, 2016 · Oct 3, 2016 · Oct 3, 2016
diff --git a/PandasConcatWorkaround.adoc b/PandasConcatWorkaround.adoc
@@ -10,7 +10,8 @@ If these assumptions are not met, this approach could still work...but it
 will likely need to be modified.
 
 Start by importing `os`, `Pandas` and `chain` from `itertools`.
-```
+
+```python
 import os
 import pandas as pd
 from itertools import chain
@@ -23,47 +24,45 @@ PATH_TO_FILES = '/your/path/here/'
 
 Read in the Data as Pandas DataFrames (csv files, in this example):
 
-```
+```python
 frames = list()
 for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
     frames.append(pd.read_csv(csv))
 ```
 
 Define a function to flatten large 2D lists quickly:
-```
+
+```python
 def fast_flatten(input_list):
     return list(chain.from_iterable(input_list))
 ```
 
 Next, construct a dictionary using the column names from one of the dataframes (located at index 0):
 
-```
+```python
 COLUMN_NAMES = frames[0].columns
 ```
 
 Now, construct a dictionary from the column names:
 
-```
+```python
 df_dict = dict.fromkeys(COLUMN_NAMES, [])
 ```
 
 Iterate though the columns:
 
-```
+```python
 for col in COLUMN_NAMES:
-    # Use a generator to save memory
     extracted = (frame[col] for frame in frames)
 
     # Flatten and save to df_dict
     df_dict[col] = fast_flatten(extracted)
 ```
 
 Lastly use the `from_dict` method to produce the combined DataFrame:
-```
+```python
 df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]
 ```
 
 While this method is not very pretty, it typically is much faster than `pd.concat()` and
 yields the exact same result.
-
-
diff --git a/PandasConcatWorkaround.adoc b/PandasConcatWorkaround.adoc
@@ -1,5 +1,5 @@
 Pandas DataFrames are fantastic. However, concatenating them using standard approaches,
-such as `pandas.concat()`, become very slow when the dataframes become very large.
+such as `pandas.concat()`, can be very slow with large dataframes.
 This is a work around for that problem.
 
 Note: this approach assumes that:

diff --git a/PandasConcatWorkaround.adoc b/PandasConcatWorkaround.adoc
@@ -6,7 +6,7 @@ Note: this approach assumes that:
   (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
   (b) all dataframes share the same column names.
 
-If these assumptions are not met, this solution could still work...but it
+If these assumptions are not met, this approach could still work...but it
 will likely need to be modified.
 
 Start by importing `os`, `Pandas` and `chain` from `itertools`.
@@ -60,7 +60,7 @@ for col in COLUMN_NAMES:
 
 Lastly use the `from_dict` method to produce the combined DataFrame:
 ```
-df = pd.DataFrame.from_dict(df_dict)
+df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]
 ```
 
 While this method is not very pretty, it typically is much faster than `pd.concat()` and

diff --git a/PandasConcatWorkaround.adoc b/PandasConcatWorkaround.adoc
@@ -4,7 +4,7 @@ This is a work around for that problem.
 
 Note: this approach assumes that:
   (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
-  (b) all dataframes share the same column names
+  (b) all dataframes share the same column names.
 
 If these assumptions are not met, this solution could still work...but it
 will likely need to be modified.

diff --git a/PandasConcatWorkaround.adoc b/PandasConcatWorkaround.adoc
@@ -0,0 +1,69 @@
+Pandas DataFrames are fantastic. However, concatenating them using standard approaches,
+such as `pandas.concat()`, become very slow when the dataframes become very large.
+This is a work around for that problem.
+
+Note: this approach assumes that:
+  (a) the goal is a row-wise concatenation (i.e., `axis=0`) and
+  (b) all dataframes share the same column names
+
+If these assumptions are not met, this solution could still work...but it
+will likely need to be modified.
+
+Start by importing `os`, `Pandas` and `chain` from `itertools`.
+```
+import os
+import pandas as pd
+from itertools import chain
+```
+
+Set the path to the data files:
+```
+PATH_TO_FILES = '/your/path/here/'
+```
+
+Read in the Data as Pandas DataFrames (csv files, in this example):
+
+```
+frames = list()
+for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
+    frames.append(pd.read_csv(csv))
+```
+
+Define a function to flatten large 2D lists quickly:
+```
+def fast_flatten(input_list):
+    return list(chain.from_iterable(input_list))
+```
+
+Next, construct a dictionary using the column names from one of the dataframes (located at index 0):
+
+```
+COLUMN_NAMES = frames[0].columns
+```
+
+Now, construct a dictionary from the column names:
+
+```
+df_dict = dict.fromkeys(COLUMN_NAMES, [])
+```
+
+Iterate though the columns:
+
+```
+for col in COLUMN_NAMES:
+    # Use a generator to save memory
+    extracted = (frame[col] for frame in frames)
+
+    # Flatten and save to df_dict
+    df_dict[col] = fast_flatten(extracted)
+```
+
+Lastly use the `from_dict` method to produce the combined DataFrame:
+```
+df = pd.DataFrame.from_dict(df_dict)
+```
+
+While this method is not very pretty, it typically is much faster than `pd.concat()` and
+yields the exact same result.
+
+