Skip to content

Instantly share code, notes, and snippets.

@dat-vikash
Forked from joshlk/faster_toPandas.py
Created February 27, 2018 12:52
Show Gist options
  • Select an option

  • Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.

Select an option

Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.

Revisions

  1. @joshlk joshlk revised this gist Apr 8, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion faster_toPandas.py
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@ def _map_to_pandas(rdds):

    def toPandas(df, n_partitions=None):
    """
    Returns the contents of this `DataFrame` as Pandas `pandas.DataFrame` in a speedy fashion. The DataFrame is
    Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
    repartitioned if `n_partitions` is passed.
    :param df: pyspark.sql.DataFrame
    :param n_partitions: int or None
  2. @joshlk joshlk revised this gist Apr 8, 2016. 1 changed file with 1 addition and 5 deletions.
    6 changes: 1 addition & 5 deletions faster_toPandas.py
    Original file line number Diff line number Diff line change
    @@ -12,12 +12,8 @@ def toPandas(df, n_partitions=None):
    :param n_partitions: int or None
    :return: pandas.DataFrame
    """

    if n_partitions is not None:
    df = df.repartition(n_partitions)

    if n_partitions is not None: df = df.repartition(n_partitions)
    df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
    df_pand = pd.concat(df_pand)
    df_pand.columns = df.columns

    return df_pand
  3. @joshlk joshlk created this gist Mar 22, 2016.
    23 changes: 23 additions & 0 deletions faster_toPandas.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,23 @@
    import pandas as pd

    def _map_to_pandas(rdds):
    """ Needs to be here due to pickling issues """
    return [pd.DataFrame(list(rdds))]

    def toPandas(df, n_partitions=None):
    """
    Returns the contents of this `DataFrame` as Pandas `pandas.DataFrame` in a speedy fashion. The DataFrame is
    repartitioned if `n_partitions` is passed.
    :param df: pyspark.sql.DataFrame
    :param n_partitions: int or None
    :return: pandas.DataFrame
    """

    if n_partitions is not None:
    df = df.repartition(n_partitions)

    df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
    df_pand = pd.concat(df_pand)
    df_pand.columns = df.columns

    return df_pand