Skip to content

Instantly share code, notes, and snippets.

@weldpua2008
Created August 4, 2020 10:13
Show Gist options
  • Select an option

  • Save weldpua2008/defc0a2e89b5294b71efdffcf8df1fe5 to your computer and use it in GitHub Desktop.

Select an option

Save weldpua2008/defc0a2e89b5294b71efdffcf8df1fe5 to your computer and use it in GitHub Desktop.

Revisions

  1. weldpua2008 created this gist Aug 4, 2020.
    25 changes: 25 additions & 0 deletions example_spark.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,25 @@
    from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
    from pyspark.sql import SparkSession

    # Create Spark session
    spark = SparkSession.builder \
    .appName('appName') \
    .getOrCreate()
    # List
    data = [('Category A', 100, "This is category A"),
    ('Category B', 120, "This is category B"),
    ('Category C', 150, "This is category C")]

    # Create a schema for the dataframe
    schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
    ])
    # Convert list to RDD
    rdd = spark.sparkContext.parallelize(data)

    # Create data frame
    df = spark.createDataFrame(rdd,schema)
    print(df.schema)
    df.show()