Skip to content

Instantly share code, notes, and snippets.

@kenhui521
Forked from dapangmao/spark_py_examples.md
Last active August 29, 2015 14:23
Show Gist options
  • Select an option

  • Save kenhui521/0099a203887be6fa30f0 to your computer and use it in GitHub Desktop.

Select an option

Save kenhui521/0099a203887be6fa30f0 to your computer and use it in GitHub Desktop.

Revisions

  1. @dapangmao dapangmao revised this gist Nov 21, 2014. 1 changed file with 0 additions and 5 deletions.
    5 changes: 0 additions & 5 deletions spark_py_examples.md
    Original file line number Diff line number Diff line change
    @@ -4,7 +4,6 @@
    ```python
    import sys
    from operator import add

    from pyspark import SparkContext

    if __name__ == "__main__":
    @@ -34,9 +33,7 @@ if __name__ == "__main__":
    Then create a text file in `localdir` and the words in the file will get counted.
    """

    import sys

    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext

    @@ -60,10 +57,8 @@ if __name__ == "__main__":
    - Sort
    ```python
    import sys

    from pyspark import SparkContext


    if __name__ == "__main__":
    if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: sort <file>"
  2. @dapangmao dapangmao revised this gist Nov 21, 2014. 1 changed file with 24 additions and 0 deletions.
    24 changes: 24 additions & 0 deletions spark_py_examples.md
    Original file line number Diff line number Diff line change
    @@ -57,3 +57,27 @@ if __name__ == "__main__":
    ssc.start()
    ssc.awaitTermination()
    ```
    - Sort
    ```python
    import sys

    from pyspark import SparkContext


    if __name__ == "__main__":
    if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: sort <file>"
    exit(-1)
    sc = SparkContext(appName="PythonSort")
    lines = sc.textFile(sys.argv[1], 1)
    sortedCount = lines.flatMap(lambda x: x.split(' ')) \
    .map(lambda x: (int(x), 1)) \
    .sortByKey(lambda x: x)
    # This is just a demo on how to bring all the sorted data back to a single node.
    # In reality, we wouldn't want to collect all the data to the driver node.
    output = sortedCount.collect()
    for (num, unitcount) in output:
    print num

    sc.stop()
    ```
  3. @dapangmao dapangmao revised this gist Nov 20, 2014. 1 changed file with 36 additions and 1 deletion.
    37 changes: 36 additions & 1 deletion spark_py_examples.md
    Original file line number Diff line number Diff line change
    @@ -21,4 +21,39 @@ if __name__ == "__main__":
    print "%s: %i" % (word, count)

    sc.stop()
    ```
    ```
    - Word Count by Streaming
    ```python
    """
    Counts words in new text files created in the given directory
    Usage: hdfs_wordcount.py <directory>
    <directory> is the directory that Spark Streaming will use to find and read new text files.
    To run this on your local machine on directory `localdir`, run this example
    $ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py localdir
    Then create a text file in `localdir` and the words in the file will get counted.
    """

    import sys

    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext

    if __name__ == "__main__":
    if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: hdfs_wordcount.py <directory>"
    exit(-1)

    sc = SparkContext(appName="PythonStreamingHDFSWordCount")
    ssc = StreamingContext(sc, 1)

    lines = ssc.textFileStream(sys.argv[1])
    counts = lines.flatMap(lambda line: line.split(" "))\
    .map(lambda x: (x, 1))\
    .reduceByKey(lambda a, b: a+b)
    counts.pprint()

    ssc.start()
    ssc.awaitTermination()
    ```
  4. @dapangmao dapangmao created this gist Nov 20, 2014.
    24 changes: 24 additions & 0 deletions spark_py_examples.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,24 @@
    # Examples for python and Spark
    [Link](https://github.com/apache/spark/tree/branch-1.2/examples/src/main/python)
    - Word Count
    ```python
    import sys
    from operator import add

    from pyspark import SparkContext

    if __name__ == "__main__":
    if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: wordcount <file>"
    exit(-1)
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile(sys.argv[1], 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
    print "%s: %i" % (word, count)

    sc.stop()
    ```