Skip to content

Instantly share code, notes, and snippets.

View csyhuang's full-sized avatar

Clare S. Y. Huang csyhuang

View GitHub Profile
@csyhuang
csyhuang / gist:1fd0255d9c91f3a6da5a357353ba1cf2
Created October 27, 2018 16:23 — forked from CristinaSolana/gist:1885435
Keeping a fork up to date

1. Clone your fork:

git clone [email protected]:YOUR-USERNAME/YOUR-FORKED-REPO.git

2. Add remote from original repository in your forked repository:

cd into/cloned/fork-repo
git remote add upstream git://github.com/ORIGINAL-DEV-USERNAME/REPO-YOU-FORKED-FROM.git
git fetch upstream
@csyhuang
csyhuang / spark_tips_and_tricks.md
Created October 1, 2018 21:44 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
@csyhuang
csyhuang / Basic-Bash-Command.md
Last active October 24, 2018 14:57
Some useful basic bash command for work

Kill all background jobs

kill $(jobs -p)

Run unit-testing and time all procedures

pytest --durations=0

Profiling of unittests

$ python -m cProfile -o profile $(which py.test)
@csyhuang
csyhuang / tmux-cheatsheet.markdown
Created September 20, 2018 15:09 — forked from MohamedAlaa/tmux-cheatsheet.markdown
tmux shortcuts & cheatsheet

tmux shortcuts & cheatsheet

start new:

tmux

start new with session name:

tmux new -s myname
@csyhuang
csyhuang / install_algs4.sh
Created July 3, 2018 03:58 — forked from JIghtuse/install_algs4.sh
Installing the Programming Environment for Sedgewick's "Algorithms" on Linux
#!/bin/bash
# Based on http://algs4.cs.princeton.edu/linux/
declare -r ALGS4_DIRECTORY=~/.local/algs4
get_drjava() {
wget http://algs4.cs.princeton.edu/linux/drjava.jar
wget http://algs4.cs.princeton.edu/linux/drjava
chmod 700 drjava