Skip to content

Instantly share code, notes, and snippets.

View raksja's full-sized avatar

raksja raksja

  • CA
View GitHub Profile
@raksja
raksja / README.md
Created June 14, 2021 05:18 — forked from davideicardi/README.md
Write and read Avro records from bytes array

Avro serialization

There are 4 possible serialization format when using avro:

#!/bin/sh
############################ QUICK ALIAS ###########################
alias gs="git status"
alias gb="git branch"
alias gl="git log -5"
alias gd="git diff"
alias gt="git tree"
alias gc="git checkout"
@raksja
raksja / gist:15a8c3ec1920967677e7cb86d354ae31
Created June 19, 2018 21:26
gitconfig with singleline git logs
[filter "media"]
clean = git-media-clean %f
smudge = git-media-smudge %f
[url "https://"]
insteadOf = git://
[credential]
helper = osxkeychain
[core]
excludesfile = /Users/<usr>/.gitignore_global
[difftool "sourcetree"]
@raksja
raksja / gist:55066d9d0cb6c35dfd28739af0ab8644
Last active March 6, 2018 19:33
Spark Production Usecase References
https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/
https://venkateshiyer.net/production-ready-spark-streaming-a8d85b7d66be
https://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/
#!/bin/bash
# Minimum TODOs on a per job basis:
# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
@raksja
raksja / SparkAsyncHttpDistributed.scala
Last active February 3, 2018 02:46
Spark job with Async HTTP call
// Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
@raksja
raksja / sparkEc.scala
Created July 25, 2017 00:12 — forked from atamborrino/sparkEc.scala
Serializable Scala ExecutionContext for Spark. Allows to automatically re-create a thread-pool per Spark worker.
class ECProvider()(implicit conf: Config) extends Serializable {
@transient implicit lazy val ec: ExecutionContext = {
ExecutionContext.fromExecutorService(
Executors.newWorkStealingPool(conf.getForkJoinPoolMaxParallelism())
)
}
}
@raksja
raksja / ExecutionContextMonitor.scala
Created July 25, 2017 00:11 — forked from atamborrino/ExecutionContextMonitor.scala
Monitor Scala's ExecutionContext / Akka Dispatcher lag (number of tasks in waiting queues)
import java.util.concurrent._
import akka.dispatch.{Dispatcher, ExecutorServiceDelegate}
import config.Config
import helpers.ScalaLogger
class ExecutionContextMonitor()(implicit metricsService: MetricsClient, config: Config) {
private val log = ScalaLogger.get(this.getClass)
private val scheduler = Executors.newSingleThreadScheduledExecutor()