raksja raksja

Avro serialization

There are 4 possible serialization format when using avro:

Avro Json encoding
Avro Data Serialization (https://avro.apache.org/docs/current/spec.html#Data+Serialization) Binary format with an header that contains the full schema, this is the format usually used when writing Avro files
Avro Single Object Encoding (https://avro.apache.org/docs/current/spec.html#single_object_encoding) Binary format with an header with only the fingerprint/id of the schema, this it the format used by Kafka (see this
Avro Binary Encoding (https://avro.apache.org/docs/current/spec.html#binary_encoding)

	[filter "media"]
	clean = git-media-clean %f
	smudge = git-media-smudge %f
	[url "https://"]
	insteadOf = git://
	[credential]
	helper = osxkeychain
	[core]
	excludesfile = /Users/<usr>/.gitignore_global
	[difftool "sourcetree"]

	https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/

	https://venkateshiyer.net/production-ready-spark-streaming-a8d85b7d66be

	https://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/

	#!/bin/bash

	# Minimum TODOs on a per job basis:
	# 1. define name, application jar path, main class, queue and log4j-yarn.properties path

	// Basically we used this to do a batch operation for a huge query into multiple sub queries.

	// Break down your huge workload into smaller chunks, in this case huge query string is broken
	// down to a small set of subqueries
	// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
	val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)

	// Then map each one those to a Spark Task, in this case its a Future that returns a string
	val tasks: RDD[Future[String]] = queries.map(query => {
	val task = makeHttpCall(query) // Method returns http call response as a Future[String]

	class ECProvider()(implicit conf: Config) extends Serializable {

	@transient implicit lazy val ec: ExecutionContext = {
	ExecutionContext.fromExecutorService(
	Executors.newWorkStealingPool(conf.getForkJoinPoolMaxParallelism())
	)
	}

	}

	import java.util.concurrent._
	import akka.dispatch.{Dispatcher, ExecutorServiceDelegate}
	import config.Config
	import helpers.ScalaLogger

	class ExecutionContextMonitor()(implicit metricsService: MetricsClient, config: Config) {
	private val log = ScalaLogger.get(this.getClass)

	private val scheduler = Executors.newSingleThreadScheduledExecutor()