MattFaus’s gists

MattFaus / appengine_config.py

Last active August 3, 2018 12:28

All of the code necessary to implement and test protobuf projection in a Google Appengine web application.

	import db_util
	db_util.enable_db_protobuf_projection()
	db_util.enable_ndb_protobuf_projection()

MattFaus / keybase.md

Created November 17, 2014 19:35

Verification of my keybase public key

Keybase proof

I hereby claim:

I am mattfaus on github.
I am mattfaus (https://keybase.io/mattfaus) on keybase.
I have a public key whose fingerprint is 1CF5 6643 9369 2689 9402 2358 69E8 0354 58E5 E154

To claim this, I am signing this object:

MattFaus / BatchedGcsCsvShardFileWriter.py

Created October 29, 2014 21:27

Writes CSV data into multiple output shards, grouping rows by keys. Output shards are written to Google Cloud Storage.

	class BatchedGcsCsvShardFileWriter(object):
	"""Writes CSV data into multiple output shards, grouping rows by keys.

	This class is a context manager, which closes all shards upon exit.

	Say you are writing a lot of CSV data, like:

	[0, "Bakery"],
	[2, "Francisco"],
	[3, "Matt"],

MattFaus / SortedGcsCsvShardFileMergeReader.py

Last active February 23, 2022 11:58

Merge-reads several sorted .csv files stored on Google Cloud Storage.

	class SortedGcsCsvShardFileMergeReader(object):
	"""Merges several sorted .csv files stored on GCS.

	This class is both an iterator and a context manager.

	Let's say there are 2 .csv files stored on GCS, with contents like:

	/bucket/file_1.csv:
	[0, "Matt"],
	[0, "Sam"],

MattFaus / ParallelInMemorySortGcsCsvShardFiles.py

Created October 29, 2014 21:01

A Pipeline job which launches a map-only job to sort .csv files in memory. Each .csv file is read from Google Cloud Storage into memory, sorted by the specified key, and then written back out to Google Cloud Storage. The machine running the sorting process must have roughly 10x the amount of memory as the size of each .csv file.

	class ParallelInMemorySortGcsCsvShardFiles(pipeline.Pipeline):

	def run(self, input_bucket, input_pattern, sort_columns,
	model_type, output_bucket, output_pattern):
	"""Sorts each input file in-memory, then writes it to an output file.

	Arguments:
	input_bucket - The GCS bucket which contains the unsorted .csv
	files.
	input_pattern - A regular expression used to find files in the

MattFaus / DeterministicCompressedFeatures.py

Created October 8, 2014 20:49

An improvement over the CompressedFeatures class introduced at http://derandomized.com/post/51709771229/compressed-features-for-machine-learning#disqus_thread by not requiring the key->component mapping to be stored.

	class DeterministicCompressedFeatures(CompressedFeatures):
	"""Generates random components after seeding with the component_key.

	By using a known seed to generate the random components, we do not need to
	store or manage them. We can just recompute them whenever we need.
	"""

	def __init__(self, num_features=RANDOM_FEATURE_LENGTH):
	super(DeterministicallyRandomFeatures, self).__init__(num_features)

MattFaus / 2014_05_31_transformed.Video.json

Created June 4, 2014 21:07

BigQuery's JSON representation of the schema of 2014_05_31_transformed.Video.

	{
	u 'fields': [{
	u 'type': u 'STRING',
	u 'name': u 'playlists',
	u 'mode': u 'REPEATED'
	}, {
	u 'type': u 'STRING',
	u 'name': u 'source_table',
	u 'mode': u 'NULLABLE'
	}, {

MattFaus / bq_connection.py

Last active August 29, 2015 14:02

Some helper functions to build a SELECT statement for defining a view.

	def get_table_schema(dataset, table):
	"""If the table exists, returns its schema. Otherwise, returns None."""
	table_service = BigQueryService.get_service().tables()
	try:
	get_result = table_service.get(
	projectId=BQ_PROJECT_ID,
	datasetId=dataset,
	tableId=table
	).execute()
	return get_result['schema']

MattFaus / advanced_mapreduce.py

Created May 1, 2014 20:38

Experimental code demonstrating arbitrary mappers and reducers in the mapreduce library

	import collections
	import jinja2
	import logging
	import os
	import request_handler
	import third_party.mapreduce
	import third_party.mapreduce.input_readers
	import third_party.mapreduce.output_writers
	import third_party.mapreduce.lib.files
	import third_party.mapreduce.operation

MattFaus / custom_bq_transformers.py

Created March 22, 2014 00:34

A custom property transformer to translate a ndb.JsonProperty into a repeated record with fields for each of the keys in the original JSON.

	class TransformedVideoTranslationInfo(bq_property_transform.TransformedEntity):

	CUSTOM_SCHEMAS = {
	'translated_youtube_ids': {
	'name': 'translated_youtube_ids',
	'type': 'record',
	'mode': 'repeated',
	'fields': [
	{'name': 'language',
	'type': 'string'},

Matt Faus MattFaus

Keybase proof