# Processing Data

## Compute Services
Overview
![Screen Shot 2022-07-19 at 11 09 09 AM](https://user-images.githubusercontent.com/5553105/179821140-9b53a1a9-4981-4ffc-9de1-52e054c203bc.png)

Compute Engine
- fast-booting VMs
- highly configurable, zonal service
- choose machine types: general purpose, compute-optimized, memory-optimized, processor-optimized (GPU)
- select public or private disk image
- options include preemptible (or spot)
- also good to know about sole-tenant nodes (byol dedicated hardware requirements), instance groups (MIG/UIG)

Kubernetes Engine (GKE)
Container orchestration system with clusters, node pools, and control plane
- regional, managed container service
- standard (total control), autopilot (fully managed)
- supports auto repair and auto upgrade
- know the following:
    - kubectl syntax
    - private clusters (VPC native w/ RFC1918 IP addresses)
    - how to deploy, scale, expose services

App Engine
Oldest of all GCP services and comes in 2 versions: Standard, Flexible
- Standard
    - regional, platform as a service for serverless apps
    - zero server mgmt and config
    - instantaneous scaling, down to zero VMs
    - features
        - second gen runtimes: python 3, java 11, nodejs, php 7, ruby, go 1.12+
        - 1st gen is limited
- Flex
    - for containerized apps
    - zero server mgmt and config
    - best for apps with consistent traffic, gradual scaling is acceptable
    - robust runtimes
        - python 2.7/3.6, java 8, nodejs, php 5/7, ruby, go, .net

Cloud Run
- great for modern websites, REST APIs, back-end office admin
- regional, fully managed serverless service for containers
- integrated support for cloud operations
- built on Knative open-source standards for easy portability
- supports any language, library, or binary
- scales from zero and back in an instant

Cloud Functions
- regional, event-drive, serverless functions as a service (FaaS)
- triggers
    - HTTP
    - Cloud Storage
    - Cloud Pub/Sub
    - Cloud Firestore
    - Audit Logs
    - Cloud Scheduler
- totally serverless
- automatic horizontal scaling
- networks well with hybrid and multi-cloud
- acts as glue between services
- great for streaming data and IoT apps

---

## Choosing the correct compute option
![Screen Shot 2022-07-19 at 11 30 19 AM](https://user-images.githubusercontent.com/5553105/179820925-7182394d-845b-4173-b96d-5c844369ab83.png)
![Screen Shot 2022-07-19 at 11 30 34 AM](https://user-images.githubusercontent.com/5553105/179821020-46b961da-b6d1-48cf-931b-383fd9d500dc.png)

Summary
- Mobile apps: `Firebase`
- event-driven functions: `Cloud Functions`
- specific OS or kernel: `Compute Engine`
- no hybrid or multi-cloud: `App Engine Standard` (rapid scale) or `Flex`
- containers: `Cloud Run` or `Kubernetes Engine`

---

## Compute autoscaling comparison
![Screen Shot 2022-07-19 at 11 49 18 AM](https://user-images.githubusercontent.com/5553105/179821198-d912725c-f2c2-4ccb-bb27-b98422d9cc73.png)

Summary
- when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response
- for HA, Kubernetes Engine node pool is best used with minimum of 3 nodes in production
- Cloud Run scales almost as fast as App Engine Standard, and you are only charged when a request is made

---

## Evolving the Cloud with AI and ML services

AI Data Lifecycle Steps

Key DATA lifecycle steps (covered earlier)
1. Ingest
2. Store
3. Process / Analyze
4. Explore / Visualize

Key AI Data lifecycle steps
1. Ingest
2. Store
3. Process / Analyze
4. Train
5. Model
6. Evaluate
7. Deploy
8. Predict

Reviewing AI and ML Services
AI has been evolving on Google, and "currently" called "Vertex AI"

ML Services
- Vision API (OCR, tagging)
- Video Intelligence API (local, cloud storage, track objects, recognize text)
- Translation API (Cloud Translation for 100 language pairs, with auto-detect)
    - Basic / Advanced (also includes batch requests, custom models, glossaries)
- Text-to-speech / Speech-to-text
- Natural Language API
- Cloud TPU (hardware behind of APIs above)
    - 8 VMs w/ GPU took 200 minutes vs 1 TPU 8 minutes; faster and cheaper for some tasks

ML Best Practices
Setting up the ML environment
- use Notebooks for development
- create a Notebook instance for each teammate
    - treat each notebook instance as virtual workspace
    - stop when not in use
- store prepared data and model in same project

ML development
- prepare a good amount of training data
- store tabular data in BigQuery
- store unstructured data (images, video, audio) in Cloud Storage
    - includes tf files, avro, etc.
    - aim for files > 100MB and between 100 - 10,000 shards

During data processing
- use Tensorflow Extended for TF projects
    - NEW: Vertex AI Pipelines (replacement in future)
- process tabular data with BigQuery
    - can use BigQuery ML and save results in BQ permanent table
- process unstructured data with Cloud Dataflow (based on Apache Beam)
    - can generate TF record
    - if using Apache Spark, then can use Dataproc
- Link data to model with `managed datasets`

Putting the model into production
- specify appropriate (virtual) hardware
    - may be straight VMs or with GPU/TPU
- plan for additional inputs (features) to model
    - i.e. data lake, messaging
- enable autoscaling

Summary
- AI data lifecycle epands traditional lifecycle
    - ingest, store, transform, train, model, evaluate, deploy, and predict
- Vertex AI is Google Cloud's AI platform, incorporating all machine learning APIs, such as Vision API, its AutoML services, and even related hardware, like Cloud TPU
- Be sure to use the proper GCP service for the various stages in the AI data lifecycl, such as using BigQuery for storing and processing tabular data, and Dataflow / Dataproc for processing unstructured data

---

## Handling Big Data and IoT

Working with Cloud IoT Core Devices
- remember TerramEarth
- Cloud IoT Core - full managed
    - Device manager (identity, auth, config, control)
    - Protocol bridge (publishes incoming telemetry data to Pub/Sub for processing)
- Features
    - Secure connection via HTTPS or MQTT
    - CA signed certs verify device ownership
    - 2-way comms allow updates, on and offline
- How it works
    - Devices -> Cloud IoT Core -> Pub/Sub -> CF or Dataflow (update device config after process)
![Screen Shot 2022-07-25 at 8 59 34 AM](https://user-images.githubusercontent.com/5553105/180815481-d27148e8-953f-4a5e-9433-b402ebd2c1f6.png)

Massive Messaging via Cloud Pub/Sub
- Scalable, durable, global messaging and ingestion service, based on at-least-once publish/subscribe model
- Connects many services together and helps small increments of data to flow better
- Supports both push and pull modes, with exactly-once processing
- Pull mode delivers message and waits for ACK
- Features
    - Truly global: consistent latency from anywhere
    - Messages can be ordered and/or filtered
    - Lower-cost Pub/Sub Lite is availabke, requiring more management and lower availability and durability

The Big Data Dog: Cloud BigQuery
- Serverless, multi-regional, multi-cloud SQL column-store data warehouse
- Scales to handle terrabytes in seconds and petabytes in seconds
- Built-in integration for ML and backbone for Business Intelligence Engine
- Supports real-time analytics with streams from Pub/Sub, Dataflow, and Datastream
- Automatically replicates data and keeps seven-day history of changes

Transforming Big Data

Cloud Dataprep
- visually explore, clean, and prepare data for analysis and ML, used by data analysts
- integrated partner service offered by Trifacta in conjection with Google
- automatically detects schemas, data types, possible joins, and anomalies like missing values, outliars, and duplicates
- interprets data transformation intent by user selection and predicts next transformation
- transformation functions include
    - aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, and regex
- works with CSV, JSON, or relational data from Cloud Storage, BigQuery, or upload
- outputs to Dataflow, BigQuery, or exports to other file formats

Cloud Dataproc (map reduce)
- Zonal resource that manages Spark and Hadoop clusters for batch MapReduce processing
- Can be scaled (up or down) while running jobs
- Offers image versioning to switch between versions of Spark
- Best for migrating existing Spark or Hadoop jobs to the cloud
- Most VMs in cluster can be preemptible, but at least one node must be non-preemptible

Cloud Dataflow (more recent approach)
- Unified Data Processing
- Serverless, fast, and cost-effective
- Handles both batch and streaming data with one processing model (compared to one only in others)
- Fully managed service, suitable for a wide variety of data processing patterns
- Horizontal autoscaling with reliable, consistent, exactly-once processing
- Based on open-source Apache Beam
    - Beam open-source, unified model for defining both batch and streaming data - parallel processing pipelines
    - Use Beam SDK to build a program that defines a pipeline
        - Java, Python, Go
    - Apache Beam (illuminates Dataflow) by being supported distributed processing backend, and executes the pipeline

Choosing the right tool
![Screen Shot 2022-07-25 at 9 14 55 AM](https://user-images.githubusercontent.com/5553105/180815541-07c5c8b5-1a02-4f7f-ad09-f3c80f50d971.png)

Summary
- Cloud IoT Core - global, fully-managed service to connect, manage and ingest data from Internet-connected devices and a primary source for streaming data
- Cloud Pub/Sub - global messaging and ingestion service that supports both push and pull modes with exactly-once-processing for many GCP services
- Cloud BigQuery - serverless, multi-regional, multi-cloud, SQL column-store data warehouse used for data analytics and ML capable of scaling to petabytes in minutes
- GCP has a number of big data processing services
    - Cloud Dataprep for visually preparing data
    - Cloud Dataproc for working with Spark and Hadoop-based workloads
    - Cloud Dataflow for both batch and streaming data with one processing model