# Processing Data ## Compute Services Overview ![Screen Shot 2022-07-19 at 11 09 09 AM](https://user-images.githubusercontent.com/5553105/179821140-9b53a1a9-4981-4ffc-9de1-52e054c203bc.png) Compute Engine - fast-booting VMs - highly configurable, zonal service - choose machine types: general purpose, compute-optimized, memory-optimized, processor-optimized (GPU) - select public or private disk image - options include preemptible (or spot) - also good to know about sole-tenant nodes (byol dedicated hardware requirements), instance groups (MIG/UIG) Kubernetes Engine (GKE) Container orchestration system with clusters, node pools, and control plane - regional, managed container service - standard (total control), autopilot (fully managed) - supports auto repair and auto upgrade - know the following: - kubectl syntax - private clusters (VPC native w/ RFC1918 IP addresses) - how to deploy, scale, expose services App Engine Oldest of all GCP services and comes in 2 versions: Standard, Flexible - Standard - regional, platform as a service for serverless apps - zero server mgmt and config - instantaneous scaling, down to zero VMs - features - second gen runtimes: python 3, java 11, nodejs, php 7, ruby, go 1.12+ - 1st gen is limited - Flex - for containerized apps - zero server mgmt and config - best for apps with consistent traffic, gradual scaling is acceptable - robust runtimes - python 2.7/3.6, java 8, nodejs, php 5/7, ruby, go, .net Cloud Run - great for modern websites, REST APIs, back-end office admin - regional, fully managed serverless service for containers - integrated support for cloud operations - built on Knative open-source standards for easy portability - supports any language, library, or binary - scales from zero and back in an instant Cloud Functions - regional, event-drive, serverless functions as a service (FaaS) - triggers - HTTP - Cloud Storage - Cloud Pub/Sub - Cloud Firestore - Audit Logs - Cloud Scheduler - totally serverless - automatic horizontal scaling - networks well with hybrid and multi-cloud - acts as glue between services - great for streaming data and IoT apps --- ## Choosing the correct compute option ![Screen Shot 2022-07-19 at 11 30 19 AM](https://user-images.githubusercontent.com/5553105/179820925-7182394d-845b-4173-b96d-5c844369ab83.png) ![Screen Shot 2022-07-19 at 11 30 34 AM](https://user-images.githubusercontent.com/5553105/179821020-46b961da-b6d1-48cf-931b-383fd9d500dc.png) Summary - Mobile apps: `Firebase` - event-driven functions: `Cloud Functions` - specific OS or kernel: `Compute Engine` - no hybrid or multi-cloud: `App Engine Standard` (rapid scale) or `Flex` - containers: `Cloud Run` or `Kubernetes Engine` --- ## Compute autoscaling comparison ![Screen Shot 2022-07-19 at 11 49 18 AM](https://user-images.githubusercontent.com/5553105/179821198-d912725c-f2c2-4ccb-bb27-b98422d9cc73.png) Summary - when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response - for HA, Kubernetes Engine node pool is best used with minimum of 3 nodes in production - Cloud Run scales almost as fast as App Engine Standard, and you are only charged when a request is made --- ## Evolving the Cloud with AI and ML services AI Data Lifecycle Steps Key DATA lifecycle steps (covered earlier) 1. Ingest 2. Store 3. Process / Analyze 4. Explore / Visualize Key AI Data lifecycle steps 1. Ingest 2. Store 3. Process / Analyze 4. Train 5. Model 6. Evaluate 7. Deploy 8. Predict Reviewing AI and ML Services AI has been evolving on Google, and "currently" called "Vertex AI" ML Services - Vision API (OCR, tagging) - Video Intelligence API (local, cloud storage, track objects, recognize text) - Translation API (Cloud Translation for 100 language pairs, with auto-detect) - Basic / Advanced (also includes batch requests, custom models, glossaries) - Text-to-speech / Speech-to-text - Natural Language API - Cloud TPU (hardware behind of APIs above) - 8 VMs w/ GPU took 200 minutes vs 1 TPU 8 minutes; faster and cheaper for some tasks ML Best Practices Setting up the ML environment - use Notebooks for development - create a Notebook instance for each teammate - treat each notebook instance as virtual workspace - stop when not in use - store prepared data and model in same project ML development - prepare a good amount of training data - store tabular data in BigQuery - store unstructured data (images, video, audio) in Cloud Storage - includes tf files, avro, etc. - aim for files > 100MB and between 100 - 10,000 shards During data processing - use Tensorflow Extended for TF projects - NEW: Vertex AI Pipelines (replacement in future) - process tabular data with BigQuery - can use BigQuery ML and save results in BQ permanent table - process unstructured data with Cloud Dataflow (based on Apache Beam) - can generate TF record - if using Apache Spark, then can use Dataproc - Link data to model with `managed datasets` Putting the model into production - specify appropriate (virtual) hardware - may be straight VMs or with GPU/TPU - plan for additional inputs (features) to model - i.e. data lake, messaging - enable autoscaling Summary - AI data lifecycle epands traditional lifecycle - ingest, store, transform, train, model, evaluate, deploy, and predict - Vertex AI is Google Cloud's AI platform, incorporating all machine learning APIs, such as Vision API, its AutoML services, and even related hardware, like Cloud TPU - Be sure to use the proper GCP service for the various stages in the AI data lifecycl, such as using BigQuery for storing and processing tabular data, and Dataflow / Dataproc for processing unstructured data --- ## Handling Big Data and IoT Working with Cloud IoT Core Devices - remember TerramEarth - Cloud IoT Core - full managed - Device manager (identity, auth, config, control) - Protocol bridge (publishes incoming telemetry data to Pub/Sub for processing) - Features - Secure connection via HTTPS or MQTT - CA signed certs verify device ownership - 2-way comms allow updates, on and offline - How it works - Devices -> Cloud IoT Core -> Pub/Sub -> CF or Dataflow (update device config after process) ![Screen Shot 2022-07-25 at 8 59 34 AM](https://user-images.githubusercontent.com/5553105/180815481-d27148e8-953f-4a5e-9433-b402ebd2c1f6.png) Massive Messaging via Cloud Pub/Sub - Scalable, durable, global messaging and ingestion service, based on at-least-once publish/subscribe model - Connects many services together and helps small increments of data to flow better - Supports both push and pull modes, with exactly-once processing - Pull mode delivers message and waits for ACK - Features - Truly global: consistent latency from anywhere - Messages can be ordered and/or filtered - Lower-cost Pub/Sub Lite is availabke, requiring more management and lower availability and durability The Big Data Dog: Cloud BigQuery - Serverless, multi-regional, multi-cloud SQL column-store data warehouse - Scales to handle terrabytes in seconds and petabytes in seconds - Built-in integration for ML and backbone for Business Intelligence Engine - Supports real-time analytics with streams from Pub/Sub, Dataflow, and Datastream - Automatically replicates data and keeps seven-day history of changes Transforming Big Data Cloud Dataprep - visually explore, clean, and prepare data for analysis and ML, used by data analysts - integrated partner service offered by Trifacta in conjection with Google - automatically detects schemas, data types, possible joins, and anomalies like missing values, outliars, and duplicates - interprets data transformation intent by user selection and predicts next transformation - transformation functions include - aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, and regex - works with CSV, JSON, or relational data from Cloud Storage, BigQuery, or upload - outputs to Dataflow, BigQuery, or exports to other file formats Cloud Dataproc (map reduce) - Zonal resource that manages Spark and Hadoop clusters for batch MapReduce processing - Can be scaled (up or down) while running jobs - Offers image versioning to switch between versions of Spark - Best for migrating existing Spark or Hadoop jobs to the cloud - Most VMs in cluster can be preemptible, but at least one node must be non-preemptible Cloud Dataflow (more recent approach) - Unified Data Processing - Serverless, fast, and cost-effective - Handles both batch and streaming data with one processing model (compared to one only in others) - Fully managed service, suitable for a wide variety of data processing patterns - Horizontal autoscaling with reliable, consistent, exactly-once processing - Based on open-source Apache Beam - Beam open-source, unified model for defining both batch and streaming data - parallel processing pipelines - Use Beam SDK to build a program that defines a pipeline - Java, Python, Go - Apache Beam (illuminates Dataflow) by being supported distributed processing backend, and executes the pipeline Choosing the right tool ![Screen Shot 2022-07-25 at 9 14 55 AM](https://user-images.githubusercontent.com/5553105/180815541-07c5c8b5-1a02-4f7f-ad09-f3c80f50d971.png) Summary - Cloud IoT Core - global, fully-managed service to connect, manage and ingest data from Internet-connected devices and a primary source for streaming data - Cloud Pub/Sub - global messaging and ingestion service that supports both push and pull modes with exactly-once-processing for many GCP services - Cloud BigQuery - serverless, multi-regional, multi-cloud, SQL column-store data warehouse used for data analytics and ML capable of scaling to petabytes in minutes - GCP has a number of big data processing services - Cloud Dataprep for visually preparing data - Cloud Dataproc for working with Spark and Hadoop-based workloads - Cloud Dataflow for both batch and streaming data with one processing model