Skip to content

Instantly share code, notes, and snippets.

@mikesparr
Last active January 31, 2025 23:44
Show Gist options
  • Select an option

  • Save mikesparr/144c4ee0a80c3a8156af28fac063b9b7 to your computer and use it in GitHub Desktop.

Select an option

Save mikesparr/144c4ee0a80c3a8156af28fac063b9b7 to your computer and use it in GitHub Desktop.

Revisions

  1. mikesparr revised this gist Aug 31, 2022. 1 changed file with 223 additions and 0 deletions.
    223 changes: 223 additions & 0 deletions 08-manage-secure-data.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,223 @@
    # Managing and Securing Data

    ## Establishing Core Security (Cloud IAM)

    Cloud IAM
    - determining WHO has ACCESS to WHICH resources
    - Who (principals or members)
    - Google account
    - Service account
    - Google group (best practice)
    - Google Workspace account (org)
    - Cloud Identity domain (org less Workspace Apps / Features)
    - All authenticated users (users on Internet authenticated by Google)
    - All users (anyone on the Internet)
    - Access (roles)
    - Billing Account Administrator
    - Billing Account User
    - Storage Object Creator
    - Storage Object Viewer
    - Cloud SQL Editor
    - Cloud SQL Instance User
    - Security Admin (get/set any IAM policy)
    - Which resources
    - VM instance
    - GKE cluster
    - Storage bucket
    - Pub/Sub topic
    - Organization
    - Folder
    - Project
    - Roles
    - Primitive (oldest, pre-date Cloud IAM, broadest permissions)
    - Predefined (target specific resources w/ actions at granular level)
    - Custom (unique set of permissions, most granular level)
    - requires Role Administrator role
    - Policies
    - Role binding - 1 or more principals assigned to a role (policy)
    - Summary (Part 1: Cloud IAM)
    - globally manages access control for organizations
    - resource access is granted to roles (collection of permissions), and roles are granted to principals
    - recommender helps identify excess or needed permissions from principals
    - grants IAM access to external identities (AD, etc.) with workload identity federation

    Resource Manager
    - centrally manages and secures organization's projects with custom folder hiearchy
    - example:
    - company
    - dept Y
    - team B
    - product 1
    - dev
    - test
    - GCE, GAE, GCS resources
    - production
    - modified Cloud IAM policies across an org
    - Cloud Asset Inventory monitors and analyizes all GCP assets, including IAM policies
    - Organization Policy Service sets constraints on resources and helps orgs stay in compliance

    Cloud Identity
    - fully-managed Identity as a Service (IaaS) for provisioning and managing identity resources
    - each user and group given a Cloud Identity account allow Cloud IAM to manage access
    - can be configured to federate identities with other identity providers (i.e. Active Directory)
    - features
    - SSO with other apps
    - Multi-factor authentication (MFA)
    - Device security with endpoint management
    - Context-aware access without VPN

    Cloud Identity-Aware Proxy (IAP)
    - establishes a central authorization layer for apps accessed by HTTPS, also internally by HTTP
    - enforces access control policies for apps and resources
    - based on load balancer and IAM, permits only auth request
    - supports
    - App Engine
    - Compute Engine
    - Kubernetes Engine
    - Cloud Run
    - On-premises

    Summary
    - Cloud IAM: principals, roles, and resources relation and well as IAM policies creation and inheritance
    - keep the principle of least privilege in mind and practice; GCP stresses this concept and offers the Recommender service to help implement it
    - controlling and managing access is critical to an orgs security. GCP offers two services: Cloud Identity and Cloud Identity-Aware Proxy

    ## Detecting and Responding to Security Threats

    Cloud Security Command Center - hub for GCP protective resources
    - comprehensive security management and risk platform
    - two tiers: standard and premium
    - designed to prevent, detect, and respond to threats from a single pane of glass
    - integrates and monitors many security services on GCP as well as external services
    - identifies security compliance violations and misconfiguration in Google Cloud assets
    - exports SCC data to Splunk as well as other SIEMs
    - standard
    - SHA: security health analytics
    - WSS: web security scanner
    - CA/WAF: cloud armor
    - DLP: cloud data loss prevention
    - anomaly detection
    - Foreseti Security integration
    - premium
    - SHA: adds monitoring/reporting for compliance
    - WSS: adds managed scans
    - ETD: event threat detection
    - CTD: container threat detection
    - continuous exports to Pub/Sub

    Web Security Scanner - guarding frontlines of Internet traffic
    - detects key vulnerabilities in App Engine, Compute Engine, and Kubernetes Engine applications
    - crawler based, supports public URLs and IPs not behind a firewall
    - standard
    - custom scans
    - premium
    - managed scans
    - detects
    - Cross-site scripting (XSS)
    - Flash injection
    - mixed (HTTP/HTTPS) content
    - outdated and insecure JavaScript libraries
    - readable text passwords

    Cloud Armor
    - edge-level, enterprise-grade DDoS protection and web application firewall (WAF)
    - leverages Google Cloud load balancing
    - mitigates OWASP's top ten risks
    - features
    - allow or deny traffic by IPs or CIDR ranges
    - preview changes before pushing policy live
    - configure WAF fules to reduce false positives
    - reference named IP address lists from CDN partners (Fastly, Cloudflare, Imperva)

    Event Threat Detection - malware, crypto mining
    - identify threats in near-real time by monitoring and analyziing Cloud Logging
    - threats are defined by rules, which specify needed logs
    - create custom rules by running queires on log data exported to BigQuery
    - quickly detect many types of attacks
    - malware
    - crypto mining
    - outgoing DDoS attacks
    - port scanning
    - IAM anomalous grant
    - brute-force SSH

    Cloud Data Loss Prevention
    - inspection, classification, and de-identification platform to protect sensitive data
    - includes over 150 data detectors for personal identifiable information (PII)
    - connect DLP results to SCC, Data Catalog, or export to external SIEM or governance tool
    - detects data in
    - streams of data or structured text
    - files in cloud storage or BigQuery
    - images

    Summary
    - the Cloud Security Command Center (SCC) platform monitors the majority of GCP's security services and is accessible through Standard and Premium tiers
    - if you use Google Cloud's external HTTPS load balancer, protect your web-based applications hosted on GAE, GCE, or GKE with Web Security Scanner
    - when Event Threat Detection (EDT) is enabled, GCP analyzes a range of logs from Cloud Logging to find signs of malware, crypto mining, outgoing DDoS attacks, brute-force SSH, and other threats

    ## Managing Encrypted Keys
    A cryto key is a string of characters when used with an encryption algorithm, it makes ordinary text unreadable. When that key, or another, is used with a decryption algorithm, it makes the text readable.

    In order to be effective, cryto keys have to be complex and not something anyone should memorize. As such, we need a service to maintains them like KMS.

    Cloud Key Management Service (KMS)
    - highly available, low-latency service to generate, manage and apply cryptographic keys
    - Cloud KMS encrypts and decrypts - does not store secrets itself - and controls access to keys
    - supports both symmetrical (e.g. AES) and asymmetrical (e.g. RSA or EC), algorithms
    - includes a 24-hour delay for key material destruction, to prevent accidental or malicious data loss
    - supports regulatory compliance and adds optional variations
    - Cloud HSM
    - Cloud EKM
    - CMEK
    - CSEK
    - google recommends you regularly and automatically rotate symmetric keys
    - asymmetric keys cannot be automated, but good practice

    Cloud Hardware Security Module (HSM)
    - hosts encryption keys and performs cryptographic actions in cluster of FIPS 140-2 level 3 certified devices
    - enables compliance with hardware requirements
    - HSM keys are crytographically bound to region, with support for multi-regions
    - Cloud HSM properties
    - keys are non-exportable
    - tamper resistant
    - provides tamper evidence
    - auto-scales horizontally

    Cloud External Key Managment (EKM)
    - use keys from supported external key management partners instead of GCP
    - works only with supported CMEK integration services
    - BigQuery
    - Compute Engine
    - Cloud Run
    - Cloud Spanner
    - Cloud Storage
    - GKE
    - Pub/Sub
    - Secret Manager
    - key ring should be created in same location as external key management partner
    - benefits include
    - key provenance
    - access control
    - must grant GCP project access to key
    - centralized key management

    Secret Manager
    - allows storage of passwords and variables to use in applications
    - fully managed service for storing, managing, and accessing secrets as binary blobs or text strings
    - used for storing sensitive runtim info such as database passwords, API keys, or TLS certificates
    - data of each secret is immutable and new versions are created each time value is modified
    - best practices
    - follow principle of least privilege
    - limit access with IAM conditions
    - use the Secret Manager API instead of env vars
    - reference secrets by version number, not "latest"

    Encrypted Keys Flowcharts
    ![Service Types](https://user-images.githubusercontent.com/5553105/187797687-c74b10bf-8b9d-497d-83d8-7562e88de74d.png)

    ![Flowchart](https://user-images.githubusercontent.com/5553105/187797732-bca97f4a-5549-4005-80c0-0b950192b30a.png)

    Summary
    - cloud KMS offers a full range of key sources: Google-managed, Cloud HSM devices, or Cloud EKM partners as well as customer-managed or supplied keys
    - regular automatic rotation of symmetric algorithm keys is considered a best practics; Cloud KMS does not support automatic rotation of asymmetric keys
    - follow the principle of least privilege when assigning access to Secret Manager entries by using Cloud IAM conditions or secret-level binding
  2. mikesparr revised this gist Jul 29, 2022. 1 changed file with 139 additions and 0 deletions.
    139 changes: 139 additions & 0 deletions 07-networking-data.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,139 @@
    # Networking Data

    Google has 75K miles networking cable and 150+ POPs around the globe

    ## Globally Connecting with external networking

    Cloud Domains
    - global registrar for domain names using Google Domains
    - uses built-in DNS or allows custom nameservers
    - supports DNSSEC and private WhoIs records
    - integration features
    - managed as a GCP project, including billing
    - automatic domain verification with Search Console, App Engine, Cloud Run, etc.
    - works with Cloud IAM for access management
    - partners including Shopify, Wix, Squarepace, Bluehost, Weebly, and others

    Cloud DNS
    - hierarchical DB linking domain names to IP addresses
    - global, scalable, fully-managed authoritive domain name service
    - 100% SLA uptime guarantee
    - offers both public and private managed zones
    - private visible to one or more private VPC that you specify
    - features
    - Cloud IAM and Cloud Logging integration
    - DNS peering and DNS forwarding
    - Anycast nameservers - allows multiple machines to share same IP (nearest machine)
    - DNSSEC support

    Static External IP Addresses
    - reserve static external IP addresses in projects and assign to resources
    - GCP supports two types: regional and global
    - regional IP addresses can be assigned to compute engine VMs and network load balancers
    - global IP addresses are assigned to Anycast IPs and global load balancers (HTTP/S, SSL, and TCP proxies)
    - IPV6 only global and only global load balancers
    - static IP addresses can be assigned through console, gcloud command line, API, or Terraform
    - no charge except for IP addresses that are reserved but not used

    Cloud Load Balancing
    - fully distributed, software-defined managed service that spreads network traffic across multiple instances of your apps
    - Layer 4 and Layer 7 load balancing with Cloud CDN integration and automatic scaling
    - L4: transport layer (TCP, UDP)
    - L7: application layer (HTTP, HTTPS)
    - Regional load balancing features
    - health checks
    - session affinity
    - IPv4 only
    - good for single region, IPv4 only, or compliance
    - Global load balancing features
    - multi-region failover
    - connects to closest region for lowest latency
    - IPv4 and IPv6
    - good for backends distributed across multiple regions, want to deliver nearest to user with Anycast

    Cloud CDN
    - serves content closer to the user
    - relies on Google Cloud's global edge network
    - works with `external` GCP HTTPS load balancing
    - manage cache rules with cache control header or allow it to automatically cache static content
    - content sources include
    - instance groups
    - zonal network endpoint groups (NEGs)
    - serverless NEGs like GAE or Cloud Functions
    - Cloud Storage buckets

    Summary
    - optimal external networking requires establishing a domain, connection to DNS, reserving external IP, integrating load balancer, and accessing a CDN
    - because Cloud Load Balancing is software-defined and does not rely on any devices - virtual or physical - it can handle spikes with no prewarming
    - cloud CDN works only with content delivered from sources within Google Cloud, such as GCE instance groups or GCS buckets. Custom origins are not allowed.

    ## Networking internally

    Virtual Private Cloud (VPC)
    - delivers networking for all your orgs GCP resources
    - global IPv4 unicast software-defined network
    - automatic or custom creation (configure subnets, firewalls rules, routes, VPNs and BGP)
    - VPC is global, but subnets are regional
    - Options include
    - Shared VPC
    - VPC peering (make services available private across different VPC networks)
    - Bring your own IP addresses
    - Packet mirroring

    Cloud Interconnect
    - extends on-premises network to VPC through HA, low-latency connection
    - dedicated
    - direct physical connection to GCP
    - best for high bandwidth needs
    - 10Gbps or 100Gbps circuits
    - capacities from 50Mbs - 50Gbs
    - traffic not encrypted but can be added
    - cannot use Cloud VPN with it
    - partner
    - connects to GCP via partner
    - better for lower bandwidth needs
    - depends on partner capabilities
    - capacities from 50Mbs - 50Gbs
    - traffic not encrypted but can be added
    - cannot use Cloud VPN with it

    Cloud VPN
    - securely connects peer network to VPC
    - any network, including those on other providers
    - traffic is encrypted by one IPsec VPN gateway and decrypted by another
    - requires static IP address for persistence and does not support dynamic, e.g. "dial-in" VPN
    - best practices
    - keep cloud VPN resource in own project
    - use dynamic routing and BGP
    - establish secure firewal rules for VPN
    - generate strong pre-shared keys for tunnels

    Examinining Other Networking Services (Cloud Router, CDN Interconnect)
    - Cloud Router
    - provides dynamic routing for hybrid networks linking VPCs to external networks via BGP
    - works with Cloud VPN and Dedicated Interconnect
    - automatically learns subnets in VPC and announces them to on-premises network
    - works with router appliances
    - CDN Interconnect
    - direct low-latency connectivity to certain CDN providers, with lower egress fees
    - works for both pull and push cache fills
    - best for high-volume egress traffic (lowers cost) and frequent content updates (lower latency)
    - supports Akamai, Verizon, Cloudflare, Fastly, and few others

    Summary
    - because GCP VPC is software-defined network, can use single VPC to cover multiple regions without trafficking across the public internet
    - if your company requires direct connection from on-prem datacenters to their VPC, use either Dedicated Interconnect (highest capacity) or Partner Interconnect
    - for lower traffic requirements that require a secure connection, connect to VPC with Cloud VPN and Cloud Router using a static IP address

    ## Finding a Load Balancer

    External
    ![Screen Shot 2022-07-29 at 3 20 29 PM](https://user-images.githubusercontent.com/5553105/181845100-80876fb6-42a1-45e4-96e8-3a828ebfa04e.png)

    Internal
    ![Screen Shot 2022-07-29 at 3 21 26 PM](https://user-images.githubusercontent.com/5553105/181845125-c16ba50c-5055-4a22-ad5d-df881f2a7271.png)

    Summary
    - when handling HTTP or HTTPS traffic around the world, use external HTTP(S) Load Balancing
    - if you have TCP traffic and would prefer to offload the SSL/TLS, best choice is SSL Proxy Load Balancing
    - Internal TCP or UDP traffic should rely on regional internal TCP/UDP Load Balancing for lowest latency and most direct connection
  3. mikesparr revised this gist Jul 26, 2022. 1 changed file with 4 additions and 1 deletion.
    5 changes: 4 additions & 1 deletion 06-storing-data.md
    Original file line number Diff line number Diff line change
    @@ -64,7 +64,10 @@ Summary

    ## Selecting the proper storage

    Flowcharts
    ![Screen Shot 2022-07-25 at 2 27 42 PM](https://user-images.githubusercontent.com/5553105/181115925-fd4cee0c-c849-48c1-b7f9-16cb94b18b6b.png)

    ### Buckets
    ![Screen Shot 2022-07-25 at 2 28 53 PM](https://user-images.githubusercontent.com/5553105/181115976-53f1a559-ef8e-4a96-88c8-213f69e313ab.png)

    ---

  4. mikesparr revised this gist Jul 26, 2022. 1 changed file with 137 additions and 0 deletions.
    137 changes: 137 additions & 0 deletions 06-storing-data.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,137 @@
    # Storing Data

    ## Storing Objects and Files

    Going straight to Local SSD
    - fastest block storage option (physical disk attached to computer)
    - very fast zonal resource, 375 GB solid state disk directly attached to server hosting VM instance
    - expandable to 3, 6, 9 TB with increasing performance up to 2.4M reads and 1.2M write IOPS
    - all data encrypted at rest, lost when VM stops but it can survive live migration
    - best for transient data (media rendering, analytics, high-perf computing, caches)

    Persevering with Persistant Disks
    - major benefit: persistence, available after VM shutdown
    - independent of VMs where data is distributed across disks for redundancy
    - highly durable (up to six 9s) and secure: data encrypted at rest and in transit
    - configurations
    - zonal
    - data in a single zone
    - 4 types: Standard, Balanced, SSD, Extreme
    - can be used for both snapshot and boot disk
    - can add more storage space, throughput, and IOPS
    - regional
    - data in 2 zones in same region
    - 3 types: Standard, Balanced, SSD
    - can be used for snapshots but not for boot disks
    - ONLY storage can be changed, not throughput or IOPS

    Managing File-based Storage
    - file stored as whole unit without data being broken down into blocks
    - fully managed file-based storage service (like a NAS)
    - provision instance in specific zone
    - access using NFSv3 protocol
    - consistently fast and good for lift and shift migration
    - read-only snapshots are supported
    - 3 tiers
    - Basic: best for file sharing, k8s, dev, web hosting (1-63.9 TiB)
    - Enterprise: best for critical large-scale ops, GCE, K8S (1-10 TiB)
    - High Scale: best for high-perf computing (i.e. genome sequencing: 10-100 TiB)

    Keeping Objects in Cloud Storage
    - infinitely scalable, fully-managed, highly durable object storage service (11 9s of durability)
    - for mutable, unstructured data such as images, videos, and documents
    - all objects stored in buckets
    - can be regional or multi-regional
    - support folders/sub-folders
    - supports versioning per bucket, with live object and noncurrent versions
    - permissions granted by bucket or by object and limited to teams, or people, or fully public
    - storage classes
    - standard: most frequently accessed or for brief time
    - nearline: for data you plan to access once a month or less
    - coldline: access at most once every 90 days
    - archive: access less than once/year
    - use lifecycle management rules to move objects between classes
    - age
    - when put in bucket (on or before of after)
    - current version

    Summary
    - local SSDs are highest performance block storage but lose data if VM is stopped
    - file-based storage, go with Filestore: Basic, Enterprise, High Scale
    - control costs with bucket locations and lifecycle rules

    ---

    ## Selecting the proper storage

    Flowcharts

    ---

    ## Saving your data on GCP

    Cloud SQL
    - regional, fully-managed relational db service for SQL Server, MySQL, and PostgreSQL
    - automatic replication with automatic failover, backup, point-in-time recovery
    - scale manually up to 96 cores, more than 624GB RAM, add replicas as needed
    - features
    - built-in high availability
    - automaticaly scale storage up to 30TB
    - connects with GAE, GCE, GKE, and BigQuery, among other services

    Cloud Spanner
    - fully-managed relational DB with up to 5 9s availability and unlimited scale (Mountkirk Games)
    - create spanner instance by define instance config, and compute capacity
    - best practice use query parameters to increase efficiency and lower costs
    - features
    - automatic sharding
    - external consistency (all transactions sequentially [even though distributed])
    - backup/restore and PITR

    Cloud Bigtable
    - fully-managed, scalable NoSQL db service used for large analyticals and operational workloads
    - no related tables, primary or foreign keys
    - key / value store
    - handles large amount of data in key-value store and supports high read and write at low latency
    - tables stored in instances that contain up to 4 nodes located in different zones
    - use cases
    - time-series data
    - marketing and/or financial data
    - IoT data

    Firestore
    - fully-managed, scalable, NoSQL serverless database
    - live syncronization and offline mode allow multi-user, collaboritive applications on mobile web
    - supports Datastore dbs and Datastore API
    - workloads include:
    - live asset and activity tracking
    - real-time analytics
    - media and product catalogs
    - social user profiles and gaming leaderboards (Mountkirk Games?)

    Examining other DB options
    - Datastream
    - serverless CDC and replication service
    - synchronizes data across heterogenous database and applications reliably
    - Firebase Realtime DB
    - serverless NoSQL databas for storing and syncing data
    - enhances collaboration among user across devices and web in real time
    - MemoryStore
    - in-memory service for Redis and Memcached
    - provides low-latency access and high-throughput for heavily-accessed data

    Summary
    - relational services: Cloud SQL, Cloud Spanner
    - NoSQL services: Firestore (document-based) and Bigtable (key-value)
    - cached, gaming, streaming data use Memorystore, which supports both Redis and Memcached

    ---

    ## Deciding on the best databases

    ![Screen Shot 2022-07-26 at 3 28 03 PM](https://user-images.githubusercontent.com/5553105/181115774-ef6fd15c-753c-40b9-a5d7-298180facef6.png)

    Summary
    - first question is whether data structured or not; if not go with Cloud Storage unless you need mobile SDKs
    - if workload primarily data analytics, best options are Bigtable (if NoSQL and low latency), and otherwise BigQuery
    - if workload structured data, CloudSQL for basic relational DB needs and Cloud Spanner for horizontal scalability
  5. mikesparr revised this gist Jul 25, 2022. 1 changed file with 128 additions and 0 deletions.
    128 changes: 128 additions & 0 deletions 05-containers.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,128 @@
    # Containers and Specialized Workloads

    ## Kubernetes Engine

    Coordinating Clusters
    - includes at least one control plane and multiple worker machines (a.k.a. nodes)
    - can create zonal or regional clusters
    - single or multi-zonal (single control plane replica)
    - regional cluster (control plane replicated to multiple zones in regional)
    - private clusters are VPC native, dependent on internal IP addresses
    - HA apps, distribute your workload using multi-zonal node pools
    - Horizontal Pod Autoscaler (HPA) checks the workload's metrics against target thresholds
    - Configure horizontal pod autoscaling on deployment, rather than ReplicaSet

    Working with Workloads (application running on Kubernetes)
    - custom and external metrics use HPA to scale based on conditions besides the workloads
    - custom metric reported from your app running in K8S
    - external metric reported from service outside cluster
    - configuring limits for Pods based on workload is highly recommended
    - ConfigMaps bind non-sensitive configuration artifacts to your pod containers at runtime
    - Deployments are best for stateless apps with ReadOnlyMany or ReadWriteMany volumes
    - DaemonSets are good for ongoing background tasks that do not require user intervention
    - attempt to adhere to 1 pod/node model (across cluster or subset of nodes)
    - StatefulSets are pods with unique persistent identities and hostnames

    Networking Pods, Services and External Clients
    - VPC-native clusters scale better than routes-based clusters and are needed for private clusters
    - VPC native uses alias IP
    - routes-based uses static routes
    - Shared VPC networks are best for orgs with centralized management team
    - attach Service Projects to the Host Project (sharing selected subnets/ranges)
    - GKE Ingress (internal or external) implements Ingress resources as Google Cloud load balancers for HTTP(S) workloads
    - Workload Identity links Kubernetes service accounts to Google service accounts to safely access other Google services

    Keeping an eye on Operations
    - monitoring and logging can be enabled for both new and existing clusters
    - GKE container logs are removed when the host pod is removed, when their disk runs out of space, or when replaced by newer logs
    - GKE generates two types of metrics:
    - System metrics - metrics from essential system components describing CPU, memory, storage
    - Workload metrics - exposed by any GKE workload like cronjob, etc.
    - use Istio Fault Injection to test apps resiliency (chaos engineering)

    Summary
    - clusters can be zonal (single or multi-zonal), regional, or private. Use regional clusters for high-availability production workloads
    - keep in mind the best use cases and scenarios for Deployments, StatefulSets, DaemonSets, and ConfigMaps
    - remember that VPC-native networks are required for private clusters and Ingress objects create load balancers both for external and internall HTTP traffic. Remember to use Workload identity to connect clusters to other Google services

    ---

    ## Anthos: Closer Look

    Uncovering the Anthos 411
    - application deployment anywhere: GPC, on prem, hybrid, and multicloud
    - supports K8S clusters, Cloud Run, Compute Engine VMs
    - use Migrate for Anthos to migrate and modernize existing workloads to containers
    - enhance app development and delivery with up-to-date CI/CD automated pipelines
    - uses open source
    - enabled defense-in-depth security strategy for comprehensive security controls across all deployments
    - fully integrated with GCP Monitoring and Logging, including hybrid and on-prem configs

    Managing Microservices with Anthos Service Mesh
    - suite of tools to monitor and manage service mesh on-prem or Google Cloud
    - ASM enables managed, observable, and secure communication across microservices, on-prem, and GCP
    - Power by open-source Istio, ASM is one or more control planes and a data plane which monitors all traffic through a proxy
    - ASM controls traffic flow between services as well as ingress and egress
    - supports canary and blue-green deployments
    - configure load balancing between services
    - provides in-depth telemetry with Cloud Monitoring, Logging, and Trace

    The Kubernetes Engine Connection
    - Anthos Clusters provide a unified way to work with K8S clusters as part of Anthos, extending GKE to work in multiple environments
    - Anthos on GCP uses "traditional" GKE, while on-premises uses VMWare and Bare Metal
    - Logically group and normalize multiple clusters via Fleets to manage multi-cluster capabilities and apply consistent policies
    - Anthos Config Management (ACM) creates a common configuration across all infra, including custom policies, applied both on-premises and in the cloud
    - Binary Authorization configures a validation policy enforced when deploying a container image
    - only explicitly-authorized images deployed using an "Attester"

    Accessing Cloud Run for Anthos
    - flexible serverless development platform for hybrid and on prem enviromnets
    - managed with Knative, which enables serverless workloads on K8S
    - streamlines operational needs with advanced workload autoscaling and automatic networking
    - Scale idle workloads to zero or set min instance count for baseline availability
    - Out-of-the-box integration with Monitoring, Logging, and Error Reporting
    - Easily perform A/B tests with traffic splitting and quickly roll back to known working services

    Summary
    - Anthos makes it possible to deploy, manage and monitor applications anywhere an in multiple locations: GCP, on-prem, multicloud, or hybrid
    - in addition to supporting GKE, Cloud Run, and VMs, Anthos offers system-spanning services such as Migrate for Anthos, Anthos Service Mesh (ASM), and Anthos Config Management (ACM)
    - familiarize yourself with special features that Anthos offers, particularly in securing CI/CD pipelines like Binary Authorization, Service Mesh testing and reporting, and Cloud Run for Anthos traffic splitting

    ---

    ## Bare Metal: Closer Look

    All about Anthos Bare Metal
    - Anthos clusters on bare metal allow you to directly deploy applications on your own hardware
    - manages app deployment and health across existing datacenters for more efficient operations
    - control system security without compatibility issues for virtual machines and OS
    - scale up apps while maintaining reliability regardless of fluctuations in workload and network traffic thanks to advanced monitoring
    - security can be customized with minimal connections to outside resources

    Discovering Deployment Options
    - Admin Cluster: manages user clusters
    - User Cluster: control plane + workers
    - 3 basic models to choose from
    - Standalone: single cluster both user and admin
    - best for single teams or workloads
    - no need for separate admin clusters
    - works great for edge locations
    - Multi-cluster: one admin and one or more user clusters
    - works well for fleet of clusters with central mgmt
    - provides separation between teams
    - isolates development and production workloads
    - Hybrid: runs user workloads on admin
    - create from standalone by adding more user clusters
    - use only if no security concerns with user workloads on admin
    - configure HA for user clusters independently

    Operating Bare Metal Clusters
    - use `Connect` to associate your bare metal clusters to Google Cloud
    - access is enabled for workload management and unified UI (Cloud Console)
    - Cloud Console displays health of all connected workloads and allows modifications to all
    - Put nodes into `maintenance mode` to train pods/workloads and exclude them from pod scheduling

    Summary
    - Anthos on bare metal gives best flexibility using companies own hardware
    - Bare metal offers 3 kinds of deployment for admin/user clusters: standalone, multi-cluster, and hybrid
    - once Connect has been used to associate your clusters with Google Cloud, the Cloud Console is enabled and provides a unified user interface for all clusters, regardless of location
  6. mikesparr revised this gist Jul 25, 2022. 1 changed file with 143 additions and 2 deletions.
    145 changes: 143 additions & 2 deletions 04-processing-data.md
    Original file line number Diff line number Diff line change
    @@ -87,8 +87,149 @@ Summary
    ---

    ## Evolving the Cloud with AI and ML services
    TODO

    AI Data Lifecycle Steps

    Key DATA lifecycle steps (covered earlier)
    1. Ingest
    2. Store
    3. Process / Analyze
    4. Explore / Visualize

    Key AI Data lifecycle steps
    1. Ingest
    2. Store
    3. Process / Analyze
    4. Train
    5. Model
    6. Evaluate
    7. Deploy
    8. Predict

    Reviewing AI and ML Services
    AI has been evolving on Google, and "currently" called "Vertex AI"

    ML Services
    - Vision API (OCR, tagging)
    - Video Intelligence API (local, cloud storage, track objects, recognize text)
    - Translation API (Cloud Translation for 100 language pairs, with auto-detect)
    - Basic / Advanced (also includes batch requests, custom models, glossaries)
    - Text-to-speech / Speech-to-text
    - Natural Language API
    - Cloud TPU (hardware behind of APIs above)
    - 8 VMs w/ GPU took 200 minutes vs 1 TPU 8 minutes; faster and cheaper for some tasks

    ML Best Practices
    Setting up the ML environment
    - use Notebooks for development
    - create a Notebook instance for each teammate
    - treat each notebook instance as virtual workspace
    - stop when not in use
    - store prepared data and model in same project

    ML development
    - prepare a good amount of training data
    - store tabular data in BigQuery
    - store unstructured data (images, video, audio) in Cloud Storage
    - includes tf files, avro, etc.
    - aim for files > 100MB and between 100 - 10,000 shards

    During data processing
    - use Tensorflow Extended for TF projects
    - NEW: Vertex AI Pipelines (replacement in future)
    - process tabular data with BigQuery
    - can use BigQuery ML and save results in BQ permanent table
    - process unstructured data with Cloud Dataflow (based on Apache Beam)
    - can generate TF record
    - if using Apache Spark, then can use Dataproc
    - Link data to model with `managed datasets`

    Putting the model into production
    - specify appropriate (virtual) hardware
    - may be straight VMs or with GPU/TPU
    - plan for additional inputs (features) to model
    - i.e. data lake, messaging
    - enable autoscaling

    Summary
    - AI data lifecycle epands traditional lifecycle
    - ingest, store, transform, train, model, evaluate, deploy, and predict
    - Vertex AI is Google Cloud's AI platform, incorporating all machine learning APIs, such as Vision API, its AutoML services, and even related hardware, like Cloud TPU
    - Be sure to use the proper GCP service for the various stages in the AI data lifecycl, such as using BigQuery for storing and processing tabular data, and Dataflow / Dataproc for processing unstructured data

    ---

    TODO
    ## Handling Big Data and IoT

    Working with Cloud IoT Core Devices
    - remember TerramEarth
    - Cloud IoT Core - full managed
    - Device manager (identity, auth, config, control)
    - Protocol bridge (publishes incoming telemetry data to Pub/Sub for processing)
    - Features
    - Secure connection via HTTPS or MQTT
    - CA signed certs verify device ownership
    - 2-way comms allow updates, on and offline
    - How it works
    - Devices -> Cloud IoT Core -> Pub/Sub -> CF or Dataflow (update device config after process)
    ![Screen Shot 2022-07-25 at 8 59 34 AM](https://user-images.githubusercontent.com/5553105/180815481-d27148e8-953f-4a5e-9433-b402ebd2c1f6.png)

    Massive Messaging via Cloud Pub/Sub
    - Scalable, durable, global messaging and ingestion service, based on at-least-once publish/subscribe model
    - Connects many services together and helps small increments of data to flow better
    - Supports both push and pull modes, with exactly-once processing
    - Pull mode delivers message and waits for ACK
    - Features
    - Truly global: consistent latency from anywhere
    - Messages can be ordered and/or filtered
    - Lower-cost Pub/Sub Lite is availabke, requiring more management and lower availability and durability

    The Big Data Dog: Cloud BigQuery
    - Serverless, multi-regional, multi-cloud SQL column-store data warehouse
    - Scales to handle terrabytes in seconds and petabytes in seconds
    - Built-in integration for ML and backbone for Business Intelligence Engine
    - Supports real-time analytics with streams from Pub/Sub, Dataflow, and Datastream
    - Automatically replicates data and keeps seven-day history of changes

    Transforming Big Data

    Cloud Dataprep
    - visually explore, clean, and prepare data for analysis and ML, used by data analysts
    - integrated partner service offered by Trifacta in conjection with Google
    - automatically detects schemas, data types, possible joins, and anomalies like missing values, outliars, and duplicates
    - interprets data transformation intent by user selection and predicts next transformation
    - transformation functions include
    - aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, and regex
    - works with CSV, JSON, or relational data from Cloud Storage, BigQuery, or upload
    - outputs to Dataflow, BigQuery, or exports to other file formats

    Cloud Dataproc (map reduce)
    - Zonal resource that manages Spark and Hadoop clusters for batch MapReduce processing
    - Can be scaled (up or down) while running jobs
    - Offers image versioning to switch between versions of Spark
    - Best for migrating existing Spark or Hadoop jobs to the cloud
    - Most VMs in cluster can be preemptible, but at least one node must be non-preemptible

    Cloud Dataflow (more recent approach)
    - Unified Data Processing
    - Serverless, fast, and cost-effective
    - Handles both batch and streaming data with one processing model (compared to one only in others)
    - Fully managed service, suitable for a wide variety of data processing patterns
    - Horizontal autoscaling with reliable, consistent, exactly-once processing
    - Based on open-source Apache Beam
    - Beam open-source, unified model for defining both batch and streaming data - parallel processing pipelines
    - Use Beam SDK to build a program that defines a pipeline
    - Java, Python, Go
    - Apache Beam (illuminates Dataflow) by being supported distributed processing backend, and executes the pipeline

    Choosing the right tool
    ![Screen Shot 2022-07-25 at 9 14 55 AM](https://user-images.githubusercontent.com/5553105/180815541-07c5c8b5-1a02-4f7f-ad09-f3c80f50d971.png)

    Summary
    - Cloud IoT Core - global, fully-managed service to connect, manage and ingest data from Internet-connected devices and a primary source for streaming data
    - Cloud Pub/Sub - global messaging and ingestion service that supports both push and pull modes with exactly-once-processing for many GCP services
    - Cloud BigQuery - serverless, multi-regional, multi-cloud, SQL column-store data warehouse used for data analytics and ML capable of scaling to petabytes in minutes
    - GCP has a number of big data processing services
    - Cloud Dataprep for visually preparing data
    - Cloud Dataproc for working with Spark and Hadoop-based workloads
    - Cloud Dataflow for both batch and streaming data with one processing model
  7. mikesparr revised this gist Jul 19, 2022. 1 changed file with 6 additions and 1 deletion.
    7 changes: 6 additions & 1 deletion 04-processing-data.md
    Original file line number Diff line number Diff line change
    @@ -86,4 +86,9 @@ Summary

    ---

    ## Evolving the Cloud with AI and ML services
    ## Evolving the Cloud with AI and ML services
    TODO

    ---

    TODO
  8. mikesparr revised this gist Jul 19, 2022. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions 04-processing-data.md
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@

    ## Compute Services
    Overview
    - Configurable - Fully Managed (know which 3 are serverless)
    ![Screen Shot 2022-07-19 at 11 09 09 AM](https://user-images.githubusercontent.com/5553105/179821140-9b53a1a9-4981-4ffc-9de1-52e054c203bc.png)

    Compute Engine
    - fast-booting VMs
    @@ -64,8 +64,8 @@ Cloud Functions
    ---

    ## Choosing the correct compute option


    ![Screen Shot 2022-07-19 at 11 30 19 AM](https://user-images.githubusercontent.com/5553105/179820925-7182394d-845b-4173-b96d-5c844369ab83.png)
    ![Screen Shot 2022-07-19 at 11 30 34 AM](https://user-images.githubusercontent.com/5553105/179821020-46b961da-b6d1-48cf-931b-383fd9d500dc.png)

    Summary
    - Mobile apps: `Firebase`
    @@ -77,7 +77,7 @@ Summary
    ---

    ## Compute autoscaling comparison
    Comparison table
    ![Screen Shot 2022-07-19 at 11 49 18 AM](https://user-images.githubusercontent.com/5553105/179821198-d912725c-f2c2-4ccb-bb27-b98422d9cc73.png)

    Summary
    - when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response
  9. mikesparr revised this gist Jul 19, 2022. 1 changed file with 4 additions and 30 deletions.
    34 changes: 4 additions & 30 deletions 04-processing-data.md
    Original file line number Diff line number Diff line change
    @@ -1,34 +1,5 @@
    # Processing Data

    ## Review Compute Services
    - Overview of compute services
    - Compute Engine (VMs)
    - Kubernetes Engine (GKE)
    - App Engine (Standard, Flexible)
    - Cloud Run
    - Cloud Functions
    ## Choosing the correct compute option
    - Mobile or No?
    - Event-driven functions
    - Specific OS?
    - Setting up PaaS Apps?
    - Meeting Raw Computing Needs
    - Integrating Containers
    ## Compute Autoscaling Comparison
    - Intro
    - Setting up the comparisons
    - Compute Engine
    - Kubernetes Engine
    - App Engine
    - How fast does Cloud Run
    - Cloud Functions
    ## Evolving the Cloud with ML and AI services
    - TODO
    ## Handling Big Data and IoT
    - TODO

    ---

    ## Compute Services
    Overview
    - Configurable - Fully Managed (know which 3 are serverless)
    @@ -93,7 +64,10 @@ Cloud Functions
    ---

    ## Choosing the correct compute option
    Flowcharts



    Summary
    - Mobile apps: `Firebase`
    - event-driven functions: `Cloud Functions`
    - specific OS or kernel: `Compute Engine`
  10. mikesparr revised this gist Jul 19, 2022. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion 04-processing-data.md
    Original file line number Diff line number Diff line change
    @@ -23,8 +23,9 @@
    - How fast does Cloud Run
    - Cloud Functions
    ## Evolving the Cloud with ML and AI services

    - TODO
    ## Handling Big Data and IoT
    - TODO

    ---

  11. mikesparr revised this gist Jul 19, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion 03-case-studies.md
    Original file line number Diff line number Diff line change
    @@ -147,7 +147,7 @@ Our new game is our most ambitious to date and will open doors for us to support
    - Higher performance; lower cost

    As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.
    - Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals
    - Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals

    Key takeaways
    - Wants to expand reach to other gaming platforms and other regions of the world
  12. mikesparr revised this gist Jul 19, 2022. 1 changed file with 83 additions and 119 deletions.
    202 changes: 83 additions & 119 deletions 03-case-studies.md
    Original file line number Diff line number Diff line change
    @@ -1,40 +1,5 @@
    # Case Studies

    ## EHR Healthcare
    Global company with electronic health record org
    - Who is EHR Healthcare?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - EHR business requirements
    - EHR technical requirements
    - Big picture (executive statement)
    ## Helicopter Racing League
    Strong emphasis on streaming content and predictions
    - Who is Helicopter Racing League?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - HRL business requirements
    - HRL technical requirements
    - Big picture (executive statement)
    ## Mountkirk Games
    Make believe leader to mobile gaming world; looking to expand on other platforms
    - Who is Mountkirk Games (2.0 - years later)?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - MG business requirements
    - MG technical requirements
    - Big picture (executive statement)
    ## TerramEarth
    Mining and ag equipment and sends "dump trucks of data" to GCP and share w/ partners and dealers
    - Who is TerramEarth?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - TE business requirements
    - TE technical requirements
    - Big picture (executive statement)

    ---

    ## EHR Healthcare

    Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)
    @@ -90,49 +55,49 @@ Key takeaways:

    ---

    Helicopter Racing League
    ## Helicopter Racing League

    Who is Helicopter Racing League - HRL is a global sports league for competitive helicopter racing. Each year HRL holds the world championship and several regional league competitions where teams compete to earn a spot on the world championship. HRL offers a paid service to stream the races all over the world with live telemetry and predictions throughout the race.
    Global (covering lot of territory w/ lots of regional focus); cater to entire globe at one time, but also break down to smaller targeted services; commercial enterprise so uptime is important; gathering a lot of data in real time and analyzing and forecasting with it.
    - Global (covering lot of territory w/ lots of regional focus); cater to entire globe at one time, but also break down to smaller targeted services; commercial enterprise so uptime is important; gathering a lot of data in real time and analyzing and forecasting with it.

    Primary concerns
    Migrate to new platform
    Expand use of AI and ML
    Fans in emerging regions
    Move service of content, real-time and recorded
    Closer to viewers to keep latency down
    - Migrate to new platform
    - Expand use of AI and ML
    - Fans in emerging regions
    - Move service of content, real-time and recorded
    - Closer to viewers to keep latency down

    Lay of the land
    Already in the cloud (unnamed)
    Existing content stored in Object Storage service on cloud
    Video recording and editing handled at race tracks
    VMs for every job handle Video Encode/Transcode in cloud
    TensorFlow predictions run on other VMs in cloud
    - Already in the cloud (unnamed)
    - Existing content stored in Object Storage service on cloud
    - Video recording and editing handled at race tracks
    - VMs for every job handle Video Encode/Transcode in cloud
    - TensorFlow predictions run on other VMs in cloud

    Business requirements
    Expose the predictive models to partners (API and private connectivity)
    Increase predictive capabilities during and before races
    Increase telemetry and create additional insights (enhance experience)
    Measure fan engagement and new predictions
    Enhance global availability and quality of broadcasts
    Increase the number of concurrent viewers (streaming capacity increase)
    Minimize operational complexity (standardize)
    Ensure compliance with regulations
    Create a merchandising revenue stream (e-comm app or connection to one)
    - Expose the predictive models to partners (API and private connectivity)
    - Increase predictive capabilities during and before races
    - Increase telemetry and create additional insights (enhance experience)
    - Measure fan engagement and new predictions
    - Enhance global availability and quality of broadcasts
    - Increase the number of concurrent viewers (streaming capacity increase)
    - Minimize operational complexity (standardize)
    - Ensure compliance with regulations
    - Create a merchandising revenue stream (e-comm app or connection to one)

    Technical requirements
    Maintain or increase prediction throughput and accuracy (ramp up efficiency)
    Reduce viewer latency (get content closer to viewers)
    Increate transcoding performance (vertically scale up VMs)
    Create real-time analytics of viewer consumption patterns and engagement (streaming data and pipeline)
    Create data mart to enable processing of large volumes of race data (batch data)
    - Maintain or increase prediction throughput and accuracy (ramp up efficiency)
    - Reduce viewer latency (get content closer to viewers)
    - Increate transcoding performance (vertically scale up VMs)
    - Create real-time analytics of viewer consumption patterns and engagement (streaming data and pipeline)
    - Create data mart to enable processing of large volumes of race data (batch data)

    Big picture (exec statement)
    Our CEO, S. Hawke, wants to bring high-adrenaline racing to fans all around the world. We listen to our fans, and they want enhanced video streams that include predictions of events within the race (e.g., overtaking).
    Global, ramped up graphic processing, heavily data dependant and may include video analysis
    Our CEO, S. Hawke, wants to bring high-adrenaline racing to fans all around the world. We listen to our fans, and they want enhanced video streams that include predictions of events within the race (e.g., overtaking).
    - Global, ramped up graphic processing, heavily data dependant and may include video analysis

    Our current platform allows us to predict race outcomes but lacks the facility to support real-time predictions during races and the capacity to process season-long results.
    Streaming data analysis, batch analysis
    Our current platform allows us to predict race outcomes but lacks the facility to support real-time predictions during races and the capacity to process season-long results.
    - Streaming data analysis, batch analysis

    Key takeaways:
    - emphasizes numerous scenarios involving data predictions and forecasts that would entail significant use of AI and ML
    @@ -141,48 +106,48 @@ Key takeaways:

    ---

    Mountkirk Games
    ## Mountkirk Games

    Who is Mountkirk Games - makes online, session-based, multiplayer games for mobile platforms. They have recently started expanding to other platforms after successfully migrating their on-premisis environments to Google Cloud. Their most recent endeavor is to create a retro-style first-person shooter (FPS) game that allows hundreds of simultanous players to join a geo-specific digital arena from multiple platforms and locations. A real-time digital banner will display a global leaderboard of all the top players across every active arena.

    Primary concerns
    Building a new multiplayer game
    Want to use GKE
    Use global load balancer to keep latency down
    Keep global leader board in sync (streaming data)
    Willing to use Cloud Spanner as their database engine
    - Building a new multiplayer game
    - Want to use GKE
    - Use global load balancer to keep latency down
    - Keep global leader board in sync (streaming data)
    - Willing to use Cloud Spanner as their database engine

    Lay of the land
    Recently lift & shift 5 games to GCP
    Each game in own project under one folder (most permissions and network policies)
    Some legacy games with little traffic consolidated to single project
    Separate environments for development and testing
    - Recently lift & shift 5 games to GCP
    - Each game in own project under one folder (most permissions and network policies)
    - Some legacy games with little traffic consolidated to single project
    - Separate environments for development and testing

    Business requirements
    Support multiple gaming platforms (from mobile only to multiple platforms)
    Support multiple regions (protect data and diff compliance regs)
    Support rapid iteration of game features (CICD)
    Minimize latency
    Optimize for dynamic scaling
    Use managed services and pooled resources (standardization)
    Minimize costs
    - Support multiple gaming platforms (from mobile only to multiple platforms)
    - Support multiple regions (protect data and diff compliance regs)
    - Support rapid iteration of game features (CICD)
    - Minimize latency
    - Optimize for dynamic scaling
    - Use managed services and pooled resources (standardization)
    - Minimize costs

    Technical requirements
    Dynamically scale based on game activity
    Publish scoring data on near real-time global leaderboard
    Store game activity logs in structured files for future analysis
    Use GPU processing to render graphics server-side for multi-platform support
    Support eventual migration of legacy games to this new platform
    - Dynamically scale based on game activity
    - Publish scoring data on near real-time global leaderboard
    - Store game activity logs in structured files for future analysis
    - Use GPU processing to render graphics server-side for multi-platform support
    - Support eventual migration of legacy games to this new platform

    Big picture (exec statement)
    Our last game was the first time we used Google Cloud and it was a success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all new games using cloud native design principles.
    See advantage reviewing user actions and game responses; going completely cloud native
    Our last game was the first time we used Google Cloud and it was a success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all new games using cloud native design principles.
    - See advantage reviewing user actions and game responses; going completely cloud native

    Our new game is our most ambitious to date and will open doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge.
    Higher performance; lower cost
    Our new game is our most ambitious to date and will open doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge.
    - Higher performance; lower cost

    As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.
    Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals
    As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.
    - Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals

    Key takeaways
    - Wants to expand reach to other gaming platforms and other regions of the world
    @@ -191,47 +156,46 @@ Key takeaways

    ---

    TerramEarth
    ## TerramEarth

    Who is TerramEarth - manufactures heavy equipment for the mining and agriculture industries. They have over 500 dealers and service centers in 100 countries. Their mission is to build product that make their customers more productive.
    Sophisticated earth-moving equipment; solid network; customer focused
    - Sophisticated earth-moving equipment; solid network; customer focused

    Primary concerns
    2 million TE vehicles in operation
    Collect telemetry data from many sensors (IoT)
    Subset of critical data in real time
    Rest of data collected, compressed, and uploaded daily
    200-500MB of data per vehicle per day (1 PB each day)
    - 2 million TE vehicles in operation
    - Collect telemetry data from many sensors (IoT)
    - Subset of critical data in real time
    - Rest of data collected, compressed, and uploaded daily
    - 200-500MB of data per vehicle per day (1 PB each day)

    Lay of the land
    Infra in GCP serving clients all around the world (data gathering and analysis)
    Private data center integration (2 main mfr plants sent to) with multiple Interconnects
    - Infra in GCP serving clients all around the world (data gathering and analysis)
    - Private data center integration (2 main mfr plants sent to) with multiple Interconnects

    Business requirements
    Predict and detect vehicle malfunction
    Ship parts to dealerships for just-in-time repair with little/no downtime
    Decrease cloud operational costs and adapt to seasonality
    Increase speed and reliability of developer workflow (SRE)
    Allow remote developers to productive without compromising code or data security
    Create flexible and scalable platform for custom API Services for dealers and partners (Apigee)
    - Predict and detect vehicle malfunction
    - Ship parts to dealerships for just-in-time repair with little/no downtime
    - Decrease cloud operational costs and adapt to seasonality
    - Increase speed and reliability of developer workflow (SRE)
    - Allow remote developers to productive without compromising code or data security
    - Create flexible and scalable platform for custom API Services for dealers and partners (Apigee)

    Technical requirements
    Create new abstraction layer for HTTP API access to legacy systems to enable a gradual migration without disrupting operations (API gateway)
    Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable environments (GKE, Cloud Run, Cloud Build)
    Allow developers to experiment without compromising security and governance (new test project)
    Create a self-service portal for internal and partner developers to create new projects, request resources for data analytics jobs, and centrally manage access to the API endpoints (secure new web front end with access to spin up resources; network tags)
    Use cloud-native solutions for keys and secrets management and optimize for identity-based access (IAM, Secrets Manager, and KMS)
    Improve and standardize tools necessary for application and network monitoring and troubleshooting (Cloud Operations: Monitoring, Logging, Debugging)
    - Create new abstraction layer for HTTP API access to legacy systems to enable a gradual migration without- disrupting operations (API gateway)
    - Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable- environments (GKE, Cloud Run, Cloud Build)
    - Allow developers to experiment without compromising security and governance (new test project)
    - Create a self-service portal for internal and partner developers to create new projects, request resources for- data analytics jobs, and centrally manage access to the API endpoints (secure new web front end with access to- spin up resources; network tags)
    - Use cloud-native solutions for keys and secrets management and optimize for identity-based access (IAM, Secrets- Manager, and KMS)
    - Improve and standardize tools necessary for application and network monitoring and troubleshooting (Cloud Operations: Monitoring, Logging, Debugging)

    Big picture (exec statement)
    Our advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtime. After moving multiple systems to Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships.
    Customer is successful, they are; keeping vehicles operational leads to success; always improving
    Our advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtime. After moving multiple systems to Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships.
    - Customer is successful, they are; keeping vehicles operational leads to success; always improving

    5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.
    Moving physical and digital information daily
    5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.
    - Moving physical and digital information daily

    Key takeaways
    - places great emphasis on customer and partner support which requires consistent and secure communication between systems and devices
    - after success of initial migration, TE seeks to expand their global integration without disrupting operations or regulations
    - company's equipment must be able to transmit and analyze a great deal of telemetry data to maintain high-performance levels and just-in-time repairs

  13. mikesparr revised this gist Jul 19, 2022. 1 changed file with 10 additions and 10 deletions.
    20 changes: 10 additions & 10 deletions 02-overarching-principles.md
    Original file line number Diff line number Diff line change
    @@ -86,16 +86,16 @@ Applying HTTP/HTTPS - works on L7 (Application Layer)
    - 504 - Gateway timeout
    - 511 - Network authentication required

    Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations - Ben Traynor (2003)
    - SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    - Request latency - how long to return a response to a request
    - Failure rate - fraction of all rates received
    - Batch throughput - proportion of time that data processing rate > threshold set
    - SLO - Service Level Objective (specify target level for reliability of service)
    - 100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice - difference, more resources focused on value add of service
    - SLA - contractual obligation
    - includes consequences of meeting or missing SLOs it contains
    - SLI - drives - SLO - informs - SLA
    Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations (Ben Traynor ~ 2003)
    - SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    - Request latency - how long to return a response to a request
    - Failure rate - fraction of all rates received
    - Batch throughput - proportion of time that data processing rate > threshold set
    - SLO - Service Level Objective (specify target level for reliability of service)
    - 100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice - difference, more resources focused on value add of service
    - SLA - contractual obligation
    - includes consequences of meeting or missing SLOs it contains
    - SLI - drives - SLO - informs - SLA

    ---

  14. mikesparr revised this gist Jul 19, 2022. 1 changed file with 161 additions and 161 deletions.
    322 changes: 161 additions & 161 deletions 02-overarching-principles.md
    Original file line number Diff line number Diff line change
    @@ -1,209 +1,209 @@
    # Overall Principles

    - Grasping Key Tech Fundamentals
    Describing distributed systems
    Core networking fundamentals
    Applying HTTP/HTTPS
    Understanding SRE principles
    - Keeping in Compliance - follow spirit and letter of "the law"
    Compliance with what?
    Getting help with compliance
    Relevant products and services
    - Annotating Resources Properly
    Understanding annotation options
    Applying security marks
    Working with labels
    Implementing networking tags
    Choosing the right annotation
    - Managing Quotas & Costs
    Working with quota limits
    Cost optimization principles
    Best practices (overall, compute, storage and data analysis)
    ## Grasping Key Tech Fundamentals
    - Describing distributed systems
    - Core networking fundamentals
    - Applying HTTP/HTTPS
    - Understanding SRE principles
    ## Keeping in Compliance - follow spirit and letter of "the law"
    - Compliance with what?
    - Getting help with compliance
    - Relevant products and services
    ## Annotating Resources Properly
    - Understanding annotation options
    - Applying security marks
    - Working with labels
    - Implementing networking tags
    - Choosing the right annotation
    ## Managing Quotas & Costs
    - Working with quota limits
    - Cost optimization principles
    - Best practices (overall, compute, storage and data analysis)

    ---

    ## Key Fundamentals

    Distributed System - group of servers working together as to appear as a single server to end user
    Scale Horizontally - increase capacity by adding more servers that work together
    Scale Vertically - Increasing capacity by adding more memory or using a faster CPU
    Sharding - Splitting server into multiple servers, a.k.a. "partitioning"
    - Scale Horizontally - increase capacity by adding more servers that work together
    - Scale Vertically - Increasing capacity by adding more memory or using a faster CPU
    - Sharding - Splitting server into multiple servers, a.k.a. "partitioning"

    Networking - be familiar with 7-layer OSI model
    7 Layer OSI model
    Application - End user layer (human comp interaction): HTTP, FTP, IRC, SSH, DNS
    Presentation - Syntax layer: SSL, SSH, IMAP, FTP, MPEG, JPEG
    Session - Sync and send to port: APIs, Sockets, WinSock
    Transport - End to end Connections: TCP, UDP
    Network - Packets: IP, ICMP, IPSec, IGMP
    Data Link - Frames: Ethernet, PPP, Switch, Bridge
    Physical - coax, fiber, wireless, hubs, repeaters
    TCP/IP - primary way data gets around the Internet
    Handshaking with syn/ack
    Addressing with IPv4 and IPv6
    Public Internet and private RFC1918 addressing
    SSL/TLS - encrypted comms
    SSH - access disks
    Ports
    80 - HTTP
    22 - SSH
    53 - DNS
    443 - HTTPS
    25 - SMTP
    3306 - MySQL
    - 7 Layer OSI model
    - Application - End user layer (human comp interaction): HTTP, FTP, IRC, SSH, DNS
    - Presentation - Syntax layer: SSL, SSH, IMAP, FTP, MPEG, JPEG
    - Session - Sync and send to port: APIs, Sockets, WinSock
    - Transport - End to end Connections: TCP, UDP
    - Network - Packets: IP, ICMP, IPSec, IGMP
    - Data Link - Frames: Ethernet, PPP, Switch, Bridge
    - Physical - coax, fiber, wireless, hubs, repeaters
    - TCP/IP - primary way data gets around the Internet
    - Handshaking with syn/ack
    - Addressing with IPv4 and IPv6
    - Public Internet and private RFC1918 addressing
    - SSL/TLS - encrypted comms
    - SSH - access disks
    - Ports
    - 80 - HTTP
    - 22 - SSH
    - 53 - DNS
    - 443 - HTTPS
    - 25 - SMTP
    - 3306 - MySQL

    Applying HTTP/HTTPS - works on L7 (Application Layer)
    Understand your resources (URL/URI) and how parameters are applied
    Know verbs: GET, POST, PUT, DELETE & PATCH, OPTIONS, TRACE, CONNECT
    Have firm grasp of caching: headers and locations (browsers, proxies, CDN, memory cache)
    Be familiar with CORS
    HTTP/HTTPS status codes
    100 Information
    100 - Continue
    101 - Switching protocol
    200 Successful response
    200 - Okay
    201 - Create
    202 - Accepted
    204 - No content
    206 - Partial content
    300 Redirection
    301 - Moved permananently
    304 - Not modified (caching)
    307 - Temporary redirect
    308 - Permanent redirect
    400 Client Errors
    400 - Bad request
    401 - Unauthorized
    403 - Forbidden
    408 - Request timeout
    429 - Too many requests
    500 Server Error
    500 - Internal server error
    501 - No implemented
    502 - Bad gateway
    503 - Service unavailable / quota exceeded
    504 - Gateway timeout
    511 - Network authentication required
    - Understand your resources (URL/URI) and how parameters are applied
    - Know verbs: GET, POST, PUT, DELETE & PATCH, OPTIONS, TRACE, CONNECT
    - Have firm grasp of caching: headers and locations (browsers, proxies, CDN, memory cache)
    - Be familiar with CORS
    - HTTP/HTTPS status codes
    - 100 Information
    - 100 - Continue
    - 101 - Switching protocol
    - 200 Successful response
    - 200 - Okay
    - 201 - Create
    - 202 - Accepted
    - 204 - No content
    - 206 - Partial content
    - 300 Redirection
    - 301 - Moved permananently
    - 304 - Not modified (caching)
    - 307 - Temporary redirect
    - 308 - Permanent redirect
    - 400 Client Errors
    - 400 - Bad request
    - 401 - Unauthorized
    - 403 - Forbidden
    - 408 - Request timeout
    - 429 - Too many requests
    - 500 Server Error
    - 500 - Internal server error
    - 501 - No implemented
    - 502 - Bad gateway
    - 503 - Service unavailable / quota exceeded
    - 504 - Gateway timeout
    - 511 - Network authentication required

    Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations - Ben Traynor (2003)
    SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    Request latency - how long to return a response to a request
    Failure rate - fraction of all rates received
    Batch throughput - proportion of time that data processing rate > threshold set
    SLO - Service Level Objective (specify target level for reliability of service)
    100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice difference, more resources focused on value add of service
    SLA - contractual obligation
    includes consequences of meeting or missing SLOs it contains
    SLI - drives - SLO - informs - SLA
    - SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    - Request latency - how long to return a response to a request
    - Failure rate - fraction of all rates received
    - Batch throughput - proportion of time that data processing rate > threshold set
    - SLO - Service Level Objective (specify target level for reliability of service)
    - 100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice - difference, more resources focused on value add of service
    - SLA - contractual obligation
    - includes consequences of meeting or missing SLOs it contains
    - SLI - drives - SLO - informs - SLA

    ---

    ## Compliance

    Compliance with what
    Legistation - targeted areas (health regs, privacy, children's privacy, ownership)
    Commercial - protect sensitive data, credit cards / PII
    Industry certifications - ensure following health, safety, and environmental regulations
    Audits - create necessary structure to allow for 3rd-party audits
    - Legistation - targeted areas (health regs, privacy, children's privacy, ownership)
    - Commercial - protect sensitive data, credit cards / PII
    - Industry certifications - ensure following health, safety, and environmental regulations
    - Audits - create necessary structure to allow for 3rd-party audits

    Getting help with compliance
    Visit the Compliance Center - sortable by region, industry, and focus area
    General Data Protection Regulations (GDPR) - continue to have major impact on web services around the world
    BAA - Google business association agreement (customer must request BAA from account manager for HIPAA compliance)
    - Visit the Compliance Center - sortable by region, industry, and focus area
    - General Data Protection Regulations (GDPR) - continue to have major impact on web services around the world
    - BAA - Google business association agreement (customer must request BAA from account manager for HIPAA compliance)

    Relevant products and services
    2-factor authentication
    Cloud Security Command Center (CSCC)
    Cloud IAM (global across all Google Cloud)
    Cloud Logging
    Cloud DLP (de-identification routines to protect PII)
    Cloud Monitoring (surface compliance missteps / alerts in real time)
    - 2-factor authentication
    - Cloud Security Command Center (CSCC)
    - Cloud IAM (global across all Google Cloud)
    - Cloud Logging
    - Cloud DLP (de-identification routines to protect PII)
    - Cloud Monitoring (surface compliance missteps / alerts in real time)

    ---

    ## Annotations

    Understanding annotations
    Security Marks - assigned and utilized through Cloud Security Command Center (CSCC)
    Labels - key-value pairs that help you organize cloud resources
    Network tags - applied to VM instances used for routing traffic to/fro
    - Security Marks - assigned and utilized through Cloud Security Command Center (CSCC)
    - Labels - key-value pairs that help you organize cloud resources
    - Network tags - applied to VM instances used for routing traffic to/fro

    Applying security marks
    Adds business context to assets for compliance
    Enhanced security focused insights into resources
    Unique to CSCC
    Set at org, project, or individually
    Works with labels and network tags
    - Adds business context to assets for compliance
    - Enhanced security focused insights into resources
    - Unique to CSCC
    - Set at org, project, or individually
    - Works with labels and network tags

    Working with labels
    Key-value pairs supported by wide range or GCP resources
    Used for many scenarios
    Identify individual teams or cost center resources
    Distinguish deployment environments
    Cost allocation and billing breakdowns
    Monitor resource groups for metadata
    Labels to projects, but NOT folders
    - Key-value pairs supported by wide range or GCP resources
    - Used for many scenarios
    - Identify individual teams or cost center resources
    - Distinguish deployment environments
    - Cost allocation and billing breakdowns
    - Monitor resource groups for metadata
    - Labels to projects, but NOT folders

    Implementing network tags
    Control traffic to/from VM instances
    Identify VM instances subject to firewall rules and network routes
    Use tags as source and destination values in firewall rules
    Identify instances on a certain route
    Configured with gcloud, console, or API
    - Control traffic to/from VM instances
    - Identify VM instances subject to firewall rules and network routes
    - Use tags as source and destination values in firewall rules
    - Identify instances on a certain route
    - Configured with gcloud, console, or API

    Choosing right annotation
    Need to group/classify for compliance?
    Yes - use Security Marks
    No - Need billing breakdown?
    Yes - use Labels
    No - Need to manage network traffic to/from VMs?
    Yes - use Network Tags
    - Need to group/classify for compliance?
    - Yes : use Security Marks
    - No : Need billing breakdown?
    - Yes : use Labels
    - No : Need to manage network traffic to/from VMs?
    - Yes : use Network Tags

    ---

    ## Managing Quotas & Costs

    Working within quota limits - restrict how much of a shared GCP resource you can use
    Not to be confused with fixed contstraints which cannot be increased or decreased (i.e. max file siz, database schema limitis)
    Two types of quotas:
    Rate quotas - limit number of API or service requests
    Allocation quotas - restrict the resource available at any one time
    Limits are specific to your org
    Add your own limits to impose spending limits
    Exceeded quotas can generat quota error and 503 status for HTTP requests
    - Not to be confused with fixed contstraints which cannot be increased or decreased (i.e. max file siz, database schema limitis)
    - Two types of quotas:
    - Rate quotas - limit number of API or service requests
    - Allocation quotas - restrict the resource available at any one time
    - Limits are specific to your org
    - Add your own limits to impose spending limits
    - Exceeded quotas can generat quota error and 503 status for HTTP requests

    Cost optimization principles
    Understand the total cost of ownership (TCO)
    Commonly misunderstood when moving from on-prem (CapEx) model to cloud-based (OpEx)
    Organize costs in relation to business needs
    Maximize value of all expenses while eliminating waste
    Implement standardized processes at the start
    - Understand the total cost of ownership (TCO)
    - Commonly misunderstood when moving from on-prem (CapEx) model to cloud-based (OpEx)
    - Organize costs in relation to business needs
    - Maximize value of all expenses while eliminating waste
    - Implement standardized processes at the start

    Best practices: use cost management tools
    Organize and Structure - set up folders, projects, and use labels to structure costs in relation to business needs
    Billing Reports - view costs and analyze trends and filter as needed
    Custom dashboards - can also export to BigQuery, then visualize in Cloud Data Studio

    Compute - pay for the compute you need
    Identify idle VMs
    use Idle VM recommender service to identify inactive VMs
    Snapshot them before deleting
    Stop without deleting
    Start/stop VMs automatically or via Cloud Functions
    Create custom VMs with right size CPUs and memory
    Make the most of preemptible/spot VMs (often is an option - consider it for exam)

    Cloud Storage - ways to keep more of your company's hard-earned money
    Choose right storage class: nearline 30, coldline 90, archive
    Modify storage class as needed with lifecycle policies
    Deduplicate data wherever possible (i.e. Cloud Dataflow)
    Choose multi-region rather than single region buckets whewre viable
    Set object versioning policies to keep copies down (i.e. delete oldest after 2 versions)

    Keep BigQuery from BigCosts
    Limit query costs with the maximum bytes billed setting
    Partition tables based on ingestion time, data, timestamp or integer range column
    Switch from on-demand to flat rate pricing to process unlimited bytes for fixed predictable cost
    Combine Flex Slots (like preemptible) with annual and monthly commitments (blended)

    - Organize and Structure - set up folders, projects, and use labels to structure costs in relation to business needs
    - Billing Reports - view costs and analyze trends and filter as needed
    - Custom dashboards - can also export to BigQuery, then visualize in Cloud Data Studio

    - Compute - pay for the compute you need
    - Identify idle VMs
    - use Idle VM recommender service to identify inactive VMs
    - Snapshot them before deleting
    - Stop without deleting
    - Start/stop VMs automatically or via Cloud Functions
    - Create custom VMs with right size CPUs and memory
    - Make the most of preemptible/spot VMs (often is an option - consider it for exam)

    - Cloud Storage - ways to keep more of your company's hard-earned money
    - Choose right storage class: nearline 30, coldline 90, archive
    - Modify storage class as needed with lifecycle policies
    - Deduplicate data wherever possible (i.e. Cloud Dataflow)
    - Choose multi-region rather than single region buckets whewre viable
    - Set object versioning policies to keep copies down (i.e. delete oldest after 2 versions)

    - Keep BigQuery from BigCosts
    - Limit query costs with the maximum bytes billed setting
    - Partition tables based on ingestion time, data, timestamp or integer range column
    - Switch from on-demand to flat rate pricing to process unlimited bytes for fixed predictable cost
    - Combine Flex Slots (like preemptible) with annual and monthly commitments (blended)

  15. mikesparr revised this gist Jul 19, 2022. 1 changed file with 8 additions and 0 deletions.
    8 changes: 8 additions & 0 deletions 01-architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -24,6 +24,8 @@
    3. Process/Analyze - transform the data into actionable information
    4. Explore/Visualize - convert processed data into shareable, relatable content

    ---

    ### Ingesting Data (11 services)

    Streaming
    @@ -43,6 +45,8 @@ Application
    - Cloud Bigtable - large amounts of NoSQL data
    - Cloud Spanner - fully managed relational database for structured SQL data

    ---

    ### Storing Data

    Objects
    @@ -58,6 +62,8 @@ Databases
    Warehouse
    - BigQuery - serverless highly-scalable multi-cloud data warehouse

    ---

    ### Processing and Analyzing Data
    - big data, ETL pipelines, machine learning

    @@ -74,6 +80,8 @@ Large-Scale
    Analysis
    - BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

    ---

    ### Exploring and Visualizing Data

    Science
  16. mikesparr revised this gist Jul 19, 2022. 1 changed file with 5 additions and 0 deletions.
    5 changes: 5 additions & 0 deletions 01-architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -48,11 +48,13 @@ Application
    Objects
    - Cloud Storage
    - Cloud Storage for Firebase - mostly mobile / web apps with some overlap

    Databases
    - Cloud SQL - relational DB for MySQL, Postgres, SQL Server
    - Cloud Spanner - large distributed SQL
    - Cloud Bigtable - large NoSQL
    - Cloud Firestore - serverless NoSQL

    Warehouse
    - BigQuery - serverless highly-scalable multi-cloud data warehouse

    @@ -63,17 +65,20 @@ Compute
    - Compute Engine - virtual compute machines
    - Kubernetes Engine - orchestration of containerized workloads
    - App Engine - quickly get apps up and running

    Large-Scale
    - Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
    - Cloud Dataflow - based on Apache Beam
    - Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML

    Analysis
    - BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

    ### Exploring and Visualizing Data

    Science
    - Cloud Datalab - uses jupyter notebooks to interact and visualize data

    Visualizing
    - BigQuery BI - business intelligence functionality for BQ
    - Cloud Data Studio - can be utilized by host of data
  17. mikesparr revised this gist Jul 19, 2022. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions 01-architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -28,11 +28,13 @@

    Streaming
    - Cloud Pub/Sub - messaging middleware system

    Batch
    - Cloud Storage - object storage in buckets
    - Storage Transfer Service - move data from one place to another
    - BigQuery Transfer Service - move structured data from one place to another
    - Storage Transfer Appliance - move very large amounts of data (physical to cloud)

    Application
    - Cloud Logging - outputs
    - Cloud Pub/Sub -
  18. mikesparr revised this gist Jul 19, 2022. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions 01-architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -24,7 +24,7 @@
    3. Process/Analyze - transform the data into actionable information
    4. Explore/Visualize - convert processed data into shareable, relatable content

    1. Ingesting Data (11 services)
    ### Ingesting Data (11 services)

    Streaming
    - Cloud Pub/Sub - messaging middleware system
    @@ -41,7 +41,7 @@ Application
    - Cloud Bigtable - large amounts of NoSQL data
    - Cloud Spanner - fully managed relational database for structured SQL data

    2. Storing Data
    ### Storing Data

    Objects
    - Cloud Storage
    @@ -54,7 +54,7 @@ Databases
    Warehouse
    - BigQuery - serverless highly-scalable multi-cloud data warehouse

    3. Processing and Analyzing Data
    ### Processing and Analyzing Data
    - big data, ETL pipelines, machine learning

    Compute
    @@ -68,7 +68,7 @@ Large-Scale
    Analysis
    - BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

    4. Exploring and Visualizing Data
    ### Exploring and Visualizing Data

    Science
    - Cloud Datalab - uses jupyter notebooks to interact and visualize data
  19. mikesparr revised this gist Jul 19, 2022. 1 changed file with 44 additions and 50 deletions.
    94 changes: 44 additions & 50 deletions 01-architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -1,30 +1,24 @@
    # Architecting for the cloud

    Architect solutions to be
    scalable and reilient

    Business requirements involve
    lowering costs / enhancing user experience

    Keep an eye on technical needs during
    development and operation
    - Architect solutions to be scalable and reilient
    - Business requirements involve lowering costs / enhancing user experience
    - Keep an eye on technical needs during development and operation

    ---

    3 Major Questions To Ask
    # 3 Major Questions To Ask

    1. Where is the company coming from
    business, technical, personnel
    - business, technical, personnel

    2. Where is the company going to
    on GCP, hybrid, multi-cloud / regional, national, global
    - on GCP, hybrid, multi-cloud / regional, national, global

    3. What's next
    allow for future changes
    - allow for future changes

    ---

    Key Data Lifecycle Steps (4)
    ## Key Data Lifecycle Steps (4)
    1. Ingest - pull in raw data via streaming, batch, or app processes
    2. Store - keep the retrieved data in a durable and accessible environment
    3. Process/Analyze - transform the data into actionable information
    @@ -33,58 +27,58 @@ Key Data Lifecycle Steps (4)
    1. Ingesting Data (11 services)

    Streaming
    Cloud Pub/Sub - messaging middleware system
    - Cloud Pub/Sub - messaging middleware system
    Batch
    Cloud Storage - object storage in buckets
    Storage Transfer Service - move data from one place to another
    BigQuery Transfer Service - move structured data from one place to another
    Storage Transfer Appliance - move very large amounts of data (physical to cloud)
    - Cloud Storage - object storage in buckets
    - Storage Transfer Service - move data from one place to another
    - BigQuery Transfer Service - move structured data from one place to another
    - Storage Transfer Appliance - move very large amounts of data (physical to cloud)
    Application
    Cloud Logging - outputs
    Cloud Pub/Sub -
    Cloud SQL - structured data
    Cloud Firestore - serverless document data for NoSQL data
    Cloud Bigtable - large amounts of NoSQL data
    Cloud Spanner - fully managed relational database for structured SQL data
    - Cloud Logging - outputs
    - Cloud Pub/Sub -
    - Cloud SQL - structured data
    - Cloud Firestore - serverless document data for NoSQL data
    - Cloud Bigtable - large amounts of NoSQL data
    - Cloud Spanner - fully managed relational database for structured SQL data

    2. Storing Data

    Objects
    Cloud Storage
    Cloud Storage for Firebase - mostly mobile / web apps with some overlap
    - Cloud Storage
    - Cloud Storage for Firebase - mostly mobile / web apps with some overlap
    Databases
    Cloud SQL - relational DB for MySQL, Postgres, SQL Server
    Cloud Spanner - large distributed SQL
    Cloud Bigtable - large NoSQL
    Cloud Firestore - serverless NoSQL
    - Cloud SQL - relational DB for MySQL, Postgres, SQL Server
    - Cloud Spanner - large distributed SQL
    - Cloud Bigtable - large NoSQL
    - Cloud Firestore - serverless NoSQL
    Warehouse
    BigQuery - serverless highly-scalable multi-cloud data warehouse
    - BigQuery - serverless highly-scalable multi-cloud data warehouse

    3. Processing and Analyzing Data
    big data, ETL pipelines, machine learning
    - big data, ETL pipelines, machine learning

    Compute
    Compute Engine - virtual compute machines
    Kubernetes Engine - orchestration of containerized workloads
    App Engine - quickly get apps up and running
    - Compute Engine - virtual compute machines
    - Kubernetes Engine - orchestration of containerized workloads
    - App Engine - quickly get apps up and running
    Large-Scale
    Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
    Cloud Dataflow - based on Apache Beam
    Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML
    - Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
    - Cloud Dataflow - based on Apache Beam
    - Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML
    Analysis
    BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead
    - BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

    4. Exploring and Visualizing Data

    Science
    Cloud Datalab - uses jupyter notebooks to interact and visualize data
    - Cloud Datalab - uses jupyter notebooks to interact and visualize data
    Visualizing
    BigQuery BI - business intelligence functionality for BQ
    Cloud Data Studio - can be utilized by host of data
    Looker - frontent enterprise platform for BI, apps, embedded data analytics

    Key points:
    4 phases: ingest, store, processed/analyzed, explored and visualized
    Data ingested via streaming, batch, or application processes
    Data structure can change, depending on its source and destination
    Google offers a wide range of services to manage data in every phase of its lifecycle
    - BigQuery BI - business intelligence functionality for BQ
    - Cloud Data Studio - can be utilized by host of data
    - Looker - frontent enterprise platform for BI, apps, embedded data analytics

    ## Key points:
    - 4 phases: ingest, store, processed/analyzed, explored and visualized
    - Data ingested via streaming, batch, or application processes
    - Data structure can change, depending on its source and destination
    - Google offers a wide range of services to manage data in every phase of its lifecycle
  20. mikesparr renamed this gist Jul 19, 2022. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  21. mikesparr revised this gist Jul 19, 2022. 1 changed file with 114 additions and 0 deletions.
    114 changes: 114 additions & 0 deletions processing-data.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,114 @@
    # Processing Data

    ## Review Compute Services
    - Overview of compute services
    - Compute Engine (VMs)
    - Kubernetes Engine (GKE)
    - App Engine (Standard, Flexible)
    - Cloud Run
    - Cloud Functions
    ## Choosing the correct compute option
    - Mobile or No?
    - Event-driven functions
    - Specific OS?
    - Setting up PaaS Apps?
    - Meeting Raw Computing Needs
    - Integrating Containers
    ## Compute Autoscaling Comparison
    - Intro
    - Setting up the comparisons
    - Compute Engine
    - Kubernetes Engine
    - App Engine
    - How fast does Cloud Run
    - Cloud Functions
    ## Evolving the Cloud with ML and AI services

    ## Handling Big Data and IoT

    ---

    ## Compute Services
    Overview
    - Configurable - Fully Managed (know which 3 are serverless)

    Compute Engine
    - fast-booting VMs
    - highly configurable, zonal service
    - choose machine types: general purpose, compute-optimized, memory-optimized, processor-optimized (GPU)
    - select public or private disk image
    - options include preemptible (or spot)
    - also good to know about sole-tenant nodes (byol dedicated hardware requirements), instance groups (MIG/UIG)

    Kubernetes Engine (GKE)
    Container orchestration system with clusters, node pools, and control plane
    - regional, managed container service
    - standard (total control), autopilot (fully managed)
    - supports auto repair and auto upgrade
    - know the following:
    - kubectl syntax
    - private clusters (VPC native w/ RFC1918 IP addresses)
    - how to deploy, scale, expose services

    App Engine
    Oldest of all GCP services and comes in 2 versions: Standard, Flexible
    - Standard
    - regional, platform as a service for serverless apps
    - zero server mgmt and config
    - instantaneous scaling, down to zero VMs
    - features
    - second gen runtimes: python 3, java 11, nodejs, php 7, ruby, go 1.12+
    - 1st gen is limited
    - Flex
    - for containerized apps
    - zero server mgmt and config
    - best for apps with consistent traffic, gradual scaling is acceptable
    - robust runtimes
    - python 2.7/3.6, java 8, nodejs, php 5/7, ruby, go, .net

    Cloud Run
    - great for modern websites, REST APIs, back-end office admin
    - regional, fully managed serverless service for containers
    - integrated support for cloud operations
    - built on Knative open-source standards for easy portability
    - supports any language, library, or binary
    - scales from zero and back in an instant

    Cloud Functions
    - regional, event-drive, serverless functions as a service (FaaS)
    - triggers
    - HTTP
    - Cloud Storage
    - Cloud Pub/Sub
    - Cloud Firestore
    - Audit Logs
    - Cloud Scheduler
    - totally serverless
    - automatic horizontal scaling
    - networks well with hybrid and multi-cloud
    - acts as glue between services
    - great for streaming data and IoT apps

    ---

    ## Choosing the correct compute option
    Flowcharts
    - Mobile apps: `Firebase`
    - event-driven functions: `Cloud Functions`
    - specific OS or kernel: `Compute Engine`
    - no hybrid or multi-cloud: `App Engine Standard` (rapid scale) or `Flex`
    - containers: `Cloud Run` or `Kubernetes Engine`

    ---

    ## Compute autoscaling comparison
    Comparison table

    Summary
    - when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response
    - for HA, Kubernetes Engine node pool is best used with minimum of 3 nodes in production
    - Cloud Run scales almost as fast as App Engine Standard, and you are only charged when a request is made

    ---

    ## Evolving the Cloud with AI and ML services
  22. mikesparr revised this gist Jul 19, 2022. 1 changed file with 9 additions and 5 deletions.
    14 changes: 9 additions & 5 deletions 03-case-studies.md
    Original file line number Diff line number Diff line change
    @@ -1,27 +1,31 @@
    # Case Studies

    EHR Healthcare - global company with electronic health record org
    ## EHR Healthcare
    Global company with electronic health record org
    - Who is EHR Healthcare?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - EHR business requirements
    - EHR technical requirements
    - Big picture (executive statement)
    Helicopter Racing League - strong emphasis on streaming content and predictions
    ## Helicopter Racing League
    Strong emphasis on streaming content and predictions
    - Who is Helicopter Racing League?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - HRL business requirements
    - HRL technical requirements
    - Big picture (executive statement)
    Mountkirk Games - make believe leader to mobile gaming world; looking to expand on other platforms
    ## Mountkirk Games
    Make believe leader to mobile gaming world; looking to expand on other platforms
    - Who is Mountkirk Games (2.0 - years later)?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - MG business requirements
    - MG technical requirements
    - Big picture (executive statement)
    TerramEarth - mining and ag equipment and sends "dump trucks of data" to GCP and share w/ partners and dealers
    ## TerramEarth
    Mining and ag equipment and sends "dump trucks of data" to GCP and share w/ partners and dealers
    - Who is TerramEarth?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    @@ -31,7 +35,7 @@ TerramEarth - mining and ag equipment and sends "dump trucks of data" to GCP and

    ---

    EHR Healthcare
    ## EHR Healthcare

    Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)
    - Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)
  23. mikesparr revised this gist Jul 19, 2022. 1 changed file with 56 additions and 56 deletions.
    112 changes: 56 additions & 56 deletions 03-case-studies.md
    Original file line number Diff line number Diff line change
    @@ -1,83 +1,83 @@
    # Case Studies

    EHR Healthcare - global company with electronic health record org
    Who is EHR Healthcare?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    EHR business requirements
    EHR technical requirements
    Big picture (executive statement)
    - Who is EHR Healthcare?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - EHR business requirements
    - EHR technical requirements
    - Big picture (executive statement)
    Helicopter Racing League - strong emphasis on streaming content and predictions
    Who is Helicopter Racing League?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    HRL business requirements
    HRL technical requirements
    Big picture (executive statement)
    - Who is Helicopter Racing League?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - HRL business requirements
    - HRL technical requirements
    - Big picture (executive statement)
    Mountkirk Games - make believe leader to mobile gaming world; looking to expand on other platforms
    Who is Mountkirk Games (2.0 - years later)?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    MG business requirements
    MG technical requirements
    Big picture (executive statement)
    - Who is Mountkirk Games (2.0 - years later)?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - MG business requirements
    - MG technical requirements
    - Big picture (executive statement)
    TerramEarth - mining and ag equipment and sends "dump trucks of data" to GCP and share w/ partners and dealers
    Who is TerramEarth?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    TE business requirements
    TE technical requirements
    Big picture (executive statement)
    - Who is TerramEarth?
    - Primary concerns (solution concept)
    - Lay of the land (existing tech environment)
    - TE business requirements
    - TE technical requirements
    - Big picture (executive statement)

    ---

    EHR Healthcare

    Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)
    Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)
    - Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)

    Primary concerns
    Growing exponentially
    Scaling their environment
    Disaster recovery plan
    New continuous deployment
    Replace colocation facilities with GCP
    - Growing exponentially
    - Scaling their environment
    - Disaster recovery plan
    - New continuous deployment
    - Replace colocation facilities with GCP

    Lay of the land (existing tech)
    Multiple colocation facilities; lease on one about to expire
    Apps are in containers; candidate for Kubernetes
    MySQL, MSSQL, Redis, Mongo DB
    Legacy integrations (no current plan to move short term)
    Users managed by Microsoft AD; monitoring via open source; email alerts often ignored
    - Multiple colocation facilities; lease on one about to expire
    - Apps are in containers; candidate for Kubernetes
    - MySQL, MSSQL, Redis, Mongo DB
    - Legacy integrations (no current plan to move short term)
    - Users managed by Microsoft AD; monitoring via open source; email alerts often ignored

    Business requirements
    Onboard new insurance providers ASAP
    Minimum 99.9% availability for customer apps
    Centralize visibility, proactive performance and usage
    Provide insights into healthcare trends (AI platform)
    Reduce latency for all customers
    Maintain regulatory compliance
    Decrease infra administration costs (can be handled through cloud computing)
    Make predictions and generate reports on industry trends based on provider data (models from external data sources)
    - Onboard new insurance providers ASAP
    - Minimum 99.9% availability for customer apps
    - Centralize visibility, proactive performance and usage
    - Provide insights into healthcare trends (AI platform)
    - Reduce latency for all customers
    - Maintain regulatory compliance
    - Decrease infra administration costs (can be handled through cloud computing)
    - Make predictions and generate reports on industry trends based on provider data (models from external data sources)

    Technical requirements
    Maintain legacy interfaces to insurance providers for both on-premisis systems and cloud providers
    Provide a consisten way to manage customer-facing, container-based applications (Anthos GKE)
    Security and high-perf connection between on-premises systems and GCP
    Consistent logging, log retention, monitoring, and alerting capabilities
    Maintain and managed multiple container-based environments
    Dynamically scale and provision new environments
    Create interfaces to ingest and process data from new providers (Dataproc or Dataflow)
    - Maintain legacy interfaces to insurance providers for both on-premisis systems and cloud providers
    - Provide a consisten way to manage customer-facing, container-based applications (Anthos GKE)
    - Security and high-perf connection between on-premises systems and GCP
    - Consistent logging, log retention, monitoring, and alerting capabilities
    - Maintain and managed multiple container-based environments
    - Dynamically scale and provision new environments
    - Create interfaces to ingest and process data from new providers (Dataproc or Dataflow)

    Big picture (exec statement)
    Our on-prem strategy has worked for years but has required major investment of time and money in training our team on distinctly different systems, managing similar, but separate environments, and responding to outages.
    CapEx and OpEx way too high (too many diverse systems increasing mgmt and training costs)
    - Our on-prem strategy has worked for years but has required major investment of time and money in training our team on distinctly different systems, managing similar, but separate environments, and responding to outages.
    - CapEx and OpEx way too high (too many diverse systems increasing mgmt and training costs)

    Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsisten monitoring practices.
    Too old or broken to deal with customer load; off/on monitoring; seeking change
    - Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsisten monitoring practices.
    - Too old or broken to deal with customer load; off/on monitoring; seeking change

    We want to use Google Cloud to leverage a scalable and resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.
    They see light at end of tunnel which is Google Cloud, capable of handling legacy to modern
    - We want to use Google Cloud to leverage a scalable and resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.
    - They see light at end of tunnel which is Google Cloud, capable of handling legacy to modern

    Key takeaways:
    - governance and compliance play signicant role
  24. mikesparr revised this gist Jul 18, 2022. 3 changed files with 0 additions and 0 deletions.
    File renamed without changes.
  25. mikesparr revised this gist Jul 18, 2022. 1 changed file with 233 additions and 0 deletions.
    233 changes: 233 additions & 0 deletions case-studies.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,233 @@
    # Case Studies

    EHR Healthcare - global company with electronic health record org
    Who is EHR Healthcare?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    EHR business requirements
    EHR technical requirements
    Big picture (executive statement)
    Helicopter Racing League - strong emphasis on streaming content and predictions
    Who is Helicopter Racing League?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    HRL business requirements
    HRL technical requirements
    Big picture (executive statement)
    Mountkirk Games - make believe leader to mobile gaming world; looking to expand on other platforms
    Who is Mountkirk Games (2.0 - years later)?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    MG business requirements
    MG technical requirements
    Big picture (executive statement)
    TerramEarth - mining and ag equipment and sends "dump trucks of data" to GCP and share w/ partners and dealers
    Who is TerramEarth?
    Primary concerns (solution concept)
    Lay of the land (existing tech environment)
    TE business requirements
    TE technical requirements
    Big picture (executive statement)

    ---

    EHR Healthcare

    Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)
    Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)

    Primary concerns
    Growing exponentially
    Scaling their environment
    Disaster recovery plan
    New continuous deployment
    Replace colocation facilities with GCP

    Lay of the land (existing tech)
    Multiple colocation facilities; lease on one about to expire
    Apps are in containers; candidate for Kubernetes
    MySQL, MSSQL, Redis, Mongo DB
    Legacy integrations (no current plan to move short term)
    Users managed by Microsoft AD; monitoring via open source; email alerts often ignored

    Business requirements
    Onboard new insurance providers ASAP
    Minimum 99.9% availability for customer apps
    Centralize visibility, proactive performance and usage
    Provide insights into healthcare trends (AI platform)
    Reduce latency for all customers
    Maintain regulatory compliance
    Decrease infra administration costs (can be handled through cloud computing)
    Make predictions and generate reports on industry trends based on provider data (models from external data sources)

    Technical requirements
    Maintain legacy interfaces to insurance providers for both on-premisis systems and cloud providers
    Provide a consisten way to manage customer-facing, container-based applications (Anthos GKE)
    Security and high-perf connection between on-premises systems and GCP
    Consistent logging, log retention, monitoring, and alerting capabilities
    Maintain and managed multiple container-based environments
    Dynamically scale and provision new environments
    Create interfaces to ingest and process data from new providers (Dataproc or Dataflow)

    Big picture (exec statement)
    Our on-prem strategy has worked for years but has required major investment of time and money in training our team on distinctly different systems, managing similar, but separate environments, and responding to outages.
    CapEx and OpEx way too high (too many diverse systems increasing mgmt and training costs)

    Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsisten monitoring practices.
    Too old or broken to deal with customer load; off/on monitoring; seeking change

    We want to use Google Cloud to leverage a scalable and resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.
    They see light at end of tunnel which is Google Cloud, capable of handling legacy to modern

    Key takeaways:
    - governance and compliance play signicant role
    - while dedicated to cloud computing, must maintain legacy integrations and high speed connections between GCP and on-prem
    - attention to security concerns is strong thread, containers and protecting patient data

    ---

    Helicopter Racing League

    Who is Helicopter Racing League - HRL is a global sports league for competitive helicopter racing. Each year HRL holds the world championship and several regional league competitions where teams compete to earn a spot on the world championship. HRL offers a paid service to stream the races all over the world with live telemetry and predictions throughout the race.
    Global (covering lot of territory w/ lots of regional focus); cater to entire globe at one time, but also break down to smaller targeted services; commercial enterprise so uptime is important; gathering a lot of data in real time and analyzing and forecasting with it.

    Primary concerns
    Migrate to new platform
    Expand use of AI and ML
    Fans in emerging regions
    Move service of content, real-time and recorded
    Closer to viewers to keep latency down

    Lay of the land
    Already in the cloud (unnamed)
    Existing content stored in Object Storage service on cloud
    Video recording and editing handled at race tracks
    VMs for every job handle Video Encode/Transcode in cloud
    TensorFlow predictions run on other VMs in cloud

    Business requirements
    Expose the predictive models to partners (API and private connectivity)
    Increase predictive capabilities during and before races
    Increase telemetry and create additional insights (enhance experience)
    Measure fan engagement and new predictions
    Enhance global availability and quality of broadcasts
    Increase the number of concurrent viewers (streaming capacity increase)
    Minimize operational complexity (standardize)
    Ensure compliance with regulations
    Create a merchandising revenue stream (e-comm app or connection to one)

    Technical requirements
    Maintain or increase prediction throughput and accuracy (ramp up efficiency)
    Reduce viewer latency (get content closer to viewers)
    Increate transcoding performance (vertically scale up VMs)
    Create real-time analytics of viewer consumption patterns and engagement (streaming data and pipeline)
    Create data mart to enable processing of large volumes of race data (batch data)

    Big picture (exec statement)
    Our CEO, S. Hawke, wants to bring high-adrenaline racing to fans all around the world. We listen to our fans, and they want enhanced video streams that include predictions of events within the race (e.g., overtaking).
    Global, ramped up graphic processing, heavily data dependant and may include video analysis

    Our current platform allows us to predict race outcomes but lacks the facility to support real-time predictions during races and the capacity to process season-long results.
    Streaming data analysis, batch analysis

    Key takeaways:
    - emphasizes numerous scenarios involving data predictions and forecasts that would entail significant use of AI and ML
    - global org and intent on extending their reach and market while maintaining high quality and low latency
    - HRL must process a tremendous amount of data in near real-time and output the results worldwide to specific regions

    ---

    Mountkirk Games

    Who is Mountkirk Games - makes online, session-based, multiplayer games for mobile platforms. They have recently started expanding to other platforms after successfully migrating their on-premisis environments to Google Cloud. Their most recent endeavor is to create a retro-style first-person shooter (FPS) game that allows hundreds of simultanous players to join a geo-specific digital arena from multiple platforms and locations. A real-time digital banner will display a global leaderboard of all the top players across every active arena.

    Primary concerns
    Building a new multiplayer game
    Want to use GKE
    Use global load balancer to keep latency down
    Keep global leader board in sync (streaming data)
    Willing to use Cloud Spanner as their database engine

    Lay of the land
    Recently lift & shift 5 games to GCP
    Each game in own project under one folder (most permissions and network policies)
    Some legacy games with little traffic consolidated to single project
    Separate environments for development and testing

    Business requirements
    Support multiple gaming platforms (from mobile only to multiple platforms)
    Support multiple regions (protect data and diff compliance regs)
    Support rapid iteration of game features (CICD)
    Minimize latency
    Optimize for dynamic scaling
    Use managed services and pooled resources (standardization)
    Minimize costs

    Technical requirements
    Dynamically scale based on game activity
    Publish scoring data on near real-time global leaderboard
    Store game activity logs in structured files for future analysis
    Use GPU processing to render graphics server-side for multi-platform support
    Support eventual migration of legacy games to this new platform

    Big picture (exec statement)
    Our last game was the first time we used Google Cloud and it was a success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all new games using cloud native design principles.
    See advantage reviewing user actions and game responses; going completely cloud native

    Our new game is our most ambitious to date and will open doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge.
    Higher performance; lower cost

    As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.
    Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals

    Key takeaways
    - Wants to expand reach to other gaming platforms and other regions of the world
    - Very specific ideas on how to architect their next steps, including Kubernetes, Load Balancer, and Cloud Spanner
    - Latency as top priority and cost management as second; happy users while keeping eye on bottom line

    ---

    TerramEarth

    Who is TerramEarth - manufactures heavy equipment for the mining and agriculture industries. They have over 500 dealers and service centers in 100 countries. Their mission is to build product that make their customers more productive.
    Sophisticated earth-moving equipment; solid network; customer focused

    Primary concerns
    2 million TE vehicles in operation
    Collect telemetry data from many sensors (IoT)
    Subset of critical data in real time
    Rest of data collected, compressed, and uploaded daily
    200-500MB of data per vehicle per day (1 PB each day)

    Lay of the land
    Infra in GCP serving clients all around the world (data gathering and analysis)
    Private data center integration (2 main mfr plants sent to) with multiple Interconnects

    Business requirements
    Predict and detect vehicle malfunction
    Ship parts to dealerships for just-in-time repair with little/no downtime
    Decrease cloud operational costs and adapt to seasonality
    Increase speed and reliability of developer workflow (SRE)
    Allow remote developers to productive without compromising code or data security
    Create flexible and scalable platform for custom API Services for dealers and partners (Apigee)

    Technical requirements
    Create new abstraction layer for HTTP API access to legacy systems to enable a gradual migration without disrupting operations (API gateway)
    Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable environments (GKE, Cloud Run, Cloud Build)
    Allow developers to experiment without compromising security and governance (new test project)
    Create a self-service portal for internal and partner developers to create new projects, request resources for data analytics jobs, and centrally manage access to the API endpoints (secure new web front end with access to spin up resources; network tags)
    Use cloud-native solutions for keys and secrets management and optimize for identity-based access (IAM, Secrets Manager, and KMS)
    Improve and standardize tools necessary for application and network monitoring and troubleshooting (Cloud Operations: Monitoring, Logging, Debugging)

    Big picture (exec statement)
    Our advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtime. After moving multiple systems to Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships.
    Customer is successful, they are; keeping vehicles operational leads to success; always improving

    5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.
    Moving physical and digital information daily

    Key takeaways
    - places great emphasis on customer and partner support which requires consistent and secure communication between systems and devices
    - after success of initial migration, TE seeks to expand their global integration without disrupting operations or regulations
    - company's equipment must be able to transmit and analyze a great deal of telemetry data to maintain high-performance levels and just-in-time repairs

  26. mikesparr created this gist Jul 18, 2022.
    90 changes: 90 additions & 0 deletions architecting-solutions.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,90 @@
    # Architecting for the cloud

    Architect solutions to be
    scalable and reilient

    Business requirements involve
    lowering costs / enhancing user experience

    Keep an eye on technical needs during
    development and operation

    ---

    3 Major Questions To Ask

    1. Where is the company coming from
    business, technical, personnel

    2. Where is the company going to
    on GCP, hybrid, multi-cloud / regional, national, global

    3. What's next
    allow for future changes

    ---

    Key Data Lifecycle Steps (4)
    1. Ingest - pull in raw data via streaming, batch, or app processes
    2. Store - keep the retrieved data in a durable and accessible environment
    3. Process/Analyze - transform the data into actionable information
    4. Explore/Visualize - convert processed data into shareable, relatable content

    1. Ingesting Data (11 services)

    Streaming
    Cloud Pub/Sub - messaging middleware system
    Batch
    Cloud Storage - object storage in buckets
    Storage Transfer Service - move data from one place to another
    BigQuery Transfer Service - move structured data from one place to another
    Storage Transfer Appliance - move very large amounts of data (physical to cloud)
    Application
    Cloud Logging - outputs
    Cloud Pub/Sub -
    Cloud SQL - structured data
    Cloud Firestore - serverless document data for NoSQL data
    Cloud Bigtable - large amounts of NoSQL data
    Cloud Spanner - fully managed relational database for structured SQL data

    2. Storing Data

    Objects
    Cloud Storage
    Cloud Storage for Firebase - mostly mobile / web apps with some overlap
    Databases
    Cloud SQL - relational DB for MySQL, Postgres, SQL Server
    Cloud Spanner - large distributed SQL
    Cloud Bigtable - large NoSQL
    Cloud Firestore - serverless NoSQL
    Warehouse
    BigQuery - serverless highly-scalable multi-cloud data warehouse

    3. Processing and Analyzing Data
    big data, ETL pipelines, machine learning

    Compute
    Compute Engine - virtual compute machines
    Kubernetes Engine - orchestration of containerized workloads
    App Engine - quickly get apps up and running
    Large-Scale
    Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
    Cloud Dataflow - based on Apache Beam
    Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML
    Analysis
    BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

    4. Exploring and Visualizing Data

    Science
    Cloud Datalab - uses jupyter notebooks to interact and visualize data
    Visualizing
    BigQuery BI - business intelligence functionality for BQ
    Cloud Data Studio - can be utilized by host of data
    Looker - frontent enterprise platform for BI, apps, embedded data analytics

    Key points:
    4 phases: ingest, store, processed/analyzed, explored and visualized
    Data ingested via streaming, batch, or application processes
    Data structure can change, depending on its source and destination
    Google offers a wide range of services to manage data in every phase of its lifecycle
    209 changes: 209 additions & 0 deletions overarching-principles.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,209 @@
    # Overall Principles

    - Grasping Key Tech Fundamentals
    Describing distributed systems
    Core networking fundamentals
    Applying HTTP/HTTPS
    Understanding SRE principles
    - Keeping in Compliance - follow spirit and letter of "the law"
    Compliance with what?
    Getting help with compliance
    Relevant products and services
    - Annotating Resources Properly
    Understanding annotation options
    Applying security marks
    Working with labels
    Implementing networking tags
    Choosing the right annotation
    - Managing Quotas & Costs
    Working with quota limits
    Cost optimization principles
    Best practices (overall, compute, storage and data analysis)

    ---

    ## Key Fundamentals

    Distributed System - group of servers working together as to appear as a single server to end user
    Scale Horizontally - increase capacity by adding more servers that work together
    Scale Vertically - Increasing capacity by adding more memory or using a faster CPU
    Sharding - Splitting server into multiple servers, a.k.a. "partitioning"

    Networking - be familiar with 7-layer OSI model
    7 Layer OSI model
    Application - End user layer (human comp interaction): HTTP, FTP, IRC, SSH, DNS
    Presentation - Syntax layer: SSL, SSH, IMAP, FTP, MPEG, JPEG
    Session - Sync and send to port: APIs, Sockets, WinSock
    Transport - End to end Connections: TCP, UDP
    Network - Packets: IP, ICMP, IPSec, IGMP
    Data Link - Frames: Ethernet, PPP, Switch, Bridge
    Physical - coax, fiber, wireless, hubs, repeaters
    TCP/IP - primary way data gets around the Internet
    Handshaking with syn/ack
    Addressing with IPv4 and IPv6
    Public Internet and private RFC1918 addressing
    SSL/TLS - encrypted comms
    SSH - access disks
    Ports
    80 - HTTP
    22 - SSH
    53 - DNS
    443 - HTTPS
    25 - SMTP
    3306 - MySQL

    Applying HTTP/HTTPS - works on L7 (Application Layer)
    Understand your resources (URL/URI) and how parameters are applied
    Know verbs: GET, POST, PUT, DELETE & PATCH, OPTIONS, TRACE, CONNECT
    Have firm grasp of caching: headers and locations (browsers, proxies, CDN, memory cache)
    Be familiar with CORS
    HTTP/HTTPS status codes
    100 Information
    100 - Continue
    101 - Switching protocol
    200 Successful response
    200 - Okay
    201 - Create
    202 - Accepted
    204 - No content
    206 - Partial content
    300 Redirection
    301 - Moved permananently
    304 - Not modified (caching)
    307 - Temporary redirect
    308 - Permanent redirect
    400 Client Errors
    400 - Bad request
    401 - Unauthorized
    403 - Forbidden
    408 - Request timeout
    429 - Too many requests
    500 Server Error
    500 - Internal server error
    501 - No implemented
    502 - Bad gateway
    503 - Service unavailable / quota exceeded
    504 - Gateway timeout
    511 - Network authentication required

    Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations - Ben Traynor (2003)
    SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    Request latency - how long to return a response to a request
    Failure rate - fraction of all rates received
    Batch throughput - proportion of time that data processing rate > threshold set
    SLO - Service Level Objective (specify target level for reliability of service)
    100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice difference, more resources focused on value add of service
    SLA - contractual obligation
    includes consequences of meeting or missing SLOs it contains
    SLI - drives - SLO - informs - SLA

    ---

    ## Compliance

    Compliance with what
    Legistation - targeted areas (health regs, privacy, children's privacy, ownership)
    Commercial - protect sensitive data, credit cards / PII
    Industry certifications - ensure following health, safety, and environmental regulations
    Audits - create necessary structure to allow for 3rd-party audits

    Getting help with compliance
    Visit the Compliance Center - sortable by region, industry, and focus area
    General Data Protection Regulations (GDPR) - continue to have major impact on web services around the world
    BAA - Google business association agreement (customer must request BAA from account manager for HIPAA compliance)

    Relevant products and services
    2-factor authentication
    Cloud Security Command Center (CSCC)
    Cloud IAM (global across all Google Cloud)
    Cloud Logging
    Cloud DLP (de-identification routines to protect PII)
    Cloud Monitoring (surface compliance missteps / alerts in real time)

    ---

    ## Annotations

    Understanding annotations
    Security Marks - assigned and utilized through Cloud Security Command Center (CSCC)
    Labels - key-value pairs that help you organize cloud resources
    Network tags - applied to VM instances used for routing traffic to/fro

    Applying security marks
    Adds business context to assets for compliance
    Enhanced security focused insights into resources
    Unique to CSCC
    Set at org, project, or individually
    Works with labels and network tags

    Working with labels
    Key-value pairs supported by wide range or GCP resources
    Used for many scenarios
    Identify individual teams or cost center resources
    Distinguish deployment environments
    Cost allocation and billing breakdowns
    Monitor resource groups for metadata
    Labels to projects, but NOT folders

    Implementing network tags
    Control traffic to/from VM instances
    Identify VM instances subject to firewall rules and network routes
    Use tags as source and destination values in firewall rules
    Identify instances on a certain route
    Configured with gcloud, console, or API

    Choosing right annotation
    Need to group/classify for compliance?
    Yes - use Security Marks
    No - Need billing breakdown?
    Yes - use Labels
    No - Need to manage network traffic to/from VMs?
    Yes - use Network Tags

    ---

    ## Managing Quotas & Costs

    Working within quota limits - restrict how much of a shared GCP resource you can use
    Not to be confused with fixed contstraints which cannot be increased or decreased (i.e. max file siz, database schema limitis)
    Two types of quotas:
    Rate quotas - limit number of API or service requests
    Allocation quotas - restrict the resource available at any one time
    Limits are specific to your org
    Add your own limits to impose spending limits
    Exceeded quotas can generat quota error and 503 status for HTTP requests

    Cost optimization principles
    Understand the total cost of ownership (TCO)
    Commonly misunderstood when moving from on-prem (CapEx) model to cloud-based (OpEx)
    Organize costs in relation to business needs
    Maximize value of all expenses while eliminating waste
    Implement standardized processes at the start

    Best practices: use cost management tools
    Organize and Structure - set up folders, projects, and use labels to structure costs in relation to business needs
    Billing Reports - view costs and analyze trends and filter as needed
    Custom dashboards - can also export to BigQuery, then visualize in Cloud Data Studio

    Compute - pay for the compute you need
    Identify idle VMs
    use Idle VM recommender service to identify inactive VMs
    Snapshot them before deleting
    Stop without deleting
    Start/stop VMs automatically or via Cloud Functions
    Create custom VMs with right size CPUs and memory
    Make the most of preemptible/spot VMs (often is an option - consider it for exam)

    Cloud Storage - ways to keep more of your company's hard-earned money
    Choose right storage class: nearline 30, coldline 90, archive
    Modify storage class as needed with lifecycle policies
    Deduplicate data wherever possible (i.e. Cloud Dataflow)
    Choose multi-region rather than single region buckets whewre viable
    Set object versioning policies to keep copies down (i.e. delete oldest after 2 versions)

    Keep BigQuery from BigCosts
    Limit query costs with the maximum bytes billed setting
    Partition tables based on ingestion time, data, timestamp or integer range column
    Switch from on-demand to flat rate pricing to process unlimited bytes for fixed predictable cost
    Combine Flex Slots (like preemptible) with annual and monthly commitments (blended)