rudy-remindme · April 15, 2023 11:19
diff --git a/Aurora Serverless Performance b/Aurora Serverless Performance
 Regarding why the write operation was of poor throughput it would depend on the amount of data that would be inserted by the operation along with  any deadlocks or wait events waiting for a lock it may need to take hold off as well.  To that end if possible you could run those write operations to check for any deadlocks(by checking the output of the show engine innodb status) they may be causing , further you can also run profile on them to see what stage of the execution is slowing down the whole query. The explain plan will also provide us with insights on the query execution plan allowing you to make changes to improve efficiency.

 https://dev.mysql.com/doc/refman/5.7/en/show-profile.html 
 https://dev.mysql.com/doc/refman/5.7/en/show-engine.html 
 https://dev.mysql.com/doc/refman/5.7/en/explain.html 


 >>Do we have to look at using managed Aurora cluster with read replicas, connection pooling for better throughput?

 It is indeed an alternative that you can consider, your workload contains both read and write traffic therefore with a provisioned aurora cluster you will have a writer and readers to split the write and read workloads . This along with connection polling can indeed increase the overall throughput for your database. You can try to restore a snapshot of your cluster to a provisioned one and test against your workload . Further provisioned clusters come with Enhanced monitoring and perfromance insights which give us more insights on the resource utilisation and query performance. 

 https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html 
 https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html 
diff --git a/Cognito LB Integration b/Cognito LB Integration
 From your case description, I understand that you have created two app clients in the user pool, which are further integrated with two different Load balancers. When the user try to log in to one application, he is also able to login into other application with any ask for login(SSO experience). 

 In regard to your case, the implementation performed is called "Multi Tenancy Support". Cognito User Pool represents a single tenant, users in a user pool belong to the same directory and share the same settings like password policy, custom attributes, MFA settings, advanced security settings …etc. 

 In your case, the approach of Same user pool, multiple app clients is used. Here, single user pool is used to host all users and use app client to represent tenants. This is easier to maintain, but tenants share the same settings. This approach requires additional considerations when hosted UI is used to authenticate users with native accounts; e.g. username and password. When hosted UI is in-use, a session cookie is created to maintain session for the authenticated user from cognito end, and it provides SSO experience between "application clients" in the same user pool, if SSO is not the desired behavior in your application, hosted UI shouldn’t be used with this approach to authenticate native accounts.

 The cons of using same user pool, multiple app clients approach is:
 - It would Require you to perform tenant match logic on client side through CustomUI to determine which app client to authenticate users against
 - It would also require additional Auth logic to verify that user belongs to this tenant (since all users share one pool, it is technically possible that users authenticate against any app client)

 The possible workaround in this case at the moment would be to use different user-pools for the purpose. Later, you can move ahead with the approach of using custom UI to implement the tenant match logic. 
diff --git a/EKS ALB Traffic Routing - Multi AZ b/EKS ALB Traffic Routing - Multi AZ
 We want to understand how ALB traffic routing takes place in EKS context.

 Assume that we have a 3 node Multi-AZ EKS cluster in us-east-1-region.

 Node 1 - us-east-1a
 Node 2 - us-east-1b
 Node 3 - us-east-1c

 We have created an ALB in instance mode for a Kubernetes service, which means that the ALB has targets to the 3 instance nodes rather than the pods themselves.

 Case 1:
 We have 3 pods mapped to the Kubernetes service and each node has one of the pods running.

 When a request is sent to ALB from us-east-1a region, does it always forward the traffic to the node in the same AZ as the loadbalancer? 

 Case 2:
 We have only 1 pod mapped to the Kubernetes service and that pod is running in the us-east-1b node.

 When a request is sent to ALB from us-east-1a region, does it send the traffic to us-east-1b node (or) it sends to us-east-1a node but then kubernetes forwards the traffic to us-east-1b node as pod-pod communication traffic?



 Answer:
 =======================

 The default setting for externalTrafficPolicy is “Cluster,” which allows every worker node in the cluster to accept traffic for every service no matter if a pod for that service is running on the node or not. Traffic is then forwarded on to a node running the service via kube-proxy.
 This is typically fine for smaller or single AZ clusters but when you start to scale your instances it will mean more instances will be backends for a service and the traffic is more likely to have an additional hop before it arrives at the instance running the container it wants.

 When running services that span multiple AZs, you should consider setting the externalTrafficPolicy in your service to help reduce cross AZ traffic. 
 By setting externalTrafficPolicy to Local, instances that are running the service container will be load balancer backends, which will reduce the number of endpoints on the load balancer and the number of hops the traffic will need to take.

 Another benefit of using the Local policy is you can preserve the source IP from the request. As the packets route through the load balancer to your instance and ultimately your service, the IP from the originating request can be preserved without an additional kube-proxy hop.

 An example service object with externalTrafficPolicy set would look like this:
 apiVersion: v1
 kind: Service
 metadata:
  name: example-service
 spec:
  selector:
    app: example
  ports:
    - port: 8765
      targetPort: 9376
  externalTrafficPolicy: Local
  type: LoadBalancer
diff --git a/ES Cluster RED Status b/ES Cluster RED Status
 Index must deleted in order to bring the cluster status back to green. 

 The process to restore the correct index from a snapshot is seen below:

 Identifying the red indices:
 * GET _cat/indices?health=red

 Run the following API call to know the snapshot repository name:
 *GET /-snapshot?pretty

 Once, the snapshot repository name is identified, (Most cases it is 'cs-automated' or 'cs-automated-enc'), please run the following API call to list the snapshots. 
 * GET /_snapshot/repository/_all?pretty (Replace the 'repository' with your repository name)

 Deleting the red index: 
 * DELETE /index-name. (Replace the 'index-name' with the index that you need to delete.)

 Once, the snapshot name is identified from which you want to restore the deleted index, you can run the following API call to restore the index. 
 * POST /_snapshot/cs-automated//_restore { "indices": "index_name" }
 
 (Replace the '' with the snapshot name that you have identified from which you want to restore the deleted index. Also, replace the "my-index" with the index name that you want to restore.)


 For more info on Restoring Snapshots, See 'Restoring Snapshots' link below[1].
 If you have any further questions, feel free to reach out to me and I will be happy to assist. 

 References:

 [1] - 'Restoring Snapshots' https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains-snapshots.html#es-managedomains-snapshot-restore 
diff --git a/MSK Rebalance Error b/MSK Rebalance Error
 Attempt to heartbeat failed since group is rebalancing
 Revoke previously assigned partitions 
 (Re-)joining group
 ending LeaveGroup request to coordinator (rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in pol

 Answer
 ===========

 The way Kafka messages are consumed is that a consumer.poll() function call fetches a list of records from the Kafka topic, the consumer application then processes those records in a loop and does the next consumer.poll() call to fetch the next batch. The maximum permitted time between each poll function call is defined by the "max.poll.interval.ms" Kafka consumer configuration parameter (default to 300 seconds unless explicitly overridden). If the time between 2 consumer.poll() calls goes over this 5 minute mark, the consumer instance would leave the consumer group forcing the group co-ordinator to trigger a rebalance and redistribute the Kafka topic's partitions across all the other available consumer instances. This is an indication of slow processing logic or the Kafka records are being sent to a downstream application which is slow to respond, in turn increasing the overall time taken in the processing logic. In such cases, as the error message suggests it is advisable to:

 1. Increase the value of "max.poll.interval.ms" to a higher value. This would help in accommodating sudden increases in record processing time and ensures that the consumer group does not enter into a rebalancing state.
 2. Decrease the total number of records returned by Kafka in each poll cycle by tuning the "max.poll.records" (defaults to 500) consumer parameter. Although this might slow down the entire consumption process even when the processing logic is behaving normally and taking usual time to process records.
diff --git a/RDS to S3 OUTFILE - Access Denied Error b/RDS to S3 OUTFILE - Access Denied Error
 The issue has been identified as the restrictive bucket policy on the target bucket, named "xxxx". The 2 specific rules which are causing the deny are "DenyIncorrectEncryptionHeader" and "DenyUnEncryptedObjectUploads". I have added these rules to my own S3 bucket and immediately my outfile operations failed with "Error Code: 63994. S3 API returned error: Access Denied:Access Denied".

 As the outfile generated by MySQL is not an encrypted object, the above policy rules are denying the operation. Furthermore, as there is no option to create the outfile as an encrypted object, there are 2 options which come to mind. 

 1. Remove the above mentioned rules from the bucket policy. This would obviously depend on your organizations own policies and procedures.

 2. Create a new bucket without the above mentioned rules in it's bucket policy. 
diff --git a/Route53 SplitView DNS b/Route53 SplitView DNS
 When we try to resolve the public hosted zone record "" from within a pod running in EKS cluster "" residing in a private subnet it results in "getaddrinfo ENOTFOUND".

 Answer
 ============

 In your Route53 configuration you have the same domain in the private and public hosted zones. This is called split-view DNS and is described in the documentation link below in details.
 https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zone-private-considerations.html 

 The idea is if there's a private hosted zone name that matches the domain name in the request, the hosted zone is searched for a record that matches the domain name and DNS type in the request.
 And if there's a matching private hosted zone but there's no record that matches the domain name and type in the request, Resolver doesn't forward the request to a public DNS resolver. Instead, it returns NXDOMAIN (non-existent domain) to the client.

 This explains the behaviour you are getting, and only the records in the private hosted zones will resolve from the VPC attached to that private zone.

 In order to overcome this, I would advise to add the records you need in the private zone like in the public zone.
	Regarding why the write operation was of poor throughput it would depend on the amount of data that would be inserted by the operation along with any deadlocks or wait events waiting for a lock it may need to take hold off as well. To that end if possible you could run those write operations to check for any deadlocks(by checking the output of the show engine innodb status) they may be causing , further you can also run profile on them to see what stage of the execution is slowing down the whole query. The explain plan will also provide us with insights on the query execution plan allowing you to make changes to improve efficiency.

	https://dev.mysql.com/doc/refman/5.7/en/show-profile.html
	https://dev.mysql.com/doc/refman/5.7/en/show-engine.html
	https://dev.mysql.com/doc/refman/5.7/en/explain.html


	>>Do we have to look at using managed Aurora cluster with read replicas, connection pooling for better throughput?

	It is indeed an alternative that you can consider, your workload contains both read and write traffic therefore with a provisioned aurora cluster you will have a writer and readers to split the write and read workloads . This along with connection polling can indeed increase the overall throughput for your database. You can try to restore a snapshot of your cluster to a provisioned one and test against your workload . Further provisioned clusters come with Enhanced monitoring and perfromance insights which give us more insights on the resource utilisation and query performance.

	https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html
	https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html
	From your case description, I understand that you have created two app clients in the user pool, which are further integrated with two different Load balancers. When the user try to log in to one application, he is also able to login into other application with any ask for login(SSO experience).

	In regard to your case, the implementation performed is called "Multi Tenancy Support". Cognito User Pool represents a single tenant, users in a user pool belong to the same directory and share the same settings like password policy, custom attributes, MFA settings, advanced security settings …etc.

	In your case, the approach of Same user pool, multiple app clients is used. Here, single user pool is used to host all users and use app client to represent tenants. This is easier to maintain, but tenants share the same settings. This approach requires additional considerations when hosted UI is used to authenticate users with native accounts; e.g. username and password. When hosted UI is in-use, a session cookie is created to maintain session for the authenticated user from cognito end, and it provides SSO experience between "application clients" in the same user pool, if SSO is not the desired behavior in your application, hosted UI shouldn’t be used with this approach to authenticate native accounts.

	The cons of using same user pool, multiple app clients approach is:
	- It would Require you to perform tenant match logic on client side through CustomUI to determine which app client to authenticate users against
	- It would also require additional Auth logic to verify that user belongs to this tenant (since all users share one pool, it is technically possible that users authenticate against any app client)

	The possible workaround in this case at the moment would be to use different user-pools for the purpose. Later, you can move ahead with the approach of using custom UI to implement the tenant match logic.
	We want to understand how ALB traffic routing takes place in EKS context.

	Assume that we have a 3 node Multi-AZ EKS cluster in us-east-1-region.

	Node 1 - us-east-1a
	Node 2 - us-east-1b
	Node 3 - us-east-1c

	We have created an ALB in instance mode for a Kubernetes service, which means that the ALB has targets to the 3 instance nodes rather than the pods themselves.

	Case 1:
	We have 3 pods mapped to the Kubernetes service and each node has one of the pods running.

	When a request is sent to ALB from us-east-1a region, does it always forward the traffic to the node in the same AZ as the loadbalancer?

	Case 2:
	We have only 1 pod mapped to the Kubernetes service and that pod is running in the us-east-1b node.

	When a request is sent to ALB from us-east-1a region, does it send the traffic to us-east-1b node (or) it sends to us-east-1a node but then kubernetes forwards the traffic to us-east-1b node as pod-pod communication traffic?



	Answer:
	=======================

	The default setting for externalTrafficPolicy is “Cluster,” which allows every worker node in the cluster to accept traffic for every service no matter if a pod for that service is running on the node or not. Traffic is then forwarded on to a node running the service via kube-proxy.
	This is typically fine for smaller or single AZ clusters but when you start to scale your instances it will mean more instances will be backends for a service and the traffic is more likely to have an additional hop before it arrives at the instance running the container it wants.

	When running services that span multiple AZs, you should consider setting the externalTrafficPolicy in your service to help reduce cross AZ traffic.
	By setting externalTrafficPolicy to Local, instances that are running the service container will be load balancer backends, which will reduce the number of endpoints on the load balancer and the number of hops the traffic will need to take.

	Another benefit of using the Local policy is you can preserve the source IP from the request. As the packets route through the load balancer to your instance and ultimately your service, the IP from the originating request can be preserved without an additional kube-proxy hop.

	An example service object with externalTrafficPolicy set would look like this:
	apiVersion: v1
	kind: Service
	metadata:
	name: example-service
	spec:
	selector:
	app: example
	ports:
	- port: 8765
	targetPort: 9376
	externalTrafficPolicy: Local
	type: LoadBalancer
	Index must deleted in order to bring the cluster status back to green.

	The process to restore the correct index from a snapshot is seen below:

	Identifying the red indices:
	* GET _cat/indices?health=red

	Run the following API call to know the snapshot repository name:
	*GET /-snapshot?pretty

	Once, the snapshot repository name is identified, (Most cases it is 'cs-automated' or 'cs-automated-enc'), please run the following API call to list the snapshots.
	* GET /_snapshot/repository/_all?pretty (Replace the 'repository' with your repository name)

	Deleting the red index:
	* DELETE /index-name. (Replace the 'index-name' with the index that you need to delete.)

	Once, the snapshot name is identified from which you want to restore the deleted index, you can run the following API call to restore the index.
	* POST /_snapshot/cs-automated//_restore { "indices": "index_name" }

	(Replace the '' with the snapshot name that you have identified from which you want to restore the deleted index. Also, replace the "my-index" with the index name that you want to restore.)


	For more info on Restoring Snapshots, See 'Restoring Snapshots' link below[1].
	If you have any further questions, feel free to reach out to me and I will be happy to assist.

	References:

	[1] - 'Restoring Snapshots' https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains-snapshots.html#es-managedomains-snapshot-restore
	Attempt to heartbeat failed since group is rebalancing
	Revoke previously assigned partitions
	(Re-)joining group
	ending LeaveGroup request to coordinator (rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in pol

	Answer
	===========

	The way Kafka messages are consumed is that a consumer.poll() function call fetches a list of records from the Kafka topic, the consumer application then processes those records in a loop and does the next consumer.poll() call to fetch the next batch. The maximum permitted time between each poll function call is defined by the "max.poll.interval.ms" Kafka consumer configuration parameter (default to 300 seconds unless explicitly overridden). If the time between 2 consumer.poll() calls goes over this 5 minute mark, the consumer instance would leave the consumer group forcing the group co-ordinator to trigger a rebalance and redistribute the Kafka topic's partitions across all the other available consumer instances. This is an indication of slow processing logic or the Kafka records are being sent to a downstream application which is slow to respond, in turn increasing the overall time taken in the processing logic. In such cases, as the error message suggests it is advisable to:

	1. Increase the value of "max.poll.interval.ms" to a higher value. This would help in accommodating sudden increases in record processing time and ensures that the consumer group does not enter into a rebalancing state.
	2. Decrease the total number of records returned by Kafka in each poll cycle by tuning the "max.poll.records" (defaults to 500) consumer parameter. Although this might slow down the entire consumption process even when the processing logic is behaving normally and taking usual time to process records.
	The issue has been identified as the restrictive bucket policy on the target bucket, named "xxxx". The 2 specific rules which are causing the deny are "DenyIncorrectEncryptionHeader" and "DenyUnEncryptedObjectUploads". I have added these rules to my own S3 bucket and immediately my outfile operations failed with "Error Code: 63994. S3 API returned error: Access Denied:Access Denied".

	As the outfile generated by MySQL is not an encrypted object, the above policy rules are denying the operation. Furthermore, as there is no option to create the outfile as an encrypted object, there are 2 options which come to mind.

	1. Remove the above mentioned rules from the bucket policy. This would obviously depend on your organizations own policies and procedures.

	2. Create a new bucket without the above mentioned rules in it's bucket policy.
	When we try to resolve the public hosted zone record "" from within a pod running in EKS cluster "" residing in a private subnet it results in "getaddrinfo ENOTFOUND".

	Answer
	============

	In your Route53 configuration you have the same domain in the private and public hosted zones. This is called split-view DNS and is described in the documentation link below in details.
	https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zone-private-considerations.html

	The idea is if there's a private hosted zone name that matches the domain name in the request, the hosted zone is searched for a record that matches the domain name and DNS type in the request.
	And if there's a matching private hosted zone but there's no record that matches the domain name and type in the request, Resolver doesn't forward the request to a public DNS resolver. Instead, it returns NXDOMAIN (non-existent domain) to the client.

	This explains the behaviour you are getting, and only the records in the private hosted zones will resolve from the VPC attached to that private zone.

	In order to overcome this, I would advise to add the records you need in the private zone like in the public zone.