EKS Best Practice - Monitoring with Prometheus and Grafana

Read Time 4 mins | Written by: Ziyi.T | Celeste Shao

As more and more organizations adopt Kubernetes for their container orchestration needs, Amazon's Elastic Kubernetes Service (Amazon EKS) has emerged as a popular choice for running Kubernetes clusters in the cloud. While Amazon EKS offers many benefits, one of the pain points of using Amazon EKS is monitoring the health and performance of your Kubernetes infrastructure.

Monitoring Amazon EKS can be a complex and challenging task, especially when you have to deal with multiple clusters and nodes, each with its own set of metrics e.g:

memory usage
network I/O
CPU usage

Without proper monitoring, you may not be able to detect and resolve issues in a timely manner, which can lead to downtime and poor application performance.

The Solution

Fortunately, there are tools that can help you streamline the monitoring process and gain better insights into your Amazon EKS infrastructure. In this document, we'll explore mainstream tools for monitoring Amazon EKS: Prometheus and Grafana.

Prometheus is an open-source monitoring system and time-series database that allows you to collect, store, and query metrics from your applications and infrastructure. It is widely used in cloud-native environments such as Kubernetes and is supported by a large and active open-source community.

Grafana is an open-source, web-based data visualization and analytics tool that allows you to create and share interactive dashboards and graphs. It can be used with Prometheus to monitor and visualize metrics from applications and infrastructure running on platforms such as Kubernetes and AWS.

Benefits

Monitoring and Visualization: As Prometheus is designed for time-series data, you can collect and store metrics from Amazon EKS nodes, pods, and containers to monitor the performance and health of the clusters and then visualize the data with Grafana
Alerting: Prometheus also provides a powerful alerting system that allows you to define alerts based on specific metrics and thresholds such as CPU utilization
Integration: Prometheus and Grafana can be integrated with other tools and services used in Amazon EKS clusters, such as Istio, Fluentd, and more

The ASCENDING Approach

We use helm charts to deploy all the applications, which allows us to customize the configurations and make the deployment process succinct. For each application, we create a separate namespace on our Amazon EKS cluster.

1. Prometheus for log collection

There are a couple of things worth considering when deploying Prometheus on your Amazon EKS cluster:

Ingress Configuration

The Ingress configuration for Prometheus is optional. If you want to make the Prometheus service accessible from the public internet, you can enable ingress by simply changing enabled to true under ingress.

AWS has its own ALB ingress controller so please make sure to install it if you decide to create an ingress on your EKS cluster.

Basic Auth （https://prometheus.io/docs/guides/basic-auth/ ）

Prometheus supports basic authentication and TLS. This is experimental and might change in the future. It can be easily secure through network configuration by not allowing public traffic. To enable Basic Auth currently, you have to update the web config by modifying ConfigMap manually as there is no such option in the helm chart yet. Repository Info:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

2. Grafana for data visualization

The process of deploying Grafana is almost the same as Prometheus:

Update helm repo: helm repo add grafana https://grafana.github.io/helm-charts

Customize helm charts by referring to its official repo https://github.com/grafana/helm-charts/tree/main/charts/grafana

As Grafana dashboards are represented as a json model, we recommend provisioning your dashboard from JSON files and creating a ConfigMap for the dashboard:

Create a ConfigMap where the data part are json codes for the dashboards

ConfigMap Grafana

in yaml specify dashboardProviders and dashboardsConfigMaps with related values

3. Use cases of Pushgateway and Pushgateway-Aggregator

Pushgateway (https://github.com/prometheus/pushgateway) is a tool in the Prometheus ecosystem that allows you to push metrics from short-lived jobs, such as batch jobs or cron jobs, that cannot be scraped by Prometheus directly. Instead of scraping the metrics from the job, Pushgateway allows the job to push its metrics to a designated endpoint, where they can be scraped by Prometheus.

Pushgateway acts as an intermediary, storing the metrics temporarily until they are scraped by Prometheus. It accepts metrics data via a simple HTTP API but only supports simple key-value pair. If you have metrics such as a counter or gauge that also from an ephemeral job (for example if you would like to collect the count of total failed jobs of a Jenkins pipeline) then you need to use Pushgateway-Aggregator (https://github.com/weaveworks/prom-aggregation-gateway) which supports different types of metrics.

Our approach of setting up Pushgateway/Pushgateway-Aggregator

Ziyi.T

Solution Architect @ASCENDING

Celeste Shao

Data Engineer @ASCENDING

EKS Best Practice - Monitoring with Prometheus and Grafana

The Solution

Benefits

The ASCENDING Approach

1. Prometheus for log collection

2. Grafana for data visualization

3. Use cases of Pushgateway and Pushgateway-Aggregator

Ziyi.T

Celeste Shao

Kubernetes Secrets and Pod Restarts

Stop Kubernetes from Sabotaging Your 45-Minute Jobs with These Simple Tweaks

The Hidden Limitations of NAT Gateways

Cloud Storage Solution Deployment Instruction