# Metrics

## Metrics Overview


One of the main goals of Envoy is to make the network easy to understand. Envoy generates a large amount of statistical information depending on how it is configured. In general, statistics (metrics) fall into three categories:

- **Downstream**: Downstream metrics are related to incoming connections/requests. They are generated by `listener`, `HTTP connection manager (HCM)`, `TCP proxy filter` and so on.
- **Upstream**: Upstream metrics are related to outgoing connections/requests. They are generated by `connection pool`, `router filter`, `tcp proxy filter`, and so on.
- **Server**: `Server` metrics information describes the operation of the Envoy server instance. Statistics such as server uptime or amount of memory allocated.

In the simplest scenario, a single Envoy Proxy typically involves `Downstream` and `Upstream` statistics. These two metrics reflect the operation of the `Network Node` from which they are taken. Statistics from the entire grid provide very detailed summary information about the health of each `Network Node` and the network as a whole.Envoy's documentation has some brief descriptions of these metrics.

<!-- Starting with the `Envoy v2 API`, Envoy is able to support a custom, pluggable `Metrics Sink`. Here's [a list of the Stats Sinks that come with Envoy](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg-config- metrics-v3-statssink):

- envoy.stat_sinks.dog_statsd
- envoy.stat_sinks.graphite_statsd
- envoy.stat_sinks.hystrix
- envoy.stat_sinks.metrics_service
- envoy.stat_sinks.statsd
- envoy.stat_sinks.wasm -->


### Tag

Envoy's metrics also have two subconcepts that are supported for use in metrics: `tags` / `dimensions`. The `tags` pair here is equal to the label of the Prometheus metric, in the sense that it can be interpreted as: categorical dimensions.


Envoy's `metrics` are identified by canonical strings. The dynamic parts (substrings) of these strings are extracted as `tags`. This can be done by specifying [tag extraction rules (Tag Specifier configuration.)](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg-config-metrics-v3-tagspecifier) to customize tags.

As an example:
```bash
### 1. The original Envoy metrics ###

$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats'

### Returns:
cluster.outbound|8080||fortio-server-l2.mark.svc.cluster.local.external.upstream_rq_2xx: 300

# where:
# - The `outbound|8080||fortio-server-l2.mark.svc.cluster.local` part is the name of the upstream cluster. It can be extracted as a tag.
# - The `2xx` part is the HTTP Status Code category. This can be extracted as a tag. The configuration of this extraction rule is described below.

### 2. Metrics for Prometheus ###
$ kubectl exec fortio-server -c istio-proxy -- curl 'localhost:15000/stats?format=prometheus' | grep 'outbound|8080||fortio-server-l2' | grep ' external.upstream_rq'

# Returns:
envoy_cluster_external_upstream_rq{response_code_class="2xx",cluster_name="outbound|8080||fortio-server-l2.mark.svc.cluster.local" } 300

```

### Metrics data types

Envoy emits three types of values as statistics:

- **Counters**: unsigned integers that only increase, not decrease. For example, Total Requests.
- **Gauges(Gauges)**: unsigned integers that increase and decrease. For example, currently active requests.
- **Histograms**: Unsigned integers that are part of a stream of metrics, which are then aggregated by the collector to eventually produce a summarized percentile (i.e., the usual P99/P50/Pxx). For example, `Upstream` response time.

In Envoy's internal implementation, Counters and Gauges are batched and refreshed periodically to improve performance. histograms are written on receipt.


## Metrics Interpretation

Metrics can be categorized by where they are produced:
- cluster manager : L3/L4/L7 level metrics for `upstream`.
- http connection manager(HCM) : L7 level metrics for `upstream` & `downstream`.
- listeners: Layer L3/L4 metrics for `downstream`.
- server (global)
- watch dog

Below I've selected only some of the key performance metrics to briefly explain.

### cluster manager

[Envoy documentation:cluster manager stats](https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats)

The documentation above already goes into a bit more detail. I'll just add some aspects to focus on when performance tuning. So, what metrics to focus on in general?

Let's analyze it from the famous [Utilization Saturation and Errors (USE)](https://www.brendangregg.com/usemethod.html) methodology.

Utilization:
 - `upstream_cx_total` (Counter): counter of connections
 - `upstream_rq_active`

Saturation:
 - `upstream_rq_time` (Histogram): response latency
 - `upstream_cx_connect_ms` (Histogram)
 - `upstream_cx_rx_bytes_buffered`
 - `upstream_cx_tx_bytes_buffered`
 - `upstream_rq_pending_total` (Counter)
 - `upstream_rq_pending_active` (Gauge)
 - `circuit_breakers.*cx_open`
 - `circuit_breakers.*cx_pool_open`
 - `circuit_breakers.*rq_pending_open`
 - `circuit_breakers.*rq_open`
 - `circuit_breakers.*rq_retry_open`
 
Error:
 - `upstream_cx_connect_fail` (Counter): Number of connection failures.
 - `upstream_cx_connect_timeout` (Counter): number of connection timeouts
 - `upstream_cx_overflow` (Counter): total number of cluster connection breaker overflows
 - `upstream_cx_pool_overflow`
 - `upstream_cx_destroy_local_with_active_rq`
 - `upstream_cx_destroy_remote_with_active_rq`
 - `upstream_rq_timeout`
 - `upstream_rq_retry`
 - `upstream_rq_rx_reset`
 - `upstream_rq_tx_reset`
 - `upstream_rq_pending_overflow` (Counter) : Total number of requests that overflowed the connection pool or requests (mainly for HTTP/2 and higher) that melted and failed

Other:
 - `upstream_rq_total` (Counter) : TPS (throughput)
 - `upstream_cx_destroy_local` (Counter): Count of connections actively disconnected by Envoy
 - `upstream_cx_destroy_remote` (Counter): count of Envoy passive disconnects
 - `upstream_cx_length_ms` (Histogram)


### http connection manager(HCM)

[Envoy docs:http connection manager(HCM) stats](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats)

This can be thought of as an L7 layer metrics for `downstream` & some `upstream`.

Utilization:
 - `downstream_cx_total` 
 - `downstream_cx_active`
 - `downstream_cx_http1_active`
 - `downstream_rq_total`
 - `downstream_rq_http1_total`
 - `downstream_rq_active`


Saturation:
 - `downstream_cx_rx_bytes_buffered` 
 - `downstream_cx_tx_bytes_buffered`
 - `downstream_flow_control_paused_reading_total`
 - `downstream_flow_control_resumed_reading_total`


Error:
 - `downstream_cx_destroy_local_active_rq`
 - `downstream_cx_destroy_remote_active_rq`
 - `downstream_rq_rx_reset`
 - `downstream_rq_tx_reset`
 - `downstream_rq_too_large`
 - `downstream_rq_max_duration_reached`
 - `downstream_rq_timeout`
 - `downstream_rq_overload_close`
 - `rs_too_large`

Others：
 - `downstream_cx_destroy_remote` 
 - `downstream_cx_destroy_local`
 - `downstream_cx_length_ms`

### listeners

[Envoy docs:listener stats](https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats)

It can be assumed that this is an metrics of the L3/L4 level of the downstream.

Utilization:
 - `downstream_cx_total` 
 - `downstream_cx_active`


Saturation:
 - `downstream_pre_cx_active`


Error:
 - `downstream_cx_transport_socket_connect_timeout`
 - `downstream_cx_overflow` 
 - `no_filter_chain_match`
 - `downstream_listener_filter_error`
 - `no_certificate`

Others:
 - `downstream_cx_length_ms` 


### server

Envoy basic info metrics

[Envoy docs:server stats](https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/statistics)

Utilization:
 - `concurrency` 


Error:
 - `days_until_first_cert_expiring`


### watch dog

[Envoy docs: Watchdog](https://www.envoyproxy.io/docs/envoy/latest/operations/performance)

The Envoy also includes a configurable watchdog system that adds statistics and optionally terminates the server when the Envoy is not responding. The system has two separate watchdog configurations, one for the main thread and one for the worker threads; as different threads have different workloads. These statistics help to understand at a high level whether the Envoy's event loop is not responding because it is doing too much work, blocking, or not being scheduled by the operating system.

Saturation.
 - `watchdog_mega_miss`(Counter): number of mega misses
 - `watchdog_miss`(Counter): number of misses

If you are interested in the watchdog mechanism, see:
> https://github.com/envoyproxy/envoy/issues/11391
> https://github.com/envoyproxy/envoy/issues/11388


### Event loop 
[Envoy documentation: Event loop](https://www.envoyproxy.io/docs/envoy/latest/operations/performance)

The Envoy architecture is designed to optimize scalability and resource utilization by running the event loop on a small number of threads. The `"main"` thread is responsible for control plane processing, and each `"worker"` thread shares a portion of the data plane tasks. Envoy exposes two statistics to monitor the performance of all these threaded event loops.

Time taken to run a round of the loop: each iteration of the event loop executes a number of tasks. The number of tasks varies with the load. However, if one or more threads have unusually long-tailed loop execution elapsed times, there may be performance issues. For example, the responsibility may be unevenly distributed between worker threads, or there may be long blocking operations in the plugin that impede task progress.

Polling Latency: In each iteration of the event loop, the event scheduler polls for I/O events and "wakes up" threads when some `I/O event is ready` or a `timeout` occurs, whichever occurs first. In the case of a `timeout`, we can measure the difference between the expected wakeup time after polling and the actual wakeup time; this difference is called `polling delay`. It is normal to see some small ``polling delay``, usually equal to the kernel scheduler's ``time slice`` or ``quantum`` -- depending on which kernel is running Envoy. -- depending on the operating system running Envoy -- but if this number is significantly higher than its normally observed baseline, it indicates that the kernel scheduler may be experiencing delays.

This can be done by setting [enable_dispatcher_stats](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api -field-config-bootstrap-v3-bootstrap-enable-dispatcher-stats) to `true` to enable these statistics.

- The event scheduler for the `main` thread has a statistics tree rooted at `server.dispatcher.`. Each `worker` thread has a statistics tree rooted at `server.dispatcher. 
- The event scheduler for each `worker` thread has a statistics tree rooted at `listener_manager.worker_<id>.dispatcher.`.

Each tree has the following statistics:


| Name             | Type      | Description                          |
| ---------------- | --------- | ------------------------------------ |
| loop_duration_us | Histogram | event loop duration in microseconds |
| poll_delay_us | Histogram | Polling delay in microseconds |

Note that this does not include any auxiliary (non-main and worker) threads.

```{hint}
Watch Dog and Event loop are both tools for solving and monitoring event processing delays and timings, and there are a lot of details and stories here, even down to the Linux Kernel. hopefully there will be time later in the book to learn and analyze these interesting details with you.
```

## Configuration

```{hint}
If you read the introduction to this book {ref}`index:What this book is not`, it says it's not a "user's manual", so why is it talking about configuration? Well, all I can say is that it's better to start with understanding how to use it, and then learn how to implement it, than to come straight to the source code.

This section is referenced in:
[Envoy Documentation](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto)
```

### config.bootstrap.v3.Bootstrap

[Envoy docs:config.bootstrap.v3.Bootstrap proto](https://github.com/envoyproxy/envoy/blob/255af425e1d51066cc8b69a39208b70e18d07073/api/envoy/config/bootstrap/v3/bootstrap.proto#L44)

```
{
  "node": {...},
  "static_resources": {...},
  "dynamic_resources": {...},
  "cluster_manager": {...},
  "stats_sinks": [],
  "stats_config": {...},
  "stats_flush_interval": {...},
  "stats_flush_on_admin": ...,
...
}
```

```{hint}
What is `stats sink`? This book does not explain it.Istio does not customize the configuration by default. The following is only part of the configuration of concern.
```


- stats\_config
([config.metrics.v3.StatsConfig](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-msg -config-metrics-v3-statsconfig)) Configuration for internal processing of statistics.

- stats\_flush\_interval
([Duration](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration)) Interval at which to flush the `stats sink`. For performance reasons, Envoy does not flush the counter in real time, only the counter and gauge are flushed periodically. If not specified, the default value is 5000 milliseconds. Only one of `stats_flush_interval` or `stats_flush_on_admin` can be set. Duration must be at least 1 millisecond and at most 5 minutes.


- stats\_flush\_on\_admin
([bool](https://developers.google.com/protocol-buffers/docs/proto#scalar)) Flush statistics to `sink` only when queried on the `admin interface`. If set, no refresh timer is created. Only one of `stats_flush_on_admin` or `stats_flush_interval` can be set.

### config.metrics.v3.StatsConfig

[Envoy docs:config-metrics-v3-statsconfig](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#config-metrics-v3-statsconfig)

```
{
  "stats_tags": [],
  "use_all_default_tags": {...},
  "stats_matcher": {...},
  "histogram_bucket_settings": []
}
```

- stats_tags - dimension extraction rules (corresponds to Prometheus label extraction)
  (**Multiple** [config.metrics.v3.TagSpecifier](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3- api-msg-config-metrics-v3-tagspecifier) ) Each `metrics name string` is processed independently by these tag rules. When a tag matches, the first capture group is not immediately removed from the name, so the later [TagSpecifiers](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto# envoy-v3-api-msg-config-metrics-v3-tagspecifier) can also match the same section repeatedly. After all tag matches have been completed, the matching portion of the `metrics name string` is then clipped and used as the metric name for the `stats sink`, e.g., the metric name for Prometheus.

- use_all_default_tags
  (BoolValue) Use all the default tags regular expressions specified in the Envoy. These can be used in conjunction with the custom tags specified in stats_tags. They will be processed before the custom tags.Istio defaults to false.

- stats_matcher
  ([config.metrics.v3.StatsMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api- msg-config-metrics-v3-statsmatcher)) Specifies which metrics the Envoy will output. Supports `include`/`exclude` rule specification. If not provided, all metrics will be output. Blocking statistics for certain sets of metrics can improve Envoy performance a bit.


### config.metrics.v3.StatsMatcher

[Envoy docs:config-metrics-v3-statsmatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#config-metrics-v3-statsmatcher)

Configuration for disabling/enabling the calculation and output of statistical metrics.

```
{
  "reject_all": ...,
  "exclusion_list": {...},
  "inclusion_list": {...}
}
```

- reject_all
  ([bool](https://developers.google.com/protocol-buffers/docs/proto#scalar)) If `reject_all` is true, disable all statistics. If `reject_all` is false, all statistics are enabled.

- exclusion_list
  ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) exclusion list

- inclusion_list
  ([type.matcher.v3.ListStringMatcher](https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/matcher/v3/string.proto#envoy-v3-api- msg-type-matcher-v3-liststringmatcher)) inclusion list


```{note}
This section references: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/statistics

In the next section, an example of how Istio can be used with the configuration above will be shown.
```