# Appendix C: Metrics Reference

Each metric is prefixed with a customizable prefix which is set appliance-wide. Metric names also vary slightly from collector to collector in terms of punctuation:

InfluxDB metric names do not have periods (.) in them. Instead, periods are replaced with underscores. For example, prefix.celery.metrics.active_application_count would be recorded as prefix_celery_metrics_active_application_count
statsd does not support tags by default. To ensure tags are preserved, they are appended to the metric name. For example, prefix.celery.metrics.active_application_count which has tags=service could be recorded as prefix.celery.metrics.active_application_count.service_box if the metric was tagged with service=box

Metrics are collected in the three general StatsD types: gauge, counter, and timing. They are collected by Telegraf or StatsD, summarized depending on the type and periodically flushed to the time-series backend. Each of those metrics measures a single value. In addition, there are event metrics that represent individual data points and are written directly to the time-series.

All of the metrics are tagged with extra information that provide additional details on the context/source of the information. Telegraf adds a hostname tag, set to the hostname of the appliance, to all the metrics it collects. This is particularly useful for the system-related checks as it identifies where the origin of the datapoints.

# Metric Types

gauge - A gauge value indicates the last accurate value of the metric. The gauge value at a given time is the value for that metric at that point in time. For example, disk pct_util is a gauge for the percentage of disk space used. To get the current value of the metric, look at the latest point recorded.

counter - A counter value indicates the number of events for a specific flush interval. Each value for a counter at a given time is the count of that metric over the flush interval that covers that point in time. For example, request_count is a counter for the number of API requests made per flush interval.

timing - A timing value indicates a timing for a specific process or function call. The timing metric at a given time consists of measurements over the flush interval that covers that point in time. This includes: count - the number of times the timing was measured in the flush interval, mean - mean of the timings in the flush interval, upper - longest timing in the flush interval, lower - shortest timing in the flush interval, stddev - standard deviation of the timings in the flush interval. For example, celery_task_rtt is the time taken for a given celery task to complete.

event - An event represents individual time series data points and are not aggregated by flush interval in the same way as others. Metrics of this type are not available when external StatsD service is configured. For example, the api_request measurement tracks information about the performance and status of incoming API requests.

# Counts & Statistics

Format of command entries:
Metric Name (Type)
Description

prefix.celery.metrics.active_application_count (Type gauge)
Count of active applications

prefix.celery.metrics.active_developer_count (Type gauge)
Count of active developers

prefix.celery.metrics.active_credential_count (Type gauge)
tags=service
Count of active accounts

prefix.celery.credentials.requests.request_count (Type counter)
tags=application_id, credential_id, service, task
Count of tasks executed

prefix.celery.credentials.requests.credential_request_count (Type counter)
tags=credential_id, service
Count of worker tasks for a specific connected account

prefix.celery.credentials.requests.non_credential_request_count (Type counter)
tags=service, task
Count of worker tasks for a specific service, but not a connected to a specific account

prefix.api.notification_count (Type counter)
tags=service
Count of incoming webhook notifications from upstream services

prefix.api.celery_queue_length (Type gauge)
tags=queue
Length of each celery task queue. If the queues consistently have tasks waiting on them, there is likely a slow-down occurring within the workers. Alternatively, the deployment may need to be scaled up.

prefix.api.health_check_api (Typetime)
tags=status
Timing for API server health check (always 0 unless status is not OK)

prefix.api.health_check_celery_queues (Type timing)
tags=status
Time taken to check Redis queue lengths

prefix.api.health_check_db (Type timing)
tags=status
Time taken to check database connectivity

prefix.api.health_check_task (Type timing)
tags=status
Time taken to task to ensure worker processes are reachable

prefix.celery.credentials.requests.request_handler_rtt (Type timing)
tags=task
Time taken for the task to execute once it reaches the worker processes. If this is abnormally high, then it is likely that there is higher than average load on the worker processes or requests to the upstream services are taking longer than usual.

prefix.celery.credentials.requests.client_request_rtt (Type timing)
tags=application_id, credential_id, service, task
Time taken for task execution on worker after retrieving account data from the database.

prefix.celery.credentials.requests.db_credential_request_rtt (Type timing)
tags=task
Time to retrieve account information from the database during task execution

prefix.celery.utils.celery_task_rtt (Type timing)
tags=task
Time for celery tasks to complete as seen by process requesting the task

prefix.interfaces.upstream_request_rtt (Type timing)
tags=application_id, credential_id, service, task
Time for upstream API requests to complete. Not all services are instrumented. If this is abnormally high, then it might indicate an issue with the upstream service provider.

prefix.celery.credential_recentd.daemon_job_delay (Type timing)
The average delay in event retrieval commencing for accounts beyond the configured polling interval. If this is growing, contact Kloudless Support to update the daemon configuration to add more co-routines.

# Code behavior & Success/Failure/Errors

prefix.celery.utils.timeout_failure (Type counter)
tags=task
Count of CeleryTimeoutError exceptions. Indicates how many tasks are failing to complete within a reasonable amount of time (60s for most requests). Tagged by task name.

prefix.celery.utils.hard_failure (Type counter)
tags=task
Count of unhandled celery task failures. Tagged by task name.

prefix.celery.utils.success (Type counter)
tags=task
Count of tasks executed successfully. Tagged by task name.

prefix.celery.credentials.requests.dry_run_probability (Type gauge)
Probability that event collection will not save results. This will always be 0 unless otherwise configured.

prefix.lib.interfaces.poolcache.get_pool_count (Type counter)
tags=type
Count of requests to internal client pool cache. Each count is tagged as hit, miss, uncached, or error.

prefix.api.notifications_throttle_percent (Type gauge)
The currently probability that an incoming notification from a cloud service will be throttled. This will be 0 unless the notifications_throttle_probability is set via the Administrative Portal.

prefix.api.unhandled_exceptions (Type counter)
Count of unhandled exceptions in API server

prefix.error.celery.credentials.update.credential_deactivation (Type event)
tags=application_id, credential_id, service
fields=error_id
Deactivated accounts due to refresh failures

prefix.api.oauth_failure (Type counter)
tags=application_id, service, type
Count of authentication failures. Tagged by application ID, service, and failure type.

# API Requests

prefix.celery.stats.api_requests (Type event)
tags=credential_id, method, status, application_id
fields=request_id, start, duration, query, remote_address, path,
x_forwarded_for, tx_bytes, rx_bytes

Information about incoming API requests.
Values:

request_id: ID of the individual requests
start: unix timestamp of the start of the request
duration: duration of the API request in ms
query: the query-string used in the request
remote_address: remote address of the HTTP request
path: the path of the request
x_forwarded_for: the X-Forwarded-For header
tx_bytes: Size of the outgoing response in bytes.
rx_bytes: Size of the incoming request in bytes.

# System

The fields listed below are the basic system stats collected by Telegraf. They are used in the default host dashboards and alerts provided by Chronograf. More detailed system information is provided by the sysstat plugin, included with Telegraf. All of the sysstat related metrics have the prefix sysstat_. Please refer to man sar (https://linux.die.net/man/1/sar) for full details. Note: Since these are collected by Telegraf, they will not be collected/sent if a remote StatsD server is configured.

prefix.cpu (Type gauge)
fields=usage_idle, usage_steal

usage_idle - cpu %idle
Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
usage_steal - cpu %steal
Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

prefix.sysstat_disk (Type gauge)
fields=pct_util

pct_util - disk %utilization
Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

prefix.sysstat_io (Type gauge)
fields=bread_per_s, bwrtn_per_s, rtps, wtps, tps

bread_per_s - blocks read per second
Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.
bwrtns_per_s - blocks written per second
Total amount of data written to devices in blocks per second.
rtps - read transfers per second
Total number of read requests per second issued to physical devices.
wtps - write transfers per second
Total number of write requests per second issued to physical devices.
tps - transfers per second
Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

prefix.sysstat_mem_util (Type gauge)
fields=kbbuffers, kbcached, kbmemfree, kbmemused

kbbuffers - Amount of memory used as buffers by the kernel in kilobytes.
kbcached - Amount of memory used to cache data by the kernel in kilobytes.
kbmemfree - Amount of free memory available in kilobytes.
kbmemused - Amount of used memory in kilobytes. This does not take into account memory used by the kernel itself.

%mem usage is derived as (kbmemused-kbbuffers - kbcachedkbmemused) / kbmemfree X 100%

prefix.sysstat_network (Type gauge)
fields=packet_per_s, rxkB_per_s, rxpck_per_s, txkB_per_s, txpck_per_s

packet_per_s - Number of network packets received per second.
rxkB_per_s - Total number of kilobytes received per second.
rxpck_per_s - Total number of packets received per second.
txkB_per_s - Total number of kilobytes transmitted per second.
txpck_per_s - Total number of packets transmitted per second.

← Appendix B: Container Deployment Appendix D: Management Commands →