Appendix F: Metrics Reference

Each metric is prefixed with a customizable prefix which is set appliance-wide. Metric names also vary slightly from collector to collector in terms of punctuation:

  • InfluxDB metric names do not have periods (.) in them. Instead, periods are replaced with underscores. For example, prefix.celery.metrics.active_application_count would be recorded as prefix_celery_metrics_active_application_count
  • statsd does not support tags by default. To ensure tags are preserved, they are appended to the metric name. For example, prefix.celery.metrics.active_application_count which has tags=service could be recorded as prefix.celery.metrics.active_application_count.service_box if the metric was tagged with service=box

Metrics are collected in the three general StatsD types: gauge, counter, and timing. They are collected by Telegraf or StatsD, summarized depending on the type and periodically flushed to the time-series backend. Each of those metrics measures a single value. In addition, there are event metrics that represent individual data points and are written directly to the time-series.

All of the metrics are tagged with extra information that provide additional details on the context/source of the information. Telegraf adds a hostname tag, set to the hostname of the appliance, to all the metrics it collects. This is particularly useful for the system-related checks as it identifies where the origin of the datapoints.

Metric Types

gauge - A gauge value indicates the last accurate value of the metric. The gauge value at a given time is the value for that metric at that point in time. For example, disk pct_util is a gauge for the percentage of disk space used. To get the current value of the metric, look at the latest point recorded.

counter - A counter value indicates the number of events for a specific flush interval. Each value for a counter at a given time is the count of that metric over the flush interval that covers that point in time. For example, request_count is a counter for the number of API requests made per flush interval.

timing - A timing value indicates a timing for a specific process or function call. The timing metric at a given time consists of measurements over the flush interval that covers that point in time. This includes: count - the number of times the timing was measured in the flush interval, mean - mean of the timings in the flush interval, upper - longest timing in the flush interval, lower - shortest timing in the flush interval, stddev - standard deviation of the timings in the flush interval. For example, celery_task_rtt is the time taken for a given celery task to complete.

event - An event represents individual time series data points and are not aggregated by flush interval in the same way as others. Metrics of this type are not available when external StatsD service is configured. For example, the api_request measurement tracks information about the performance and status of incoming API requests.

Counts & Statistics

Format of command entries:
Metric Name (Type)
Description

prefix.celery.metrics.active_application_count (Type gauge)
Count of active applications

prefix.celery.metrics.active_developer_count (Type gauge)
Count of active developers

prefix.celery.metrics.active_credential_count (Type gauge)
tags=service
Count of active accounts

prefix.celery.credentials.requests.request_count (Type counter)
tags=application_id, credential_id, service, task
Count of tasks executed

prefix.celery.credentials.requests.credential_request_count (Type counter)
tags=credential_id, service
Count of worker tasks for a specific connected account

prefix.celery.credentials.requests.non_credential_request_count (Type counter)
tags=service, task
Count of worker tasks for a specific service, but not a connected to a specific account

prefix.api.notification_count (Type counter)
tags=service
Count of incoming webhook notifications from upstream services

prefix.api.celery_queue_length (Type gauge)
tags=queue
Length of each celery task queue. If the queues consistently have tasks waiting on them, there is likely a slow-down occurring within the workers. Alternatively, the deployment may need to be scaled up.

prefix.api.health_check_api (Typetime)
tags=status
Timing for API server health check (always 0 unless status is not OK)

prefix.api.health_check_celery_queues (Type timing)
tags=status
Time taken to check redis queue lengths

prefix.api.health_check_db (Type timing)
tags=status
Time taken to check database connectivity

prefix.api.health_check_task (Type timing)
tags=status
Time taken to task to ensure worker processes are reachable

prefix.celery.credentials.requests.request_handler_rtt (Type timing)
tags=task
Time taken for the task to execute once it reaches the worker processes. If this is abnormally high, then it is likely that there is higher than average load on the worker processes or requests to the upstream services are taking longer than usual.

prefix.celery.credentials.requests.client_request_rtt (Type timing)
tags=application_id, credential_id, service, task
Time taken for task execution on worker after retrieving account data from the database.

prefix.celery.credentials.requests.db_credential_request_rtt (Type timing)
tags=task
Time to retrieve account information from the database during task execution

prefix.celery.utils.celery_task_rtt (Type timing)
tags=task
Time for celery tasks to complete as seen by process requesting the task

prefix.interfaces.upstream_request_rtt (Type timing)
tags=application_id, credential_id, service, task
Time for upstream API requests to complete. Not all services are instrumented. If this is abnormally high, then it might indicate an issue with the upstream service provider.

prefix.celery.credential_recentd.daemon_job_delay (Type timing)
The average delay in event retrieval commencing for accounts beyond the configured polling interval. If this is growing, contact Kloudless Support to update the daemon configuration to add more co-routines.

Code behavior & Success/Failure/Errors

prefix.celery.utils.timeout_failure (Type counter)
tags=task
Count of CeleryTimeoutError exceptions. Indicates how many tasks are failing to complete within a reasonable amount of time (60s for most requests). Tagged by task name.

prefix.celery.utils.hard_failure (Type counter)
tags=task
Count of unhandled celery task failures. Tagged by task name.

prefix.celery.utils.success (Type counter)
tags=task
Count of tasks executed successfully. Tagged by task name.

prefix.celery.credentials.requests.dry_run_probability (Type gauge)
Probability that event collection will not save results. This will always be 0 unless otherwise configured.

prefix.lib.interfaces.poolcache.get_pool_count (Type counter)
tags=type
Count of requests to internal client pool cache. Each count is tagged as hit, miss, uncached, or error.

prefix.api.notifications_throttle_percent (Type gauge)
The currently probability that an incoming notification from a cloud service will be throttled. This will be 0 unless the notifications_throttle_probability is set via the Administrative Portal.

prefix.api.unhandled_exceptions (Type counter)
Count of unhandled exceptions in API server

prefix.error.celery.credentials.update.credential_deactivation (Type event)
tags=application_id, credential_id, service
fields=error_id
Deactivated accounts due to refresh failures

prefix.api.oauth_failure (Type counter)
tags=application_id, service, type
Count of authentication failures. Tagged by application ID, service, and failure type.

API Requests

prefix.celery.stats.api_requests (Type event)
tags=credential_id, method, status, application_id
fields=request_id, start, duration, query, remote_address, path,
x_forwarded_for, tx_bytes, rx_bytes

Information about incoming API requests.
Values:

  • request_id: ID of the individual requests
  • start: unix timestamp of the start of the request
  • duration: duration of the API request in ms
  • query: the query-string used in the request
  • remote_address: remote address of the HTTP request
  • path: the path of the request
  • x_forwarded_for: the X-Forwarded-For header
  • tx_bytes: Size of the outgoing response in bytes.
  • rx_bytes: Size of the incoming request in bytes.

System

The fields listed below are the basic system stats collected by Telegraf. They are used in the default host dashboards and alerts provided by Chronograf. More detailed system information is provided by the sysstat plugin, included with Telegraf. All of the sysstat related metrics have the prefix sysstat_. Please refer to man sar (https://linux.die.net/man/1/sar) for full details. Note: Since these are collected by Telegraf, they will not be collected/sent if a remote StatsD server is configured.

prefix.cpu (Type gauge)
fields=usage_idle, usage_steal

  • usage_idle - cpu %idle
    Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
  • usage_steal - cpu %steal
    Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

prefix.sysstat_disk (Type gauge)
fields=pct_util

  • pct_util - disk %utilization
    Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

prefix.sysstat_io (Type gauge)
fields=bread_per_s, bwrtn_per_s, rtps, wtps, tps

  • bread_per_s - blocks read per second
    Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.
  • bwrtns_per_s - blocks written per second
    Total amount of data written to devices in blocks per second.
  • rtps - read transfers per second
    Total number of read requests per second issued to physical devices.
  • wtps - write transfers per second
    Total number of write requests per second issued to physical devices.
  • tps - transfers per second
    Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

prefix.sysstat_mem_util (Type gauge)
fields=kbbuffers, kbcached, kbmemfree, kbmemused

  • kbbuffers - Amount of memory used as buffers by the kernel in kilobytes.
  • kbcached - Amount of memory used to cache data by the kernel in kilobytes.
  • kbmemfree - Amount of free memory available in kilobytes.
  • kbmemused - Amount of used memory in kilobytes. This does not take into account memory used by the kernel itself.

%mem usage is derived as (kbmemused-kbbuffers - kbcachedkbmemused) / kbmemfree X 100%

prefix.sysstat_network (Type gauge)
fields=packet_per_s, rxkB_per_s, rxpck_per_s, txkB_per_s, txpck_per_s

  • packet_per_s - Number of network packets received per second.
  • rxkB_per_s - Total number of kilobytes received per second.
  • rxpck_per_s - Total number of packets received per second.
  • txkB_per_s - Total number of kilobytes transmitted per second.
  • txpck_per_s - Total number of packets transmitted per second.