# Appendix C: Metrics Reference
Each metric is prefixed with a customizable prefix which is set appliance-wide. Metric names also vary slightly from collector to collector in terms of punctuation:
- InfluxDB metric names do not have periods (.) in them. Instead, periods are
replaced with underscores. For example,
prefix.celery.metrics.active_application_count
would be recorded asprefix_celery_metrics_active_application_count
- statsd does not support tags by default. To ensure tags are preserved, they
are appended to the metric name. For example,
prefix.celery.metrics.active_application_count
which hastags=service
could be recorded asprefix.celery.metrics.active_application_count.service_box
if the metric was tagged withservice=box
Metrics are collected in the three general StatsD types: gauge
, counter
, and
timing
. They are collected by Telegraf or StatsD, summarized depending on the
type and periodically flushed to the time-series backend. Each of those metrics
measures a single value. In addition, there are event
metrics that represent
individual data points and are written directly to the time-series.
All of the metrics are tagged with extra information that provide additional
details on the context/source of the information. Telegraf adds a hostname
tag, set to the hostname of the appliance, to all the metrics it collects. This
is particularly useful for the system-related checks as it identifies where the
origin of the datapoints.
# Metric Types
gauge
- A gauge
value indicates the last accurate value of the
metric. The gauge
value at a given time is the value for that metric at that
point in time. For example, disk pct_util
is a gauge for the percentage of
disk space used. To get the current value of the metric, look at the latest
point recorded.
counter
- A counter
value indicates the number of events for a
specific flush interval. Each value for a counter at a given time is the count
of that metric over the flush interval that covers that point in time. For
example, request_count
is a counter
for the number of API requests made per
flush interval.
timing
- A timing
value indicates a timing for a specific process or
function call. The timing metric at a given time consists of measurements over
the flush interval that covers that point in time. This includes: count - the
number of times the timing was measured in the flush interval, mean - mean of
the timings in the flush interval, upper - longest timing in the flush interval,
lower - shortest timing in the flush interval, stddev - standard deviation of
the timings in the flush interval. For example, celery_task_rtt is the time
taken for a given celery task to complete.
event
- An event
represents individual time series data points and are
not aggregated by flush interval in the same way as others. Metrics of this
type are not available when external StatsD service is configured. For example,
the api_request measurement tracks information about the performance and status
of incoming API requests.
# Counts & Statistics
Format of command entries:
Metric Name
(Type)
Description
prefix.celery.metrics.active_application_count
(Type gauge
)
Count of active applications
prefix.celery.metrics.active_developer_count
(Type gauge
)
Count of active developers
prefix.celery.metrics.active_credential_count
(Type gauge
)
tags=service
Count of active accounts
prefix.celery.credentials.requests.request_count
(Type counter
)
tags=application_id, credential_id, service, task
Count of tasks executed
prefix.celery.credentials.requests.credential_request_count
(Type counter
)
tags=credential_id, service
Count of worker tasks for a specific connected account
prefix.celery.credentials.requests.non_credential_request_count
(Type counter
)
tags=service, task
Count of worker tasks for a specific service, but not a connected to a
specific account
prefix.api.notification_count
(Type counter
)
tags=service
Count of incoming webhook notifications from upstream services
prefix.api.celery_queue_length
(Type gauge
)
tags=queue
Length of each celery task queue. If the queues consistently have tasks
waiting on them, there is likely a slow-down occurring within the workers.
Alternatively, the deployment may need to be scaled up.
prefix.api.health_check_api
(Typetime
)
tags=status
Timing for API server health check (always 0 unless status is not OK)
prefix.api.health_check_celery_queues
(Type timing
)
tags=status
Time taken to check Redis queue lengths
prefix.api.health_check_db
(Type timing
)
tags=status
Time taken to check database connectivity
prefix.api.health_check_task
(Type timing
)
tags=status
Time taken to task to ensure worker processes are reachable
prefix.celery.credentials.requests.request_handler_rtt
(Type timing
)
tags=task
Time taken for the task to execute once it reaches the worker processes. If this
is abnormally high, then it is likely that there is higher than average load on
the worker processes or requests to the upstream services are taking longer than
usual.
prefix.celery.credentials.requests.client_request_rtt
(Type timing
)
tags=application_id, credential_id, service, task
Time taken for task execution on worker after retrieving account data from the
database.
prefix.celery.credentials.requests.db_credential_request_rtt
(Type timing
)
tags=task
Time to retrieve account information from the database during task execution
prefix.celery.utils.celery_task_rtt
(Type timing
)
tags=task
Time for celery tasks to complete as seen by process requesting the task
prefix.interfaces.upstream_request_rtt
(Type timing
)
tags=application_id, credential_id, service, task
Time for upstream API requests to complete. Not all services are instrumented.
If this is abnormally high, then it might indicate an issue with the upstream
service provider.
prefix.celery.credential_recentd.daemon_job_delay
(Type timing
)
The average delay in event retrieval commencing for accounts beyond the
configured polling interval. If this is growing, contact Kloudless Support to
update the daemon configuration to add more co-routines.
# Code behavior & Success/Failure/Errors
prefix.celery.utils.timeout_failure
(Type counter
)
tags=task
Count of CeleryTimeoutError exceptions. Indicates how many tasks are failing to
complete within a reasonable amount of time (60s for most requests). Tagged by
task name.
prefix.celery.utils.hard_failure
(Type counter
)
tags=task
Count of unhandled celery task failures. Tagged by task name.
prefix.celery.utils.success
(Type counter
)
tags=task
Count of tasks executed successfully. Tagged by task name.
prefix.celery.credentials.requests.dry_run_probability
(Type gauge
)
Probability that event collection will not save results. This will always be 0
unless otherwise configured.
prefix.lib.interfaces.poolcache.get_pool_count
(Type counter
)
tags=type
Count of requests to internal client pool cache. Each count is tagged as hit,
miss, uncached, or error.
prefix.api.notifications_throttle_percent
(Type gauge
)
The currently probability that an incoming notification from a cloud service
will be throttled. This will be 0 unless the notifications_throttle_probability
is set via the Administrative Portal.
prefix.api.unhandled_exceptions
(Type counter
)
Count of unhandled exceptions in API server
prefix.error.celery.credentials.update.credential_deactivation
(Type event
)
tags=application_id, credential_id, service
fields=error_id
Deactivated accounts due to refresh failures
prefix.api.oauth_failure
(Type counter
)
tags=application_id, service, type
Count of authentication failures. Tagged by application ID, service, and failure
type.
# API Requests
prefix.celery.stats.api_requests
(Type event
)
tags=credential_id, method, status, application_id
fields=request_id, start, duration, query, remote_address, path,
x_forwarded_for, tx_bytes, rx_bytes
Information about incoming API requests.
Values:
request_id
: ID of the individual requestsstart
: unix timestamp of the start of the requestduration
: duration of the API request in msquery
: the query-string used in the requestremote_address
: remote address of the HTTP requestpath
: the path of the requestx_forwarded_for
: the X-Forwarded-For headertx_bytes
: Size of the outgoing response in bytes.rx_bytes
: Size of the incoming request in bytes.
# System
The fields listed below are the basic system stats collected by Telegraf. They
are used in the default host dashboards and alerts provided by Chronograf. More
detailed system information is provided by the sysstat plugin, included with
Telegraf. All of the sysstat related metrics have the prefix sysstat_
. Please
refer to man sar
(https://linux.die.net/man/1/sar) for full details. Note:
Since these are collected by Telegraf, they will not be collected/sent if a
remote StatsD server is configured.
prefix.cpu
(Type gauge
)
fields=usage_idle
, usage_steal
usage_idle
- cpu %idle
Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.usage_steal
- cpu %steal
Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
prefix.sysstat_disk
(Type gauge
)
fields=pct_util
pct_util
- disk %utilization
Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.
prefix.sysstat_io
(Type gauge
)
fields=bread_per_s
, bwrtn_per_s
, rtps
, wtps
, tps
bread_per_s
- blocks read per second
Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.bwrtns_per_s
- blocks written per second
Total amount of data written to devices in blocks per second.rtps
- read transfers per second
Total number of read requests per second issued to physical devices.wtps
- write transfers per second
Total number of write requests per second issued to physical devices.tps
- transfers per second
Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.
prefix.sysstat_mem_util
(Type gauge
)
fields=kbbuffers
, kbcached
, kbmemfree
, kbmemused
kbbuffers
- Amount of memory used as buffers by the kernel in kilobytes.kbcached
- Amount of memory used to cache data by the kernel in kilobytes.kbmemfree
- Amount of free memory available in kilobytes.kbmemused
- Amount of used memory in kilobytes. This does not take into account memory used by the kernel itself.
%mem usage is derived as (kbmemused-kbbuffers - kbcachedkbmemused) / kbmemfree X 100%
prefix.sysstat_network
(Type gauge
)
fields=packet_per_s
, rxkB_per_s
, rxpck_per_s
, txkB_per_s
,
txpck_per_s
packet_per_s
- Number of network packets received per second.rxkB_per_s
- Total number of kilobytes received per second.rxpck_per_s
- Total number of packets received per second.txkB_per_s
- Total number of kilobytes transmitted per second.txpck_per_s
- Total number of packets transmitted per second.