Metrics and Monitoring

See the Configuration section for information on configuring metrics collection for the Kloudless appliance. By default, all metrics are collected by the local Telegraf process and stored in the local InfluxDB. In order to view the collected metrics, Chronograf can be accessed at http://appliance_hostname:8888/. This gives an easy way to view the available metrics, create dashboards, and configure simple alerts on the embedded Kapacitor instance.

Dashboards

After directing your browser to http://appliance_hostname:8888/, navigate to DASHBOARDS on the left navigation bar. To create a dashboard:

  • Navigate to Create Dashboard.
  • Use the bright blue + Add Cell button at the top to create a new dashboard cell.
  • Rename the added dashboard cell using the dropdown at the top right of the cell.
  • Edit the cell to select the series and graph type. For more information about using Chronograf's dashboards, please refer to the Chronograf documentation.

Time-series data

All of the time series data available in Chronograf is stored in InfluxDB. By default, the data is retained for two weeks. A subset of the metrics that might be most immediately useful for monitoring and usage tracking are listed in this section. See Appendix F for a full reference of the available metrics. prefix is the prefix configured in /data/kloudless.yml and defaults to the empty string ('').

API Usage Metrics

The following metrics give insight into usage of the API provided by the appliance over time:

  • prefix.celery.metrics.active_application_count (gauge) Count of active applications.
  • prefix.celery.metrics.active_developer_count (gauge) Count of active developers.
  • prefix.celery.metrics.active_credential_count (gauge) Count of active accounts connected.
  • prefix.celery.stats.api_requests (event) Tracking of individual API requests. Each event in this measurement tracks the application, account ID, status, and path of the request.

Appliance Health Metrics

The following metrics give insight into the health and performance of the appliance and should be used for finding slow-downs at different places within the appliance:

  • prefix.api.health_check_* (timing) Time taken to perform health checks of various pieces of the appliance (e.g. testing if the database is reachable). Slow-downs here are most likely cause for investigation.
  • prefix.celery.utils.celery_task_rtt (timing) Time for celery tasks to complete as seen by process requesting the task. If this is high, there could be slow downs due to either the message broker or the worker processes themselves.
  • prefix.api.celery_queue_length (gauge) Length of each celery task queue. If the queues consistently have tasks waiting on them, there is likely a slow-down occurring within the workers. Alternatively, the deployment may need to be scaled up.
  • prefix.celery.credentials.requests.request_handler_rtt (timing) Time taken for tasks to execute on the worker processes. If this is abnormally high, then it is likely that there is higher than average load on the worker processes or requests to the upstream services are taking longer than usual.
  • prefix.interfaces.upstream_request_rtt (timing) This tracks the duration of requests to upstream services. Not all services are instrumented. If this is abnormally high, then it might indicate an issue with the upstream service provider.
  • prefix.error.celery.credentials.update.credential_deactivation (event) Accounts deactivated due to refresh failures. This tracks account deactivations and can be used for alerting to those events.

System Status

There are a number of measurements taken by Telegraf of the appliance's system statistics, these are useful for standard host monitoring:

  • prefix.cpu (gauge) Tracks different % of cpu usage (e.g. idle, steal)
  • prefix.mem (gauge) Tracks memory free and used, both raw size and %.
  • prefix.sysstat_* (gauge) Tracks different systat statistics which are useful for system resource utilization (e.g. Disk IO, Memory utilization, etc.).

For a full reference of the available metrics please refer to Appendix F.

Monitoring

Chronograf has a simple built-in integration with Kapacitor, a continuous query engine that makes up the monitoring/alerting part of the TICK stack. The alerts configurable from within the Chronograf UI are limited to threshold, relative change, and "deadman" alerts. They can be managed by going to Alerting>Alert Rules in the sidebar.

New rules can be created by clicking the blue "Create Rule" button on the Alert Rules page. For a more detailed walkthrough of how to create alerting rules via Chronograf, please see the documentation. For sending the alerts, Chronograf and Kapacitor support different outputs. For example, webhooks, chat messages, etc. The alert contents support simple templating, which is also covered in the Chronograf documentation.

For more complex monitoring and alerting, it is recommended that you use TICKscript. Scripts can be managed from the appliance's management shell using the kapacitor command. For example:

# listing existing scripts (tasks)
kapacitor list tasks
# loading a TICKscript
kapacitor define my_task -tick /path/to/script.tick
# starting a defined task
kapacitor enable my_task
# stopping a running task
kapacitor disable my_task
# getting info about a task
kapacitor show my_task

Kloudless maintains a repository of sample TICKscripts that can be used as a starting point for enabling monitoring of your appliance/deployment. If you have any questions about integrating the appliance's built-in monitoring capabilities with your existing infrastructure, please reach out to support@kloudless.com.