# Appliance Troubleshooting and Diagnosis
This section will describe common issues and indicate steps one can perform to diagnose and remedy them without having to contact a Kloudless engineer for assistance. A full list of useful commands is available in the Management Command Reference in Appendix D.
# Immediate Troubleshooting
Use this checklist to resolve any immediate issues affecting production if time is of the essence. Cluster SSH can be used to execute commands on several instances at once.
Check alerts for system metrics such as CPU, Memory and Disk Space to determine if resource scarcity is causing unavailability. Make a request to the instance while running
ke_logs
to view new as it may point to the cause. Past log data can be also be viewed at/var/log/syslog
.- Errors related to
pgbouncer
usually indicate an issue with DB connectivity. See the “Database” section below for more information.
- Errors related to
If there is no clear cause and the API server is unreachable (500* status code) via a browser, run:
sudo service nginx restart # Restart nginx supervisorctl restart api:* devs # Restart web servers
If
curl http://localhost
works but the server is inaccessible (unable to make connections) externally, check firewall rules.If issues persist, restart all application processes:
supervisorctl restart api:* devs # Restart serially to limit downtime. supervisorctl status | grep celery-worker | cut -f 1 -d ' ' | xargs -n1 supervisorctl restart
If issues persist, re-configure the instance to ensure all processes possess the correct state. If the instance has internet connectivity, the following command can be used to re-sync any custom configuration specific to your instance:
sudo ke_update_configuration
Alternatively, salt can be called directly to ensure the instance is correctly configured without attempting to re-download the custom configuration. This will use the instance’s base configuration and the custom configuration already available from the last time it was retrieved:
sudo salt-call --local state.highstate
Retry your requests once the processes shown in
supervisorctl status
remain in theRUNNING
state. If the processes do not consistently remain in the running state for more than 30 seconds proceed to the last step in this list to contact us. If any processes are in theUNKNOWN
state then the following commands should be run:sudo service supervisor stop sudo killall -9 -u kloudless sudo killall -9 -u celery sudo service supervisor start
Then the process state should be verified with the status command again.
If issues persist, restart the instance to restart all system and application-level servers, daemons, and processes.
If the instance continues to remain inaccessible, refer to the notes regarding External Services in the Inaccessibility section below for further steps.
If issues persist, enable remote help by running
ke_remote_help start
and contact support@kloudless.com with the information output by it.
# Inaccessibility
If a curl request to the API or Developer Portal web servers does not return a response, this section provides further guidance on determining the cause. To begin, SSH into the instance. If you are unable to do so, the instance’s firewall rules may not be correct or the instance may be under high load.
# System Process Issues
Please refer to the Immediate Troubleshooting section above for more information if it appears processes on the system are not responsive. If it is unclear if the services are running, please proceed through the troubleshooting section above as well to confirm they are running.
# Firewall Rules
If curl http://localhost
or curl https://localhost
return a response
successfully but the server is unable to make external connections (e.g. curl https://kloudless.com
from the instance), or inbound connections do not
succeed, check firewall rules.
# External Services
If the Kloudless appliance does not exhibit high CPU, I/O, or memory usage but is inaccessible or responding more slowly than usual, the issue might be due to non-responsive external services.
- If an external Database or Redis instance is used, check the health, number of connections, and attempt to connect to them via standard utilities.
- Check the database storage capacity. If it is full, proceed with these steps:
- Contact us and also begin steps to provision more space so that the database can be connected to and truncated to a smaller size. A 10% increase in disk capacity should be sufficient for immediate troubleshooting. Feel free to double disk capacity if disk capacity is < 1 TB.
- Halt all processes with the following command:
supervisorctl stop all
- Once you are able to connect to the database, locate the largest tables. For PostgreSQL, the query under “Finding the size of your biggest relations” on the PostgreSQL wiki can be used.
- If the table is
apimodels_event
, you can safely truncate this table withTRUNCATE apimodels_event
; to return disk space to the database server. NOTE: This will delete all currently collected event information. - You may then start services once more with supervisorctl start all.
- If running
sudo -l
does not return immediately and an external logging receiver is configured, the external logging server may not be processing log entries quickly enough, causing a system-wide slowdown.- As a temporary, immediate workaround, run the following commands to remove
the external logging configuration and then restart rsyslog:All services should now be accessible with no further changes required.
sudo mv /etc/rsyslog.d/10-ke-logging.conf ~/ sudo service rsyslog restart
- The Log Retention section
of this configuration guide provides more information on updating the
logging configuration at
/data/kloudless.yml
for a more permanent solution. Thesalt-call
command mentioned earlier could then be used to sync the configuration.
- As a temporary, immediate workaround, run the following commands to remove
the external logging configuration and then restart rsyslog:
# Rate-Limiting
Both the upstream cloud service as well as the Kloudless appliance can return rate-limited requests in the event of a high volume of API requests being performed. An error response with a 429 status code will be returned if too many API requests are performed.
If Kloudless is responsible for the rate-limiting, the error_code
returned in
the body of the HTTP response will be too_many_requests
. If the upstream
service is responsible for rate-limiting requests, the error code will instead
be too_many_service_requests
.
The appliance’s default connection limit is 45 times the number of CPU cores. This represents the number of concurrent API requests that can be performed. This value is configurable if required.
If you encounter rate-limiting by the upstream software service, please feel free to reach out to the upstream vendor to determine if rate limits for your developer application can be increased. Please also reach out to Kloudless to determine if a more efficient approach can be utilized.
# Invalid License Key
When there are issues validating or updating the license key on your appliance, the API gateway will show a page saying that your license key is invalid. If the license agreement is still valid and the associated Kloudless account is in good standing, then the issue should be solved by manually updating the license key on all nodes:
sudo ke_update_configuration
This should result in all of the application processes re-reading the new license key and resuming normal function.
# Unable to Connect Cloud Account
There are many possible causes of this issue and the following sections will outline common causes and their solutions.
# Misconfigured Service Keys
This will manifest as being able to select a service from the Kloudless account
page, but receiving an error from the upstream cloud service either before or
after the account’s credentials have been entered. The cloud storage provider's
error page will usually mention a redirect_url
or something similar that needs
to be configured in the storage services' account page. There is documentation
on how to configure these urls in the developer dashboard on the appliance as
well as on the main
Kloudless website.
# Local Redis Runs Out of Memory
A locally configured Redis server may run out of memory in rare or unusual
circumstances. This error will result in the appliance presenting a 500 error
page when attempting to connect an account or perform certain other API actions.
The error can be definitively identified by examining the logs shown by the
ke_logs
utility on the primary node and looking for error messages containing
"Redis" and "OOM". If this error is occuring please record the current contents
of the data store by running the following command:
redis-cli keys "*" > ~/redis-keys.log
This will help us determine the root cause of the problem as it should not be
encountered under normal circumstances. The error can be temporarily resolved by
increasing the value of maxmemory
in /etc/redis/redis.conf
. The change can
then be applied by running sudo service redis-server restart
. This change will
be reverted when using the ke_configure
command or any of the other
configuration utilities, so please contact Kloudless as soon as possible if you
encounter this issue.
# Unable to Update Kloudless
Please attempt to resolve the issue with the following steps, or contact Kloudless Support Add link here if you continue to experience difficulties:
- Confirm that a higher version is available via the Release Notes.
- Remember that
sudo ke_fetch_upgrade --system
updates to the latest minor release (x.y.0) rather thansudo ke_fetch_upgrade
, which only updates to the latest patch release (x.y.z). - Check your Kloudless Enterprise dashboard at https://kloudless.com/ to ensure the License Key in use is still valid. If it isn’t, contact Kloudless.
- Run
cat /etc/BUILD
to verify that it contains the type of the image. Valid types include: amazon-ebs, virtualbox-ovf, docker, rackspace, azure, google. If it does not, run this command to populate the file:echo virtualbox-ovf | sudo tee /etc/BUILD > /dev/null
. It is very important to replacevirtualbox-ovf
with the correct type of platform the instance is running in. For example,amazon-ebs
if the instance is running in AWS.
# Docker Issues
This section covers some common errors that have been encountered and how to resolve them.
FATA[0000] Error response from daemon: mkdir
/var/lib/docker/overlay/c4a8f5e516d401534f2d994f5546f7e08639ffd675eb3573267f76d79394f172-init/merged/dev/shm:
invalid argument
This issue typically arises when starting the container on a RedHat based system with XFS and indicates that the current kernel is out of date. This issue should be resolved after upgrading to a later kernel version. For more information, please refer to the relevant Docker Issue.