Appliance Troubleshooting and Diagnosis
This section will describe common issues and indicate steps one can perform to diagnose and remedy them without having to contact a Kloudless engineer for assistance. A full list of useful commands is available in the Management Command Reference below.
Use this checklist to resolve any immediate issues affecting production if time is of the essence. Cluster SSH can be used to execute commands on several instances at once.
Check alerts for system metrics such as CPU, Memory and Disk Space to determine if resource scarcity is causing unavailability. Make a request to the instance while running
ke_logsto view new as it may point to the cause. Past log data can be also be viewed at
- Errors related to
pgbouncerusually indicate an issue with DB connectivity. See the “Database” section below for more information.
- Errors related to
If there is no clear cause and the API server is unreachable (500* status code) via a browser, run:
sudo service nginx restart # Restart nginx supervisorctl restart api devs # Restart web servers
curl http://localhostworks but the server is inaccessible (unable to make connections) externally, check firewall rules.
If issues persist, restart all application processes:
supervisorctl restart all
If issues persist, re-configure the instance to ensure all processes possess the correct state. If the instance has internet connectivity, the following command can be used to re-sync any custom configuration specific to your instance:
Alternatively, salt can be called directly to ensure the instance is correctly configured without attempting to re-download the custom configuration. This will use the instance’s base configuration and the custom configuration already available from the last time it was retrieved:
sudo salt-call --local state.highstate
Retry your requests once the processes shown in
supervisorctl statusremain in the
RUNNINGstate. If the processes do not consistently remain in the running state for more than 30 seconds proceed to the last step in this list to contact us. If any processes are in the
UNKNOWNstate then the following commands should be run:
sudo service supervisor stop sudo killall -9 -u kloudless sudo killall -9 -u celery sudo service supervisor start
Then the process state should be verified with the status command again.
If issues persist, restart the instance to restart all system and application-level servers, daemons, and processes.
If the instance continues to remain inaccessible, refer to the notes regarding External Services in the Inaccessibility section below for further steps.
If issues persist, enable remote help by running
ke_remote_help startand contact firstname.lastname@example.org with the information output by it.
If a curl request to the API or Developer Portal web servers does not return a response, this section provides further guidance on determining the cause. To begin, SSH into the instance. If you are unable to do so, the instance’s firewall rules may not be correct or the instance may be under high load.
System Process Issues
Please refer to the Immediate Troubleshooting section above for more information if it appears processes on the system are not responsive. If it is unclear if the services are running, please proceed through the troubleshooting section above as well to confirm they are running.
curl http://localhost or
curl https://localhost return a response
successfully but the server is unable to make external connections (e.g.
curl https://kloudless.com from the instance), or inbound connections do not
succeed, check firewall rules.
If the Kloudless appliance does not exhibit high CPU, I/O, or memory usage but is inaccessible or responding more slowly than usual, the issue might be due to non-responsive external services.
- If an external Database or Redis instance is used, check the health, number of connections, and attempt to connect to them via standard utilities.
- Check the database storage capacity. If it is full, proceed with these steps:
- Contact us and also begin steps to provision more space so that the database can be connected to and truncated to a smaller size. A 10% increase in disk capacity should be sufficient for immediate troubleshooting. Feel free to double disk capacity if disk capacity is < 1 TB.
- Halt all processes with the following command:
supervisorctl stop all
- Once you are able to connect to the database, locate the largest tables. For PostgreSQL, the query under “Finding the size of your biggest relations” on the PostgreSQL wiki can be used.
- If the table is
apimodels_event, you can safely truncate this table with
TRUNCATE apimodels_event; to return disk space to the database server. NOTE: This will delete all currently collected event information.
- You may then start services once more with supervisorctl start all.
- If running
sudo -ldoes not return immediately and an external logging receiver is configured, the external logging server may not be processing log entries quickly enough, causing a system-wide slowdown.
- As a temporary, immediate workaround, run the following commands to remove
the external logging configuration and then restart rsyslog:All services should now be accessible with no further changes required.
sudo mv /etc/rsyslog.d/10-ke-logging.conf ~/ sudo service rsyslog restart
- The Log Retention section
of this configuration guide provides more information on updating the
logging configuration at
/data/kloudless.ymlfor a more permanent solution. The
salt-callcommand mentioned earlier could then be used to sync the configuration.
- As a temporary, immediate workaround, run the following commands to remove the external logging configuration and then restart rsyslog:
Both the upstream cloud service as well as the Kloudless appliance can return rate-limited requests in the event of a high volume of API requests being performed. An error response with a 429 status code will be returned if too many API requests are performed.
If Kloudless is responsible for the rate-limiting, the
error_code returned in
the body of the HTTP response will be
too_many_requests. If the upstream
service is responsible for rate-limiting requests, the error code will instead
The appliance’s default connection limit is 45 times the number of CPU cores. This represents the number of concurrent API requests that can be performed. This value is configurable if required.
If you encounter rate-limiting by the upstream software service, please feel free to reach out to the upstream vendor to determine if rate limits for your developer application can be increased. Please also reach out to Kloudless to determine if a more efficient approach can be utilized.
Invalid License Key
When there are issues validating or updating the license key on your appliance, the API gateway will show a page saying that your license key is invalid. If the license agreement is still valid and the associated Kloudless account is in good standing, then the issue should be solved by manually updating the license key on all nodes:
This should result in all of the application processes re-reading the new license key and resuming normal function.
Unable to Connect Cloud Account
There are many possible causes of this issue and the following sections will outline common causes and their solutions.
Misconfigured Service Keys
This will manifest as being able to select a service from the Kloudless account
page, but receiving an error from the upstream cloud service either before or
after the account’s credentials have been entered. The cloud storage provider's
error page will usually mention a
redirect_url or something similar that needs
to be configured in the storage services' account page. There is documentation
on how to configure these urls in the developer dashboard on the appliance as
well as on the main
Local Redis Runs Out of Memory
A locally configured Redis server may run out of memory in rare or unusual
circumstances. This error will result in the appliance presenting a 500 error
page when attempting to connect an account or perform certain other API actions.
The error can be definitively identified by examining the logs shown by the
ke_logs utility on the primary node and looking for error messages containing
"Redis" and "OOM". If this error is occuring please record the current contents
of the data store by running the following command:
redis-cli keys "*" > ~/redis-keys.log
This will help us determine the root cause of the problem as it should not be
encountered under normal circumstances. The error can be temporarily resolved by
increasing the value of
/etc/redis/redis.conf. The change can
then be applied by running
sudo service redis-server restart. This change will
be reverted when using the
ke_configure command or any of the other
configuration utilities, so please contact Kloudless as soon as possible if you
encounter this issue.
Unable to Update Kloudless
Please attempt to resolve the issue with the following steps, or contact Kloudless Support Add link here if you continue to experience difficulties:
- Confirm that a higher version is available via the Release Notes.
- Remember that
sudo ke_fetch_upgrade --systemupdates to the latest minor release (x.y.0) rather than
sudo ke_fetch_upgrade, which only updates to the latest patch release (x.y.z).
- Check your Kloudless Enterprise dashboard at https://kloudless.com/ to ensure the License Key in use is still valid. If it isn’t, contact Kloudless.
cat /etc/BUILDto verify that it contains the type of the image. Valid types include: amazon-ebs, virtualbox-ovf, docker, rackspace, azure, google. If it does not, run this command to populate the file:
echo virtualbox-ovf | sudo tee /etc/BUILD > /dev/null. It is very important to replace
virtualbox-ovfwith the correct type of platform the instance is running in. For example,
amazon-ebsif the instance is running in AWS.
This section covers some common errors that have been encountered and how to resolve them.
FATA Error response from daemon: mkdir /var/lib/docker/overlay/c4a8f5e516d401534f2d994f5546f7e08639ffd675eb3573267f76d79394f172-init/merged/dev/shm: invalid argument
This issue typically arises when starting the container on a RedHat based system with XFS and indicates that the current kernel is out of date. This issue should be resolved after upgrading to a later kernel version. For more information, please refer to the relevant Docker Issue.