Appliance Troubleshooting and Diagnosis

This section will describe common issues and indicate steps one can perform to diagnose and remedy them without having to contact a Kloudless engineer for assistance. A full list of useful commands is available in the Management Command Reference below.

Immediate Troubleshooting

Use this checklist to resolve any immediate issues affecting production if time is of the essence. Cluster SSH can be used to execute commands on several instances at once.

  1. Check alerts for system metrics such as CPU, Memory and Disk Space to determine if resource scarcity is causing unavailability. Make a request to the instance while running ke_logs to view new as it may point to the cause. Past log data can be also be viewed at /var/log/syslog.

    • Errors related to pgbouncer usually indicate an issue with DB connectivity. See the “Database” section below for more information.
  2. If there is no clear cause and the API server is unreachable (500* status code) via a browser, run:

     sudo service nginx restart          # Restart nginx
     supervisorctl restart api devs      # Restart web servers
    
  3. If curl http://localhost works but the server is inaccessible (unable to make connections) externally, check firewall rules.

  4. If issues persist, restart all application processes:

    supervisorctl restart api devs
    # Restart serially to limit downtime.
    supervisorctl status | grep celery-worker | cut -f 1 -d ' ' | xargs -n1 supervisorctl restart
    

    If issues persist, re-configure the instance to ensure all processes possess the correct state. If the instance has internet connectivity, the following command can be used to re-sync any custom configuration specific to your instance:

    sudo ke_update_configuration
    

    Alternatively, salt can be called directly to ensure the instance is correctly configured without attempting to re-download the custom configuration. This will use the instance’s base configuration and the custom configuration already available from the last time it was retrieved:

     sudo salt-call --local state.highstate
    

    Retry your requests once the processes shown in supervisorctl status remain in the RUNNING state. If the processes do not consistently remain in the running state for more than 30 seconds proceed to the last step in this list to contact us. If any processes are in the UNKNOWN state then the following commands should be run:

     sudo service supervisor stop
     sudo killall -9 -u kloudless
     sudo killall -9 -u celery
     sudo service supervisor start
    

    Then the process state should be verified with the status command again.

  5. If issues persist, restart the instance to restart all system and application-level servers, daemons, and processes.

  6. If the instance continues to remain inaccessible, refer to the notes regarding External Services in the Inaccessibility section below for further steps.

  7. If issues persist, enable remote help by running ke_remote_help start and contact support@kloudless.com with the information output by it.

Inaccessibility

If a curl request to the API or Developer Portal web servers does not return a response, this section provides further guidance on determining the cause. To begin, SSH into the instance. If you are unable to do so, the instance’s firewall rules may not be correct or the instance may be under high load.

System Process Issues

Please refer to the Immediate Troubleshooting section above for more information if it appears processes on the system are not responsive. If it is unclear if the services are running, please proceed through the troubleshooting section above as well to confirm they are running.

Firewall Rules

If curl http://localhost or curl https://localhost return a response successfully but the server is unable to make external connections (e.g. curl https://kloudless.com from the instance), or inbound connections do not succeed, check firewall rules.

External Services

If the Kloudless appliance does not exhibit high CPU, I/O, or memory usage but is inaccessible or responding more slowly than usual, the issue might be due to non-responsive external services.

  • If an external Database or Redis instance is used, check the health, number of connections, and attempt to connect to them via standard utilities.
  • Check the database storage capacity. If it is full, proceed with these steps:
    • Contact us and also begin steps to provision more space so that the database can be connected to and truncated to a smaller size. A 10% increase in disk capacity should be sufficient for immediate troubleshooting. Feel free to double disk capacity if disk capacity is < 1 TB.
    • Halt all processes with the following command: supervisorctl stop all
    • Once you are able to connect to the database, locate the largest tables. For PostgreSQL, the query under “Finding the size of your biggest relations” on the PostgreSQL wiki can be used.
    • If the table is apimodels_event, you can safely truncate this table with TRUNCATE apimodels_event; to return disk space to the database server. NOTE: This will delete all currently collected event information.
    • You may then start services once more with supervisorctl start all.
  • If running sudo -l does not return immediately and an external logging receiver is configured, the external logging server may not be processing log entries quickly enough, causing a system-wide slowdown.
    • As a temporary, immediate workaround, run the following commands to remove the external logging configuration and then restart rsyslog:
      sudo mv /etc/rsyslog.d/10-ke-logging.conf ~/
      sudo service rsyslog restart
      
      All services should now be accessible with no further changes required.
    • The Log Retention section of this configuration guide provides more information on updating the logging configuration at /data/kloudless.yml for a more permanent solution. The salt-call command mentioned earlier could then be used to sync the configuration.

Rate-Limiting

Both the upstream cloud service as well as the Kloudless appliance can return rate-limited requests in the event of a high volume of API requests being performed. An error response with a 429 status code will be returned if too many API requests are performed.

If Kloudless is responsible for the rate-limiting, the error_code returned in the body of the HTTP response will be too_many_requests. If the upstream service is responsible for rate-limiting requests, the error code will instead be too_many_service_requests.

The appliance’s default connection limit is 45 times the number of CPU cores. This represents the number of concurrent API requests that can be performed. This value is configurable if required.

If you encounter rate-limiting by the upstream software service, please feel free to reach out to the upstream vendor to determine if rate limits for your developer application can be increased. Please also reach out to Kloudless to determine if a more efficient approach can be utilized.

Invalid License Key

When there are issues validating or updating the license key on your appliance, the API gateway will show a page saying that your license key is invalid. If the license agreement is still valid and the associated Kloudless account is in good standing, then the issue should be solved by manually updating the license key on all nodes:

sudo ke_update_configuration

This should result in all of the application processes re-reading the new license key and resuming normal function.

Unable to Connect Cloud Account

There are many possible causes of this issue and the following sections will outline common causes and their solutions.

Misconfigured Service Keys

This will manifest as being able to select a service from the Kloudless account page, but receiving an error from the upstream cloud service either before or after the account’s credentials have been entered. The cloud storage provider's error page will usually mention a redirect_url or something similar that needs to be configured in the storage services' account page. There is documentation on how to configure these urls in the developer dashboard on the appliance as well as on the main Kloudless website.

Local Redis Runs Out of Memory

A locally configured Redis server may run out of memory in rare or unusual circumstances. This error will result in the appliance presenting a 500 error page when attempting to connect an account or perform certain other API actions. The error can be definitively identified by examining the logs shown by the ke_logs utility on the primary node and looking for error messages containing "Redis" and "OOM". If this error is occuring please record the current contents of the data store by running the following command:

redis-cli keys "*" > ~/redis-keys.log

This will help us determine the root cause of the problem as it should not be encountered under normal circumstances. The error can be temporarily resolved by increasing the value of maxmemory in /etc/redis/redis.conf. The change can then be applied by running sudo service redis-server restart. This change will be reverted when using the ke_configure command or any of the other configuration utilities, so please contact Kloudless as soon as possible if you encounter this issue.

Unable to Update Kloudless

Please attempt to resolve the issue with the following steps, or contact Kloudless Support Add link here if you continue to experience difficulties:

  • Confirm that a higher version is available via the Release Notes.
  • Remember that sudo ke_fetch_upgrade --system updates to the latest minor release (x.y.0) rather than sudo ke_fetch_upgrade, which only updates to the latest patch release (x.y.z).
  • Check your Kloudless Enterprise dashboard at https://kloudless.com/ to ensure the License Key in use is still valid. If it isn’t, contact Kloudless.
  • Run cat /etc/BUILD to verify that it contains the type of the image. Valid types include: amazon-ebs, virtualbox-ovf, docker, rackspace, azure, google. If it does not, run this command to populate the file: echo virtualbox-ovf | sudo tee /etc/BUILD > /dev/null. It is very important to replace virtualbox-ovf with the correct type of platform the instance is running in. For example, amazon-ebs if the instance is running in AWS.

Docker Issues

This section covers some common errors that have been encountered and how to resolve them.

FATA[0000] Error response from daemon: mkdir

/var/lib/docker/overlay/c4a8f5e516d401534f2d994f5546f7e08639ffd675eb3573267f76d79394f172-init/merged/dev/shm:
invalid argument

This issue typically arises when starting the container on a RedHat based system with XFS and indicates that the current kernel is out of date. This issue should be resolved after upgrading to a later kernel version. For more information, please refer to the relevant Docker Issue.