# Kloudless Crawler

# Introduction

The Crawler is a type of Subscription that performs a one-time retrieval of all file and folder metadata in a connected user's account. If the account is an admin account, authenticated using the Kloudless admin OAuth flow, this retrieval includes metadata from all the users in the admin's organization.

Initiate a crawl for a connected account by creating a Subscription with the subscription_type set to crawl. In addition to the Crawler publishing data on all the existing files and folders in the account, you can choose to track new activity using the Activity API by creating a different Subscription with the subscription_type set to changes instead.

# Kloudless Crawler quickstart

Test out the Kloudless Crawler with the following steps, which are explained in further detail below:

  1. Configure a default notification channel in the Kloudless Developer Portal to receive the crawler's JSON response.

  2. Connect an account via the Kloudless OAuth flow or the API Explorer.

  3. Create a subscription with the subscription_type set to crawl.

  4. Continue to monitor for new activity (optional).

# Configure a notification channel

On the Webhooks and Activity Monitoring page, configure one of the following notification channels to receive metadata from the crawler: Amazon EventBridge, Azure Service Bus, or Google Cloud Pub/Sub.

# Amazon EventBridge

Data can be filtered and routed to services like SQS, SNS, Amazon Kinesis, AWS Lambda, and more. Provide the following details in the Webhooks and Activity Monitoring page, under the Amazon EventBridge section:

  • AWS Region
  • AWS Account ID

# Google Cloud Pub/Sub

Provision a Google Pub/Sub topic and a service account.

The role of the service account should be role/pubsub.publisher (Pub/Sub Publisher) at minimum. Please refer to the Pub/Sub Access Control docs for information on roles as well as how to grant project-wide and topic-specific permissions.

Provide the following details on the Webhooks and Activity Monitoring page, under the Google Cloud Pub/Sub section:

  • Topic name
  • Service account key

    The service account key should be in JSON format. It can be created on the Google Cloud Platform Console during or after service account creation.

# Azure Service Bus

Create a Service Bus resource and topic, then provide the following details in the Webhooks and Activity Monitoring page, under the Azure Service Bus section:

  • Topic name
  • Primary connection string

    Because the default Shared Access Key has full control of the Service Bus namespace, it is recommended to set up a Shared Access Key at the topic level, since access to the entire namespace is not required.

# Connect an account

For testing purposes, you can use the API Explorer to connect your account. This simulates the process your customers would go through to authorize access to their account.

# The Kloudless OAuth flow

In your app, you can include the Kloudless Authenticator JS library to prompt users to connect their account, or your app can directly implement the Kloudless OAuth flow to connect user accounts. You can also configure custom OAuth keys to white label your app's authentication flow.

# Create a crawler subscription

Use the Create Subscription endpoint to manually create a new subscription. Set the subscription_type attribute to crawl. In the request header, include the bearer token you received during the OAuth flow:

curl -H 'Authorization: Bearer TOKEN' \
  -H 'Content-Type: application/json' \
  -XPOST -d '{"subscription_type": "crawl"}' \
  'https://api.kloudless.com/v1/accounts/me/subscriptions/'

If you are using the API Explorer to create the Subscription, the bearer token is automatically included in the generated request's header.

# Monitoring for new activity

Once you have received metadata for the existing files and folders in the connected account, you can continue to monitor for new activity using the list activity endpoint.

If you enabled Track Activity on the Webhooks and Activity Monitoring page page before connecting the account, a default changes subscription was automatically created when the account was connected, and you can immediately begin querying the List Activity endpoint. Otherwise, you'll need to manually create a subscription with the subscription_type attribute set to changes before you can use the List Activity endpoint.

See the Activity Monitoring usage guide for more information on using the Activity API to monitor for new activity.