This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Kubernetes Cluster

How to deploy and administer a container based Stroom cluster using Kubernetes.

1 - Introduction

Introduction to using Stroom on Kubernetes.

Kubernetes is an open-source system for automating deployment scaling and management of containerised applications.

Stroom is a distributed application designed to handle large-scale dataflows. As such, it is ideally suited to a Kubernetes deployment, especially when operated at scale. Features standard to Kubernetes, like Ingress and Cluster Networking , simplify the installation and ongoing operation of Stroom.

Running applications in K8s can be challenging for applications not designed to operate in a K8s cluster natively. A purpose-built Kubernetes Operator ( stroom-k8s-operator ) has been developed to make deployment easier, while taking advantage of several key Kubernetes features to further automate Stroom cluster management.

The concept of Kubernetes operators is discussed here .

Key features

The Stroom K8s Operator provides the following key features:

Deployment

  1. Simplified configuration, enabling administrators to define the entire state of a Stroom cluster in one file
  2. Designate separate processing and UI nodes, to ensure the Stroom user interface remains responsive, regardless of processing load
  3. Automatic secrets management

Operations

  1. Scheduled database backups
  2. Stroom node audit log shipping
  3. Automatically drain Stroom tasks before node shutdown
  4. Automatic Stroom task limit tuning, to attempt to keep CPU usage within configured parameters
  5. Rolling Stroom version upgrades

Next steps

Install the Stroom K8s Operator

2 - Install Operator

How to install the Stroom Kubernetes operator.

Prerequisites

  1. Kubernetes cluster, version >= 1.20.2
  2. metrics-server (pre-installed with some K8s distributions)
  3. kubectl and cluster-wide admin access

Preparation

Stage the following images in a locally-accessible container registry:

  1. All images listed in: https://github.com/p-kimberley/stroom-k8s-operator/blob/master/deploy/images.txt
  2. MySQL (e.g. mysql/mysql-server:8.0.25)
  3. Stroom (e.g. gchq/stroom:v7-LATEST)
  4. gchq/stroom-log-sender:v2.2.0 (only required if log forwarding is enabled)

Install the Stroom K8s Operator

  1. Clone the repository

    git clone https://github.com/p-kimberley/stroom-k8s-operator.git
  2. Edit ./deploy/all-in-one.yaml, prefixing any referenced images with your private registry URL. For example, if your private registry is my-registry.example.com, the image gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0 will become: my-registry.example.com:5000/gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0.

  3. Deploy the Operator

    kubectl apply -f ./deploy/all-in-one.yaml

The Stroom K8s Operator is now deployed to namespace stroom-operator-system. You can monitor its progress by watching the Pod named stroom-operator-controller-manager. Once it reaches Ready state, you can deploy a Stroom cluster.

Allocating more resources

If the Operator Pod is killed due to running out of memory, you may want to increase the amount allocated to it.

This can be done by:

  1. Editing the resources.limits settings of the controller Pod in all-in-one.yaml
  2. kubectl apply -f all-in-one.yaml

Next steps

Configure a Stroom database server
Upgrade
Remove

3 - Upgrade Operator

How to upgrade the Stroom Kubernetes Operator.

Upgrading the Operator can be performed without disrupting any resources it controls, including Stroom clusters.

To perform the upgrade, follow the same steps in Installing the Stroom K8s Operator.

Once you have initiated the update (by executing kubectl apply -f all-in-one.yaml), an instance of the new Operator version will be created. Once it starts up successfully, the old instance will be removed.

You can check whether the update succeeded by inspecting the image tag of the Operator Pod: stroom-operator-system/stroom-operator-controller-manager. The tag should correspond to the release number that was downloaded (e.g. 1.0.0)

If the upgrade failed, the existing Operator should still be running.

4 - Remove Operator

How to remove the Stroom Kubernetes operator.

Removing the Stroom K8s Operator must be done with caution, as it causes all resources it manages, including StroomCluster, DatabaseServer and StroomTaskAutoscaler to be deleted.

While the Stroom clusters under its control will be gracefully terminated, they will become inaccessible until re-deployed.

It is good practice to first delete any dependent resources before deleting the Operator.

Deleting the Operator

Execute this command against the same version of manifest that was used to deploy the Operator currently running.

kubectl delete -f all-in-one.yaml

5 - Configure Database

How to configure the database server for a Stroom cluster.

Before creating a Stroom cluster, a database server must first be configured.

There are two options for deploying a MySQL database for Stroom:

Managed by Stroom K8s Operator

A Database server can be created and managed by the Operator. This is the recommended option, as the Operator will take care of the creation and storage of database credentials, which are shared securely with the Pod via the use of a Secret cluster resource.

Create a DatabaseServer resource manifest

Use the example at database-server.yaml .

See the DatabaseServer Custom Resource Definition (CRD) API documentation for an explanation of the various CRD fields.

By default, MySQL imposes a limit of 151 concurrent connections. If your Stroom cluster is larger than a few nodes, it is likely you will exceed this limit. Therefore, it is recommended to set the MySQL property max_connections to a suitable value.

Bear in mind the Operator generally consumes one connection per StroomCluster it manages, so be sure to include some headroom in your allocation.

You can specify this value via the spec.additionalConfig property as in the example below:

apiVersion: stroom.gchq.github.io/v1
kind: DatabaseServer
...
spec:
  additionalConfig:
    - max_connections=1000
...

Provision a PersistentVolume for the DatabaseServer

General instructions on creating a Kubernetes Persistent Volume (PV) are explained here .

The Operator will create StatefulSet when the DatabaseServer is deployed, which will attempt to claim a PersistentVolume matching the specification provided in DatabaseServer.spec.volumeClaim.

Fast, low-latency storage should be used for the Stroom database

Deploy the DatabaseServer to the cluster

kubectl apply -f database-server.yaml

Observe the Pod stroom-<database server name>-db start up. Once it’s reached Ready state, the server has started, and the databases you specified have been created.

Backup the created credentials

The Operator generates a Secret containing the passwords of the users root and stroomuser when it initially creates the DatabaseServer resource. These credentials should be backed up to a secure location, in the event the Secret is inadvertently deleted.

The Secret is named using the format: stroom-<db server name>-db (e.g. stroom-dev-db).

External

You may alternatively provide the connection details of an existing MySQL (or compatible) database server. This may be desirable if you have for instance, a replication-enabled MySQL InnoDB cluster.

Provision the server and Stroom databases

Store credentials in a Secret

Create a Secret in the same namespace as the StroomCluster, containing the key stroomuser, with the value set to the password of that user.

Upgrading or removing a DatabaseServer

A DatabaseServer cannot shut down while its dependent StroomCluster is running. This is a necessary safeguard to prevent database connectivity from being lost.

Upgrading or removing a DatabaseServer requires the StroomCluster be removed first.

Next steps

Configure a Stroom cluster

6 - Configure a cluster

How to configure a Stroom cluster.

A StroomCluster resource defines the topology and behaviour of a collection of Stroom nodes.

The following key concepts should be understood in order to optimally configure a cluster.

Concepts

NodeSet

A logical grouping of nodes intended to together, fulfil a common role. There are three possible roles, as defined by ProcessingNodeRole:

  1. Undefined (default). Each node in the NodeSet can receive and process data, as well as service web frontend requests.
  2. Processing Node can receive and process data, but not service web frontend requests.
  3. Frontend Node services web frontend requests only.

There is no imposed limit to the number of NodeSets, however it generally doesn’t make sense to have more than one assigned to either Processing or Frontend roles. In clusters where nodes are not very busy, it should not be necessary to have dedicated Frontend nodes. In cases where load is prone to spikes, such nodes can greatly help improve the responsiveness of the Stroom user interface.

It is important to ensure there is at least one NodeSet for each role in the StroomCluster The Operator automatically wires up traffic routing to ensure that only non-Frontend nodes receive event data. Additionally, Frontend-only nodes have server tasks disabled automatically on startup, effectively preventing them from participating in stream processing.

Ingress

Kubernetes Ingress resources determine how requests are routed to an application. Ingress resources are configured by the Operator based on the NodeSet roles and the provided StroomCluster.spec.ingress parameters.

It is possible to disable Ingress for a given NodeSet, which excludes nodes within that group from receiving any traffic via the public endpoint. This can be useful when creating nodes dedicated to data processing, which do not receive data.

StroomTaskAutoscaler

StroomTaskAutoscaler is an optional resource that if defined, activates “auto-pilot” features for an associated StroomCluster. See this guide on how to configure.

Creating a Stroom cluster

Create a StroomCluster resource manifest

Use the example stroom-cluster.yaml .

If you chose to create an Operator-managed DatabaseServer, the StroomCluster.spec.databaseServerRef should point to the name of the DatabaseServer.

Provision a PersistentVolume for each Stroom node

Each PersistentVolume provides persistent local storage for a Stroom node. The amount of storage doesn’t generally need to be large, as stream data is stored on another volume. When deciding on a storage quota, be sure to consider the needs of log and reference data, in particular.

This volume should ideally be backed by fast, low-latency storage in order to maximise the performance of LMDB.

Deploy the StroomCluster resource

kubectl apply -f stroom-cluster.yaml

If the StroomCluster configuration is valid, the Operator will deploy a StatefulSet for each NodeSet defined in StroomCluster.spec.nodeSets. Once these StatefulSets reach Ready state, you are ready to access the Stroom UI.

Log into Stroom

Access the Stroom UI at: https://<ingress hostname>. The initial credentials are:

  • Username: admin
  • Password: admin

Further customisation (optional)

The configuration bundled with the Operator provides enough customisation for most use cases, via explicit properties and environment variables.

If you need to further customise Stroom, you have the following methods available:

Override the Stroom configuration file

Deploy a ConfigMap separately. You can then specify the ConfigMap name and key (itemName) containing the configuration file to be mounted into each Stroom node container.

Provide additional environment variables

Specify custom environment variables in StroomCluster.spec.extraEnv. You can reference these in the Stroom configuration file.

Mount additional files

You can also define additional Volumes and VolumeMounts to be injected into each Stroom node. This can be useful when providing files like certificates for Kafka integration.

Reconfiguring the cluster

Some StroomCluster configuration properties can be reconfigured while the cluster is still running:

  1. spec.image Change this to deploy a newer (or different) Stroom version
  2. spec.terminationGracePeriodSecs Applies the next time a node or cluster is deleted
  3. spec.nodeSets.count If changed, the NodeSet’s StatefulSet will be scaled (up or down) to match the corresponding number of replicas

After changing any of the above properties, re-apply the manifest:

kubectl apply -f stroom-cluster.yaml

If any other changes need to be made, delete then re-create the StroomCluster.

Next steps

Configure Stroom task autoscaling
Stop a Stroom cluster

7 - Auto Scaler

How to configure Stroom task auto scaling.

Motivation

Setting optimal Stroom stream processor task limits is a crucial factor in running a healthy, performant cluster. If a node is allocated too many tasks, it may become unresponsive or crash. Conversely, if allocated too few tasks, it may have CPU cycles to spare.

The optimal number of tasks is often time-dependent, as load will usually fluctuate during the day and night. In large deployments, it’s not ideal to set static limits, as doing so risks over-committing nodes during intense spikes in activity (such as backlog processing or multiple concurrent searches). Therefore an automated solution, factoring in system load, is called for.

Stroom task autoscaling

When a StroomTaskAutoscaler resource is deployed to a linked StroomCluster, the Operator will periodically compare each Stroom node’s average Pod CPU usage against user-defined thresholds.

Enabling autoscaling

Create an StroomTaskAutoscaler resource manifest

Use the example autoscaler.yaml .

Below is an explanation of some of the main parameters. The rest are documented here .

  • adjustmentIntervalMins Determines how often the Operator will check whether a node has exceeded its CPU parameters. It should be often enough to catch brief load spikes, but not too often as to overload the Operator and Kubernetes cluster through excessive API calls and other overhead.
  • metricsSlidingWindowMin is the window of time over which CPU usage is averaged. Should not be too small, otherwise momentary load spikes could cause task limits to be reduced unnecessarily. Too large and spikes may not cause throttling to occur.
  • minCpuPercent and maxCpuPercent should be set to a reasonably tight range, in order to keep the task limit as close to optimal as possible.
  • minTaskLimit and maxTaskLimit are considered safeguards to avoid nodes ever being allocated an unreasonable number of task. Setting maxTaskLimit to be equal to the number of assigned CPUs would be a reasonable starting point.

Deploy the resource manifest

kubectl apply -f autoscaler.yaml

Disable autoscaling

Delete the StroomTaskAutoscaler resource

kubectl delete -f autoscaler.yaml

8 - Stop Stroom Cluster

How to stop the whole Stroom cluster.

A Stroom cluster can be stopped by deleting the StroomCluster resource that was deployed. When this occurs, the Operator will perform the following actions for each node, in sequence:

  1. Disable processing of all tasks.
  2. Wait for all processing tasks to be completed. This check is performed once every minute, so there may be a brief delay between a node completed its tasks before being shut down.
  3. Terminate the container.

The StroomCluster resource will be removed from the Kubernetes cluster once all nodes have finished processing tasks.

Stopping the cluster

kubectl delete -f stroom-cluster.yaml
kubectl delete -f database-server.yaml

If a StroomTaskAutoscaler was created, remove that as well.

If any of these commands appear to hang with no response, that’s normal; the Operator is likely waiting for tasks to drain. You may press Ctrl+C to return to the shell and task termination will continue in the background.

Once the StroomCluster is removed, it can be reconfigured (if required) and redeployed, using the same process as in Configure a Stroom cluster.

PersistentVolumeClaim deletion

When a Stroom node is shut down, by default its PersistentVolumeClaim will remain. This ensures it gets re-assigned the same PersistentVolume when it starts up again.

This behaviour should satisfy most use cases. However the operator may be configured to delete the PVC in certain situations, by specifying the StroomCluster.spec.volumeClaimDeletePolicy:

  1. DeleteOnScaledownOnly deletes a node’s PVC where the number of nodes in the NodeSet is reduced and as a result, the node Pod is no longer part of the NodeSet
  2. DeleteOnScaledownAndClusterDeletion deletes the PVC if the node Pod is removed.

Next steps

Removing the Stroom K8s Operator

9 - Restart Node

How to restart a Stroom node.

Stroom nodes may occasionally hang or become unresponsive. In these situations, it may be necessary to terminate the Pod.

After you identify the unresponsive Pod (e.g. by finding a node not responding to cluster ping):

kubectl delete pod -n <Stroom cluster namespace> <pod name>

This will attempt to drain tasks for the node. After the termination grace period has elapsed, the Pod will be killed and a new one will automatically respawn to take its place. Once the new Pod finishes starting up, if functioning correct it should begin responding to cluster ping.

Force deletion

If waiting for the grace period to elapse is unacceptable and you are willing to risk shutting down the node without draining it first (or you are sure it has no active tasks), you can force delete the Pod using the procedure outline in the Kubernetes documentation :

kubectl delete pod -n <Stroom cluster namespace> <pod name> --grace-period=0 --force