This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Kubernetes Cluster

How to deploy and administer a container based Stroom cluster using Kubernetes.

1: Introduction
2: Install Operator
3: Upgrade Operator
4: Remove Operator
5: Configure Database
6: Configure a cluster
7: Auto Scaler
8: Stop Stroom Cluster
9: Restart Node

1 - Introduction

Introduction to using Stroom on Kubernetes.

Kubernetes is an open-source system for automating deployment scaling and management of containerised applications.

Stroom is a distributed application designed to handle large-scale dataflows. As such, it is ideally suited to a Kubernetes deployment, especially when operated at scale. Features standard to Kubernetes, like Ingress and Cluster Networking , simplify the installation and ongoing operation of Stroom.

Running applications in K8s can be challenging for applications not designed to operate in a K8s cluster natively. A purpose-built Kubernetes Operator ( stroom-k8s-operator ) has been developed to make deployment easier, while taking advantage of several key Kubernetes features to further automate Stroom cluster management.

The concept of Kubernetes operators is discussed here .

Key features

The Stroom K8s Operator provides the following key features:

Deployment

Simplified configuration, enabling administrators to define the entire state of a Stroom cluster in one file
Designate separate processing and UI nodes, to ensure the Stroom user interface remains responsive, regardless of processing load
Automatic secrets management

Operations

Scheduled database backups
Stroom node audit log shipping
Automatically drain Stroom tasks before node shutdown
Automatic Stroom task limit tuning, to attempt to keep CPU usage within configured parameters
Rolling Stroom version upgrades

Next steps

Install the Stroom K8s Operator

2 - Install Operator

How to install the Stroom Kubernetes operator.

Prerequisites

Kubernetes cluster, version >= 1.20.2
metrics-server (pre-installed with some K8s distributions)
kubectl and cluster-wide admin access

Preparation

Stage the following images in a locally-accessible container registry:

All images listed in: https://github.com/p-kimberley/stroom-k8s-operator/blob/master/deploy/images.txt
MySQL (e.g. mysql/mysql-server:8.0.25)
Stroom (e.g. gchq/stroom:v7-LATEST)
gchq/stroom-log-sender:v2.2.0 (only required if log forwarding is enabled)

Install the Stroom K8s Operator

Clone the repository

git clone https://github.com/p-kimberley/stroom-k8s-operator.git

Edit ./deploy/all-in-one.yaml, prefixing any referenced images with your private registry URL. For example, if your private registry is my-registry.example.com, the image gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0 will become: my-registry.example.com:5000/gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0.

Deploy the Operator

kubectl apply -f ./deploy/all-in-one.yaml

The Stroom K8s Operator is now deployed to namespace stroom-operator-system. You can monitor its progress by watching the Pod named stroom-operator-controller-manager. Once it reaches Ready state, you can deploy a Stroom cluster.

Allocating more resources

If the Operator Pod is killed due to running out of memory, you may want to increase the amount allocated to it.

This can be done by:

Editing the resources.limits settings of the controller Pod in all-in-one.yaml
kubectl apply -f all-in-one.yaml

Note

The Operator retains CPU and memory metrics for all StroomCluster Pods for a 60-minute window. In very large deployments, this may cause it to run out of memory.

Next steps

Configure a Stroom database server
Upgrade
Remove

3 - Upgrade Operator

How to upgrade the Stroom Kubernetes Operator.

Upgrading the Operator can be performed without disrupting any resources it controls, including Stroom clusters.

To perform the upgrade, follow the same steps in Installing the Stroom K8s Operator.

Warning

Ensure you do NOT delete the operator first (i.e. kubectl delete ...)

Once you have initiated the update (by executing kubectl apply -f all-in-one.yaml), an instance of the new Operator version will be created. Once it starts up successfully, the old instance will be removed.

You can check whether the update succeeded by inspecting the image tag of the Operator Pod: stroom-operator-system/stroom-operator-controller-manager. The tag should correspond to the release number that was downloaded (e.g. 1.0.0)

If the upgrade failed, the existing Operator should still be running.

4 - Remove Operator

How to remove the Stroom Kubernetes operator.

Removing the Stroom K8s Operator must be done with caution, as it causes all resources it manages, including StroomCluster, DatabaseServer and StroomTaskAutoscaler to be deleted.

While the Stroom clusters under its control will be gracefully terminated, they will become inaccessible until re-deployed.

It is good practice to first delete any dependent resources before deleting the Operator.

Deleting the Operator

Execute this command against the same version of manifest that was used to deploy the Operator currently running.

kubectl delete -f all-in-one.yaml

5 - Configure Database

How to configure the database server for a Stroom cluster.

Before creating a Stroom cluster, a database server must first be configured.

There are two options for deploying a MySQL database for Stroom:

Managed by Stroom K8s Operator

A Database server can be created and managed by the Operator. This is the recommended option, as the Operator will take care of the creation and storage of database credentials, which are shared securely with the Pod via the use of a Secret cluster resource.

Create a `DatabaseServer` resource manifest

Use the example at database-server.yaml .

See the DatabaseServer Custom Resource Definition (CRD) API documentation for an explanation of the various CRD fields.

By default, MySQL imposes a limit of 151 concurrent connections. If your Stroom cluster is larger than a few nodes, it is likely you will exceed this limit. Therefore, it is recommended to set the MySQL property max_connections to a suitable value.

Bear in mind the Operator generally consumes one connection per StroomCluster it manages, so be sure to include some headroom in your allocation.

You can specify this value via the spec.additionalConfig property as in the example below:

apiVersion: stroom.gchq.github.io/v1
kind: DatabaseServer
...
spec:
  additionalConfig:
    - max_connections=1000
...

Provision a `PersistentVolume` for the `DatabaseServer`

General instructions on creating a Kubernetes Persistent Volume (PV) are explained here .

The Operator will create StatefulSet when the DatabaseServer is deployed, which will attempt to claim a PersistentVolume matching the specification provided in DatabaseServer.spec.volumeClaim.

Fast, low-latency storage should be used for the Stroom database

Deploy the `DatabaseServer` to the cluster

kubectl apply -f database-server.yaml

Observe the Pod stroom-<database server name>-db start up. Once it’s reached Ready state, the server has started, and the databases you specified have been created.

Backup the created credentials

The Operator generates a Secret containing the passwords of the users root and stroomuser when it initially creates the DatabaseServer resource. These credentials should be backed up to a secure location, in the event the Secret is inadvertently deleted.

The Secret is named using the format: stroom-<db server name>-db (e.g. stroom-dev-db).

External

You may alternatively provide the connection details of an existing MySQL (or compatible) database server. This may be desirable if you have for instance, a replication-enabled MySQL InnoDB cluster.

Provision the server and Stroom databases

TODO

Complete this secion.

Store credentials in a `Secret`

Create a Secret in the same namespace as the StroomCluster, containing the key stroomuser, with the value set to the password of that user.

Warning

If at any time the MySQL password is updated, the value of the Secret must also be changed. Otherwise, Stroom will stop functioning.

Upgrading or removing a `DatabaseServer`

A DatabaseServer cannot shut down while its dependent StroomCluster is running. This is a necessary safeguard to prevent database connectivity from being lost.

Upgrading or removing a DatabaseServer requires the StroomCluster be removed first.

Next steps

Configure a Stroom cluster

6 - Configure a cluster

How to configure a Stroom cluster.

A StroomCluster resource defines the topology and behaviour of a collection of Stroom nodes.

The following key concepts should be understood in order to optimally configure a cluster.

Concepts

NodeSet

A logical grouping of nodes intended to together, fulfil a common role. There are three possible roles, as defined by ProcessingNodeRole:

Undefined (default). Each node in the NodeSet can receive and process data, as well as service web frontend requests.
Processing Node can receive and process data, but not service web frontend requests.
Frontend Node services web frontend requests only.

There is no imposed limit to the number of NodeSets, however it generally doesn’t make sense to have more than one assigned to either Processing or Frontend roles. In clusters where nodes are not very busy, it should not be necessary to have dedicated Frontend nodes. In cases where load is prone to spikes, such nodes can greatly help improve the responsiveness of the Stroom user interface.

It is important to ensure there is at least one NodeSet for each role in the StroomCluster The Operator automatically wires up traffic routing to ensure that only non-Frontend nodes receive event data. Additionally, Frontend-only nodes have server tasks disabled automatically on startup, effectively preventing them from participating in stream processing.

Ingress

Kubernetes Ingress resources determine how requests are routed to an application. Ingress resources are configured by the Operator based on the NodeSet roles and the provided StroomCluster.spec.ingress parameters.

It is possible to disable Ingress for a given NodeSet, which excludes nodes within that group from receiving any traffic via the public endpoint. This can be useful when creating nodes dedicated to data processing, which do not receive data.

StroomTaskAutoscaler

StroomTaskAutoscaler is an optional resource that if defined, activates “auto-pilot” features for an associated StroomCluster. See this guide on how to configure.

Creating a Stroom cluster

Create a `StroomCluster` resource manifest

Use the example stroom-cluster.yaml .

If you chose to create an Operator-managed DatabaseServer, the StroomCluster.spec.databaseServerRef should point to the name of the DatabaseServer.

Provision a `PersistentVolume` for each Stroom node

Each PersistentVolume provides persistent local storage for a Stroom node. The amount of storage doesn’t generally need to be large, as stream data is stored on another volume. When deciding on a storage quota, be sure to consider the needs of log and reference data, in particular.

This volume should ideally be backed by fast, low-latency storage in order to maximise the performance of LMDB.

Deploy the `StroomCluster` resource

kubectl apply -f stroom-cluster.yaml

If the StroomCluster configuration is valid, the Operator will deploy a StatefulSet for each NodeSet defined in StroomCluster.spec.nodeSets. Once these StatefulSets reach Ready state, you are ready to access the Stroom UI.

Note

If the StatefulSets don’t deploy, there is probably something wrong with your configuration. Check the logs of the pod stroom-operator-system/stroom-operator-controller-manager for any errors.

Log into Stroom

Access the Stroom UI at: https://<ingress hostname>. The initial credentials are:

Username: admin
Password: admin

Further customisation (optional)

The configuration bundled with the Operator provides enough customisation for most use cases, via explicit properties and environment variables.

If you need to further customise Stroom, you have the following methods available:

Override the Stroom configuration file

Deploy a ConfigMap separately. You can then specify the ConfigMap name and key (itemName) containing the configuration file to be mounted into each Stroom node container.

Provide additional environment variables

Specify custom environment variables in StroomCluster.spec.extraEnv. You can reference these in the Stroom configuration file.

Mount additional files

You can also define additional Volumes and VolumeMounts to be injected into each Stroom node. This can be useful when providing files like certificates for Kafka integration.

Reconfiguring the cluster

Some StroomCluster configuration properties can be reconfigured while the cluster is still running:

spec.image Change this to deploy a newer (or different) Stroom version
spec.terminationGracePeriodSecs Applies the next time a node or cluster is deleted
spec.nodeSets.count If changed, the NodeSet’s StatefulSet will be scaled (up or down) to match the corresponding number of replicas

After changing any of the above properties, re-apply the manifest:

kubectl apply -f stroom-cluster.yaml

If any other changes need to be made, delete then re-create the StroomCluster.

Next steps

Configure Stroom task autoscaling
Stop a Stroom cluster

7 - Auto Scaler

How to configure Stroom task auto scaling.

Motivation

Setting optimal Stroom stream processor task limits is a crucial factor in running a healthy, performant cluster. If a node is allocated too many tasks, it may become unresponsive or crash. Conversely, if allocated too few tasks, it may have CPU cycles to spare.

The optimal number of tasks is often time-dependent, as load will usually fluctuate during the day and night. In large deployments, it’s not ideal to set static limits, as doing so risks over-committing nodes during intense spikes in activity (such as backlog processing or multiple concurrent searches). Therefore an automated solution, factoring in system load, is called for.

Stroom task autoscaling

When a StroomTaskAutoscaler resource is deployed to a linked StroomCluster, the Operator will periodically compare each Stroom node’s average Pod CPU usage against user-defined thresholds.

Enabling autoscaling

Create an `StroomTaskAutoscaler` resource manifest

Use the example autoscaler.yaml .

Below is an explanation of some of the main parameters. The rest are documented here .

adjustmentIntervalMins Determines how often the Operator will check whether a node has exceeded its CPU parameters. It should be often enough to catch brief load spikes, but not too often as to overload the Operator and Kubernetes cluster through excessive API calls and other overhead.
metricsSlidingWindowMin is the window of time over which CPU usage is averaged. Should not be too small, otherwise momentary load spikes could cause task limits to be reduced unnecessarily. Too large and spikes may not cause throttling to occur.
minCpuPercent and maxCpuPercent should be set to a reasonably tight range, in order to keep the task limit as close to optimal as possible.
minTaskLimit and maxTaskLimit are considered safeguards to avoid nodes ever being allocated an unreasonable number of task. Setting maxTaskLimit to be equal to the number of assigned CPUs would be a reasonable starting point.

Note

A node’s task limits will only be adjusted while its task queue is full. That is, unless a node is fully-committed, it will not be scaled. This is to avoid continually downscaling each node to the minimum during periods of inactivity. Because of this, be realistic with setting maxTaskLimit to ensure the node is actually capable of hitting that maximum. If it can’t, the autoscaler will continue adjusting upwards, potentially causing the node to become unresponsive.

Deploy the resource manifest

kubectl apply -f autoscaler.yaml

Disable autoscaling

Delete the StroomTaskAutoscaler resource

kubectl delete -f autoscaler.yaml

8 - Stop Stroom Cluster

How to stop the whole Stroom cluster.

A Stroom cluster can be stopped by deleting the StroomCluster resource that was deployed. When this occurs, the Operator will perform the following actions for each node, in sequence:

Disable processing of all tasks.
Wait for all processing tasks to be completed. This check is performed once every minute, so there may be a brief delay between a node completed its tasks before being shut down.
Terminate the container.

The StroomCluster resource will be removed from the Kubernetes cluster once all nodes have finished processing tasks.

Note

The StroomCluster.spec.nodeTerminationGracePeriodSecs is an important setting that determines how long the Operator will wait for each node’s tasks to complete before terminating it. Ensure this is set to a reasonable value, otherwise long-running tasks may not have enough time to finish if the StroomCluster is taken down (e.g. for maintenance).

Stopping the cluster

kubectl delete -f stroom-cluster.yaml
kubectl delete -f database-server.yaml

If a StroomTaskAutoscaler was created, remove that as well.

If any of these commands appear to hang with no response, that’s normal; the Operator is likely waiting for tasks to drain. You may press Ctrl+C to return to the shell and task termination will continue in the background.

Note

If the StroomCluster deletion appears to be hung, you can inspect the Operator logs to see which nodes are holding up deletion due to outstanding tasks. You will see a list of one or more node names, with the number of tasks outstanding in brackets (e.g. StroomCluster deletion waiting on task completing for 1 nodes: stroom-dev-node-data-0 (5)).

Once the StroomCluster is removed, it can be reconfigured (if required) and redeployed, using the same process as in Configure a Stroom cluster.

`PersistentVolumeClaim` deletion

When a Stroom node is shut down, by default its PersistentVolumeClaim will remain. This ensures it gets re-assigned the same PersistentVolume when it starts up again.

This behaviour should satisfy most use cases. However the operator may be configured to delete the PVC in certain situations, by specifying the StroomCluster.spec.volumeClaimDeletePolicy:

DeleteOnScaledownOnly deletes a node’s PVC where the number of nodes in the NodeSet is reduced and as a result, the node Pod is no longer part of the NodeSet
DeleteOnScaledownAndClusterDeletion deletes the PVC if the node Pod is removed.

Next steps

Removing the Stroom K8s Operator

9 - Restart Node

How to restart a Stroom node.

Stroom nodes may occasionally hang or become unresponsive. In these situations, it may be necessary to terminate the Pod.

After you identify the unresponsive Pod (e.g. by finding a node not responding to cluster ping):

kubectl delete pod -n <Stroom cluster namespace> <pod name>

This will attempt to drain tasks for the node. After the termination grace period has elapsed, the Pod will be killed and a new one will automatically respawn to take its place. Once the new Pod finishes starting up, if functioning correct it should begin responding to cluster ping.

Note

Prior to a Stroom node being stopped (for whatever reason), task processing for that node is disabled and it is drained of all active tasks. Task processing is resumed once the node starts up again.

Force deletion

If waiting for the grace period to elapse is unacceptable and you are willing to risk shutting down the node without draining it first (or you are sure it has no active tasks), you can force delete the Pod using the procedure outline in the Kubernetes documentation :

kubectl delete pod -n <Stroom cluster namespace> <pod name> --grace-period=0 --force

Kubernetes Cluster

1 - Introduction

Key features

Deployment

Operations

Next steps

2 - Install Operator

Prerequisites

Preparation

Install the Stroom K8s Operator

Allocating more resources

Note

Next steps

3 - Upgrade Operator

Warning

4 - Remove Operator

Deleting the Operator

5 - Configure Database

Managed by Stroom K8s Operator

Create a DatabaseServer resource manifest

Provision a PersistentVolume for the DatabaseServer

Deploy the DatabaseServer to the cluster

Backup the created credentials

External

Provision the server and Stroom databases

TODO

Store credentials in a Secret

Warning

Upgrading or removing a DatabaseServer

Next steps

6 - Configure a cluster

Concepts

NodeSet

Ingress

StroomTaskAutoscaler

Creating a Stroom cluster

Create a StroomCluster resource manifest

See Also

Provision a PersistentVolume for each Stroom node

Deploy the StroomCluster resource

Note

Log into Stroom

Further customisation (optional)

Override the Stroom configuration file

Provide additional environment variables

Mount additional files

Reconfiguring the cluster

Next steps

7 - Auto Scaler

Motivation

Stroom task autoscaling

Enabling autoscaling

Create an StroomTaskAutoscaler resource manifest

Note

Deploy the resource manifest

Disable autoscaling

8 - Stop Stroom Cluster

Note

Stopping the cluster

Note

PersistentVolumeClaim deletion

Next steps

9 - Restart Node

Note

Force deletion

Create a `DatabaseServer` resource manifest

Provision a `PersistentVolume` for the `DatabaseServer`

Deploy the `DatabaseServer` to the cluster

Store credentials in a `Secret`

Upgrading or removing a `DatabaseServer`

Create a `StroomCluster` resource manifest

Provision a `PersistentVolume` for each Stroom node

Deploy the `StroomCluster` resource

Create an `StroomTaskAutoscaler` resource manifest

`PersistentVolumeClaim` deletion