1 - Single Node Docker Installation

How to install a Single node instance of Stroom using Docker containers.

Running Stroom in Docker is the quickest and easiest way to get Stroom up and running. Using Docker means you don’t need to install the right versions of dependencies like Java or MySQL or get them configured correctly for Stroom.

This section details how to install single instances of of Stroom and Stroom-Proxy using Docker.

Stroom Docker stacks

Stroom has a number of predefined stacks that combine multiple docker containers into a fully functioning Stroom environment. The Docker stacks are aimed primarily at single node instances or for evaluation/test. The stack makes use of various shell scripts combined with Docker Compose to integrate the various Docker containers and make them easy to run.

At the moment the usable stacks are:

  • stroom_core - A single node stroom stack geared towards production use.

  • stroom_core_test - A single node stroom for test/evaluation, pre-loaded with content. Also includes a remote proxy for demonstration purposes. If you just want to try out Stroom, this is the one to use.

  • stroom_proxy - A remote proxy stack for aggregating and forwarding logs to stroom(-proxy). Intended for use as a remote proxy that will forward received/aggregated data into a downstream stroom/stroom-proxy.

  • stroom_services - An Nginx instance for running stroom without Docker.

Each stack contains the following docker compose services.

stroom_core

stroom
stroom-proxy-local
stroom-log-sender
nginx
mysql

stroom_core_test

stroom
stroom-proxy-local
stroom-proxy-remote
stroom-log-sender
nginx
mysql

stroom_proxy

stroom-proxy-remote
stroom-log-sender
nginx

stroom_services

stroom-log-sender
nginx

The services are as follows:

  • stroom - A Stroom instance.
  • stroom-proxy-local - A Stroom-Proxy instance that is typically local to Stroom and acts as its front door for data reception.
  • stroom-proxy-remote - A Stroom-Proxy instance that is remote from Stroom (e.g. owned by another team) and is intended to pass data to a downstream Stroom-Proxy.
  • nginx - An instance of nginx that is configured to reverse proxy to Stroom and Stroom-Proxy as appropriate. It can also be configured to act as a load balancer to multiple Stroom instances if Stroom is being installed without using Docker.
  • mysql - An instance of MySQL that is configured to create the database and users required by Stroom.
  • stroom-log-sender - A simple container that is configured to gather all the log files produced by Stroom, Stroom-Proxy and nginx, to then forward them to Stroom so Stroom can process its own logs.

Prerequisites

In order to run Stroom using Docker you will need the following installed on the machine you intend to run Stroom on:

Install steps

This will install the core stack (Stroom and the peripheral services required to run Stroom).

Visit stroom-resources/releases to find the latest stack release. The Stroom stack comes in a number of different variants:

  • stroom_core_test - If you are just evaluating Stroom or just want to see it running then download the stroom_core_test*.tar.gz stack which includes some pre-loaded content.
  • stroom_core - If it is for an actual deployment of Stroom then download stroom_core*.tar.gz, which has no content and requires some configuration.

Using stroom_core_test-v7.10.11.tar.gz as an example:

# Define the version to download
VERSION="v7.10.11"; STACK="stroom_core_test"

# Download and extract the Stroom stack
curl -sL "https://github.com/gchq/stroom-resources/releases/download/stroom-stacks-${VERSION}/${STACK}-${VERSION}.tar.gz" | tar xz

# Navigate into the new stack directory, where xxxx is the directory that has just been created
cd "${STACK}-${VERSION}"

# Start the stack
./start.sh

Alternatively if you understand the risks of redirecting web sourced content direct to bash, you can get the latest stroom_core_test release using:

# Download and extract the latest Stroom stack
bash <(curl -s https://gchq.github.io/stroom-resources/v7.1/get_stroom.sh)
(out)
# Navigate into the new stack directory
cd stroom_core_test/stroom_core_test*
(out)
# Start the stack
./start.sh

On first run stroom will build the database schemas so this can take a minute or two. The start.sh script will provide details of the various URLs that are available.

Open a browser (preferably Chrome) at https://localhost and login with:

  • username: admin
  • password: admin

The stroom stack comes supplied with self-signed certificates so you may need to accept a prompt warning you about visiting an untrusted site.

Configuration

To configure your new instance see Configuration.

2 - Configuration

Stroom and its associated services can be deployed in may ways (single node docker stack, non-docker cluster, kubernetes, etc.). This document will cover two types of deployment:

  • Single node stroom_core docker stack.
  • A mixed deployment with nginx in docker and stroom, stroom-proxy and the database not in docker.

This document will explain how each application/service is configured and where its configuration files live.

Application Configuration

The following sections provide links to how to configure each application.

General configuration of docker stacks

Environment variables

The stroom docker stacks have a single env file <stack name>.env that acts as a single point to configure some aspects of the stack. Setting values in the env file can be useful when the value is shared between multiple containers. This env file sets environment variables that are then used for variable substitution in the docker compose YAML files, e.g.

    environment:
      - MYSQL_ROOT_PASSWORD=${STROOM_DB_ROOT_PASSWORD:-my-secret-pw}

In this example the environment variable STROOM_DB_ROOT_PASSWORD is read and used to set the environment variable MYSQL_ROOT_PASSWORD in the docker container. If STROOM_DB_ROOT_PASSWORD is not set then the value my-secret-pw is used instead.

The environment variables set in the env file are NOT automatically visible inside the containers. Only those environment variables defined in the environment section of the docker-compose YAML files are visible. These environment entries can either be hard coded values or use environment variables from outside the container. In some case the names in the env file and the names of the environment variables set in the containers are the same, in some they are different.

The environment variables set in the containers can then be used by the application running in each container to set its configuration. For example, stroom’s config.yml file also uses variable substitution, e.g.

appConfig:
  commonDbDetails:
    connection:
    jdbcDriverClassName: "${STROOM_JDBC_DRIVER_CLASS_NAME:-com.mysql.cj.jdbc.Driver}"

In this example jdbcDriverUrl will be set to the value of environment variable STROOM_JDBC_DRIVER_CLASS_NAME or com.mysql.cj.jdbc.Driver if that is not set.

The following example shows how setting MY_ENV_VAR=123 means myProperty will ultimately get a value of 123 and not its default of 789.

env file (stroom<stack name>.env) - MY_ENV_VAR=123
                |
                |
                | environment variable substitution
                |
                v
docker compose YAML (01_stroom.yml) - STROOM_ENV_VAR=${MY_ENV_VAR:-456}
                |
                |
                | environment variable substitution
                |
                v
Stroom configuration file (config.yml) - myProperty: "${STROOM_ENV_VAR:-789}"

Note that environment variables are only set into the container on start. Any changes to the env file will not take effect until the container is (re)started.

Configuration files

The following shows the basic structure of a stack with respect to the location of the configuration files:

── stroom_core_test-vX.Y.Z
   ├── config                [stack env file and docker compose YAML files]
   └── volumes
       └── <service>
           └── conf/config   [service specifc configuration files]

Some aspects of configuration do not lend themselves to environment variable substitution, e.g. deeply nested parts of stroom’s config.yml. In these instances it may be necessary to have static configuration files that have no connection to the env file or only use environment variables for some values.

Bind mounts

Everything in the stack volumes directory is bind-mounted into the named docker container but is mounted read-only to the container. This allows configuration files to be read by the container but not modified.

Typically the bind mounts mount a directory into the container, though in the case of the stroom-all-dbs.cnf file, the file is mounted. The mounts are done using the inode of the file/directory rather than the name, so docker will mount whatever the inode points to even if the name changes. If for instance the stroom-all-dbs.cnf file is renamed to stroom-all-dbs.cnf.old then copied to stroom-all-dbs.cnf and then the new version modified, the container would still see the old file.

Docker managed volumes

When stroom is running various forms of data are persisted, e.g. stroom’s stream store, stroom-all-dbs database files, etc. All this data is stored in docker managed volumes. By default these will be located in /var/lib/docker/volumes/<volume name>/_data and root/sudo access will be needed to access these directories.

Docker data root

IMPORTANT

By default Docker stores all its images, container layers and managed volumes in its default data root directory which defaults to /var/lib/docker. It is typical in server deployments for the root file system to be kept fairly small and this is likely to result in the root file system running out of space due to the growth in docker images/layers/volumes in /var/lib/docker. It is therefore strongly recommended to move the docker data root to another location with more space.

There are various options for achieving this. In all cases the docker daemon should be stopped prior to making the changes, e.g. service docker stop, then started afterwards.

  • Symlink - One option is to move the var/lib/docker directory to a new location then create a symlink to it. For example:

    ln -s /large_mount/docker_data_root /var/lib/docker

    This has the advantage that anyone unaware that the data root has moved will be able to easily find it if they look in the default location.

  • Configuration - The location can be changed by adding this key to the file /etc/docker/daemon.json (or creating this file if it doesn’t exist.

    {
      "data-root": "/mnt/docker"
    }
    
  • Mount - If your intention is to use a whole storage device for the docker data root then you can mount that device to /var/lib/docker. You will need to make a copy of the /var/lib/docker directory prior to doing this then copy it mount once created. The process for setting up this mount will be OS dependent and is outside the scope of this document.

Active services

Each stroom docker stack comes pre-built with a number of different services, e.g. the stroom_core stack contains the following:

  • stroom
  • stroom-proxy-local
  • stroom-all-dbs
  • nginx
  • stroom-log-sender

While you can pass a set of service names to the commands like start.sh and stop.sh, it may sometimes be required to configure the stack instance to only have a set of services active. You can set the active services like so:

./set_services.sh stroom stroom-all-dbs nginx

In the above example and subsequent use of commands like start.sh and stop.sh with no named services would only act upon the active services set by set_services.sh. This list of active services is held in ACTIVE_SERVICES.txt and the full list of available services is held in ALL_SERVICES.txt.

Certificates

A number of the services in the docker stacks will make use of SSL certificates/keys in various forms. The certificate/key files are typically found in the directories volumes/<service>/certs/.

The stacks come with a set of client/server certificates that can be used for demo/test purposes. For production deployments these should be replaced with the actual certificates/keys for your environment.

In general the best approach to configuring the certificates/keys is to replace the existing files with symlinks to the actual files. For example in the case of the server certificates for nginx (found in volumes/nginx/certs/) the directory would look like:

ca.pem.crt -> /some/path/to/certificate_authority.pem.crt
server.pem.crt -> /some/path/to/host123.pem.crt
server.unencrypted.key -> /some/path/to/host123.key

This approach avoids the need to change any configuration files to reference differently named certificate/key files and avoids having to copy your real certificates/keys into multiple places.

For examples of how to create certificates, keys and keystores see creatCerts.sh

2.1 - Stroom and Stroom-Proxy Configuration

How to configure Stroom and Stroom-Proxy.

The Stroom and Stroom-Proxy applications are built on the same Dropwizard framework so have a lot of similarities when it comes to configuration.

The Stroom/Stroom-Proxy applications are essentially just an executable JAR file that can be run when provided with a configuration file, config.yml. This config file is common to all forms of deployment.

2.1.1 - Common Configuration

Configuration common to Stroom and Stroom-Proxy.

This YAML file, sometimes known as the Dropwizard configuration file (as it conforms to a structure defined by Dropwizard) is the primary means of configuring Stroom/Stroom-Proxy. As a minimum this file should be used to configure anything that needs to be set before stroom can start up, e.g. web server, logging, database connection details, etc. It is also used to configure anything that is specific to a node in a stroom cluster.

If you are using some form of scripted deployment, e.g. ansible then it can be used to set all stroom properties for the environment that stroom runs in. If you are not using scripted deployments then you can maintain stroom’s node agnostic configuration properties via the user interface.

Config File Structure

This file contains both the Dropwizard configuration settings (settings for ports, paths and application logging) and the Stroom/Stroom-Proxy application specific properties configuration. The file is in YAML format and the application properties are located under the appConfig key. For details of the Dropwizard configuration structure, see here .

The file is split into sections using these keys:

  • server - Configuration of the web server, e.g. ports, paths, request logging.
  • logging - Configuration of application logging
  • jerseyClients - Configuration of the various Jersey HTTP clients in use. See Jersey HTTP Client Configuration.
  • Application specific configuration:
    • appConfig - The Stroom configuration properties. These properties can be viewed/modified in the user interface.
    • proxyConfig - The Stroom-Proxy configuration properties. These properties can be viewed/modified in the user interface.

The following is an example of the YAML configuration file for Stroom:

# Dropwizard configuration section
server:
  # e.g. ports and paths
logging:
  # e.g. logging levels/appenders

jerseyClients:
  DEFAULT:
    # Configuration of the named client

# Stroom properties configuration section
appConfig:
  commonDbDetails:
    connection:
      jdbcDriverClassName: ${STROOM_JDBC_DRIVER_CLASS_NAME:-com.mysql.cj.jdbc.Driver}
      jdbcDriverUrl: ${STROOM_JDBC_DRIVER_URL:-jdbc:mysql://localhost:3307/stroom?useUnicode=yes&characterEncoding=UTF-8}
      jdbcDriverUsername: ${STROOM_JDBC_DRIVER_USERNAME:-stroomuser}
      jdbcDriverPassword: ${STROOM_JDBC_DRIVER_PASSWORD:-stroompassword1}
  contentPackImport:
    enabled: true
  ...

The following is an example of the YAML configuration file for Stroom-Proxy:

# Dropwizard configuration section
server:
  # e.g. ports and paths
logging:
  # e.g. logging levels/appenders

jerseyClients:
  DEFAULT:
    # Configuration of the named client

# Stroom properties configuration section
proxyConfig:
  path:
    home: /some/path
  ...

appConfig Section

The appConfig section is special as it maps to the Properties seen in the Stroom user interface so values can be managed in the file or via the Properties screen in the Stroom UI. The other sections of the file can only be managed via the YAML file. In the Stroom user interface, properties are named with a dot notation key, e.g. stroom.contentPackImport.enabled. Each part of the dot notation property name represents a key in the YAML file, e.g. for this example, the location in the YAML would be:

appConfig:
  contentPackImport:
    enabled: true   # stroom.contentPackImport.enabled

The stroom part of the dot notation name is replaced with appConfig.

For more details on the link between this YAML file and Stroom Properties, see Properties

Variable Substitution

The YAML configuration file supports Bash style variable substitution in the form of:

${ENV_VAR_NAME:-value_if_not_set}

This allows values to be set either directly in the file or via an environment variable, e.g.

      jdbcDriverClassName: ${STROOM_JDBC_DRIVER_CLASS_NAME:-com.mysql.cj.jdbc.Driver}

In the above example, if the STROOM_JDBC_DRIVER_CLASS_NAME environment variable is not set then the value com.mysql.cj.jdbc.Driver will be used instead.

Typed Values

YAML supports typed values rather than just strings, see https://yaml.org/refcard.html. YAML understands booleans, strings, integers, floating point numbers, as well as sequences/lists and maps. Some properties will be represented differently in the user interface to the YAML file. This is due to how values are stored in the database and how the current user interface works. This will likely be improved in future versions. For details of how different types are represented in the YAML and the UI, see Data Types.

Server configuration

The server section controls the configuration of the Jetty web server.

For full details of how to configure the server section see:

The following is an example of the configuration for an application listening on HTTP.

server:
  # The base path for the main application and its API
  applicationContextPath: "/"
  # The base path for the admininstration pages/API
  # For Stroom-Proxy the default is /proxyAdmin
  adminContextPath: "/stroomAdmin"

  # The scheme/port for the main application and its API
  applicationConnectors:
    - type: http
      # For Stroom-Proxy the default is 8090
      port: 8080
      # Uses X-Forwarded-*** headers in request log instead of proxy server details.
      useForwardedHeaders: true

  # The scheme/port for the admininstration pages/API
  adminConnectors:
    - type: http
      # For Stroom-Proxy the default is 8091
      port: 8081
      useForwardedHeaders: true

Common Application Configuration

This section details configuration that is common in both the Stroom appConfig and Stroom-Proxy proxyConfig sections.

Receive Configuration

Configuration for controlling the receipt of data into Stroom and Stroom-Proxy through the /datafeed API.

appConfig / proxyConfig:
  receive:
    # An allow-list containing IP addresses or fully qualified host names to verify that the direct sender
    # of a request (e.g. a load balancer or reverse proxy) is trusted to supply certificate/DN headers
    # as configured with 'x509CertificateHeader' and 'x509CertificateDnHeader'.
    # If this list is null/empty then no check will be made on the client's address.
    allowedCertificateProviders: []
    # Standard cache configuration block for the cache of authenticated Datafeed Keys.
    # This cache is used to avoid having to re-verify every data feed key.
    authenticatedDataFeedKeyCache:
    # If true, the sender will be authenticated using a certificate or token depending on the
    # state of tokenAuthenticationEnabled and certificateAuthenticationEnabled. If the sender
    # can't be authenticated an error will be returned to the client
    # If false, then authentication will be performed if a token/key/certificate
    # is present, otherwise data will be accepted without a sender identity
    authenticationRequired: true
    # The meta key that is used to identify the owner of a Data Feed Key. This
    # may be an AccountId or similar. It must be provided as a header when sending data
    # using the associated Data Feed Key, and its value will be checked against the value
    # held with the hashed Data Feed Key by Stroom. Default value is 'AccountId'.
    # Case does not matter
    dataFeedKeyOwnerMetaKey: "AccountId"
    # The directory where Stroom will look for datafeed key files.
    # Only used if datafeedKeyAuthenticationEnabled is true
    # If the value is a relative path then it will be treated as being
    # relative to stroom.path.home. Data feed key files must have the extension .json.
    # Files in sub-directory will be ignored.
    dataFeedKeysDir: "data_feed_keys"
    # The types of authentication that are enabled for data receipt.
    # One or more of 
    # TOKEN - A Stroom API Key or an OAuth token in the 'Authorization' header
    # CERTIFICATE - An X509 certificate on the request or a DN in the header configured
    #               by .receive.x509CertificateDnHeader
    # DATA_FEED_KEY - A Stroom Data Feed Key in the 'Authorization' header
    enabledAuthenticationTypes:
    - "TOKEN"
    - "CERTIFICATE"
    # If receiptCheckMode is RECEIPT_POLICY or FEED_STATUS and stroom/proxy is
    # unable to perform the receipt check, then this action will be used as a fallback
    # until the receipt check can be successfully performed
    fallbackReceiveAction: "RECEIVE"
    # If true the client is not required to set the 'Feed' header. If Feed is not present
    # a feed name will be generated based on the template specified by the
    # 'feedNameTemplate' property. If false (the default), a populated 'Feed'
    # header will be required
    feedNameGenerationEnabled: false
    # The set of header keys are mandatory if feedNameGenerationEnabled is set to true.
    # Should be set to complement the header keys used in 'feedNameTemplate', but may be a
    # sub-set of those in the template to allow for optional headers
    feedNameGenerationMandatoryHeaders:
    - "AccountId"
    - "Component"
    - "Format"
    - "Schema"
    # A template for generating a feed name from a set of headers. The value of
    # each header referenced in the template will have any unsuitable characters
    # replaced with '_'.
    # If this property is set in the YAML file, use single quotes to prevent the
    # variables being expanded when the config file is loaded
    feedNameTemplate: "${accountid}-${component}-${format}-${schema}"
    # If defined then states the maximum size of a request (uncompressed for gzip requests).
    # Will return a 413 Content Too Long response code for any requests exceeding this
    # value. If undefined then there is no limit to the size of the request.
    maxRequestSize: null
    # Set of supported meta type names. This set must contain all of the names
    # in the default value for this property but can contain additional names.
    metaTypes:
    - "Context"
    - "Detections"
    - "Error"
    - "Events"
    - "Meta Data"
    - "Raw Events"
    - "Raw Reference"
    - "Records"
    - "Reference"
    - "Test Events"
    - "Test Reference"
    # Controls how or whether data is checked on receipt. Valid values
    # (FEED_STATUS|RECEIPT_POLICY|RECEIVE_ALL|REJECT_ALL|DROP_ALL)
    receiptCheckMode: "FEED_STATUS"
    # The format of the Distinguished Name used in the certificate. Valid values are
    # LDAP and OPEN_SSL, where LDAP is the default
    x509CertificateDnFormat: "LDAP"
    # The HTTP header key used to extract the distinguished name (DN) as obtained from an X509 certificate.
    # This is used when a load balancer does the SSL/mTLS termination and passes the client DN though
    # in a header. Only used for
    # authentication if a value is set and 'enabledAuthenticationTypes' includes CERTIFICATE
    x509CertificateDnHeader: "X-SSL-CLIENT-S-DN"
    # The HTTP header key used to extract an X509 certificate. This is used when a load balancer does the
    # SSL/mTLS termination and passes the client certificate though in a header. Only used for
    # authentication if a value is set and 'enabledAuthenticationTypes' includes CERTIFICATE
    x509CertificateHeader: "X-SSL-CERT"

Cache Configuration

Multiple configuration branches in both Stroom and Stroom-Proxy have one or more properties for configuring a cache. Each of these share the same structure and will typically be named xxxCache, e.g. feedStatusCache or metaTypeCache.

      xxxCache:
        # Specifies that each entry should be automatically removed from the cache once
        # this duration has elapsed after the entry's creation, the most recent replacement of
        # its value, or its last read. In ISO-8601 duration format, e.g. 'PT10M'. If no value is set then
        #  entries will not be aged out based these criteria
        expireAfterAccess: 
        # Specifies that each entry should be automatically removed from the cache once
        # a fixed duration has elapsed after the entry's creation, or the most recent replacement of its value.
        # In ISO-8601 duration format, e.g. 'PT5M'. If no value is set then entries will not be aged out based on
        # these criteria.
        expireAfterWrite:
        # Specifies the maximum number of entries the cache may contain. Note that the cache
        # may evict an entry before this limit is exceeded or temporarily exceed the threshold while evicting.
        # As the cache size grows close to the maximum, the cache evicts entries that are less likely to be used
        # again. For example, the cache may evict an entry because it hasn't been used recently or very often.
        # When size is zero, elements will be evicted immediately after being loaded into the cache. This can
        # be useful in testing, or to disable caching temporarily without a code change. If no value is set then
        # no size limit will be applied
        maximumSize:
        # Specifies that each entry should be automatically refreshed in the cache after
        # a fixed duration has elapsed after the entry's creation, or the most recent replacement of its value.
        # In ISO-8601 duration format, e.g. 'PT5M'. Refreshing is performed asynchronously and the current value
        # provided until the refresh has occurred. This mechanism allows the cache to update values without any
        # impact on performance
        refreshAfterWrite:
        # Determines whether/how statistics are captured on cache usage
        # (e.g. hits, misses, entries, etc.). Values are (NONE, INTERNAL, DROPWIZARD_METRICS).
        # NONE means capture no stats, offering a very slight performance gain, but the Caches screen in Stroom
        # won't be able to show any stats for this cache.
        # INTERNAL means the stats are captured but are only accessible via the Stroom Caches screen, thus not
        # suitable for Stroom-Proxy.
        # DROPWIZARD_METRICS means the stats are captured and are accessible via the Stroom Caches screen AND via
        # the metrics servlet on the admin port for integration with tools like Graphite/Collectd
        # The default for Stroom is INTERNAL, the default for Stroom-Proxy is DROPWIZARD_METRICS
        statisticsMode:

Open ID Configuration

Both Stroom and Stroom-Proxy share the same configuration structure for configuring Open ID Connect authentication. This section of config is only applicable if appConfig/proxyConfig.security.authentication.identityProviderType is set to EXTERNAL_IDP.

appConfig / proxyConfig:
  security:
    authentication:
      openId:
        # A set of audience claim values, one of which must appear in the audience
        # claim in the token.
        # If empty, no validation will be performed on the audience claim
        # If audienceClaimRequired is false and there is no audience claim in the token,
        # then allowedAudiences will be ignored
        allowedAudiences: []
        # If true the token will fail validation if the audience claim is not present
        # and allowedAudiences is not empty
        audienceClaimRequired: false
        # The authentication endpoint used in OpenId authentication
        # Should only be set if not using a configuration endpoint
        authEndpoint: null
        # If custom scopes are required for client_credentials requests then this should be
        # set to replace the default of 'openid'. E.g. for Azure AD you will likely need to set
        # this to 'openid' and '<your-app-id-uri>/.default>'
        clientCredentialsScopes:
        - "openid"
        # The client ID used in OpenId authentication.
        clientId: null
        # The client secret used in OpenId authentication.
        clientSecret: null
        # If using an AWS load balancer to handle the authentication, set this to the Amazon
        # Resource Names (ARN) of the load balancer(s) fronting stroom, which will be something
        # like 'arn:aws:elasticloadbalancing:region-code:account-id:loadbalance
        # /app/load-balancer-name/load-balancer-id'.
        # This config value will be used to verify the 'signer' in the JWT header.
        # Each value is the first N characters of the ARN and as a minimum must include up to
        # the colon after the account-id, i.e.
        # 'arn:aws:elasticloadbalancing:region-code:account-id:'
        # See https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-authenticate-users.html#user-claims-encodin
        expectedSignerPrefixes: []
        # Some OpenId providers, e.g. AWS Cognito, require a form to be used for token requests.
        formTokenRequest: true
        # A template to build the user's full name using claim values as variables in the
        # template. E.g '${firstName} ${lastName}' or '${name}'.
        # If this property is set in the YAML file, use single quotes to prevent the
        # variables being expanded when the config file is loaded. Note: claim names are
        # case sensitive
        fullNameClaimTemplate: "${name}"
        # The type of Open ID Connect identity provider that stroom/prox
        # will use for authentication. Valid values are:
        # INTERNAL_IDP - Stroom's internal IDP. Not valid for Stroom-Proxy.
        # EXTERNAL_IDP - An external IDP such as KeyCloak/Cognito,
        # TEST_CREDENTIALS - Use hard-coded authentication credentials for test/demo only and
        # NO_IDP - No IDP is used. API keys are set in config for feed status checks. Only for use by Stroom-Proxy
        # Changing this property will require a restart of the application
        identityProviderType: "NO_IDP"
        # The issuer used in OpenId authentication.
        # Should only be set if not using a configuration endpoint
        issuer: null
        # The URI to obtain the JSON Web Key Set from in OpenId authentication
        # Should only be set if not using a configuration endpoint
        jwksUri: null
        # The logout endpoint for the identity provider
        # This is not typically provided by the configuration endpoint
        logoutEndpoint: null
        # The name of the URI parameter to use when passing the logout redirect URI to the IDP.
        # This is here as the spec seems to have changed from 'redirect_uri' to
        # 'post_logout_redirect_uri'
        logoutRedirectParamName: "post_logout_redirect_uri"
        # You can set an openid-configuration URL to automatically configure much of the openid
        # settings. Without this the other endpoints etc must be set manually
        openIdConfigurationEndpoint: null
        # If the token is signed by AWS then use this pattern to form the URI to obtain the
        # public key from. The pattern supports the variables '${awsRegion}' and '${keyId}'.
        # Multiple instances of a variable are also supported.
        # If this property is set in the YAML file, use single quotes to prevent the
        # variables being expanded when the config file is loaded.
        publicKeyUriPattern: "https://public-keys.auth.elb.${awsRegion}.amazonaws.com/${keyId}"
        # If custom auth flow request scopes are required then this should be set to replace
        # the defaults of 'openid' and 'email'.
        requestScopes:
        - "openid"
        - "email"
        # The token endpoint used in OpenId authentication
        # Should only be set if not using a configuration endpoint
        tokenEndpoint: null
        # The Open ID Connect claim used to link an identity on the IDP to a stroom user.
        # Must uniquely identify the user on the IDP and not be subject to change. Uses 'sub' by
        # default
        uniqueIdentityClaim: "sub"
        # The Open ID Connect claim used to provide a more human friendly username for a user
        # than that provided by uniqueIdentityClaim. It is not guaranteed to be unique and may
        # change
        userDisplayNameClaim: "preferred_username"
        # A set of issuers (in addition to the 'issuer' property that is provided by the IDP
        # that are deemed valid when seen in a token. If no additional valid issuers are
        # required then set this to an empty set. Also this is used to validate the 'issuer'
        # returned by the IDP when it is not a sub path of 'openIdConfigurationEndpoint'. If
        # this set is empty then Stroom will verify that the
        validIssuers: []

Jersey HTTP Client Configuration

Stroom and Stroom Proxy use the Jersey client for making HTTP connections with other nodes or other systems (e.g. Open ID Connect identity providers). In the YAML file, the jerseyClients key controls the configuration of the various clients in use.

To allow complete control of the client configuration, Stroom uses the concept of named client configurations. Each named client will be unique to a destination (where a destination is typically a server or a cluster of functionally identical servers). Thus the configuration of the connections to each of those destinations can be configured independently.

The client names are as follows:

  • DEFAULT - The default client configuration used if a named configuration is not present.
  • AWS_PUBLIC_KEYS - Connections to fetch AWS public keys used in Open ID Connect authentication.
  • DOWNSTREAM - Connections to downstream proxy/stroom instances to check feed status. (Stroom Proxy only).
  • OPEN_ID - Connections to an Open ID Connect identity provider, e.g. Cognito, Azure AD, KeyCloak, etc.
  • STROOM - Inter-node communications within the Stroom cluster (Stroom only).

The following is an example of how the clients are configured in the YAML file:

jerseyClients:
  DEFAULT:
    # Default client configuration, e.g.
    timeout: 500ms
  STROOM:
    # Configuration items for stroom inter-node communications
    timeout: 30s
  # etc.

The configuration keys (along with their default values and descriptions) for each client can be found here:

The following is another example including most keys:

jerseyClients:
  DEFAULT:
    minThreads: 1
    maxThreads: 128
    workQueueSize: 8
    gzipEnabled: true
    gzipEnabledForRequests: true
    chunkedEncodingEnabled: true
    timeout: 500ms
    connectionTimeout: 500ms
    timeToLive: 1h
    cookiesEnabled: false
    maxConnections: 1024
    maxConnectionsPerRoute: 1024
    keepAlive: 0ms
    retries: 0
    userAgent: <application name> (<client name>)
    proxy:
      host: 192.168.52.11
      port: 8080
      scheme : http
      auth:
        username: secret
        password: stuff
        authScheme: NTLM
        realm: realm
        hostname: host
        domain: WINDOWSDOMAIN
        credentialType: NT
      nonProxyHosts:
        - localhost
        - '192.168.52.*'
        - '*.example.com'
    tls:
      protocol: TLSv1.2
      provider: SunJSSE
      verifyHostname: true
      keyStorePath: /path/to/file
      keyStorePassword: changeit
      keyStoreType: JKS
      trustStorePath: /path/to/file
      trustStorePassword: changeit
      trustStoreType: JKS
      trustSelfSignedCertificates: false
      supportedProtocols: TLSv1.1,TLSv1.2
      supportedCipherSuites: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
      certAlias: alias-of-specific-cert

Logging Configuration

The Dropwizard configuration file controls all the logging by the application. In addition to the main application log, there are additional logs such as stroom user events (for audit), Stroom-Proxy send and receive logs and database migration logs.

For full details of the logging configuration, see Dropwizard Logging Configuration

Request Log

The request log is slightly different to the other logs. It logs all requests to the web server. It is configured in the server section.

The property archivedLogFilenamePattern controls rolling of the active log file. The date pattern in the filename controls the frequency that the log files are rolled. In this example, files will be rolled every 1 minute.

server:
  requestLog:
    appenders:
    - type: file
      currentLogFilename: logs/access/access.log
      discardingThreshold: 0
      # Rolled and gzipped every minute
      archivedLogFilenamePattern: logs/access/access-%d{yyyy-MM-dd'T'HH:mm}.log.gz
      archivedFileCount: 10080
      logFormat: '%h %l "%u" [%t] "%r" %s %b "%i{Referer}" "%i{User-Agent}" %D'

Logback Logs

Dropwizard uses Logback for application level logging. All logs in Stroom and Stroom-Proxy apart from the request log are Logback based logs.

Logback uses the concept of Loggers and Appenders. A Logger is a named thing that produces log messages. An Appender is an output that a Logger can append its log messages to. Typical Appenders are:

  • File - appends messages to a file that may or may not be rolled.
  • Console - appends messages to stdout.
  • Syslog - appends messages to syslog.

Loggers

A Logger can append to more than one Appender if required. For example, the default configuration file for Stroom has two appenders for the application logs. The rolled files from one appender are POSTed to Stroom to index its own logs, then deleted and the other is intended to remain on the server until archived off to allow viewing by an administrator.

A Logger can be configured with a severity, valid severities are (TRACE, DEBUG, WARN, ERROR). The severity set on a logger means that only messages with that severity or higher will be logged, with the rest not logged.

Logger names are typically the name of the Java class that is producing the log message. You don’t need to understand too much about Java classes as you are only likely to change logger severities when requested by one of the developers. Some loggers, such as event-logger do not have a Java class name.

As an example this is a portion of a Stroom config.yml file to illustrate the different loggers/appenders:

logging:
  # This is root logging severity level for all loggers. Only messages >= to WARN will be logged unless overridden
  # for a specific logger
  level: WARN

  # All the named loggers
  loggers:
    # Logs useful information about stroom. Only set DEBUG on specific 'stroom' classes or packages
    # due to the large volume of logs that would be produced for all of 'stroom' in DEBUG.
    stroom: INFO
    # Logs useful information about dropwizard when booting stroom
    io.dropwizard: INFO
    # Logs useful information about the jetty server when booting stroom
    org.eclipse.jetty: INFO
    # Logs REST request/responses with headers/payloads. Set this to OFF to turn disable that logging.
    org.glassfish.jersey.logging.LoggingFeature: INFO
    # Logs summary information about FlyWay database migrations
    org.flywaydb: INFO
    # Logger and custom appender for audit logs
    event-logger:
      level: INFO
      # Prevents messages from this logger from being sent to other appenders
      additive: false
      appenders:
        - type: file
          currentLogFilename: logs/user/user.log
          discardingThreshold: 0
          # Rolled every minute
          archivedLogFilenamePattern: logs/user/user-%d{yyyy-MM-dd'T'HH:mm}.log
          # Minute rolled logs older than a week will be deleted. Note rolled logs are deleted
          # based on the age of the window they contain, not the number of them. This value should be greater
          # than the maximum time stroom is not producing events for.
          archivedFileCount: 10080
          logFormat: "%msg%n"
    # Logger and custom appender for the flyway DB migration SQL output
    org.flywaydb.core.internal.sqlscript:
      level: DEBUG
      additive: false
      appenders:
        - type: file
          currentLogFilename: logs/migration/migration.log
          discardingThreshold: 0
          # Rolled every day
          archivedLogFilenamePattern: logs/migration/migration-%d{yyyy-MM-dd}.log
          archivedFileCount: 10
          logFormat: "%-6level [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%t] %logger - %X{code} %msg %n"

Appenders

The following is an example of the default appenders that will be used for all loggers unless they have their own custom appender configured.

logging:
  # Appenders for all loggers except for where a logger has a custom appender configured
  appenders:

    # stdout
  - type: console
    # Multi-coloured log format for console output
    logFormat: "%highlight(%-6level) [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%green(%t)] %cyan(%logger) - %X{code} %msg %n"
    timeZone: UTC
#
    # Minute rolled files for stroom/datafeed, will be curl'd/deleted by stroom-log-sender
  - type: file
    currentLogFilename: logs/app/app.log
    discardingThreshold: 0
    # Rolled and gzipped every minute
    archivedLogFilenamePattern: logs/app/app-%d{yyyy-MM-dd'T'HH:mm}.log.gz
    # One week using minute files
    archivedFileCount: 10080
    logFormat: "%-6level [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%t] %logger - %X{code} %msg %n"

Log Rolling

Rolling of log files can be done based on size of file or time. The archivedLogFilenamePattern property controls the rolling behaviour. The rolling policy is determined from the filename pattern, e.g. a pattern with a minute precision date format will be rolled every minute. The following is an example of an appender that rolls based on the size of the log file:

  - type: file
    currentLogFilename: logs/app.log
    # The name pattern, where i a sequential number indicating age, where 1 is the most recent
    archivedLogFilenamePattern: logs/app-%i.log
    # The maximum number of rolled files to keep
    archivedFileCount: 10
    # The maximum size of a log file
    maxFileSize: "100MB"
    logFormat: "%-6level [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%t] %logger - %X{code} %msg %n"

The following is an example of an appender that rolls every minute to gzipped files:

  - type: file
    currentLogFilename: logs/app/app.log
    # Rolled and gzipped every minute
    archivedLogFilenamePattern: logs/app/app-%d{yyyy-MM-dd'T'HH:mm}.log.gz
    # One week using minute files
    archivedFileCount: 10080
    logFormat: "%-6level [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%t] %logger - %X{code} %msg %n"

2.1.2 - Stroom Configuration

Describes how the Stroom application is configured.

General configuration

The Stroom application is essentially just an executable JAR file that can be run when provided with a configuration file, config.yml. This config file is common to all forms of deployment.

config.yml

Stroom operates on a configuration by exception basis so all configuration properties will have a sensible default value and a property only needs to be explicitly configured if the default value is not appropriate, e.g. for tuning a large scale production deployment or where values are environment specific. As a result config.yml only contains a minimal set of properties. The full tree of properties can be seen in ./config/config-defaults.yml and a schema for the configuration tree (along with descriptions for each property) can be found in ./config/config-schema.yml. These two files can be used as a reference when configuring stroom.

Key Configuration Properties

The following are key properties that would typically be changed for a production deployment. All configuration branches are relative to the appConfig root.

The database name(s), hostname(s), port(s), usernames(s) and password(s) should be configured using these properties. Typically stroom is configured to keep it statistics data in a separate database to the main stroom database, as is configured below.

  commonDbDetails:
    connection:
      jdbcDriverUrl: "jdbc:mysql://localhost:3307/stroom?useUnicode=yes&characterEncoding=UTF-8"
      jdbcDriverUsername: "stroomuser"
      jdbcDriverPassword: "stroompassword1"
  statistics:
    sql:
      db:
        connection:
          jdbcDriverUrl: "jdbc:mysql://localhost:3307/stats?useUnicode=yes&characterEncoding=UTF-8"
          jdbcDriverUsername: "statsuser"
          jdbcDriverPassword: "stroompassword1"

In a clustered deployment each node must be given a node name that is unique within the cluster. This is used to identify nodes in the Nodes screen. It could be the hostname of the node or follow some other naming convention.

  node:
    name: "node1a"

Each node should have its identity on the network configured so that it uses the appropriate FQDNs. The nodeUri hostname is the FQDN of each node and used by nodes to communicate with each other, therefore it can be private to the cluster of nodes. The publicUri hostname is the public facing FQDN for stroom, i.e. the address of a load balancer or Nginx. This is the address that users will use in their browser.

  nodeUri:
    hostname: "localhost" # e.g. node5.stroomnodes.somedomain
  publicUri:
    hostname: "localhost" # e.g. stroom.somedomain

Deploying without Docker

Stroom running without docker has two files to configure it. The following locations are relative to the stroom home directory, i.e. the root of the distribution zip.

  • ./config/config.yml - Stroom configuration YAML file
  • ./config/scripts.env - Stroom scripts configuration env file

The distribution also includes these files which are helpful when it comes to configuring stroom.

  • ./config/config-defaults.yml - Full version of the config.yml file containing all branches/leaves with default values set. Useful as a reference for the structure and the default values.
  • ./config/config-schema.yml - The schema defining the structure of the config.yml file.

scripts.env

This file is used by the various shell scripts like start.sh, stop.sh, etc. This file should not need to be changed unless you want to change the locations where certain log files are written to or need to change the java memory settings.

In a production system it is highly likely that you will need to increase the java heap size as the default is only 2G. The heap size settings and any other java command line options can be set by changing:

JAVA_OPTS="-Xms512m -Xmx2048m"

As part of a docker stack

When stroom is run as part of one of our docker stacks, e.g. stroom_core there are some additional layers of configuration to take into account, but the configuration is still primarily done using the config.yml file.

Stroom’s config.yml file is found in the stack in ./volumes/stroom/config/ and this is the primary means of configuring Stroom.

The stack also ships with a default config.yml file baked into the docker image. This minimal fallback file (located in /stroom/config-fallback/ inside the container) will be used in the absence of one provided in the docker stack configuration (./volumes/stroom/config/).

The default config.yml file uses environment variable substitution so some configuration items will be set by environment variables set into the container by the stack env file and the docker-compose YAML. This approach is useful for configuration values that need to be used by multiple containers, e.g. the public FQDN of Nginx, so it can be configured in one place.

If you need to further customise the stroom configuration then it is recommended to edit the ./volumes/stroom/config/config.yml file. This can either be a simple file with hard coded values or one that uses environment variables for some of its configuration items.

The configuration works as follows:

env file (stroom<stack name>.env)
                |
                |
                | environment variable substitution
                |
                v
docker compose YAML (01_stroom.yml)
                |
                |
                | environment variable substitution
                |
                v
Stroom configuration file (config.yml)

Ansible

If you are using Ansible to deploy a stack then it is recommended that all of stroom’s configuration properties are set directly in the config.yml file using a templated version of the file and to NOT use any environment variable substitution. When using Ansible, the Ansible inventory is the single source of truth for your configuration so not using environment variable substitution for stroom simplifies the configuration and makes it clearer when looking at deployed configuration files.

Stroom-ansible has an example inventory for a single node stroom stack deployment. The group_vars/all file shows how values can be set into the env file.

2.1.3 - Stroom Proxy Configuration

Describes how the Stroom-Proxy application is configured.

The configuration of Stroom-proxy is very much the same as for Stroom with the only difference being the structure of the application specific part of the config.yml file. Stroom-proxy has a proxyConfig key in the YAML while Stroom has appConfig.

YAML Configuration File

The Stroom-proxy application is essentially just an executable JAR file that can be run when provided with a configuration file, config.yml. This configuration file is common to all forms of deployment.

As Stroom-proxy does not have a user interface, the config.yml file is the only way of configuring Stroom-Proxy. As with stroom, the config.yml file is split into three sections using these keys:

  • server - Configuration of the web server, e.g. ports, paths, request logging. See Server Configuration

  • logging - Configuration of application logging. See Logging Configuration

  • proxyConfig - Stroom-Proxy specific configuration

See also Properties for more details on structure of the config.yml file and supported data types.

Stroom-Proxy operates on a configuration by exception basis so as far as is possible, all configuration properties will have a sensible default value and a property only needs to be explicitly configured if the default value is not appropriate (e.g. for tuning a large scale production deployment) or where values are environment specific (e.g. the hostname of a forward destination).

As a result the config.yml shipped with Stroom Proxy only contains a minimal set of properties. The full tree of properties can be seen in ./config/config-defaults.yml and a schema for the configuration tree (along with descriptions for each property) can be found in ./config/config-schema.yml. These two files can be used as a reference when configuring stroom.

In the snippets of YAML configuration below, the default sections

Basic Structure

Stroom-Proxy has a number of key functions which are all configured via its YAML configuration file.

The following YAML shows the high level structure of the Stroom-Proxy configuration file. Each branch of the this YAML is explained in more detail below.

proxyConfig:

  # This should be set to a value that is unique within your Stroom/Stroom-Proxy estate.
  # It is used in the unique ReceiptId that is set in the meta of received data so
  # provides provenence of where data was received at each stage.
  proxyId: null

  # If true, Stroom-Proxy will halt on start up if any errors are found in the YAML
  # configuration file. If false, the errors will simply be logged. Setting this to
  # false is not advised
  haltBootOnConfigValidationFailure: true

  # Configuration of the base and temp paths used by Stroom-Proxy.
  # See Path Configuration below
  path:

  # This is the downstream (in flow of stream data terms) Stroom/Stroom-Proxy instance/cluster
  # used for feed status checks, supplying data receipt rules and verifying API keys.
  downstreamHost:

  # This controls the aggregation of received data into larger chunks prior to forwarding.
  # This is typically required to prevent Stroom receiving lots of small streams.
  aggregator:

  # If receive.receiptCheckMode is FEED_STATUS, this controls the feed status
  # checking. See Feed Status Configuration below.
  feedStatus:

  # Zero to many HTTP POST based destinations.
  # E.g. for forwarding to Stroom or another Stroom-Proxy
  forwardHttpDestinations:

  # Zero to many file system based destinations. See Forward Configuration below.
  forwardFileDestinations:

  # This controls the meta entries that will be included in the send and receive logs.
  logStream:

  # If receive.receiptCheckMode is RECEIPT_POLICY, this controls the fetching
  # of the policy rules.
  receiptPolicy:

  # This section is common to both Stroom and Stroom-Proxy
  # See Receive Configuration below.
  receive:

  # Configuration for authentication. See Security Configuration below.
  security:

Stroom-proxy should be configured to check the receipt status of feeds on receipt of data. This is done by configuring the end point of a downstream stroom-proxy or stroom.

  feedStatus:
    url: "http://stroom:8080/api/feedStatus/v1"
    apiKey: ""

The url should be the url for the feed status API on the downstream stroom(-proxy). If this is on the same host then you can use the http endpoint, however if it is on a remote host then you should use https and the host of its nginx, e.g. https://downstream-instance/api/feedStatus/v1.

In order to use the API, the proxy must have a configured apiKey. The API key must be created in the downstream stroom instance and then copied into this configuration.

If the proxy is configured to forward data then the forward destination(s) should be set. This is the datafeed endpoint of the downstream stroom-proxy or stroom instance that data will be forwarded to. This may also be the address of a load balancer or similar that is fronting a cluster of stroom-proxy or stroom instances. See also Feed status certificate configuration.

  forwardHttpDestinations:
    - enabled: true
      name: "downstream"
      forwardUrl: "https://some-host/stroom/datafeed"

forwardUrl specifies the URL of the datafeed endpoint on the destination host. Each forward location can use a different key/trust store pair. See also Forwarding certificate configuration.

If the proxy is configured to store then the location of the proxy repository may need to be configured if it needs to be in a different location to the proxy home directory, e.g. on another mount point.

Aggregator Configuration

proxyConfig:
  aggregator:
    enabled: true
    # Whether to split received ZIPs if they are too large.
    splitSources: true
    # Maximum number of items to include in an aggregate
    maxItemsPerAggregate: 1000
    # Maximum size of the aggregate in uncompressed bytes.
    # Aggregates may be larger than this is splitSources is false or single very
    # large streams are received.
    maxUncompressedByteSize: "1G"
    #The the length of time that data is added to an aggregate for before the aggregate is closed.
    aggregationFrequency: "PT10M"

Directory Scanner Configuration

This configuration controls the directories that Stroom-Proxy scans to look for ZIP files to ingest. It is primarily used as a means of manually re-processing files that have failed to forward, either as a result of too many retries or due to an unrecoverable error.

proxyConfig:
  dirScanner:
    # One or more directories to scan.
    # If the path is relative it is treated as relative to the proxyConfig.path.home property.
    dirs:
    - "zip_file_ingest"
    # Whether directory scanning is enabled or not
    enabled: true
    # The directory to move any failed files to.
    # If the path is relative it is treated as relative to the proxyConfig.path.home property.
    failureDir: "zip_file_ingest_failed"
    # How frequently each directory is scanned for files.
    scanFrequency: "PT1M"

Downstream Host Configuration

This is the default downstream (in flow of stream data terms) Stroom/Stroom-Proxy instance/cluster used for feed status checks, supplying data receipt rules and verifying API keys.

By default it will be used as the default

proxyConfig:
  downstreamHost:
    # http or https
    scheme: "https"
    # If not set, will default to 80/443 depending on scheme
    port: 443
    hostname: "...STROOM-PROXY OR STROOM FQDN..."
    # If not using OpenID authentication you will need to provide an API key.
    apiKey: "sak_6a011e3e5d_oKimmDxfNwj......<truncated>.....HYQxHaR2"

Event Store Configuration

The Event Store is used to store and aggregate individual events received via the /api/event API or the SQS Connectors. Events are appended to files specific to the Feed and Stream Type of the event. Once a threshold is reached, the file will be rolled and processed by Stroom-Proxy.

Each event is stored as a JSON line in the file.

proxyConfig:
  eventStore:
    # The size of an internal queue used to buffer aggregates that are ready to process.
    forwardQueueSize: 1000
    # The maximum age of the file before it is rolled.
    maxAge: "PT1M"
    # The maximum size of the file before it is rolled.
    maxByteCount: 9223372036854775807
    # The maximum number of events in the file before it is rolled.
    maxEventCount: 9223372036854775807
    # Configuration of the cache used for the event store.
    openFilesCache:
    # The frequency at which files are checked to see if they need to be rolled or not.
    rollFrequency: "PT10S"

Feed Status Configuration

The configuration for performing feed status checks. This section is only relevant if proxyConfig.receive.receiptCheckMode is set to FEED_STATUS.

proxyConfig:
  feedStatus:
    # Standard cache configuration block for configuring the cache of feed status check outcomes
    feedStatusCache:
    # The full URL to use for feed status checking.
    # ONLY set this if using a non-standard URL, otherwise
    # it will be derived from the downstreamHost.
    url: null

The configuration of the client certificates for feed status checks is done using the DOWNSTREAM jersey client configuration. See Stroom and Stroom-Proxy Common Configuration.

Forward Configuration

Stroom-Proxy has two configuration branches for controlling forwarding as each has a different structure.

proxyConfig:
  # Zero to many HTTP POST based destinations.
  forwardHttpDestinations:
  # Zero to many file system based destinations.
  forwardFileDestinations:

Both types of forwarder have an enabled property. If a forwarder’s enabled state is set to false it is as if the forwarder configuration does not exist, i.e no data will be queued for that forwarder until its state is changed to true.

File Forward Destinations Configuration

proxyConfig:
  # Zero to many file system based destinations.
  forwardFileDestinations:
    # Stroom-Proxy will attempt to move files onto the forward destination using an atomic move.
    # This ensures that the move does not happen more than once. If an atomic move is not possible,
    # e.g. the destination is a remote file system that does not support an atomic move, then it will
    # fall back to a non-atomic move with the risk of it happening more than once. If you see warnings
    # in the logs or know the file system will not support atomic moves then set this to false
  - atomicMoveEnabled: true
    # Whether this destination is enabled or not.
    enabled: true
    # If Instant Forwarding is to be used.
    instant: false
    # The type of liveness check to perform:
    # READ - will attempt to read the file/dir specified in livenessCheckPath. 
    # WRITE - will attempt to touch the file specified in livenessCheckPath.
    livenessCheckMode: "READ"
    # The path to use for regular liveness checking of this forward destination.
    # If null, empty or if the 'queue' property is not configured, then no liveness check
    # will be performed and the destination will be
    # assumed to be healthy. If livenessCheckMode is READ, livenessCheckPath can be a
    # directory or a file and stroom-proxy will attempt to check it can read the
    # file/directory. If livenessCheckMode is WRITE, then livenessCheckPath must be a
    # file and stroom-proxy will attempt to touch that file. It is
    # only recommended to set this property for a remote file system where
    # connection issues may be likely. If it is a relative path, it will be assumed
    # to be relative to 'path'
    livenessCheckPath: null
    # The unique name of the destination (across all file/http forward destinations.
    # The name is used in the directories on the file system, so do not change the name
    # once proxy has processed data. Must be provided.
    name: "...PROVIDE FORWARDER NAME..."
    # The base path of a directory to forward to.
    path: "...PROVIDE PATH..."
    # See Queue Configuration section below
    queue:
    # The templated relative sub-path of path.
    # The default path template is '${year}${month}${day}/${feed}'
    # Cannot be an absolute path and must resolve to a descendant of path.
    # Fore details of this configuration branch, see Path Templating Configuration below.
    subPathTemplate: null

HTTP Forward Destinations Configuration

proxyConfig:
  # Zero to many HTTP POST based destinations.
  forwardHttpDestinations:
    # If true, add Open ID authentication headers to the request. Only works if the identityProviderType
    # is EXTERNAL_IDP and the destination is in the same Open ID Connect realm as the OIDC client that this
    # proxy instance is using.
  - addOpenIdAccessToken: false
    # The API key to use when forwarding data if Stroom is configured to require an API key.
    # Does NOT use the API Key from downstreamHost config.
    apiKey: null
    # Whether this destination is enabled or not.
    enabled: true
    forwardHeadersAdditionalAllowSet: []
    # The full URL to forward to if different from <downstreamHost>/datafeed
    forwardUrl: null
    # Configuration of the HTTP client, see below.
    httpClient:
    # If Instant Forwarding is to be used.
    instant: false
    # Whether liveness checking of the HTTP destination will take place. The queue property
    # must also be configured for liveness checking to happen
    livenessCheckEnabled: true
    # The URL/path to check for liveness of the forward destination. The URL should return a 200 response
    # to a GET request for the destination to be considered live.
    # If the response from the liveness check is not a 200, forwarding
    # will be paused at least until the next liveness check is performed.
    # If this property is not set, the downstreamHost configuration will be combined with the default API
    # path (/status).
    # If this property is just a path, it will be combined with the downstreamHost configuration.
    # Only set this property if you wish to use a non-default path.
    # or you want to use a different host/port/scheme to that defined in downstreamHost
    livenessCheckUrl: null
    # The unique name of the destination (across all file/http forward destinations.
    # The name is used in the directories on the file system, so do not change the name
    # once proxy has processed data. Must be provided.
    name: "...PROVIDE FORWARDER NAME..."
    # See Queue Configuration section below
    queue:

Queue Configuration

Each forward destination (whether file or HTTP) has a queue configuration property that controls various aspects of forwarding, e.g. failure handling, delays, concurrency, etc.

  forwardHttpDestinations / forwardFileDestinations:
    queue:
      # The sub-path template to use for data that could not be retried
      # or has reached a retry limit.
      errorSubPathTemplate:
        enabled: true
        pathTemplate: "${year}${month}${day}/${feed}"
        templatingMode: "REPLACE_UNKNOWN_PARAMS"
      # A delay to add before forwarding. Primarily for testing.
      forwardDelay: "PT0S"
      # Number of threads to process retries
      forwardRetryThreadCount: 1
      # Number of threads to handle forwarding
      forwardThreadCount: 5
      # Duration between liveness checks
      livenessCheckInterval: "PT1M"
      # The maximum time from the first failed forward attempt to continue retrying.
      # After this the data will be move to the failure directory permenantly.
      maxRetryAge: "P7D"
      # The maximum time between retries. Must be greater than or equal to retryDelay.
      maxRetryDelay: "P1D"
      # If false forwards will be attempted imediately and any failure will restult in the
      # data being moved to the failure directory.
      queueAndRetryEnabled: false
      # The time between retries. If retryDelayGrowthFactor is >1, this value will grow
      # after each retry.
      retryDelay: "PT10M"
      # The factor to apply to retryDelay after each failed retry.
      retryDelayGrowthFactor: 1.0

Path Templating Configuration

The following properties all share the same structure:

  • proxyConfig.forwardFileDestinations.[n].subPathTemplate
  • proxyConfig.forwardFileDestinations.[n].queue.errorSubPathTemplate
  • proxyConfig.forwardHttpDestinations.[n].queue.errorSubPathTemplate
  xxxxxxTemplate:
    # Whether templating is enabled or not. If not enabled
    # no sub-path will be used.
    enabled: true
    # The template to use for the sub-path
    pathTemplate: "${year}${month}${day}/${feed}"
    # Controls how unknown parameters are dealt with. One of:
    # IGNORE_UNKNOWN_PARAMS - e.g. 'cat/${unknownparam}/dog' => 'cat/${unknownparam}/dog'
    # REMOVE_UNKNOWN_PARAMS - e.g. 'cat/${unknownparam}/dog' => 'cat/dog'
    # REPLACE_UNKNOWN_PARAMS - Replace unknown with 'XXX', e.g. 'cat/${unknownparam}/dog' => 'cat/XXX/dog'
    templatingMode: "REPLACE_UNKNOWN_PARAMS"

The following template parameters are supported:

  • ${feed} - The Feed name.
  • ${type} - The Stream Type.
  • ${year} - The 4 digit year of the current date/time.
  • ${month} - The 2 digit month of the current date/time.
  • ${day} - The 2 digit day of the current date/time.
  • ${hour} - The 2 digit hour of the current date/time.
  • ${minute} - The 2 digit minute of the current date/time.
  • ${second} - The 2 digit second of the current date/time.
  • ${millis} - The 3 digit milliseconds of the current date/time.
  • ${ms} - The current date/time as milliseconds since the Unix Epoch.

Liveness Checking

Each of the configured forward destinations has a liveness check that can be configured. This allows Stroom Proxy to periodically check that the destination is live. If the liveness check fails for a destination, all forwarding for that destination will be paused until a subsequent liveness check reports it as live again.

The liveness checks take the following forms:

  • HTTP Destination - Performs a GET request to the URL configured using forwardHttpDestinations.[n].livenessCheckUrl. If not configured it will use /status on the downstream host. The destination is considered live if it gets a 200 response. You can use a URL that allows the destination to control its liveness, i.e. to take itself off line during an upgrade.

  • File Destination - Reads or writes (touch) to a file defined by forwardFileDestinations.[n].livenessCheckPath. Liveness checking for a file destination may be useful if the destination is on a network file share. livenessCheckMode controls whether a read or write to the file is performed.

HTTP Client Configuration

proxyConfig:
  forwardHttpDestinations:
    httpClient:
      connectionRequestTimeout: "PT3M"
      connectionTimeout: "PT3M"
      cookiesEnabled: false
      keepAlive: "PT0S"
      maxConnections: 1024
      maxConnectionsPerRoute: 1024
      proxy: null
      retries: 0
      timeToLive: "PT1H"
      timeout: "PT3M"
      # Transport Layer Security, see below.
      tls: null
      userAgent: null
      validateAfterInactivityPeriod: "PT0S"

The tls branch of the configuration is for configuring Transport Layer Security (the successor to Secure Sockets Layer (SSL)). It is null by default, i.e. no additional TLS configuration is used. Its structure is:

proxyConfig:
  forwardHttpDestinations:
    httpClient:
      tls:
        protocol: "TLSv1.2"
        # The name of the JCE provider to use on client side for cryptographic support 
        # (for example, SunJCE, Conscrypt, BC, etc). See Oracle documentation for more information.
        provider:
        # The path of the key store file
        keyStorePath: null
        # The password of the key store file
        keyStorePassword: null
        # The type of key store (usually JKS, PKCS12, JCEKS, Windows-MY, or Windows-ROOT).
        keyStoreType: "JKS"
        keyStoreProvider: null
        # The path of the trust store file
        trustStorePath: null
        # The password of the trust store file
        trustStorePassword: null
        # The type of trust store (usually JKS, PKCS12, JCEKS, Windows-MY, or Windows-ROOT).
        trustStoreType: "JKS"
        trustStoreProvider: null
        trustSelfSignedCertificates: false
        verifyHostname: false
        # Zero to protocols (e.g., SSLv3, TLSv1) which are supported.
        # All other protocols will be refused.
        supportedProtocols: null
        # A list of cipher suites (e.g., TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256) which are supported.
        # All other cipher suites will be refused.
        supportedCiphers: null
        certAlias: null

Log Stream Configuration

This controls the meta entries that will be included in the send and receive logs.

proxyConfig:
  logStream:
    # The headers attributes that will be output in the send/receive log lines.
    # They will be output in the order that they appear in this list.
    # Duplicates will be ignored, case does not matter.
    metaKeys:
      - "guid"
      - "receiptid"
      - "feed"
      - "system"
      - "environment"
      - "remotehost"
      - "remoteaddress"
      - "remotedn"
      - "remotecertexpiry"

Path Configuration

proxyConfig:
  path:
    # By default all files read or written to by stroom-proxy will be in directories relative to
    # the home location. Ideally this should differ from the location of the Stroom Proxy
    # installed software as it has a different lifecycle.
    # If not set the location of the Stroom-Proxy application JAR file will be used and if that
    # can't be determined, <user's home>/.stroom will be used.
    home: "...SET TO AN ABSOLUTE PATH..."
    # The location for Stroom-Proxy's persisted data
    data: "data"
    # The location for any temporary files/directories.
    # If not set, will use a sub-directory called 'stroom-proxy' in the system temp dir,
    # i.e. as defined by 'java.io.tmpdir'.
    temp: null

All paths in the configuration file can be either relative or absolute. If relative then they will be treated as being relative to the home path.

Receipt Policy Configuration

This section of configuration is only applicable if proxyConfig.receive.receiptCheckMode is RECEIPT_POLICY. It controls the fetching of the receipt policy rules from a downstream Stroom or Stroom-Proxy.

proxyConfig:
  receiptPolicy:
    # Only set if using a non-standard URL, else this is derived based on downstreamHost
    # config.
    receiveDataRulesUrl: null
    # The duration between calls to fetch the latest policy rules.
    syncFrequency: "PT1M"

The configuration of the client certificates for receipt policy checks is done using the DOWNSTREAM jersey client configuration. See Stroom and Stroom-Proxy Common Configuration.

Receive Configuration

The receive configuration is common to both Stroom and Stroom-Proxy, see Receive Configuration

Security Configuration

proxyConfig:
  security:
    authentication:
      # This property is currently not used
      authenticationRequired: true
      # Open ID Connect configuration
      openId:

The openId branch of the config is common to both Stroom and Stroom-Proxy, see Open ID Configuration for details.

Amazon Simple Queue Service Configuration

Stroom-Proxy is able to consume messages from multiple AWS SQS queues. Each message received from a queue will be added to the Event Store for aggregation by Feed and Stream Type.

proxyConfig:
  # Zero to many connectors
  sqsConnectors:
    # This property is not currently used
  - awsProfileName: null
    # The name of the AWS region the SQS queue exists in.
    awsRegionName: "...AWS REGION..."
    # The maximum time to wait when polling the queue for messages
    pollFrequency: "PT10S"
    # This property is not currently used
    queueName: null
    # The URL of the Amazon SQS queue from which messages are received.
    queueUrl: "...SQS QUEUE URL..."

Thread Configuration

Stroom-Proxy is able to run certain operations in parallel. This configuration allows you to increase the number of threads used for each operation.

proxyConfig:
  threads:
    # Number of threads to consume from the aggregate input queue.
    aggregateInputQueueThreadCount: 1
    # Number of threads to consume from the forwarding input queue. 
    forwardingInputQueueThreadCount: 1
    # Number of threads to consume from the pre-aggregate input queue.
    preAggregateInputQueueThreadCount: 1
    # Number of threads to consume from the zip splitting input queue.
    zipSplittingInputQueueThreadCount: 1

Deploying without Docker

Apart from the structure of the config.yml file, the configuration in a non-docker environment is the same as for stroom.

As part of a docker stack

The way Stroom-Proxy is configured is essentially the same as for stroom with the only real difference being the structure of the config.yml file as note above . As with stroom the docker stack comes with a ./volumes/stroom-proxy-*/config/config.yml file that will be used in the absence of a provided one. Also as with stroom, the config.yml file supports environment variable substitution so can make use of environment variables set in the stack .env file and passed down via the docker-compose YAML files.

Certificates

Stroom-proxy makes use of client certificates for two purposes:

  • Communicating with a downstream stroom/stroom-proxy in order to establish the receipt status for the feeds it has received data for.
  • When forwarding data to a downstream stroom/stroom-proxy

The stack comes with the following files that can be used for demo/test purposes.

volumes/stroom-proxy-*/certs/ca.jks
volumes/stroom-proxy-*/certs/client.jks

For a production deployment these will need to be replaced with the certificates that are appropriate for your environment.

Typical Configuration

The following are a guide to typical configurations for operating a Stroom-Proxy with different use cases.

Store and Forward

This is a typical case where you want to aggregate received data then forward it to a downstream Stroom or Stroom-Proxy, but also retain a store of the aggregates.

server:
  applicationContextPath: /
  adminContextPath: /proxyAdmin
  applicationConnectors:
    - type: http
      port: "8090"
      useForwardedHeaders: true
  adminConnectors:
    - type: http
      port: "8091"
      useForwardedHeaders: true
  detailedJsonProcessingExceptionMapper: true
  requestLog:
    appenders:
      # Log appender for the web server request logging
    - type: file
      currentLogFilename: logs/access/access.log
      discardingThreshold: 0
      # Rolled and gzipped every minute
      archivedLogFilenamePattern: logs/access/access-%d{yyyy-MM-dd'T'HH:mm}.log.gz
      # One week using minute files
      archivedFileCount: 10080
      logFormat: '%h %l "%u" [%t] "%r" %s %b "%i{Referer}" "%i{User-Agent}" %D'

logging:
  level: WARN
  loggers:
    # Logs useful information about stroom proxy. Only set DEBUG on specific 'stroom' classes or packages
    # due to the large volume of logs that would be produced for all of 'stroom' in DEBUG.
    stroom: INFO
    # Logs useful information about dropwizard when booting stroom
    io.dropwizard: INFO
    # Logs useful information about the jetty server when booting stroom
    # Set this to INFO if you want to log all REST request/responses with headers/payloads.
    org.glassfish.jersey.logging.LoggingFeature: OFF

    # Logger and appender for proxy receipt audit logs
    "receive":
      level: INFO
      additive: false
      appenders:
      - type: file
        currentLogFilename: logs/receive/receive.log
        discardingThreshold: 0
        # Rolled and gzipped every minute
        archivedLogFilenamePattern: logs/receive/receive-%d{yyyy-MM-dd'T'HH:mm}.log.gz
        # One week using minute files
        archivedFileCount: 10080
        logFormat: "%-6level [%d{yyyy-MM-dd'T'HH:mm:ss.SSS'Z'}] [%t] %logger - %X{code} %msg %n"

    # Logger and appender for proxy send audit logs
    "send":
      level: INFO
      additive: false
      appenders:
      - type: file
        currentLogFilename: logs/send/send.log
        discardingThreshold: 0
        # Rolled and gzipped every minute
        archivedLogFilenamePattern: logs/send/send-%d{yyyy-MM-dd'T'HH:mm}.log.gz
        # One week using minute files
        archivedFileCount: 10080
        logFormat: "%-6level [%d{yyyy-MM-dd'T'HH:mm:ss.SSS'Z'}] [%t] %logger - %X{code} %msg %n"

  appenders:

    # Log to stdout, use this if running in Docker
  - type: console
    # Multi-coloured log format for console output
    logFormat: "%highlight(%-6level) [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%green(%t)] %cyan(%logger) - %X{code} %msg %n"
    timeZone: UTC

    # Minute rolled files for stroom/datafeed, will be curl'd/deleted by stroom-log-sender
  - type: file
    currentLogFilename: logs/app/app.log
    discardingThreshold: 0
    archivedLogFilenamePattern: logs/app/app-%d{yyyy-MM-dd'T'HH:mm}.log.gz
    # One week using minute files
    archivedFileCount: 10080
    logFormat: "%-6level [%d{\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",UTC}] [%t] %logger - %X{code} %msg %n"

# This section contains the Stroom Proxy configuration properties
# For more information see:
# https://gchq.github.io/stroom-docs/user-guide/properties.html
# jerseyClients are used for making feed status and content sync REST calls
jerseyClients:
  default:
    tls:
      keyStorePath: "certs/client.jks"
      keyStorePassword: "password"
      trustStorePath: "certs/ca.jks"
      trustStorePassword: "password"

proxyConfig:
  path:
    # By default all files read or written to by stroom-proxy will be in directories relative to
    # the home location. This must be set to an absolute path and also to one that differs
    # the installed software as it has a different lifecycle.
    home: "/stroomdata/stroom-proxy/home"
  # This is the downstream (in datafeed flow terms) stroom/stroom-proxy used for
  # feed status checks, supplying data receipt rules and verifying API keys.
  downstreamHost:
    scheme: "https"
    port: "443"
    hostname: "stroom.some.domain"
    apiKey: "...API KEY..."

  aggregator:
    maxItemsPerAggregate: 1000
    maxUncompressedByteSize: "1G"
    aggregationFrequency: 10m

  forwardFileDestinations:
  - name: "archive-repo"
    path: "/stroomdata/stroom-proxy/archive-repo"
    subPathTemplate:
      pathTemplate: "${year}/${year}-${month}/${year}-${month}-${day}/${year}-${month}-${day}-${feed}/"

  forwardHttpDestinations:
  - name: "downstream-stroom"
    httpClient:
      tls:
        keyStorePath: "certs/client.jks"
        keyStorePassword: "password"
        trustStorePath: "certs/ca.jks"
        trustStorePassword: "password"

  receive:
    receiptCheckMode: "RECEIPT_POLICY"

Air-Gapped Store Only

This is an example of a Stroom-Proxy instance that is hosted in an environment where is has no direct link to a downstream Stroom/Stroom-Proxy. All data is aggregated and forwarded to the local file system for transport downstream using other means outside of the scope of this documentation.

server:
  # ... Same as configuration above

logging:
  # ... Same as configuration above

jerseyClients:
  # ... Same as configuration above

proxyConfig:
  path:
    # By default all files read or written to by stroom-proxy will be in directories relative to
    # the home location. This must be set to an absolute path and also to one that differs
    # the installed software as it has a different lifecycle.
    home: "/stroomdata/stroom-proxy/home"

  # No downstreamHost due to air-gap
  downstreamHost:
    enabled: false

  aggregator:
    maxItemsPerAggregate: 1000
    maxUncompressedByteSize: "1G"
    aggregationFrequency: 10m

  forwardFileDestinations:

  # Repo for a local archive
  - name: "archive-repo"
    path: "/stroomdata/stroom-proxy/archive-repo"
    subPathTemplate:
      pathTemplate: "${year}/${year}-${month}/${year}-${month}-${day}/${year}-${month}-${day}-${feed}/"

  # Repo to be transported downstream around air-gap
  - name: "downstream-repo"
    path: "/stroomdata/stroom-proxy/downstream-repo"
    subPathTemplate:
      pathTemplate: "${year}/${year}-${month}/${year}-${month}-${day}/${year}-${month}-${day}-${feed}/"

  forwardHttpDestinations: []

  receive:
    # No receipt checking due to air-gap. All data accepted.
    receiptCheckMode: "RECEIVE_ALL"

2.2 - Nginx Configuration

Configuring Nginx for use with Stroom and Stroom Proxy.

Nginx is the standard web server used by stroom. Its primary role is SSL termination and reverse proxying for stroom and stroom-proxy that sit behind it. It can also load balance incoming requests and ensure traffic from the same source is always routed to the same upstream instance. Other web servers can be used if required but their installation/configuration is out of the scope of this documentation.

Without Docker

The standard way of deploying Nginx with stroom running without docker involves running Nginx as part of the services stack. See below for details of how to configure it. If you want to deploy Nginx without docker then you can but that is outside the scope of this documentation.

As part of a docker stack

Nginx is included in all the stroom docker stacks. Nginx is configured using multiple configuration files to aid clarity and allow reuse of sections of configuration. The main file for configuring Nginx is nginx.conf.template and this makes use of other files via include statements.

The purpose of the various files is as follows:

  • nginx.conf.template - Top level configuration file that orchestrates the other files.
  • logging.conf.template - Configures the logging output, its content and format.
  • server.conf.template - Configures things like SSL settings, timeouts, ports, buffering, etc.
  • Upstream configuration
    • upstreams.stroom.ui.conf.template - Defines the upstream host(s) for stroom node(s) that are dedicated to serving the user interface.
    • upstreams.stroom.processing.conf.template - Defines the upstream host(s) for stroom node(s) that are dedicated to stream processing and direct data receipt.
    • upstreams.proxy.conf.template - Defines the upstream host(s) for local stroom-proxy node(s).
  • Location configuration
    • locations_defaults.conf.template - Defines some default directives (e.g. headers) for configuring stroom paths.
    • proxy_location_defaults.conf.template - Defines some default directives (e.g. headers) for configuring stroom-proxy paths.
    • locations.proxy.conf.template - Defines the various paths (e.g./ /datafeed) that will be reverse proxied to stroom-proxy hosts.
    • locations.stroom.conf.template - Defines the various paths (e.g./ /datafeeddirect) that will be reverse proxied to stroom hosts.

Templating

The nginx container has been configured to support using environment variables passed into it to set values in the Nginx configuration files. It should be noted that recent versions of Nginx have templating support built in. The templating mechanism used in stroom’s Nginx container was set up before this existed but achieves the same result.

All non-default configuration files for Nginx should be placed in volumes/nginx/conf/ and named with the suffix .template (even if no templating is needed). When the container starts any variables in these templates will be substituted and the resulting files will be copied into /etc/nginx. The result of the template substitution is logged to help with debugging.

The files can contain templating of the form:

ssl_certificate             /stroom-nginx/certs/<<<NGINX_SSL_CERTIFICATE>>>;

In this example <<<NGINX_SSL_CERTIFICATE>>> will be replaced with the value of environment variable NGINX_SSL_CERTIFICATE when the container starts.

Upstreams

When configuring a multi node cluster you will need to configure the upstream hosts. Nginx acts as a reverse proxy for the applications behind it so the lists of hosts for each application need to be configured.

For example if you have a 10 node cluster and 2 of those nodes are dedicated for user interface use then the configuration would look like:

upstreams.stroom.ui.conf.template

server node1.stroomhosts:<<<STROOM_PORT>>>
server node2.stroomhosts:<<<STROOM_PORT>>>

upstreams.stroom.processing.conf.template

server node3.stroomhosts:<<<STROOM_PORT>>>
server node4.stroomhosts:<<<STROOM_PORT>>>
server node5.stroomhosts:<<<STROOM_PORT>>>
server node6.stroomhosts:<<<STROOM_PORT>>>
server node7.stroomhosts:<<<STROOM_PORT>>>
server node8.stroomhosts:<<<STROOM_PORT>>>
server node9.stroomhosts:<<<STROOM_PORT>>>
server node10.stroomhosts:<<<STROOM_PORT>>>

upstreams.proxy.conf.template

server node3.stroomhosts:<<<STROOM_PORT>>>
server node4.stroomhosts:<<<STROOM_PORT>>>
server node5.stroomhosts:<<<STROOM_PORT>>>
server node6.stroomhosts:<<<STROOM_PORT>>>
server node7.stroomhosts:<<<STROOM_PORT>>>
server node8.stroomhosts:<<<STROOM_PORT>>>
server node9.stroomhosts:<<<STROOM_PORT>>>
server node10.stroomhosts:<<<STROOM_PORT>>>

In the above example the port is set using templating as it is the same for all nodes. Nodes 1 and 2 will receive all UI and REST API traffic. Nodes 8-10 will serve all datafeed(direct) requests.

Certificates

The stack comes with a default server certificate/key and CA certificate for demo/test purposes. The files are located in volumes/nginx/certs/. For a production deployment these will need to be changed, see Certificates

Log rotation

The Nginx container makes use of logrotate to rotate Nginx’s log files after a period of time so that rotated logs can be sent to stroom. Logrotate is configured using the file volumes/stroom-log-sender/logrotate.conf.template. This file is templated in the same way as the Nginx configuration files, see above. The number of rotated files that should be kept before deleting them can be controlled using the line.

rotate 100

This should be set in conjunction with the frequency that logrotate is called, which is controlled by volumes/stroom-log-sender/crontab.txt. This crontab file drives the logrotate process and by default is set to run every minute.

2.3 - Stroom Log Sender Configuration

Stroom log sender is a docker image used for sending application logs to stroom. It is essentially just a combination of the send_to_stroom.sh script and a set of crontab entries to call the script at intervals.

Deploying without Docker

When deploying without docker stroom and stroom-proxy nodes will need to be configured to send their logs to stroom. This can be done using the ./bin/send_to_stroom.sh script in the stroom and stroom-proxy zip distributions and some crontab configuration.

The crontab file for the user account running stroom should be edited (crontab -e) and set to something like:

# stroom logs
* * * * * STROOM_HOME=<path to stroom home> ${STROOM_HOME}/bin/send_to_stroom.sh ${STROOM_HOME}/logs/access STROOM-ACCESS-EVENTS <datafeed URL> --system STROOM --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1
* * * * * STROOM_HOME=<path to stroom home> ${STROOM_HOME}/bin/send_to_stroom.sh ${STROOM_HOME}/logs/app    STROOM-APP-EVENTS    <datafeed URL> --system STROOM --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1
* * * * * STROOM_HOME=<path to stroom home> ${STROOM_HOME}/bin/send_to_stroom.sh ${STROOM_HOME}/logs/user   STROOM-USER-EVENTS   <datafeed URL> --system STROOM --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1

# stroom-proxy logs
* * * * * PROXY_HOME=<path to proxy home> ${PROXY_HOME}/bin/send_to_stroom.sh ${PROXY_HOME}/logs/access  STROOM_PROXY-ACCESS-EVENTS  <datafeed URL> --system STROOM-PROXY --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1
* * * * * PROXY_HOME=<path to proxy home> ${PROXY_HOME}/bin/send_to_stroom.sh ${PROXY_HOME}/logs/app     STROOM_PROXY-APP-EVENTS     <datafeed URL> --system STROOM-PROXY --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1
* * * * * PROXY_HOME=<path to proxy home> ${PROXY_HOME}/bin/send_to_stroom.sh ${PROXY_HOME}/logs/send    STROOM_PROXY-SEND-EVENTS    <datafeed URL> --system STROOM-PROXY --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1
* * * * * PROXY_HOME=<path to proxy home> ${PROXY_HOME}/bin/send_to_stroom.sh ${PROXY_HOME}/logs/receive STROOM_PROXY-RECEIVE-EVENTS <datafeed URL> --system STROOM-PROXY --environment <environment> --file-regex '.*/[a-z]+-[0-9]{4}-[0-9]{2}-[0-9]{2}T.*\\.log' --max-sleep 10 --key <key file> --cert <cert file> --cacert <CA cert file> --delete-after-sending --compress >> <path to log> 2>&1

where the environment specific values are:

  • <path to stroom home> - The absolute path to the stroom home, i.e. the location of the start.sh script.
  • <path to proxy home> - The absolute path to the stroom-proxy home, i.e. the location of the start.sh script.
  • <datafeed URL> - The URL that the logs will be sent to. This will typically be the nginx host or load balancer and the path will typically be https://host/datafeeddirect to bypass the proxy for faster access to the logs.
  • <environment> - The environment name that the stroom/proxy is deployed in, e.g. OPS, REF, DEV, etc.
  • <key file> - The absolute path to the SSL key file used by curl.
  • <cert file> - The absolute path to the SSL certificate file used by curl.
  • <CA cert file> - The absolute path to the SSL certificate authority file used by curl.
  • <path to log> - The absolute path to a log file to log all the send_to_stroom.sh output to.

If your implementation of cron supports environment variables then you can define some of the common values at the top of the crontab file and use them in the entries. cronie as used by Centos does not support environment variables in the crontab file but variables can be defined at the line level as has been shown with STROOM_HOME and PROXY_HOME.

The above crontab entries assume that stroom and stroom-proxy are running on the same host. If they are not then the entries can be split across the hosts accordingly.

Service host(s)

When deploying stroom/stroom-proxy without stroom you may still be deploying the service stack (nginx and stroom-log-sender) to a host. In this case see As part of a docker stack below for details of how to configure stroom-log-sender to send the nginx logs.

As part of a docker stack

Crontab

The docker stacks include the stroom-log-sender docker image for sending the logs of all the other containers to stroom. Stroom-log-sender is configured using the crontab file volumes/stroom-log-sender/conf/crontab.txt. When the container starts this file will be read. Any variables in it will be substituted with the values from the corresponding environment variables that are present in the container. These common values can be set in the config/<stack name>.env file.

As the variables are substituted on container start you will need to restart the container following any configuration change.

Certificates

The directory volumes/stroom-log-sender/certs contains the default client certificates used for the stack. These allow stroom-log-sender to send the log files over SSL which also provides stroom with details of the sender. These will need to be replaced in a production environment.

volumes/stroom-log-sender/certs/ca.pem.crt
volumes/stroom-log-sender/certs/client.pem.crt
volumes/stroom-log-sender/certs/client.unencrypted.key

For a production deployment these will need to be changed, see Certificates

2.4 - MySQL Configuration

Configuring MySQL for use with Stroom.

General configuration

MySQL is configured via the .cnf file which is typically located in one of these locations:

  • /etc/my.cnf
  • /etc/mysql/my.cnf
  • $MYSQL_HOME/my.cnf
  • <data dir>/my.cnf
  • ~/.my.cnf

Key configuration properties

  • lower_case_table_names - This property controls how the tables are stored on the filesystem and the case-sensitivity of table names in SQL. A value of 0 means tables are stored on the filesystem in the case used in CREATE TABLE and sql is case sensitive. This is the default in linux and is the preferred value for deployments of stroom of v7+. A value of 1 means tables are stored on the filesystem in lowercase but sql is case insensitive. See also Identifier Case Sensitivity

  • max_connections - The maximum permitted number of simultaneous client connections. For a clustered deployment of stroom, the default value of 151 will typically be too low. Each stroom node will hold a pool of open database connections for its use, therefore with a large number of stroom nodes and a big connection pool the total number of connections can be very large. This property should be set taking into account the values of the stroom properties of the form *.db.connectionPool.maxPoolSize. See also Connection Interfaces

  • innodb_buffer_pool_size/innodb_buffer_pool_instances - Controls the amount of memory available to MySQL for caching table/index data. Typically this will be set to 80% of available RAM, assuming MySQL is running on a dedicated host and the total amount of table/index data is greater than 80% of available RAM. Note: innodb_buffer_pool_size must be set to a value that is equal to or a multiple of innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances. See also Configuring InnoDB Buffer Pool Size

Deploying without Docker

When MySQL is deployed without a docker stack then MySQL should be installed and configured according to the MySQL documentation. How MySQL is deployed and configured will depend on the requirements of the environment, e.g. clustered, primary/standby, etc.

As part of a docker stack

Where a stroom docker stack includes stroom-all-dbs (MySQL) the MySQL instance is configured via the .cnf file. The .cnf file is located in volumes/stroom-all-dbs/conf/stroom-all-dbs.cnf. This file is read-only to the container and will be read on container start.

Database initialisation

When the container is started for the first time the database will be initialised with the root user account. It will also then run any scripts found in volumes/stroom-all-dbs/init/stroom. The scripts in here will be run in alphabetical order. Scripts of the form .sh, .sql, .sql.gz and .sql.template are supported.

.sql.template files are proprietary to stroom stacks and are just templated .sql files. They can contain tags of the form <<<ENV_VAR_NAME>>> which will be replaced with the value of the named environment variable that has been set in the container.

If you need to add additional database users then either add them to volumes/stroom-all-dbs/init/stroom/001_create_databases.sql.template or create additional scripts/templates in that directory.

The script that controls this templating is volumes/stroom-all-dbs/init/000_stroom_init.sh. This script MUST not have its executable bit set else it will be executed rather than being sourced by the MySQL entry point scripts and will then not work.

3 - Installing in an Air Gapped Environment

How to install Stroom when there is no internet connection.

Docker images

For those deployments of Stroom that use docker containers, by default docker will try to pull the docker images from DockerHub on the internet. If you do not have an internet connection then you will need to make these images available to the local docker binary in another way.

Downloading the images

Firstly you need to determine which images and which tags you need. Look at stroom-resources/releases and for each release and variant of the Stroom stacks you will see a manifest of the docker images/tags in that release/variant. For example, for stroom-stacks-v7.0-beta.175 and stack variant stroom_core the list of images is:

nginx gchq/stroom-nginx:v7.0-beta.2
stroom gchq/stroom:v7.0-beta.175
stroom-all-dbs mysql:8.0.23
stroom-log-sender gchq/stroom-log-sender:v2.2.0
stroom-proxy-local gchq/stroom-proxy:v7.0-beta.175

With the docker binary

If you have access to an internet connected computer that has Docker installed on it then you can use Docker to pull the images. For each of the required images run a command like this:

docker pull gchq/stroom-nginx:v7.0-beta.2
docker save -o stroom-nginx.tar gchq/stroom-nginx:v7.0-beta.2

Without the docker binary

If you can’t install Docker on the internet connected machine then this shell script may help you to download and assemble the various layers of an image from DockerHub using only bash, curl and jq. This is a third party script so we cannot vouch for it in any way. As with all scripts you run that you find on the internet, look at and understand what they do before running them.

Loading the images

Once you have downloaded the image tar files and transferred them over the air gap you will need to load them into your local docker repo. Either this will be the local repo on the machine where you will deploy Stroom (or one of its component containers) or you will have a central docker repository that many machines can access. Managing a central air-gapped repository is beyond the scope of this documentation.

To load the images into your local repository use a command similar to this for each of the .tar files that you created using docker save above:

docker load --input stroom-nginx.tar

You can check the images are available using:

docker image ls

4 - Upgrades

4.1 - Minor Upgrades and Patches

How to upgrade to a new minor or patch release.

Stroom versioning follows Semantic Versioning .

Given a version number MAJOR.MINOR.PATCH:

  • MAJOR is incremented when there are major or breaking changes.
  • MINOR is incremented when functionality is added in a backwards compatible manner.
  • PATCH is incremented when bugs are fixed.

Stroom is designed to detect the version of the existing database schema and to run any migrations necessary to bring it up to the version begin deployed. This means you can jump from say 7.0.0 => 7.2.0 or from 7.0.0 to 7.0.5.

This document covers minor and patch upgrades only.

Docker stack deployments

Non-docker deployments

Major version upgrades

The following notes are specific for these major version upgrades

4.2 - Upgrade from v5 to v7

This document describes the process for upgrading Stroom from v5.x to v7.x.

Warning

Before commencing an upgrade to v7 you must upgrade Stroom to the latest minor and patch version of v5.
At the time of writing the latest version of v5 is v5.5.16.

Differences between v5 and v7

Stroom v7 has significant differences to v6 which make the upgrade process a little more complicated.

  • v5 handled authentication within the application. In v7 authentication is handled either internally in stroom (the default) or by an external identity provider such as google or AWS Cognito.
  • v5 used the ~setup.xml, ~env.sh and stroom.properties files for configuration. In v7 stroom uses a config.yml file for its configuration (see Properties)
  • v5 used upper case and heavily abbreviated names for its tables. In v7 clearer and lower case table names are used. As a result ALL v5 tables get renamed with the prefix OLD_, the new tables created and any content copied over. As the database will be holding two copies of most data you need to ensure you have space to accommodate it.

Pre-Upgrade tasks

Stroom can be upgraded straight from v5 to v7 without going via v6. There are however a few pre-migration steps that need to be followed.

Upgrade Stroom to the latest v5 version

Follow your standard process for performing a minor upgrade to bring your v5 Stroom instance up to the latest v5 version. This ensures all v5 migrations are applying all the v6 and v7 migrations.

Download migration scripts

Download the migration SQL scripts from https://github.com/gchq/stroom/blob/STROOM_VERSION/scripts e.g. https://github.com/gchq/stroom/blob/v7.0-beta.198/scripts

Some of these scripts will be used in the steps below. The unused scripts are not applicable to a v5=>v7 upgrade.

Pre-migration database checks

Run the pre-migration checks script on the running database.

mysql --force --table -u"stroomuser" -p"stroompassword1" stroom \
< v7_db_pre_migration_checks.sql \
> v7_db_pre_migration_checks.out \
2>&1

This will produce a report of items that will not be migrated or need attention before migration.

Capture non-default Stroom properties

Run the following script to capture the non-default system properties that are held in the database. This is a precaution in case they are needed following migration.

mysql --force --table -u"stroomuser" -p"stroompassword1" stroom \
< v5_list_properties.sql \
> v5_list_properties.out \
2>&1

Stop processing

Before shutting stroom down it is wise to turn off stream processing and let all outstanding server tasks complete.

TODO clarify steps for this.

Stop Stroom

Stop the stack (stroom and the database) then start up the database. Do this using the v6 stack. This ensures that stroom is not trying to access the database.

./stop.sh

Backup the databases

Backup all the databases for the different components. Typically these will be stroom and stats (or statistics).

Stop the database

Stop the database using the v6 stack.

./stop.sh

Deploy v7

Deploy the latest version of Stroom but don’t start it.

TODO - more detail

Migrate the v5 configuration into v7

The configuration properties held in the database and accessed for the Properties UI screen will be migrated automatically by Stroom where possible.

Stroom v5 and v7 however are configured differently when it comes to the configuration files used to bootstrap the application, such as the database connection details. These properties will need to be manually migrated from the v5 instance to the v7 instance. The configuration to bootstrap Stroom v5 can be found in instance/lib/stroom.properties. The configuration for v7 can be found in the following places:

  • Zip distribution - config/config.yml.
  • Docker stack - volumes/stroom/config/config.yml. Note that this file uses variable substitution so values can be set in config/<stack_name>.env if suitably substituted.

The following table shows the key configuration properties that need to be set to start the application and how they map between v5 and v7.

V5 property V7 property Notes
stroom.temp appConfig.path.temp Set this if different from $TEMP env var.
- appConfig.path.home By default all local state (e.g. reference data stores, search results) will live under this directory. Typically it should be in a different location to the stroom instance as it has a different lifecycle.
stroom.node appConfig.node.name
- appConfig.nodeUrl.hostname Set this to the FQDN of the node so other nodes can communicate with it.
- appConfig.publicUrl.hostname Set this to the public FQDN of Stroom, typically a load balancer or Nginx instance.
stroom.jdbcDriverClassName appConfig.commonDbDetails.connection.jdbcDriverClassName Do not set this. Will get defaulted to com.mysql.cj.jdbc.Driver
stroom.jdbcDriverUrl appConfig.commonDbDetails.connection.jdbcDriverUrl
stroom.jdbcDriverUsername appConfig.commonDbDetails.connection.jdbcDriverUsername
stroom.jdbcDriverPassword appConfig.commonDbDetails.connection.jdbcDriverPassword
stroom.jpaDialect -
stroom.statistics.sql.jdbcDriverClassName appConfig.commonDbDetails.connection.jdbcDriverClassName Do not set this. Will get defaulted to com.mysql.cj.jdbc.Driver
stroom.statistics.sql.jdbcDriverUrl appConfig.statistics.sql.db.connection.jdbcDriverUrl
stroom.statistics.sql.jdbcDriverUsername appConfig.statistics.sql.db.connection.jdbcDriverUsername
stroom.statistics.sql.jdbcDriverPassword appConfig.statistics.sql.db.connection.jdbcDriverPassword
stroom.statistics.common.statisticEngines appConfig.statistics.internal.enabledStoreTypes Do not set this. Will get defaulted to StatisticStore
- appConfig.ui.helpUrl Set this to the URL of your locally published stroom-docs site.
stroom.contentPackImportEnabled appConfig.contentPackImport.enabled

Some v5 properties, such as connection pool settings cannot be migrated to v7 equivalents. It is recommended to review the default values for v7 appConfig.commonDbDetails.connectionPool.* and appConfig.statistics.sql.db.connectionPool.* properties to ensure they are suitable for your environment. If they are not then set them in the config.yml file. The defaults can be found in config-defaults.yml.

Upgrading the MySQL instance and database

Stroom v5 ran on MySQL v5.6. Stroom v7 runs on MySQL v8. The upgrade path for MySQL is 5.6 => 5.7.33 => 8.x (see Upgrade Paths ).

To ensure the database is up to date mysql_upgrade needs to be run using the 5.7.33 binaries, see the MySQL documentation .

This is the process for upgrading the database. The exact steps will depend on how you have installed MySQL.

  1. Shutdown the database instance.
  2. Remove the MySQL 5.6 binaries, e.g. using your package manager.
  3. Install the MySQL 5.7.33 binaries.
  4. Start the database instance using the 5.7.33 binaries.
  5. Run mysql_upgrade to upgrade the database to 5.7 specification.
  6. Shutdown the database instance.
  7. Remove the MySQL 5.7.33 binaries.
  8. Install the latest MySQL 8.0 binaries.
  9. Start the database instance. On start up MySQL 8 will detect a v5.7 instance and upgrade it to 8.0 spec automatically without the need to run mysql_upgrade.

Performing the Stroom upgrade

To perform the stroom schema upgrade to v7 run the migrate command (on a single node) which will migrate the database then exit. For a large upgrade like this is it is preferable to run the migrate command rather than just starting Stroom as Stroom will only migrate the parts of the schema as it needs to use them so some parts of the database may not be migrated initially. Running the migrate command ensures all parts of the migration are completed when the command is run and no other parts of stroom will be started.

./migrate.sh

Post-Upgrade tasks

TODO

4.3 - Upgrade from v6 to v7

This document describes the process for upgrading a Stroom single node docker stack from v6.x to v7.x.

Warning

Before commencing an upgrade to v7 you should upgrade Stroom to the latest minor and patch version of v6.

Differences between v6 and v7

Stroom v7 has significant differences to v6 which make the upgrade process a little more complicated.

  • v6 handled authentication using a separate application, stroom-auth-service, with its own database. In v7 authentication is handled either internally in stroom (the default) or by an external identity provider such as google or AWS Cognito.
  • v6 used a stroom.conf file or environment variables for configuration. In v7 stroom uses a config.yml file for its configuration (see Properties)
  • v6 used upper case and heavily abbreviated names for its tables. In v7 clearer and lower case table names are used. As a result ALL v6 tables get renamed with the prefix OLD_, the new tables created and any content copied over. As the database will be holding two copies of most data you need to ensure you have space to accommodate it.

Pre-Upgrade tasks

The following steps are required to be performed before migrating from v6 to v7.

Download migration scripts

Download the migration SQL scripts from https://github.com/gchq/stroom/blob/STROOM_VERSION/scripts e.g. https://github.com/gchq/stroom/blob/v7.0-beta.133/scripts

These scripts will be used in the steps below.

Pre-migration database checks

Run the pre-migration checks script on the running database.

docker exec \
-i \
stroom-all-dbs \
mysql --table -u"stroomuser" -p"stroompassword1" stroom \
< v7_db_pre_migration_checks.sql

This will produce a report of items that will not be migrated or need attention before migration.

Stop processing

Before shutting stroom down it is wise to turn off stream processing and let all outstanding server tasks complete.

TODO clarify steps for this.

Stop the stack

Stop the stack (stroom and the database) then start up the database. Do this using the v6 stack. This ensures that stroom is not trying to access the database.

./stop.sh
./start.sh stroom-all-dbs

Backup the databases

Backup all the databases for the different components. Typically these will be stroom, stats and auth.

If you are running in a docker stack then you can run the ./backup_databases.sh script.

Stop the database

Stop the database using the v6 stack.

./stop.sh

Deploy and configure v7

Deploy the v7 stack. TODO - more detail

Verify the database connection configuration for the stroom and stats databases. Ensure that there is NOT any configuration for a separate auth database as this will now be in stroom.

Running mysql_upgrade

Stroom v6 ran on mysql v5.6. Stroom v7 runs on mysql v8. The upgrade path for MySQL is 5.6 => 5.7.33 => 8.x

To ensure the database is up to date mysql_upgrade needs to be run using the 5.7.33 binaries, see the MySQL documentation .

This is the process for upgrading the database. All of these commands are using the v7 stack.

# Set the version of the MySQL docker image to use
export MYSQL_TAG=5.7.33
(out)
# Start MySQL at v5.7, this will recreate the container
./start.sh stroom-all-dbs
(out)
# Run the upgrade from 5.6 => 5.7.33
docker exec -it stroom-all-dbs mysql_upgrade -u"root" -p"my-secret-pw"
(out)
# Stop MySQL
./stop.sh
(out)
# Unset the tag variable so that it now uses the default from the stack (8.x)
unset MYSQL_TAG
(out)
# Start MySQL at v8.x, this will recreate the container and run the upgrade from 5.7.33=>8
./start.sh stroom-all-dbs
(out)
./stop.sh

Rename legacy stroom-auth tables

Run this command to connect to the auth database and run the pre-migration SQL script.

docker exec \
-i \
stroom-all-dbs \
mysql --table -u"authuser" -p"stroompassword1" auth \
< v7_auth_db_table_rename.sql

This will rename all but one of the tables in the auth database.

Copy the auth database content to stroom

Having run the table rename perform another backup of just the auth database.

./backup_databases.sh . auth
Now restore this backup into the stroom database. You can use the v7 stack scripts to do this.

./restore_database.sh stroom auth_20210312143513.sql.gz

You should now see the following tables in the stroom database:

OLD_AUTH_json_web_key
OLD_AUTH_schema_version
OLD_AUTH_token_types
OLD_AUTH_tokens
OLD_AUTH_users

This can be checked by running the following in the v7 stack.

echo 'select table_name from information_schema.tables where table_name like "OLD_AUTH%"' \
| ./database_shell.sh

Drop unused databases

There may be a number of databases that are no longer used that can be dropped prior to the upgrade. Note the use of the --force argument so it copes with users that are not there.

docker exec \
-i \
stroom-all-dbs \
mysql --force -u"root" -p"my-secret-pw" \
< v7_drop_unused_databases.sql

Verify it worked with:

echo 'show databases;' | docker exec -i stroom-all-dbs mysql -u"root" -p"my-secret-pw"

Performing the upgrade

To perform the stroom schema upgrade to v7 run the migrate command which will migrate the database then exit. For a large upgrade like this it is preferable to run the migrate command rather than just starting stroom as stroom will only migrate the parts of the schema as it needs to use them. Running migrate ensures all parts of the migration are completed when the command is run and no other parts of stroom will be started.

./migrate.sh

Post-Upgrade tasks

TODO remove auth* containers,images,volumes

5 - Setup

5.1 - MySQL Setup

Prerequisites

  • MySQL 8.0.x server installed (e.g. yum install mysql-server)
  • Processing User Setup

A single MySQL database is required for each Stroom instance. You do not need to setup a MySQL instance per node in your cluster.

Check Database installed and running

/sbin/chkconfig --list mysqld
(out)mysqld          0:off   1:off   2:on    3:on    4:on    5:on    6:off
mysql --user=root -p
(out)Enter password:
(out)Welcome to the MySQL monitor.  Commands end with ; or \g.
(out)...
quit

The following commands can be used to auto start mysql if required:

/sbin/chkconfig –level 345 mysqld on
/sbin/service httpd start

Overview

MySQL configuration can be simple to complex depending on your requirements.

For a very simple configuration you simply need an out-of-the-box mysql install and create a database user account.

Things get more complicated when considering:

  • Security
  • Replication
  • Tuning memory usage
  • Running Stroom Stats in a different database to Stroom
  • Performance Monitoring

Simple Install

Ensure the database is running, then create the database and grant access to it:

mysql --user=root
(out)Welcome to the MySQL monitor.  Commands end with ; or \g.
(out)...
create database stroom;
(out)Query OK, 1 row affected (0.02 sec)

grant all privileges on stroom.* to 'stroomuser'@'host' identified by 'password';
(out)Query OK, 0 rows affected (0.00 sec)

create database stroom_stats;
(out)Query OK, 1 row affected (0.02 sec)

grant all privileges on stroom_stats.* to 'stroomuser'@'host' identified by 'password';
(out)Query OK, 0 rows affected (0.00 sec)

flush privileges;
(out)Query OK, 0 rows affected (0.00 sec)

Advanced Security

It is recommended to run /usr/bin/mysql_secure_installation to remove test database and accounts.

./stroom-setup/mysql_grant.sh is a utility script that creates accounts for you to use within a cluster (or single node setup). Run to see the options:

./mysql_grant.sh
(out)usage : --name=<instance name (defaults to my for /etc/my.cnf)>
(out)        --user=<the stroom user for the db>
(out)        --password=<the stroom password for the db>
(out)        --cluster=<the file with a line per node in the cluster>
(out)--user=<db user> Must be set

N.B. name is used when multiple mysql instances are setup (see below).

You need to create a file cluster.txt with a line for each member of your cluster (or single line in the case of a one node Stroom install). Then run the utility script to lock down the server access.

hostname >> cluster.txt
./stroom-setup/mysql_grant.sh --name=mysql56_dev --user=stroomuser --password= --cluster=cluster.txt
(out)Enter root mysql password :
(out)--------------
(out)flush privileges
(out)--------------
(out)
(out)--------------
(out)delete from mysql.user where user = 'stroomuser'
(out)--------------
(out)...
(out)...
(out)...
(out)--------------
(out)flush privileges
(out)--------------

Advanced Install

The below example uses the utility scripts to create 3 custom mysql server instances on 2 servers:

  • server1 - stroom (source),
  • server2 - stroom (replica), stroom_stats

As root on server1:

yum install "mysql56-mysql-server"

Create the master database:

./stroom-setup/mysqld_instance.sh --name=mysqld56_stroom --port=3106 --server=mysqld56 --os=rhel6

(out)--master not set ... assuming master database
(out)Wrote base files in tmp (You need to move them as root).  cp /tmp/mysqld56_stroom /etc/init.d/mysqld56_stroom; cp /tmp/mysqld56_stroom.cnf /etc/mysqld56_stroom.cnf
(out)Run mysql client with mysql --defaults-file=/etc/mysqld56_stroom.cnf

cp /tmp/mysqld56_stroom /etc/init.d/mysqld56_stroom; cp /tmp/mysqld56_stroom.cnf /etc/mysqld56_stroom.cnf
/etc/init.d/mysqld56_stroom start

(out)Initializing MySQL database:  Installing MySQL system tables...
(out)OK
(out)Filling help tables...
(out)...
(out)...
(out)Starting mysql56-mysqld:                                   [  OK  ]

Check Start up Settings Correct

chkconfig mysqld off
chkconfig mysql56-mysqld off
chkconfig --add mysqld56_stroom
chkconfig mysqld56_stroom on

chkconfig --list | grep mysql
(out)mysql56-mysqld  0:off   1:off   2:off   3:off   4:off   5:off   6:off
(out)mysqld          0:off   1:off   2:off   3:off   4:off   5:off   6:off
(out)mysqld56_stroom    0:off   1:off   2:on    3:on    4:on    5:on    6:off
(out)mysqld56_stats  0:off   1:off   2:on    3:on    4:on    5:on    6:off

Create a text file with all members of the cluster:

vi cluster.txt

(out)node1.my.org
(out)node2.my.org
(out)node3.my.org
(out)node4.my.org

Create the grants:

./stroom-setup/mysql_grant.sh --name=mysqld56_stroom --user=stroomuser --password=password --cluster=cluster.txt

As root on server2:

yum install "mysql56-mysql-server"


./stroom-setup/mysqld_instance.sh --name=mysqld56_stroom --port=3106 --server=mysqld56 --os=rhel6 --master=node1.my.org --user=stroomuser --password=password

(out)--master set ... assuming slave database
(out)Wrote base files in tmp (You need to move them as root).  cp /tmp/mysqld56_stroom /etc/init.d/mysqld56_stroom; cp /tmp/mysqld56_stroom.cnf /etc/mysqld56_stroom.cnf
(out)Run mysql client with mysql --defaults-file=/etc/mysqld56_stroom.cnf

cp /tmp/mysqld56_stroom /etc/init.d/mysqld56_stroom; cp /tmp/mysqld56_stroom.cnf /etc/mysqld56_stroom.cnf
/etc/init.d/mysqld56_stroom start

(out)Initializing MySQL database:  Installing MySQL system tables...
(out)OK
(out)Filling help tables...
(out)...
(out)...
(out)Starting mysql56-mysqld:                                   [  OK  ]

Check Start up Settings Correct

chkconfig mysqld off
chkconfig mysql56-mysqld off
chkconfig --add mysqld56_stroom
chkconfig mysqld56_stroom on

chkconfig --list | grep mysql
(out)mysql56-mysqld  0:off   1:off   2:off   3:off   4:off   5:off   6:off
(out)mysqld          0:off   1:off   2:off   3:off   4:off   5:off   6:off
(out)mysqld56_stroom    0:off   1:off   2:on    3:on    4:on    5:on    6:off

Create the grants:

./stroom-setup/mysql_grant.sh --name=mysqld56_stroom --user=stroomuser --password=password --cluster=cluster.txt

Make the slave database start to follow:

cat /etc/mysqld56_stroom.cnf | grep "change master"
(out)# change master to MASTER_HOST='node1.my.org', MASTER_PORT=3106, MASTER_USER='stroomuser', MASTER_PASSWORD='password';

mysql --defaults-file=/etc/mysqld56_stroom.cnf
change master to MASTER_HOST='node1.my.org', MASTER_PORT=3106, MASTER_USER='stroomuser', MASTER_PASSWORD='password';
start slave;

As processing user on server1:

mysql --defaults-file=/etc/mysqld56_stroom.cnf --user=stroomuser --password=password
create database stroom;
(out)Query OK, 1 row affected (0.00 sec)

use stroom;
(out)Database changed

create table test (a int);
(out)Query OK, 0 rows affected (0.05 sec)

As processing user on server2 check server replicating OK:

mysql --defaults-file=/etc/mysqld56_stroom.cnf --user=stroomuser --password=password
show create table test;
(out)+-------+----------------------------------------------------------------------------------------+
(out)| Table | Create Table                                                                           |
(out)+-------+----------------------------------------------------------------------------------------+
(out)| test  | CREATE TABLE `test` (`a` int(11) DEFAULT NULL  ) ENGINE=InnoDB DEFAULT CHARSET=latin1  |
(out)+-------+----------------------------------------------------------------------------------------+
(out)1 row in set (0.00 sec)

As root on server2:

/home/stroomuser/stroom-setup/mysqld_instance.sh --name=mysqld56_stats --port=3206 --server=mysqld56 --os=rhel6 --user=statsuser --password=password
cp /tmp/mysqld56_stats /etc/init.d/mysqld56_stats; cp /tmp/mysqld56_stats.cnf /etc/mysqld56_stats.cnf
/etc/init.d/mysqld56_stats start
chkconfig mysqld56_stats on

Create the grants:

./stroom-setup/mysql_grant.sh --name=mysqld56_stats --database=stats  --user=stroomstats --password=password --cluster=cluster.txt

As processing user create the database:

mysql --defaults-file=/etc/mysqld56_stats.cnf --user=stroomstats --password=password
(out)Welcome to the MySQL monitor.  Commands end with ; or \g.
(out)....
create database stats;
(out)Query OK, 1 row affected (0.00 sec)

5.2 - Securing Stroom

How to secure Stroom and the cluster

NOTE This document was written for stroom v4/5. Some parts may not be applicable for v6+.

Firewall

The following firewall configuration is recommended:

  • Outside cluster drop all access except ports HTTP 80, HTTPS 443, and any other system ports your require SSH, etc.
  • Within cluster allow all access

This will enable nodes within the cluster to communicate on:

  • 8080 - Stroom HTTP.
  • 8081 - Stroom HTTP (admin).
  • 8090 - Stroom Proxy HTTP.
  • 8091 - Stroom Proxy HTTP (admin).
  • 3306 - MySQL

MySQL

It is recommended that you run mysql_secure_installation to set a root password and remove the test database:

mysql_secure_installation

When prompted, answer as follows (providing a root password when asked):

  • Set root password? → Y
  • Remove anonymous users? → Y
  • Disallow root login remotely? → Y
  • Remove test database and access to it? → Y
  • Reload privilege tables now? → Y

5.3 - Java Key Store Setup

In order that the java process communicates over https (for example Stroom Proxy forwarding onto Stroom) the JVM requires relevant keystore’s setting up.

As the processing user copy the following files to a directory stroom-jks in the processing user home directory :

  • CA.crt - Certificate Authority
  • SERVER.crt - Server certificate with client authentication attributes
  • SERVER.key - Server private key

As the processing user perform the following:

  • First turn your keys into der format:
cd ~/stroom-jks

SERVER=<SERVER crt/key PREFIX>
AUTHORITY=CA

openssl x509 -in ${SERVER}.crt -inform PEM -out ${SERVER}.crt.der -outform DER
openssl pkcs8 -topk8 -nocrypt -in ${SERVER}.key -inform PEM -out ${SERVER}.key.der -outform DER
  • Import Keys into the Key Stores:
Stroom_UTIL_JAR=`find ~/*app -name 'stroom-util*.jar' -print | head -1`

java -cp ${Stroom_UTIL_JAR} stroom.util.cert.ImportKey keystore=${SERVER}.jks keypass=${SERVER} alias=${SERVER} keyfile=${SERVER}.key.der certfile=${SERVER}.crt.der
keytool -import -noprompt -alias ${AUTHORITY} -file ${AUTHORITY}.crt -keystore ${AUTHORITY}.jks -storepass ${AUTHORITY}
  • Update Processing User Global Java Settings:
PWD=`pwd`
echo "export JAVA_OPTS=\"-Djavax.net.ssl.trustStore=${PWD}/${AUTHORITY}.jks -Djavax.net.ssl.trustStorePassword=${AUTHORITY} -Djavax.net.ssl.keyStore=${PWD}/${SERVER}.jks -Djavax.net.ssl.keyStorePassword=${SERVER}\"" >> ~/env.sh

Any Stroom or Stroom Proxy instance will now additionally pickup the above JAVA_OPTS settings.

5.4 - Processing Users

Processing User Setup

Stroom and Stroom Proxy should be run under a processing user (we assume stroomuser below).

Create user

/usr/sbin/adduser --system stroomuser

You may want to allow normal accounts to sudo to this account for maintenance (visudo).

Create service script

Create a service script to start/stop on server startup (as root).

vi /etc/init.d/stroomuser

Paste/type the following content into vi.

#!/bin/bash
#
# stroomuser       This shell script takes care of starting and stopping
#               the stroomuser subsystem (tomcat6, etc)
#
# chkconfig: - 86 14
# description: stroomuser is the stroomuser sub system

STROOM_USER=stroomuser
DEPLOY_DIR=/home/${STROOM_USER}/stroom-deploy

case $1 in
start)
/bin/su ${STROOM_USER} ${DEPLOY_DIR}/stroom-deploy/start.sh
;;
stop)
/bin/su ${STROOM_USER} ${DEPLOY_DIR}/stroom-deploy/stop.sh
;;
restart)
/bin/su ${STROOM_USER} ${DEPLOY_DIR}/stroom-deploy/stop.sh
/bin/su ${STROOM_USER} ${DEPLOY_DIR}/stroom-deploy/start.sh
;;
esac
exit 0

Now initialise the script.

/bin/chmod +x /etc/init.d/stroomuser
/sbin/chkconfig --level 345 stroomuser on

Setup user’s environment

Setup env.sh to include JAVA_HOME to point to the installed directory of the JDK (this will be platform specific).

vi ~/env.sh

In vi add the following lines.

# User specific aliases and functions
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export PATH=${JAVA_HOME}/bin:${PATH}

Setup the user’s profile to source the env script.

vi ~/.bashrc

In vi add the following lines.

# User specific aliases and functions
. ~/env.sh

Verify Java installation

Assuming you are using Stroom without using docker and have installed Java, verify that the processing user can use the Java installation.

The shell output below may show a different version of Java to the one you are using.

. .bashrc
which java
(out)/usr/lib/jvm/java-1.8.0/bin/java

which javac
(out)/usr/lib/jvm/java-1.8.0/bin/javac

java -version
(out)openjdk version "1.8.0_65"
(out)OpenJDK Runtime Environment (build 1.8.0_65-b17)
(out)OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

5.5 - Setting up Stroom with an Open ID Connect IDP

How to set up Stroom to use a 3rd party Identity Provider (e.g. KeyCloak, Cognito, etc.) for authentication.

5.5.1 - Accounts vs Users

The distinction between Accounts and Users in Stroom.

In Stroom we have the concept of Users and Accounts, and it is important to understand the distinction.

Accounts

Accounts are user identities in the internal Identity Provider (IDP) . The internal IDP is used when you want Stroom to manage all the authentication. The internal IDP is the default option and the simplest for test environments. Accounts are not applicable when using an external 3rd party IDP.

Accounts are managed in Stroom using the Manage Accounts screen available from the _Tools => Users menu item. An administrator can create and manage user accounts allowing users to log in to Stroom.

Accounts are for authentication only, and play no part in authorisation (permissions). A Stroom user account has a unique identity that will be associated with a Stroom User to link the two together.

When using a 3rd party IDP this screen is not available as all management of users with respect to authentication is done in the 3rd party IDP.

Accounts are stored in the account database table.

Stroom Users

A user in Stroom is used for managing authorisation, i.e. permissions and group memberships. It plays no part in authentication. A user has a unique identifier that is provided by the IDP (internal or 3rd party) to identify it. This ID is also the link it to the Stroom Account in the case of the internal IDP or the identity on a 3rd party IDP.

Stroom users and groups are managed in the stroom_user and stroom_user_group database tables respectively.

5.5.2 - Stroom's Internal IDP

Details about Stroom’s own internal identity provider and authentication mechanisms.

By default a new Stroom instance/cluster will use its own internal Identity Provider (IDP) for authentication.

In this configuration, Stroom acts as its own Open ID Connect Identity Provider and manages both the user accounts for authentication and the user/group permissions, (see Accounts and Users).

A fresh install will come pre-loaded with a user account called admin with the password admin. This user is a member of a group called Administrators which has the Administrator application permission. This admin user can be used to set up the other users on the system.

Additional user accounts are created and maintained using the Tools => Users menu item.

Configuration for the internal IDP

While Stroom is pre-configured to use its internal IDP, this section describes the configuration required.

In Stroom:

  security:
    authentication:
      authenticationRequired: true
      openId:
        identityProviderType: INTERNAL_IDP

In Stroom-Proxy:

  feedStatus:
    apiKey: "AN_API_KEY_CREATED_IN_STROOM"
  security:
    authentication:
      openId:
        identityProviderType: NO_IDP

5.5.3 - External IDP

How to setup KeyCloak as an external identity provider for Stroom.

You may be running Stroom in an environment with an existing Identity Provider (IDP) (KeyCloak, Cognito, Google, Active Directory, etc.) and want to use that for authenticating users. Stroom supports 3rd party IDPs that conform to the Open ID Connect specification.

The following is a guide to setting up a new stroom instance/cluster with KeyCloak as the 3rd party IDP. KeyCloak is an Open ID Connect IDP. Configuration for other IDPs will be very similar so these instructions will have to be adapted accordingly. It is assumed that you have deployed a new instance/cluster of stroom AND have not yet started it.

Running KeyCloak

If you already have a KeyCloak instance running then move on to the next section.

This section is not a definitive guide to running/administering KeyCloak. It describes how to run KeyCloak using non-production settings for simplicity and to demonstrate using a 3rd party IDP. You should consult the KeyCloak documentation on how to set up a production ready instance of KeyCloak.

The easiest way to run KeyCloak is using Docker. To create a KeyCloak container do the following:

docker create \
  --name keycloak \
  -p 9999:8080 \
  -e KEYCLOAK_ADMIN=admin \
  -e KEYCLOAK_ADMIN_PASSWORD=admin \
  quay.io/keycloak/keycloak:20.0.1 \
  start-dev

This example maps KeyCloak’s port to port 9999 to avoid any clash with Stroom that also runs on 8080. This will create a docker container called keycloak that uses an embedded H2 database to hold its state.

To start the container in the foreground, do:

docker start -a keycloak

KeyCloak should now be running on http://localhost:9999/admin . If you want to run KeyCloak on a different port then delete the container and create it with a different port for the -p argument.

Log into KeyCloak using the username admin and password admin as specified in the environment variables set in the container creation command above. You should see the admin console.

Creating a realm

First you need to create a Realm.

  1. Click on the drop-down in the left pane that contains the word master.
  2. Click Create Realm.
  3. Set the Realm name to StroomRealm.
  4. Click Create.

Creating a client

In the new realm click on Clients in the left pane, then Create client.

  1. Set the Client ID to StroomClient.
  2. Click Next.
  3. Set Client authentication to on.
  4. Ensure the following are ticked:
    • Standard flow
    • Direct access grants
  5. Click Save.

Open the new Client and on the Settings tab set:

  • Valid redirect URIs to https://localhost/*
  • Valid post logout redirect URIs to https://localhost/*

On the Credentials tab copy the Client secret for use later in Stroom config.

Creating users

Click on Users in the left pane then Add user. Set the following:

  • Username - admin
  • First name - Administrator
  • Last name - Administrator

Click Create.

Select the Credentials tab and click Set password.

Set the password to admin and set Temporary to off.

Repeat this process for the following user:

  • Username - jbloggs
  • First name - Joe
  • Last name - Bloggs
  • Password - password

Configure Stroom for KeyCloak

Edit the config.yml file and set the following values

  receive:
    # Set to true to require authentication for /datafeed requests
    authenticationRequired: true
    # Set to true to allow authentication using an Open ID token
    tokenAuthenticationEnabled: true
  security:
    authentication:
      authenticationRequired: true
      openId:
        # The client ID created in KeyCloak
        clientId: "StroomClient"
        # The client secret copied from KeyCloak above
        clientSecret: "XwTPPudGZkDK2hu31MZkotzRUdBWfHO6"
        # Tells Stroom to use an external IDP for authentication
        identityProviderType: EXTERNAL_IDP
        # The URL on the IDP to redirect users to when logging out in Stroom
        logoutEndpoint: "http://localhost:9999/realms/StroomRealm/protocol/openid-connect/logout"
        # The endpoint to obtain the rest of the IDPs configuration. Specific to the realm/issuer.
        openIdConfigurationEndpoint: "http://localhost:9999/realms/StroomRealm/.well-known/openid-configuration"

These values are obtained from the IDP. In the case of KeyCloak they can be found by clicking on Realm settings => Endpoints => OpenID Endpoint Configuration and extracting the various values from the JSON response. Alternatively they can typically be found at this address on any Open ID Connect IDP, https://host/.well-known/openid-configuration. The values will reflect the host/port that the IDP is running on along with the name of the realm.

Setting the above values assumes KeyCloak is running on localhost:9999 and the Realm name is StroomRealm.

Setting up the admin user in Stroom

Now that the admin user exists in the IDP we need to grant it Administrator rights in Stroom.

In the Users section of KeyCloak click on user admin. On the Details tab copy the value of the ID field. The ID is in the form of a UUID This ID will be used in Stroom to uniquely identify the user and associate it with the identity in KeyCloak.

To set up Stroom with this admin user run the following (before Stroom has been started for the first time):

subject_id="XXX"; \
java -jar /absolute/path/to/stroom-app-all.jar \
  manage_users \
  ../local.yml \
  --createUser "${subject_id}" \
  --createGroup Administrators \
  --addToGroup "${subject_id}" Administrators \
  --grantPermission Administrators "Administrator"

Where XXX is the user ID copied from the IDP as described above. This command is repeatable as it will skip any users/groups/memberships that already exist.

This command will do the following:

  • Create the Stroom User by creating an entry in the stroom_user database table for the IDP’s admin user.
  • Ensure that an Administrators group exists (i.e. an entry in the stroom_user database table for the Administrators group).
  • Add the admin user to the group Administrators.
  • Grant the application permission Administrator to the group Administrators.

Logging into Stroom

As the administrator

Now that the user and permissions have been set up in Stroom, the administrator can log in.

First start the Stroom instance/cluster.

Navigate to http://STROOM_FQDN and Stroom should re-direct you to the IDP (KeyCloak) to authenticate. Enter the username of admin and password admin. You should be authenticated by KeyCloak and re-directed back to stroom. Your user ID is shown in the bottom right corner of the Welcome tab.

As an administrator, the Tools => User Permissions menu item will be available to manage the permissions of any users that have logged on at least once.

Now select User => Logout to be re-directed to the IDP to logout. Once you logout of the IDP it should re-direct you back to the IDP login screen for Stroom to log back in again.

As an ordinary user

On the IDP login screen, login as user jbloggs with the password password. You will be re-directed to Stroom however the explorer tree will be empty and most of the menu items will be disabled. In order to gain permissions to do anything in Stroom a Stroom administrator will need to grant application/document permissions and/or group memberships to the user via the Tools => User Permissions menu item.

Configure Stroom-Proxy for KeyCloak

In order to use Stroom-Proxy with OIDC

Edit the config.yml file and set the following values

  receive:
    # Set to true to require authentication for /datafeed requests
    authenticationRequired: true
    # Set to true to allow authentication using an Open ID token
    tokenAuthenticationEnabled: true
  security:
    authentication:
      openId:
        # The client ID created in KeyCloak
        clientId: "StroomClient"
        # The client secret copied from KeyCloak above
        clientSecret: "XwTPPudGZkDK2hu31MZkotzRUdBWfHO6"
        # Tells Stroom to use an external IDP for authentication
        identityProviderType: EXTERNAL_IDP
        # The URL on the IDP to redirect users to when logging out in Stroom
        logoutEndpoint: "http://localhost:9999/realms/StroomRealm/protocol/openid-connect/logout"
        # The endpoint to obtain the rest of the IDPs configuration. Specific to the realm/issuer.
        openIdConfigurationEndpoint: "http://localhost:9999/realms/StroomRealm/.well-known/openid-configuration"

If Stroom-Proxy is configured to forward data onto another Stroom-Proxy or Stroom instance then it can use tokens when forwarding the data. This assumes the downstream Stroom or Stroom-Proxy is also configured to use the same external IDP.

  forwardHttpDestinations:

      # If true, adds a token for the service user to the request
    - addOpenIdAccessToken: true
      enabled: true
      name: "downstream"
      forwardUrl: "http://somehost/stroom/datafeed"

The token used will be for the service user account of the identity provider client used by Stroom-Proxy.

5.5.4 - Tokens for API use

How to create and use tokens for making API calls.

Creating a user access token

If a user wants to use the REST API they will need to create a token for authentication/authorisation in API calls. Any calls to the REST API will have the same permissions that the user has within Stroom.

The following excerpt of shell commands shows how you can get an access/refresh token pair for a user and then later use the refresh token to obtain a new access token. It also shows how you can extract the expiry date/time from a token using jq.

get_jwt_expiry() {
  jq \
    --raw-input \
    --raw-output \
    'split(".") | .[1] | @base64d | fromjson | .exp | todateiso8601' \
    <<< "${1}"
}

# Fetch a new set of tokens (id, access and refresh) for the user
response="$( \
  curl \
    --silent \
    --request POST \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'client_id=admin-cli' \
    --data-urlencode 'grant_type=password' \
    --data-urlencode 'scope=openid' \
    --data-urlencode 'username=jbloggs' \
    --data-urlencode 'password=password' \
    'http://localhost:9999/realms/StroomRealm/protocol/openid-connect/token' )"

# Extract the individual tokens from the response
access_token="$( jq -r '.access_token' <<< "${response}" )"
refresh_token="$( jq -r '.refresh_token' <<< "${response}" )"

# Output the tokens
echo -e "\nAccess token (expiry $( get_jwt_expiry "${access_token}")):\n${access_token}"
echo -e "\nRefresh token (expiry $( get_jwt_expiry "${refresh_token}")):\n${refresh_token}"

# Fetch a new access token using the stored refresh token
response="$( \
  curl \
    --silent \
    --request POST \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'client_id=admin-cli' \
    --data-urlencode 'grant_type=refresh_token' \
    --data-urlencode "refresh_token=${refresh_token}" \
    'http://localhost:9999/realms/StroomRealm/protocol/openid-connect/token' )"

access_token="$( jq -r '.access_token' <<< "${response}" )"
refresh_token="$( jq -r '.refresh_token' <<< "${response}" )"

echo -e "\nNew access token (expiry $( get_jwt_expiry "${access_token}")):\n${access_token}"
echo -e "\nNew refresh token (expiry $( get_jwt_expiry "${refresh_token}")):\n${refresh_token}"

The above example assumes that you have created a user called jbloggs and a client ID admin-cli.

Access tokens typically have a short life (of the order of minutes) while a refresh token will have a much longer life (maybe up to a year). Refreshing the token does not require re-authentication.

Creating a service account token

If you want another system to call one of Stroom’s APIs then it is likely that you will do that using a non-human service account (or processing user account).

Creating a new Client ID

The client system needs to be represented by a Client ID in KeyCloak. To create a new Client ID, assuming the client system is called System X, do the following in the KeyCloak admin UI.

  1. Click Clients in the left pane.
  2. Click Create client.
  3. Set the Client ID to be system-x.
  4. Set the Name to be System X.
  5. Click Next.
  6. Enable Client Authentication.
  7. Enable Service accounts roles.
  8. Click Save.

Open the Credentials tab and copy the Client secret for use later.

To create an access token run the following shell commands:

response="$( \
  curl \
    --silent \
    --request POST \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'client_secret=k0BhYyvt6PHQqwKnnQpbL3KXVFHG0Wa1' \
    --data-urlencode 'client_id=system-x' \
    --data-urlencode 'grant_type=client_credentials' \
    --data-urlencode 'scope=openid' \
    'http://localhost:9999/realms/StroomRealm/protocol/openid-connect/token' )"

access_token="$( jq -r '.access_token' <<< "${response}" )"
refresh_token="$( jq -r '.refresh_token' <<< "${response}" )"

echo -e "\nAccess token:\n${access_token}"

Where client_secret is the Client secret that you copied from KeyCloak earlier.

This access token can be refreshed in the same way as for a user access token, as described above.

Using access tokens

Access tokens can be used in calls to Stroom’s REST API or its datafeed API. The process of including the token in a HTTP request is described in API Authentication

5.5.5 - Test Credentials

Hard coded Open ID credentials for test or demonstration purposes.

Stroom and Stroom-Proxy come with a set of hard coded Open ID credentials that are intended for use in test/demo environments. These credentials mean that the _test stroom docker stack can function out of the box with Stroom-Proxy able to authenticate with Stroom.

Configuring the test credentials

To configure Stroom to use these hard-coded credentials you need to set the following property:

  security:
    authentication:
      openId:
        identityProviderType: TEST_CREDENTIALS

When you start the Stroom instance you will see a large banner message in the logs that will include the token that can be used in API calls or by Stroom-proxy for its feed status checks.

To configure Stroom-Proxy to use these credentials set the following:

  feedStatus:
    apiKey: "THE_TOKEN_OBTAINED_FROM_STROOM'S_LOGS"
  security:
    authentication:
      openId:
        identityProviderType: NO_IDP

6 - Stroom 6 Installation

Running on a single box

Running a release

Download a release , for example Stroom Core v6.0 Beta 3 , unpack it, and run the start.sh script. When you’ve given it some time to start up go to http://localhost/stroom. There’s a README.md file inside the tar.gz with more information.

Admin Account creation

By default, Stroom does not come with an administrator account/user so one or more administrators will need to be setup in order to login and continue provisioning Stroom via the UI.

See Creating an Internal IDP Administrator or Creating an External IDP Administrator depending on the type of IDP that is configured.

Post-install hardening

Before first run

Change database passwords

If you don’t do this before the first run of Stroom then the passwords will already be set and you’ll have to change them on the database manually, and then change the .env.

This change should be made in the .env configuration file. If the values are not there then this service is not included in your Stroom stack and there is nothing to change.

  • STROOM_DB_PASSWORD

  • STROOM_DB_ROOT_PASSWORD

  • STROOM_STATS_DB_ROOT_PASSWORD

  • STROOM_STATS_DB_PASSWORD

  • STROOM_AUTH_DB_PASSWORD

  • STROOM_AUTH_DB_ROOT_PASSWORD

  • STROOM_ANNOTATIONS_DB_PASSWORD

  • STROOM_ANNOTATIONS_DB_ROOT_PASSWORD

On first run

Create yourself an account

After first logging in as admin you should create yourself a normal account (using your email address) and add yourself to the Administrators group. You should then log out of admin, log in with your new administrator account and then disable the admin account.

If you decide to use the admin account as your normal account you might find yourself locked out. The admin account has no associated email address, so the Reset Password feature will not work if your account is locked. It might become locked if you enter your password incorrectly too many times.

Delete un-used users and API keys

  • If you’re not using stats you can delete or disable the following:
    • the user statsServiceUser
    • the API key for statsServiceUser

Change the API keys

First generate new API keys. You can generate a new API key using Stroom. From the top menu, select:

Tools
API Keys

The following need to be changed:

  • STROOM_SECURITY_API_TOKEN

    • This is the API token for user stroomServiceUser.

Then stop Stroom and update the API key in the .env configuration file with the new value.

Troubleshooting

I’m trying to use certificate logins (PKI) but I keep being prompted for the username and password!

You need to be sure of several things:

  • When a user arrives at Stroom the first thing Stroom does is redirect the user to the authentication service. This is when the certificate is checked. If this redirect doesn’t use HTTPS then nginx will not get the cert and will not send it onwards to the authentication service. Remember that all of this stuff, apart from back-channel/service-to-service chatter, goes through nginx. The env var that needs to use HTTPS is STROOM_AUTHENTICATION_SERVICE_URL. Note that this is the var Stroom looks for, not the var as set in the stack, so you’ll find it in the stack YAML.
  • Are your certs configured properly? If nginx isn’t able to decode the incoming cert for some reason then it won’t pass anything on to the service.
  • Is your browser sending certificates?

7 - Stroom Installation

Details how to install Stroom and its assocatied services.

Typical Deployments

Stroom can be deployed in a number of ways:

  • Single node - For environments with low data volumes, test environments or where resilience is not critical. For a single node deployment, the simplest way to deploy is with a Single Node Docker Stack as this includes everything needed for Stroom to run.

  • Non-Docker Cluster - A Stroom cluster where the Stroom Java application is running direction on the physical/virtual host and Stroom’s peripheral services (e.g. Nginx, MySQL, Stroom-Proxy) have been installed adjacent to the Stroom Cluster.

  • Kubernetes - For deploying a containerised Stroom cluster, Kubernetes (k8s) is the recommended approach. See Kubernetes Cluster.

This document will only be concerned with the installation of a non-Docker Stroom cluster.

For a more detailed description of the deployment architecture, see Architecture.

For details of how to install Stroom-Proxy see Stroom-Proxy Installation.

Assumptions

The following assumptions are used in this document.

  • The user has reasonable RHEL/CentOS/Rocky System administration skills.
  • Installation is on a fully patched minimal RHEL/CentOS/Rocky instance.
  • The application user stroomuser has been created in the OS.
  • The user has set up the Stroom processing user as described here.
  • The prerequisite software has been installed.

Firewall Configuration

The following are the ports used in a typical Stroom deployment. Some may need to be opened to allow access to the ports from outside the host.

  • 80 - Nginx listens on port 80 but redirects onto 443.
  • 443 - Nginx listens on port 443.
  • 3306 - MySQL listens on port 3306 by default.
  • 8080 - Stroom listens on port 8080 for its main public APIs (/datafeed, REST endpoints, etc).
  • 8081 - Stroom listens on port 8081 for its administration APIs. Access to this port should probably be carefully controlled.
  • 8090 - Stroom-Proxy listens on port 8090 for its main public APIs (/datafeed, REST endpoints, etc).
  • 8091 - Stroom-Proxy listens on port 8091 for its administration APIs. Access to this port should probably be carefully controlled.

Which ports you open on a host will depend on what service is running on that host. Typically Stroom will be running on different hosts to Nginx, MySQL and Stroom-Proxy, so Stroom’s 8080 port will need to be opened for traffic from Stroom-Proxy and Nginx.

For example on a RHEL/CentOS server using firewalld the commands would be as root user:

firewall-cmd --zone=public --permanent --add-port=80/tcp
firewall-cmd --zone=public --permanent --add-port=443/tcp
firewall-cmd --reload

Prerequisites

  • RHEL/CentOS/Rocky
  • Java JDK (JDK is preferred over JRE as it provides additional tools (e.g. jmap) for capturing heap histogram statistics). For details about which Java distribution and version to use, and how to install it, see Java.
  • bash v4 or greater - Used by the helper scripts.
  • GNU coreutils - Used by the helper scripts.
  • jq - Used by the stack scripts.

Create a shell script that will define the Java variable OR add the statements to .bash_profile.

Install Components

Install Nginx

To deploy Nginx, it can either be installed manually (see Installing Nginx ) or using the stroom_services Docker Stack.

Install Stroom-Proxy

For details of how to install Stroom-Proxy see Stroom-Proxy Installation.

Install MySQL

For details of how to install MySQL see MySQL Setup.

Install Stroom

Stroom releases are available from github.com/gchq/stroom/releases . Each release has a number of artefacts, the Stroom application is stroom-app-v*.zip.

The installation example below is for stroom version 7.10.20, but is applicable to other stroom v7 versions. As a suitable stroom user e.g. stroomuser - download and unpack the stroom software.

wget https://github.com/gchq/stroom/releases/download/v7.10.20/stroom-app-v7.10.20.zip
unzip stroom-app-v7.10.20.zip

The configuration file – stroom/config/config.yml – is the principal file that controls the configuration of Stroom, although once Stroom is running, the configuration can be managed via System Properties. See Stroom Configuration.

8 - Java

Stroom and Stroom-Proxy both run on Java. This section details the requirements they have in terms of Java.

There are multiple distributions of Java available (Oracle, OpenJDK, Adoptium, Azul, etc). Our recommendation is to use Adoptium Eclipse Temurin as this is free and Open Source and has 4 year support periods for Long Term Support (LTS) releases of Java.

JDK or JRE

Java distributions are available as a Java Development Kit or a Java Runtime Environment. The JDK is primarily intended for development of Java applications (i.e. compiling code) while the JRE is simply for running a compiled application.

However, we recommend installing the JDK as this can run an application in the same way as the JRE, but also provides additional tools to aid in debugging the application if required. For example the JDK includes the jmap binary that can be used by Stroom to capture statistics on object use within the Java Heap.

Java Releases

Java now has a regular release cycle of new major versions. Periodically a Java release will be deemed a Long Term Support (LTS) releases, e.g. Java v11, v17 & v25. Intermediate version have a short support lifecycle.

Stroom and Stroom-Proxy versions will now typically require an LTS releases of Java as a minimum. While you can run a later release of Java than that required by the Stroom/Stroom-Proxy release, it is generally simpler to run the minimum required version. Using the same LTS release means you will get security/bug updates for 4 or so years and you don’t need to worry about any breaking changes that a later version of Java may have introduced.

The following lists the minimum required Java version required by each Stroom release.

Stroom/Stroom-Proxy Version Minimum Java Version
v7.11 v25
v7.10 v21
v7.9 v21
v7.8 v21
v7.7 v21
v7.6 v21
v7.5 v21
v7.4 v21
v7.3 v21
v7.2 v17
v7.1 v17
v7.0 v15

Installing Java

See Linux Installation Instructions for details of how to install the JDK using your package manager.

Alternatively, see Adoptium Eclipse Temurin for links to download the Java binaries for manual installation.

Setting Java Home

Create a shell script that will define the Java variable OR add the statements to .bash_profile. e.g. vi /etc/profile.d/jdk.sh

export JAVA_HOME=/path/to/java/home
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile.d/jdk.sh
echo $JAVA_HOME
(out)/path/to/java/home

java --version
(out)openjdk 25 2025-09-16 LTS
(out)OpenJDK Runtime Environment Temurin-25+36 (build 25+36-LTS)
(out)OpenJDK 64-Bit Server VM Temurin-25+36 (build 25+36-LTS, mixed mode, sharing)

9 - Kubernetes Cluster

How to deploy and administer a container based Stroom cluster using Kubernetes.

9.1 - Introduction

Introduction to using Stroom on Kubernetes.

Kubernetes is an open-source system for automating deployment scaling and management of containerised applications.

Stroom is a distributed application designed to handle large-scale dataflows. As such, it is ideally suited to a Kubernetes deployment, especially when operated at scale. Features standard to Kubernetes, like Ingress and Cluster Networking , simplify the installation and ongoing operation of Stroom.

Running applications in K8s can be challenging for applications not designed to operate in a K8s cluster natively. A purpose-built Kubernetes Operator ( stroom-k8s-operator ) has been developed to make deployment easier, while taking advantage of several key Kubernetes features to further automate Stroom cluster management.

The concept of Kubernetes operators is discussed here .

Key features

The Stroom K8s Operator provides the following key features:

Deployment

  1. Simplified configuration, enabling administrators to define the entire state of a Stroom cluster in one file
  2. Designate separate processing and UI nodes, to ensure the Stroom user interface remains responsive, regardless of processing load
  3. Automatic secrets management

Operations

  1. Scheduled database backups
  2. Stroom node audit log shipping
  3. Automatically drain Stroom tasks before node shutdown
  4. Automatic Stroom task limit tuning, to attempt to keep CPU usage within configured parameters
  5. Rolling Stroom version upgrades

Next steps

Install the Stroom K8s Operator

9.2 - Install Operator

How to install the Stroom Kubernetes operator.

Prerequisites

  1. Kubernetes cluster, version >= 1.20.2
  2. metrics-server (pre-installed with some K8s distributions)
  3. kubectl and cluster-wide admin access

Preparation

Stage the following images in a locally-accessible container registry:

  1. All images listed in: https://github.com/p-kimberley/stroom-k8s-operator/blob/master/deploy/images.txt
  2. MySQL (e.g. mysql/mysql-server:8.0.25)
  3. Stroom (e.g. gchq/stroom:v7-LATEST)
  4. gchq/stroom-log-sender:v2.2.0 (only required if log forwarding is enabled)

Install the Stroom K8s Operator

  1. Clone the repository

    git clone https://github.com/p-kimberley/stroom-k8s-operator.git
  2. Edit ./deploy/all-in-one.yaml, prefixing any referenced images with your private registry URL. For example, if your private registry is my-registry.example.com, the image gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0 will become: my-registry.example.com:5000/gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0.

  3. Deploy the Operator

    kubectl apply -f ./deploy/all-in-one.yaml

The Stroom K8s Operator is now deployed to namespace stroom-operator-system. You can monitor its progress by watching the Pod named stroom-operator-controller-manager. Once it reaches Ready state, you can deploy a Stroom cluster.

Allocating more resources

If the Operator Pod is killed due to running out of memory, you may want to increase the amount allocated to it.

This can be done by:

  1. Editing the resources.limits settings of the controller Pod in all-in-one.yaml
  2. kubectl apply -f all-in-one.yaml

Next steps

Configure a Stroom database server
Upgrade
Remove

9.3 - Upgrade Operator

How to upgrade the Stroom Kubernetes Operator.

Upgrading the Operator can be performed without disrupting any resources it controls, including Stroom clusters.

To perform the upgrade, follow the same steps in Installing the Stroom K8s Operator.

Once you have initiated the update (by executing kubectl apply -f all-in-one.yaml), an instance of the new Operator version will be created. Once it starts up successfully, the old instance will be removed.

You can check whether the update succeeded by inspecting the image tag of the Operator Pod: stroom-operator-system/stroom-operator-controller-manager. The tag should correspond to the release number that was downloaded (e.g. 1.0.0)

If the upgrade failed, the existing Operator should still be running.

9.4 - Remove Operator

How to remove the Stroom Kubernetes operator.

Removing the Stroom K8s Operator must be done with caution, as it causes all resources it manages, including StroomCluster, DatabaseServer and StroomTaskAutoscaler to be deleted.

While the Stroom clusters under its control will be gracefully terminated, they will become inaccessible until re-deployed.

It is good practice to first delete any dependent resources before deleting the Operator.

Deleting the Operator

Execute this command against the same version of manifest that was used to deploy the Operator currently running.

kubectl delete -f all-in-one.yaml

9.5 - Configure Database

How to configure the database server for a Stroom cluster.

Before creating a Stroom cluster, a database server must first be configured.

There are two options for deploying a MySQL database for Stroom:

Managed by Stroom K8s Operator

A Database server can be created and managed by the Operator. This is the recommended option, as the Operator will take care of the creation and storage of database credentials, which are shared securely with the Pod via the use of a Secret cluster resource.

Create a DatabaseServer resource manifest

Use the example at database-server.yaml .

See the DatabaseServer Custom Resource Definition (CRD) API documentation for an explanation of the various CRD fields.

By default, MySQL imposes a limit of 151 concurrent connections. If your Stroom cluster is larger than a few nodes, it is likely you will exceed this limit. Therefore, it is recommended to set the MySQL property max_connections to a suitable value.

Bear in mind the Operator generally consumes one connection per StroomCluster it manages, so be sure to include some headroom in your allocation.

You can specify this value via the spec.additionalConfig property as in the example below:

apiVersion: stroom.gchq.github.io/v1
kind: DatabaseServer
...
spec:
  additionalConfig:
    - max_connections=1000
...

Provision a PersistentVolume for the DatabaseServer

General instructions on creating a Kubernetes Persistent Volume (PV) are explained here .

The Operator will create StatefulSet when the DatabaseServer is deployed, which will attempt to claim a PersistentVolume matching the specification provided in DatabaseServer.spec.volumeClaim.

Fast, low-latency storage should be used for the Stroom database

Deploy the DatabaseServer to the cluster

kubectl apply -f database-server.yaml

Observe the Pod stroom-<database server name>-db start up. Once it’s reached Ready state, the server has started, and the databases you specified have been created.

Backup the created credentials

The Operator generates a Secret containing the passwords of the users root and stroomuser when it initially creates the DatabaseServer resource. These credentials should be backed up to a secure location, in the event the Secret is inadvertently deleted.

The Secret is named using the format: stroom-<db server name>-db (e.g. stroom-dev-db).

External

You may alternatively provide the connection details of an existing MySQL (or compatible) database server. This may be desirable if you have for instance, a replication-enabled MySQL InnoDB cluster.

Provision the server and Stroom databases

Store credentials in a Secret

Create a Secret in the same namespace as the StroomCluster, containing the key stroomuser, with the value set to the password of that user.

Upgrading or removing a DatabaseServer

A DatabaseServer cannot shut down while its dependent StroomCluster is running. This is a necessary safeguard to prevent database connectivity from being lost.

Upgrading or removing a DatabaseServer requires the StroomCluster be removed first.

Next steps

Configure a Stroom cluster

9.6 - Configure a cluster

How to configure a Stroom cluster.

A StroomCluster resource defines the topology and behaviour of a collection of Stroom nodes.

The following key concepts should be understood in order to optimally configure a cluster.

Concepts

NodeSet

A logical grouping of nodes intended to together, fulfil a common role. There are three possible roles, as defined by ProcessingNodeRole:

  1. Undefined (default). Each node in the NodeSet can receive and process data, as well as service web frontend requests.
  2. Processing Node can receive and process data, but not service web frontend requests.
  3. Frontend Node services web frontend requests only.

There is no imposed limit to the number of NodeSets, however it generally doesn’t make sense to have more than one assigned to either Processing or Frontend roles. In clusters where nodes are not very busy, it should not be necessary to have dedicated Frontend nodes. In cases where load is prone to spikes, such nodes can greatly help improve the responsiveness of the Stroom user interface.

It is important to ensure there is at least one NodeSet for each role in the StroomCluster The Operator automatically wires up traffic routing to ensure that only non-Frontend nodes receive event data. Additionally, Frontend-only nodes have server tasks disabled automatically on startup, effectively preventing them from participating in stream processing.

Ingress

Kubernetes Ingress resources determine how requests are routed to an application. Ingress resources are configured by the Operator based on the NodeSet roles and the provided StroomCluster.spec.ingress parameters.

It is possible to disable Ingress for a given NodeSet, which excludes nodes within that group from receiving any traffic via the public endpoint. This can be useful when creating nodes dedicated to data processing, which do not receive data.

StroomTaskAutoscaler

StroomTaskAutoscaler is an optional resource that if defined, activates “auto-pilot” features for an associated StroomCluster. See this guide on how to configure.

Creating a Stroom cluster

Create a StroomCluster resource manifest

Use the example stroom-cluster.yaml .

If you chose to create an Operator-managed DatabaseServer, the StroomCluster.spec.databaseServerRef should point to the name of the DatabaseServer.

Provision a PersistentVolume for each Stroom node

Each PersistentVolume provides persistent local storage for a Stroom node. The amount of storage doesn’t generally need to be large, as stream data is stored on another volume. When deciding on a storage quota, be sure to consider the needs of log and reference data, in particular.

This volume should ideally be backed by fast, low-latency storage in order to maximise the performance of LMDB.

Deploy the StroomCluster resource

kubectl apply -f stroom-cluster.yaml

If the StroomCluster configuration is valid, the Operator will deploy a StatefulSet for each NodeSet defined in StroomCluster.spec.nodeSets. Once these StatefulSets reach Ready state, you are ready to access the Stroom UI.

Log into Stroom

Access the Stroom UI at: https://<ingress hostname>. The initial credentials are:

  • Username: admin
  • Password: admin

Further customisation (optional)

The configuration bundled with the Operator provides enough customisation for most use cases, via explicit properties and environment variables.

If you need to further customise Stroom, you have the following methods available:

Override the Stroom configuration file

Deploy a ConfigMap separately. You can then specify the ConfigMap name and key (itemName) containing the configuration file to be mounted into each Stroom node container.

Provide additional environment variables

Specify custom environment variables in StroomCluster.spec.extraEnv. You can reference these in the Stroom configuration file.

Mount additional files

You can also define additional Volumes and VolumeMounts to be injected into each Stroom node. This can be useful when providing files like certificates for Kafka integration.

Reconfiguring the cluster

Some StroomCluster configuration properties can be reconfigured while the cluster is still running:

  1. spec.image Change this to deploy a newer (or different) Stroom version
  2. spec.terminationGracePeriodSecs Applies the next time a node or cluster is deleted
  3. spec.nodeSets.count If changed, the NodeSet’s StatefulSet will be scaled (up or down) to match the corresponding number of replicas

After changing any of the above properties, re-apply the manifest:

kubectl apply -f stroom-cluster.yaml

If any other changes need to be made, delete then re-create the StroomCluster.

Next steps

Configure Stroom task autoscaling
Stop a Stroom cluster

9.7 - Auto Scaler

How to configure Stroom task auto scaling.

Motivation

Setting optimal Stroom stream processor task limits is a crucial factor in running a healthy, performant cluster. If a node is allocated too many tasks, it may become unresponsive or crash. Conversely, if allocated too few tasks, it may have CPU cycles to spare.

The optimal number of tasks is often time-dependent, as load will usually fluctuate during the day and night. In large deployments, it’s not ideal to set static limits, as doing so risks over-committing nodes during intense spikes in activity (such as backlog processing or multiple concurrent searches). Therefore an automated solution, factoring in system load, is called for.

Stroom task autoscaling

When a StroomTaskAutoscaler resource is deployed to a linked StroomCluster, the Operator will periodically compare each Stroom node’s average Pod CPU usage against user-defined thresholds.

Enabling autoscaling

Create an StroomTaskAutoscaler resource manifest

Use the example autoscaler.yaml .

Below is an explanation of some of the main parameters. The rest are documented here .

  • adjustmentIntervalMins Determines how often the Operator will check whether a node has exceeded its CPU parameters. It should be often enough to catch brief load spikes, but not too often as to overload the Operator and Kubernetes cluster through excessive API calls and other overhead.
  • metricsSlidingWindowMin is the window of time over which CPU usage is averaged. Should not be too small, otherwise momentary load spikes could cause task limits to be reduced unnecessarily. Too large and spikes may not cause throttling to occur.
  • minCpuPercent and maxCpuPercent should be set to a reasonably tight range, in order to keep the task limit as close to optimal as possible.
  • minTaskLimit and maxTaskLimit are considered safeguards to avoid nodes ever being allocated an unreasonable number of task. Setting maxTaskLimit to be equal to the number of assigned CPUs would be a reasonable starting point.

Deploy the resource manifest

kubectl apply -f autoscaler.yaml

Disable autoscaling

Delete the StroomTaskAutoscaler resource

kubectl delete -f autoscaler.yaml

9.8 - Stop Stroom Cluster

How to stop the whole Stroom cluster.

A Stroom cluster can be stopped by deleting the StroomCluster resource that was deployed. When this occurs, the Operator will perform the following actions for each node, in sequence:

  1. Disable processing of all tasks.
  2. Wait for all processing tasks to be completed. This check is performed once every minute, so there may be a brief delay between a node completed its tasks before being shut down.
  3. Terminate the container.

The StroomCluster resource will be removed from the Kubernetes cluster once all nodes have finished processing tasks.

Stopping the cluster

kubectl delete -f stroom-cluster.yaml
kubectl delete -f database-server.yaml

If a StroomTaskAutoscaler was created, remove that as well.

If any of these commands appear to hang with no response, that’s normal; the Operator is likely waiting for tasks to drain. You may press Ctrl+C to return to the shell and task termination will continue in the background.

Once the StroomCluster is removed, it can be reconfigured (if required) and redeployed, using the same process as in Configure a Stroom cluster.

PersistentVolumeClaim deletion

When a Stroom node is shut down, by default its PersistentVolumeClaim will remain. This ensures it gets re-assigned the same PersistentVolume when it starts up again.

This behaviour should satisfy most use cases. However the operator may be configured to delete the PVC in certain situations, by specifying the StroomCluster.spec.volumeClaimDeletePolicy:

  1. DeleteOnScaledownOnly deletes a node’s PVC where the number of nodes in the NodeSet is reduced and as a result, the node Pod is no longer part of the NodeSet
  2. DeleteOnScaledownAndClusterDeletion deletes the PVC if the node Pod is removed.

Next steps

Removing the Stroom K8s Operator

9.9 - Restart Node

How to restart a Stroom node.

Stroom nodes may occasionally hang or become unresponsive. In these situations, it may be necessary to terminate the Pod.

After you identify the unresponsive Pod (e.g. by finding a node not responding to cluster ping):

kubectl delete pod -n <Stroom cluster namespace> <pod name>

This will attempt to drain tasks for the node. After the termination grace period has elapsed, the Pod will be killed and a new one will automatically re-spawn to take its place. Once the new Pod finishes starting up, if functioning correct it should begin responding to cluster ping.

Force deletion

If waiting for the grace period to elapse is unacceptable and you are willing to risk shutting down the node without draining it first (or you are sure it has no active tasks), you can force delete the Pod using the procedure outline in the Kubernetes documentation :

kubectl delete pod -n <Stroom cluster namespace> <pod name> --grace-period=0 --force