Stroom has many public REST APIs to allow other systems to interact with Stroom. Everything that can be done via the user interface can also be done using the API.

All methods on the API will are authenticated and authorised, so the permissions will be exactly the same as if the API user is using the Stroom user interface directly.

1.1 - API Specification

Details of the API specifcation and how to find what API endpoints are available.

Swagger UI

The APIs are available as a Swagger Open API specification in the following forms:

JSON - stroom.json
YAML - stroom.yaml

A dynamic Swagger user interface is also available for viewing all the API endpoints with details of parameters and data types. This can be found in two places.

Published on GitHub for each minor version Swagger user interface .
Published on a running stroom instance at the path /stroom/noauth/swagger-ui.

API Endpoints in Application Logs

The API methods are also all listed in the application logs when Stroom first boots up, e.g.

INFO  2023-01-17T11:09:30.244Z main i.d.j.DropwizardResourceConfig The following paths were found for the configured resources:

    GET     /api/account/v1/ (stroom.security.identity.account.AccountResourceImpl)
    POST    /api/account/v1/ (stroom.security.identity.account.AccountResourceImpl)
    POST    /api/account/v1/search (stroom.security.identity.account.AccountResourceImpl)
    DELETE  /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
    GET     /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
    PUT     /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
    GET     /api/activity/v1 (stroom.activity.impl.ActivityResourceImpl)
    POST    /api/activity/v1 (stroom.activity.impl.ActivityResourceImpl)
    POST    /api/activity/v1/acknowledge (stroom.activity.impl.ActivityResourceImpl)
    GET     /api/activity/v1/current (stroom.activity.impl.ActivityResourceImpl)
    ...

You will also see entries in the logs for the various servlets exposed by Stroom, e.g.

INFO  ... main s.d.common.Servlets            Adding servlets to application path/port: 
INFO  ... main s.d.common.Servlets            	stroom.core.servlet.DashboardServlet          => /stroom/dashboard 
INFO  ... main s.d.common.Servlets            	stroom.core.servlet.DynamicCSSServlet         => /stroom/dynamic.css 
INFO  ... main s.d.common.Servlets            	stroom.data.store.impl.ImportFileServlet      => /stroom/importfile.rpc 
INFO  ... main s.d.common.Servlets            	stroom.receive.common.ReceiveDataServlet      => /stroom/noauth/datafeed 
INFO  ... main s.d.common.Servlets            	stroom.receive.common.ReceiveDataServlet      => /stroom/noauth/datafeed/* 
INFO  ... main s.d.common.Servlets            	stroom.receive.common.DebugServlet            => /stroom/noauth/debug 
INFO  ... main s.d.common.Servlets            	stroom.data.store.impl.fs.EchoServlet         => /stroom/noauth/echo 
INFO  ... main s.d.common.Servlets            	stroom.receive.common.RemoteFeedServiceRPC    => /stroom/noauth/remoting/remotefeedservice.rpc 
INFO  ... main s.d.common.Servlets            	stroom.core.servlet.StatusServlet             => /stroom/noauth/status 
INFO  ... main s.d.common.Servlets            	stroom.core.servlet.SwaggerUiServlet          => /stroom/noauth/swagger-ui 
INFO  ... main s.d.common.Servlets            	stroom.resource.impl.SessionResourceStoreImpl => /stroom/resourcestore/* 
INFO  ... main s.d.common.Servlets            	stroom.dashboard.impl.script.ScriptServlet    => /stroom/script 
INFO  ... main s.d.common.Servlets            	stroom.security.impl.SessionListServlet       => /stroom/sessionList 
INFO  ... main s.d.common.Servlets            	stroom.core.servlet.StroomServlet             => /stroom/ui

1.2 - Calling an API

How to call a method on the Stroom API using curl.

Authentication

In order to use the API endpoints you will need to authenticate. Authentication is achieved using an API Key or Token .

You will either need to create an API key for your personal Stroom user account or for a shared processing user account. Whichever user account you use it will need to have the necessary permissions for each API endpoint it is to be used with.

To create an API key (token) for a user:

In the top menu, select:

Click Create.
Enter a suitable expiration date. Short expiry periods are more secure in case the key is compromised.
Select the user account that you are creating the key for.
Click OK
Select the newly created API Key from the list of keys and double click it to open it.
Click Copy Key to copy the key to the clipboard.

To make an authenticated API call you need to provide a header of the form Authorization:Bearer ${TOKEN}, where ${TOKEN} is your API Key as copied from Stroom.

Calling an API method with `curl`

This section describes how to call an API method using the command line tool curl as an example client. Other clients can be used, e.g. using python, but these examples should provide enough help to get started using another client.

HTTP Requests Without a Body

Typically HTTP GET requests will have no body/payload Often PUT and DELETE requests will also have no body/payload.

The following is an example of how to call an HTTP GET method (i.e. a method that does not require a request body) on the API using curl.

TOKEN='API KEY GOES IN HERE' \
curl \
  --silent \
  --insecure \
  --header "Authorization:Bearer ${TOKEN}" \
  https://stroom-fqdn/api/node/v1/info/node1a
(out){"discoverTime":"2022-02-16T17:28:37.710Z","buildInfo":{"buildDate":"2022-01-19T15:27:25.024677714Z","buildVersion":"7.0-beta.175","upDate":"2022-02-16T09:28:11.733Z"},"nodeName":"node1a","endpointUrl":"http://192.168.1.64:8080","itemList":[{"nodeName":"node1a","active":true,"master":true}],"ping":2}

Warning

The --insecure argument is used in this example which means certificate verification will not take place. It is recommended not to use this argument and instead supply curl with client and certificate authority certificates to make a secure connection.

TOKEN='API KEY GOES IN HERE' \
curl \
  --silent \
  --cert /path/to/client-cert \
  --key /path/to/client-key \
  --cacert /path/to/ca-cert \
  --header "Authorization:Bearer ${TOKEN}" \
  https://stroom-fqdn/api/some/path

You can either call the API via Nginx (or similar reverse proxy) at https://stroom-fddn/api/some/path or if you are making the call from one of the stroom hosts you can go direct using http://localhost:8080/api/some/path. The former is preferred as it is more secure.

Requests With a Body

A lot of the API methods in Stroom require complex bodies/payloads for the request. The following example is an HTTP POST to perform a reference data lookup on the local host.

Create a file req.json containing:

{
  "mapName": "USER_ID_TO_STAFF_NO_MAP",
  "effectiveTime": "2024-12-02T08:37:02.772Z",
  "key": "user2",
  "referenceLoaders": [
    {
      "loaderPipeline" : {
        "name" : "Reference Loader",
        "uuid" : "da1c7351-086f-493b-866a-b42dbe990700",
        "type" : "Pipeline"
      },
      "referenceFeed" : {
        "name": "STAFF-NO-REFERENCE",
        "uuid": "350003fe-2b6c-4c57-95ed-2e6018c5b3d5",
        "type" : "Feed"
      }
    }
  ]
}

Now send the request with curl.

TOKEN='API KEY GOES IN HERE' \
curl \
  --json @req.json \
  --request POST \
  --header "Authorization:Bearer ${TOKEN}" \
  http://localhost:8080/api/refData/v1/lookup
(out)staff2

This API method returns plain text or XML depending on the reference data value.

Note

This assumes you are using curl version 7.82.0 or later that supports the --json argument. If not you will need to replace --json with --data and add these arguments:

--header "Content-Type: application/json"
--header "Accept: application/json"

Handling JSON

jq is a utility for processing JSON and is very useful when using the API methods.

For example to get just the build version from the node info endpoint:

TOKEN='API KEY GOES IN HERE' \
curl \
    --silent \
    --insecure \
    --header "Authorization:Bearer ${TOKEN}" \
    https://localhost/api/node/v1/info/node1a \
  | jq -r '.buildInfo.buildVersion'
(out)7.0-beta.175

1.3 - Query APIs

The APIs to allow other systems to query the data held in Stroom.

The Query APIs use common request/response models and end points for querying each type of data source held in Stroom. The request/response models are defined in stroom-query .

Currently Stroom exposes a set of query endpoints for the following data source types. Each data source type will have its own endpoint due to differences in the way the data is queried and the restrictions imposed on the query terms. However they all share the same API definition.

stroom-index Queries - The Lucene based search indexes.
Sql Statistics Query - Stroom’s SQL Statistics store.
Searchable - Searchables are various data sources that allow you to search the internals of Stroom, e.g. local reference data store, annotations, processor tasks, etc.

The detailed documentation for the request/responses is contained in the Swagger definition linked to above.

Common endpoints

The standard query endpoints are

/datasource
/destroy
/keepAlive
/search

Datasource

The Data Source endpoint is used to query Stroom for the details of a data source with a given DocRef . The details will include such things as the fields available and any restrictions on querying the data.

Search

The search endpoint is used to initiate a search against a data source or to request more data for an active search. A search request can be made using iterative mode, where it will perform the search and then only return the data it has immediately available. Subsequent requests for the same queryKey will also return the data immediately available, expecting that more results will have been found by the query. Requesting a search in non-iterative mode will result in the response being returned when the query has completed and all known results have been found.

The SearchRequest model is fairly complicated and contains not only the query terms but also a definition of how the data should be returned. A single SearchRequest can include multiple ResultRequest sections to return the queried data in multiple ways, e.g. as flat data and in an alternative aggregated form.

Stroom as a query builder

Stroom is able to export the json form of a SearchRequest model from its dashboards. This makes the dashboard a useful tool for building a query and the table settings to go with it. You can use the dashboard to defined the data source, define the query terms tree and build a table definition (or definitions) to describe how the data should be returned. The, clicking the download icon on the query pane of the dashboard will generate the SearchRequest json which can be immediately used with the /search API or modified to suit.

Destroy

This endpoint is used to kill an active query by supplying the queryKey for query in question.

Keep alive

Stroom will only hold search results from completed queries for a certain lenght of time. It will also terminate running queries that are too old. In order to prevent queries being aged off you can hit this endpoint to indicate to Stroom that you still have an interest in a particular query by supplying the query key.

1.4 - Export Content API

An API method for exporting all Stroom content to a zip file.

Stroom has API methods for exporting content in Stroom to a single zip file.

Export All - `/api/export/v1`

This method will export all content in Stroom to a single zip file. This is useful as an alternative backup of the content or where you need to export the content for import into another Stroom instance.

In order to perform a full export, the user (identified by their API Key) performing the export will need to ensure the following:

Have created an API Key
The system property stroom.export.enabled is set to true.
The user has the application permission Export Configuration or Administrator.

Only those items that the user has Read permission on will be exported, so to export all items, the user performing the export will need Read permission on all items or have the Administrator application permission.

Performing an Export

To export all readable content to a file called export.zip do something like the following:

TOKEN="API KEY GOES IN HERE"
curl \
  --silent \
  --request GET \
  --header "Authorization:Bearer ${TOKEN}" \
  --output export.zip \
  https://stroom-fqdn/api/export/v1/

Note

If you encounter problems then replace --silent with --verbose to get more information.

Export Zip Format

The export zip will contain a number of files for each document exported. The number and type of these files will depend on the type of document, however every document will have the following two file types:

.node - This file represents the document’s location in the explorer tree along with its name and UUID.
.meta - This is the metadata for the document independent of the explorer tree. It contains the name, type and UUID of the document along with the unique identifier for the version of the document.

Documents may also have files like these (a non-exhaustive list):

.json - JSON data holding the content of the document, as used for Dashboards.
.txt - Plain text data holding the content of the document, as used for Dictionaries.
.xml - XML data holding the content of the document, as used for Pipelines.
.xsd - XML Schema content.
.xsl - XSLT content.

The following is an example of the content of an export zip file:

TEST_FEED_CERT.Feed.fcee4270-a479-4cc0-a79c-0e8f18a4bad8.meta
TEST_FEED_CERT.Feed.fcee4270-a479-4cc0-a79c-0e8f18a4bad8.node
TEST_FEED_PROXY.Feed.f06d4416-8b0e-4774-94a9-729adc5633aa.meta
TEST_FEED_PROXY.Feed.f06d4416-8b0e-4774-94a9-729adc5633aa.node
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.meta
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.xsl
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.node
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.xml
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.meta
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.node

Filenames

When documents are added to the zip, they are added with a directory structure that mirrors the explorer tree.

The filenames are of the form:

<name>.<type>.<UUID>.<extension>

As Stroom allows characters in document and folder names that would not be supported in operating system paths (or cause confusion), some characters in the name/directory parts are replaced by _ to avoid this. e.g. Dashboard 01/02/2020 would become Dashboard_01_02_2020.

If you need to see the contents of the zip as if viewing it within Stroom you can run this bash script in the root of the extracted zip.

#!/usr/bin/env bash

shopt -s globstar
for node_file in **/*.node; do
  name=
  name="$(grep -o -P "(?<=name=).*" "${node_file}" )"
  path=
  path="$(grep -o -P "(?<=path=).*" "${node_file}" )"

  echo "./${path}/${name}   (./${node_file})"
done

This will output something like:

./Standard Pipelines/Json/Events to JSON   (./Standard_Pipelines/Json/Events_to_JSON.XSLT.1c3d42c2-f512-423f-aa6a-050c5cad7c0f.node)
./Standard Pipelines/Json/JSON Extraction   (./Standard_Pipelines/Json/JSON_Extraction.Pipeline.13143179-b494-4146-ac4b-9a6010cada89.node)
./Standard Pipelines/Json/JSON Search Extraction   (./Standard_Pipelines/Json/JSON_Search_Extraction.XSLT.a8c1aa77-fb90-461a-a121-d4d87d2ff072.node)
./Standard Pipelines/Reference Loader   (./Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.node)

1.5 - Reference Data API

How to perform reference data loads and lookups using the API.

The reference data store has an API to allow other systems to access the reference data store.

`/api/refData/v1/lookup`

The /lookup endpoint requires the caller to provide details of the reference feed and loader pipeline so if the effective stream is not in the store it can be loaded prior to performing the lookup. It is useful for forcing a reference load into the store and for performing ad-hoc lookups.

Note

As reference data stores are local to a node, it is best to send the request to a node that does processing as it is more likely to have already loaded the data. If you send it to a UI node that does not do processing, it is likely to trigger a load as the data will not be there.

Below is an example of a lookup request file req.json.

{
  "mapName": "USER_ID_TO_LOCATION",
  "effectiveTime": "2020-12-02T08:37:02.772Z",
  "key": "jbloggs",
  "referenceLoaders": [
    {
      "loaderPipeline" : {
        "name" : "Reference Loader",
        "uuid" : "da1c7351-086f-493b-866a-b42dbe990700",
        "type" : "Pipeline"
      },
      "referenceFeed" : {
        "name": "USER_ID_TOLOCATION-REFERENCE",
        "uuid": "60f9f51d-e5d6-41f5-86b9-ae866b8c9fa3",
        "type" : "Feed"
      }
    }
  ]
}

This is an example of how to perform the lookup on the local host.

curl \
  --json @req.json \
  --request POST \
  --header "Authorization:Bearer ${TOKEN}" \
  http://localhost:8080/api/refData/v1/lookup
(out)staff2

1.6 - Explorer API

The API for managing the folders and documents in the explorer tree.

Creating a New Document

The explorer API is responsible for creation of all document types. The explorer API is used to create the initial skeleton of a document then the API specific to the document type in question is used to update the document skeleton with additional settings/content.

This is an example request file req.json:

{
  "docType": "Feed",
  "docName": "MY_FEED",
  "destinationFolder": {
    "type": "Folder",
    "uuid": "3dfab6a2-dbd5-46ee-b6e9-6df45f90cd85",
    "name": "My Folder",
    "rootNodeUuid": "0"
  },
  "permissionInheritance": "DESTINATION"
}

You need to set the following properties in the JSON:

docType - The type of the document being created, see Documents.
docName - The name of the new document.
destinationFolder.uuid - The UUID of the destination folder (or 0 if the document is being created in the System root.
rootNodeUuid - This is always 0 for the System root.

To create the skeleton document run the following:

curl \
  -s \
  -X POST \
  -H "Authorization:Bearer ${TOKEN}" \
  -H 'Content-Type: application/json' \
  --data @"req.json" \
  http://localhost:8080/api/explorer/v2/create/ \
| jq -r '.uuid'

This will create the document and return its new UUID to stdout.

1.7 - Feed API

The API for fetching and updating feeds.

Creating a Feed

In order to create a Feed you must first create the skeleton document using the Explorer API.

Updating a Feed

To modify a feed you must first fetch the existing Feed document. This is done as follows:

curl \
  -s \
  -H "Authorization:Bearer ${TOKEN}" \
  "http://localhost:8080/api/feed/v1/${feed_uuid}" \

Where ${feed_uuid} is the UUID of the feed in question.

This will return the Feed document JSON.

{
  "type": "Feed",
  "uuid": "0dafc9c2-dcd8-4bb6-88ce-5ee228babe78",
  "name": "MY_FEED",
  "version": "32dae12f-a696-4e0e-8acb-47cf0ad3c77f",
  "createTimeMs": 1718103980225,
  "updateTimeMs": 1718103980225,
  "createUser": "admin",
  "updateUser": "admin",
  "reference": false,
  "streamType": "Raw Events",
  "status": "RECEIVE"
}

You can use jq to modify this JSON to add/change any of the document settings.

Example Script

The following is an example bash script for creating and modifying multiple Feeds. It requires curl and jq to run.

#!/usr/bin/env bash

set -e -o pipefail

main() {

  # Your API key
  local TOKEN="sak_d5752a32b2_mv1JYUYUuvRUDpikW75G5w4kQUq7EEjShQ9DiRjN14yEFonKTW42KbeQogui52gTjq9RDRufNEz2MXt1PRCThudzHU5RVpLMbZKThCgyyEX2y2sBrk31rYMJRKNg2yMG"
  # UUID of the dest folder
  local FOLDER_UUID="fc617580-8cf0-4ac3-93dd-93604603aef0"

  local feed_name
  local create_feed_req
  local feed_uuid
  local feed_doc

  for i in {1..2}; do
    # Use date to make a unique name for the test
    feed_name="MY_FEED_$(date +%s)_${i}"

    # Set the feed name and its destination
    create_feed_req=$(cat <<-END
      {
        "docType": "Feed",
        "docName": "${feed_name}",
        "destinationFolder": {
          "type": "Folder",
          "uuid": "${FOLDER_UUID}",
          "rootNodeUuid": "0"
        },
        "permissionInheritance": "DESTINATION"
      }
END
    )

    # Create the skeleton feed and extract its new UUID from the response
    feed_uuid=$( \
      curl \
        -s \
        -X POST \
        -H "Authorization:Bearer ${TOKEN}" \
        -H 'Content-Type: application/json' \
        --data "${create_feed_req}" \
        http://localhost:8080/api/explorer/v2/create/ \
      | jq -r '.uuid'
    )

    echo "Created feed $i with name '${feed_name}' and UUID '${feed_uuid}'"

    # Fetch the created feed
    feed_doc=$( \
      curl \
        -s \
        -H "Authorization:Bearer ${TOKEN}" \
        "http://localhost:8080/api/feed/v1/${feed_uuid}" \
      )

    echo -e "Skeleton Feed doc for '${feed_name}'\n$(jq '.' <<< "${feed_doc}")"

    # Add/modify propeties on the feed doc
    feed_doc=$(jq '
      .classification="HUSH HUSH" 
      | .encoding="UTF8" 
      | .contextEncoding="ASCII" 
      | .streamType="Events"
      | .volumeGroup="Default Volume Group"' <<< "${feed_doc}")

    #echo -e "Updated feed doc for '${feed_name}'\n$(jq '.' <<< "${feed_doc}")"

    # Update the feed with the new properties
    curl \
      -s \
      -X PUT \
      -H "Authorization:Bearer ${TOKEN}" \
      -H 'Content-Type: application/json' \
      --data "${feed_doc}" \
      "http://localhost:8080/api/feed/v1/${feed_uuid}" \
    > /dev/null

    # Fetch the created feed
    feed_doc=$( \
      curl \
        -s \
        -H "Authorization:Bearer ${TOKEN}" \
        "http://localhost:8080/api/feed/v1/${feed_uuid}" \
      )

    echo -e "Updated Feed doc for '${feed_name}'\n$(jq '.' <<< "${feed_doc}")"
    echo
  done
}

main "$@"

2 - Background Jobs

Managing background jobs.

There are various jobs that run in the background within Stroom. Among these are jobs that control pipeline processing, removing old files from the file system, checking the status of nodes and volumes etc. Each job executes at a different time depending on the purpose of the job. There are three ways that a job can be executed:

Cron scheduled jobs execute periodically according to a cron schedule. These include jobs such as cleaning the file system where Stroom only needs to perform this action once a day and can do so overnight.
Frequency controlled jobs are executed every X seconds, minutes, hours etc. Most of the jobs that execute with a given frequency are status checking jobs that perform a short lived action fairly frequently.
Distributed jobs are only applicable to stream processing with a pipeline. Distributed jobs are executed by a worker node as soon as a worker has available threads to execute a jobs and the task distributor has work available.

A list of job types and their execution method can be seen by opening Jobs from the main menu.

Each job can be enabled/disabled at the job level. If you click on a job you will see an entry for each Stroom node in the lower pane. The job can be enabled/disabled at the node level for fine grained control of which nodes are running which jobs.

For a full list of all the jobs and details of what each one does, see the Job reference.

2.1 - Scheduler

How background jobs are scheduled.

Stroom has two main types of schedule, a simple frequency schedule that runs the job at a fixed time interval or a more complex cron schedule.

Note

This scheduler and its syntax are also used for Analytic Rules .

Frequency Schedules

A frequency schedule is expressed as a fixed time interval. The frequency schedule expression syntax is stroom’s standard duration syntax and takes the form of a value followed by an optional unit suffix, e.g. 10m for ten minutes.

Prefix	Time Unit
milliseconds
`ms`	milliseconds
`s`	seconds
`m`	minutes
`h`	hours
`d`	days

Cron Schedules

cron is a syntax for expressing schedules.

For full details of cron expressions see Cron Syntax

Stroom uses a scheduler called Quartz which supports cron expressions for scheduling.

3 - Concepts

Describes a number of core concepts involved in using Stroom.

3.1 - Streams

The unit of data that Stroom operates on, essentially a bounded stream of data.

Streams can either be created when data is directly POSTed in to Stroom or during the proxy aggregation process. When data is directly POSTed to Stroom the content of the POST will be stored as one Stream. With proxy aggregation multiple files in the proxy repository will/can be aggregated together into a single Stream.

Anatomy of a Stream

A Stream is made up of a number of parts of which the raw or cooked data is just one. In addition to the data the Stream can contain a number of other child stream types, e.g. Context and Meta Data.

The hierarchy of a stream is as follows:

Stream nnn
- Part [1 to *]
  - Data [1-1]
  - Context [0-1]
  - Meta Data [0-1]

Although all streams conform to the above hierarchy there are three main types of Stream that are used in Stroom:

Non-segmented Stream - Raw events, Raw Reference
Segmented Stream - Events, Reference
Segmented Error Stream - Error

Segmented means that the data has been demarcated into segments or records.

Child Stream Types

Data

This is the actual data of the stream, e.g. the XML events, raw CSV, JSON, etc.

Context

This is additional contextual data that can be sent with the data. Context data can be used for reference data lookups.

Meta Data

This is the data about the Stream (e.g. the feed name, receipt time, user agent, etc.). This meta data either comes from the HTTP headers when the data was POSTed to Stroom or is added by Stroom or Stroom-Proxy on receipt/processing.

Non-Segmented Stream

The following is a representation of a non-segmented stream with three parts, each with Meta Data and Context child streams.

Raw Events and Raw Reference streams contain non-segmented data, e.g. a large batch of CSV, JSON, XML, etc. data. There is no notion of a record/event/segment in the data, it is simply data in any form (including malformed data) that is yet to be processed and demarcated into records/events, for example using a Data Splitter or an XML parser.

The Stream may be single-part or multi-part depending on how it is received. If it is the product of proxy aggregation then it is likely to be multi-part. Each part will have its own context and meta data child streams, if applicable.

Segmented Stream

The following is a representation of a segmented stream that contains three records (i.e events) and the Meta Data.

Cooked Events and Reference data are forms of segmented data. The raw data has been parsed and split into records/events and the resulting data is stored in a way that allows Stroom to know where each record/event starts/ends. These streams only have a single part.

Error Stream

Error streams are similar to segmented Event/Reference streams in that they are single-part and have demarcated records (where each error/warning/info message is a record). Error streams do not have any Meta Data or Context child streams.

4 - Data Retention

Controlling the purging/retention of old data.

By default Stroom will retain all the data it ingests and creates forever. It is likely that storage constraints/costs will mean that data needs to be deleted after a certain time. It is also likely that certain types of data may need to be kept for longer than other types.

Rules

Stroom allows for a set of data retention policy rules to be created to control at a fine grained level what data is deleted and what is retained.

The data retention rules are accessible by selecting Data Retention from the Tools menu. On first use the Rules tab of the Data Retention screen will show a single rule named Default Retain All Forever Rule. This is the implicit rule in stroom that retains all data and is always in play unless another rule overrides it. This rule cannot be edited, moved or removed.

Rule Precedence

Rules have a precedence, with a lower rule number being a higher priority. When running the data retention job, Stroom will look at each stream held on the system and the retention policy of the first rule (starting from the lowest numbered rule) that matches that stream will apply. One a matching rule is found all other rules with higher rule numbers (lower priority) are ignored. For example if rule 1 says to retain streams from feed X-EVENTS for 10 years and rule 2 says to retain streams from feeds *-EVENTS for 1 year then rule 1 would apply to streams from feed X-EVENTS and they would be kept for 10 years, but rule 2 would apply to feed Y-EVENTS and they would only be kept for 1 year. Rules are re-numbered as new rules are added/deleted/moved.

Creating a Rule

To create a rule do the following:

Click the icon to add a new rule.
Edit the expression to define the data that the rule will match on.
Provide a name for the rule to help describe what its purpose is.
Set the retention period for data matching this rule, i.e Forever or a set time period.

The new rule will be added at the top of the list of rules, i.e. with the highest priority. The and icons can be used to change the priority of the rule.

Rules can be enabled/disabled by clicking the checkbox next to the rule.

Changes to rules will not take effect until the icon is clicked.

Rules can also be deleted ( ) and copied ( ).

Impact Summary

When you have a number of complex rules it can be difficult to determine what data will actually be deleted next time the Policy Based Data Retention job runs. To help with this, Stroom has the Impact Summary tab that acts as a dry run for the active rules. The impact summary provides a count of the number of streams that will be deleted broken down by rule, stream type and feed name. On large systems with lots of data or complex rules, this query may take a long time to run.

The impact summary operates on the current state of the rules on the Rules tab whether saved or un-saved. This allows you to make a change to the rules and test its impact before saving it.

5 - Data Splitter

Data Splitter was created to transform text into XML. The XML produced is basic but can be processed further with XSLT to form any desired XML output.

Data Splitter works by using regular expressions to match a region of content or tokenizers to split content. The whole match or match group can then be output or passed to other expressions to further divide the matched data.

The root <dataSplitter> element controls the way content is read and buffered from the source. It then passes this content on to one or more child expressions that attempt to match the content. The child expressions attempt to match content one at a time in the order they are specified until one matches. The matching expression then passes the content that it has matched to other elements that either emit XML or apply other expressions to the content matched by the parent.

This process of content supply, match, (supply, match)*, emit is best illustrated in a simple CSV example. Note that the elements and attributes used in all examples are explained in detail in the element reference.

5.1 - Simple CSV Example

The following CSV data will be split up into separate fields using Data Splitter.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

The first thing we need to do is match each record. Each record in a CSV file is delimited by a new line character. The following configuration will split the data into records using ‘\n’ as a delimiter:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  
  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n"/>

</dataSplitter>

In the above example the ‘split’ tokenizer matches all of the supplied content up to the end of each line ready to pass each line of content on for further treatment.

We can now add a <group> element within <split> to take content matched by the tokenizer.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

    </group>
  </split>
</dataSplitter>

The <group> within the <split> chooses to take the content from the <split> without including the new line ‘\n’ delimiter by using match group 1, see expression match references for details.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

The content selected by the <group> from its parent match can then be passed onto sub expressions for further matching:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

      </split>
    </group>
  </split>
</dataSplitter>

In the above example the additional <split> element within the <group> will match the content provided by the group repeatedly until it has used all of the group content.

The content matched by the inner <split> element can be passed to a <data> element to emit XML content.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

        <!-- Output the value from group 1 (as above using group 1
        ignores the delimiters, without this each value would include
        the comma) -->
        <data value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example each match from the inner <split> is made available to the inner <data> element that chooses to output content from match group 1, see expression match references for details.

The above configuration results in the following XML output for the whole input:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data value="01/01/2010" />
    <data value="00:00:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logon" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:01:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="create" />
    <data value="c:\test.txt" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:02:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logoff" />
  </record>
</records>

5.2 - Simple CSV example with heading

In addition to referencing content produced by a parent element it is often desirable to store content and reference it later. The following example of a CSV with a heading demonstrates how content can be stored in a variable and then referenced later on.

Input

This example will use a similar input to the one in the previous CSV example but also adds a heading line.

Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the
  first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value
        from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:00:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logon" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:01:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="create" />
    <data name="Detail" value="c:\test.txt" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:02:00" />
    <data name="IPAdress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logoff" />
  </record>
</records>

5.3 - Complex example with regex and user defined names

The following example uses a real world Apache log and demonstrates the use of regular expressions rather than simple ‘split’ tokenizers. The usage and structure of regular expressions is outside of the scope of this document but Data Splitter uses Java’s standard regular expression library that is POSIX compliant and documented in numerous places.

This example also demonstrates that the names and values that are output can be hard coded in the absence of field name information to make XSLT conversion easier later on. Also shown is that any match can be divided into further fields with additional expressions and the ability to nest data elements to provide structure if needed.

Input

192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /doc.htm HTTP/1.1" 200 4235 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /default.css HTTP/1.1" 200 3494 "http://some.server:8080/doc.htm" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!--
  Standard Apache Format

  %h - host name should be ok without quotes
  %l - Remote logname (from identd, if supplied). This will return a dash unless IdentityCheck is set On.
  \"%u\" - user name should be quoted to deal with DNs
  %t - time is added in square brackets so is contained for parsing purposes
  \"%r\" - URL is quoted
  %>s - Response code doesn't need to be quoted as it is a single number
  %b - The size in bytes of the response sent to the client
  \"%{Referer}i\" - Referrer is quoted so that’s ok
  \"%{User-Agent}i\" - User agent is quoted so also ok

  LogFormat "%h %l \"%u\" %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
  -->

  <!-- Match line -->
  <split delimiter="\n">
    <group value="$1">

      <!-- Provide a regular expression for the whole line with match
      groups for each field we want to split out -->
      <regex pattern="^([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; \[([^\]]+)] &#34;([^&#34;]+)&#34; ([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; &#34;([^&#34;]+)&#34;">
        <data name="host" value="$1" />
        <data name="log" value="$2" />
        <data name="user" value="$3" />
        <data name="time" value="$4" />
        <data name="url" value="$5">

          <!-- Take the 5th regular expression group and pass it to
          another expression to divide into smaller components -->
          <group value="$5">
            <regex pattern="^([^ ]+) ([^ ]+) ([^ /]*)/([^ ]*)">
              <data name="httpMethod" value="$1" />
              <data name="url" value="$2" />
              <data name="protocol" value="$3" />
              <data name="version" value="$4" />
            </regex>
          </group>
        </data>
        <data name="response" value="$6" />
        <data name="size" value="$7" />
        <data name="referrer" value="$8" />
        <data name="userAgent" value="$9" />
      </regex>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /doc.htm HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/doc.htm" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="4235" />
    <data name="referrer" value="-" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /default.css HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/default.css" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="3494" />
    <data name="referrer" value="http://some.server:8080/doc.htm" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
</records>

5.4 - Multi Line Example

Example multi line file where records are split over may lines. There are various ways this data could be treated but this example forms a record from data created when some fictitious query starts plus the subsequent query results.

Input

09/07/2016    14:49:36    User = user1
09/07/2016    14:49:36    Query = some query

09/07/2016    16:34:40    Results:
09/07/2016    16:34:40    Line 1:   result1
09/07/2016    16:34:40    Line 2:   result2
09/07/2016    16:34:40    Line 3:   result3
09/07/2016    16:34:40    Line 4:   result4

09/07/2009    16:35:21    User = user2
09/07/2009    16:35:21    Query = some other query

09/07/2009    16:45:36    Results:
09/07/2009    16:45:36    Line 1:   result1
09/07/2009    16:45:36    Line 2:   result2
09/07/2009    16:45:36    Line 3:   result3
09/07/2009    16:45:36    Line 4:   result4

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each record. We want to treat the query and results as a single event so match the two sets of data separated by a double new line -->
  <regex pattern="\n*((.*\n)+?\n(.*\n)+?\n)|\n*(.*\n?)+">
    <group>

      <!-- Split the record into query and results -->
      <regex pattern="(.*?)\n\n(.*)" dotAll="true">

        <!-- Create a data element to output query data -->
        <data name="query">
          <group value="$1">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>

        <!-- Create a data element to output result data -->
        <data name="results">
          <group value="$2">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>
      </regex>
    </group>
  </regex>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="records:2 file://records-v2.0.xsd"
                version="2.0">
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="14:49:36" />
      <data name="User" value="user1" />
      <data name="Query" value="some query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:34:40" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:35:21" />
      <data name="User" value="user2" />
      <data name="Query" value="some other query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:45:36" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
</records>

5.5 - Element Reference

There are various elements used in a Data Splitter configuration to control behaviour. Each of these elements can be categorised as one of the following:

5.5.1 - Content Providers

Content providers take some content from the input source or elsewhere (see fixed strings and provide it to one or more expressions. Both the root element <dataSplitter> and <group> elements are content providers.

Root element `<dataSplitter>`

The root element of a Data Splitter configuration is <dataSplitter>. It supplies content from the input source to one or more expressions defined within it. The way that content is buffered is controlled by the root element and the way that errors are handled as a result of child expressions not matching all of the content it supplies.

Attributes

The following attributes can be added to the <dataSplitter> root element:

ignoreErrors
bufferSize

`ignoreErrors`

Data Splitter generates errors if not all of the content is matched by the regular expressions beneath the <dataSplitter> or within <group> elements. The error messages are intended to aid the user in writing good Data Splitter configurations. The intent is to indicate when the input data is not being matched fully and therefore possibly skipping some important data. Despite this, in some cases it is laborious to have to write expressions to match all content. In these cases it is preferable to add this attribute to ignore these errors. However it is often better to write expressions that capture all of the supplied content and discard unwanted characters. This attribute also affects errors generated by the use of the minMatch attribute on <regex> which is described later on.

Take the following example input:

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This could be matched with the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n[^#]+">
…
  </regex>
</dataSplitter>

The above configuration would only match up to a comment for each record line, e.g.

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This may well be the desired functionality but if there was useful content within the comment it would be lost. Because of this Data Splitter warns you when expressions are failing to match all of the content presented so that you can make sure that you aren’t missing anything important. In the above example it is obvious that this is the required behaviour but in more complex cases you might be otherwise unaware that your expressions were losing data.

To maintain this assurance that you are handling all content it is usually best to write expressions to explicitly match all content even though you may do nothing with some matches, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n([^#]+)#.+">
…
  </regex>
</dataSplitter>

The above example would match all of the content and would therefore not generate warnings. Sub-expressions of ‘body’ could use match group 1 and ignore the comment.

However as previously stated it might often be difficult to write expressions that will just match content that is to be discarded. In these cases ignoreErrors can be used to suppress errors caused by unmatched content.

`bufferSize` (Advanced)

This is an optional attribute used to tune the size of the character buffer used by Data Splitter. The default size is 20000 characters and should be fine for most translations. The minimum value that this can be set to is 20000 characters and the maximum is 1000000000. The only reason to specify this attribute is when individual records are bigger than 10000 characters which is rarely the case.

Group element `<group>`

Groups behave in a similar way to the root element in that they provide content for one or more inner expressions to deal with, e.g.

<group value="$1">
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
  ...
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
  ...

Attributes

As the <group> element is a content provider it also includes the same ignoreErrors attribute which behaves in the same way. The complete list of attributes for the <group> element is as follows:

id
value
ignoreErrors
matchOrder
reverse

`id`

When Data Splitter reports errors it outputs an XPath to describe the part of the configuration that generated the error, e.g.

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group>

It is often a little difficult to identify the configuration element that generated the error by looking at the path and the element description, particularly when multiple elements are the same, e.g. many <group> elements without attributes. To make identification easier you can add an ‘id’ attribute to any element in the configuration resulting in error descriptions as follows:

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group id="myGroupId">

`value`

This attribute determines what content to present to child expressions. By default the entire content matched by a group’s parent expression is passed on by the group to child expressions. If required, content from a specific match group in the parent expression can be passed to child expressions using the value attribute, e.g. value="$1". In addition to this content can be composed in the same way as it is for data names and values.

`ignoreErrors`

This behaves in the same way as for the root element.

`matchOrder`

This is an optional attribute used to control how content is consumed by expression matches. Content can be consumed in sequence or in any order using matchOrder="sequence" or matchOrder="any". If the attribute is not specified, Data Splitter will default to matching in sequence.

When matching in sequence, each match consumes some content and the content position is moved beyond the match ready for the subsequent match. However, in some cases the order of these constructs is not predictable, e.g. we may sometimes be presented with:

Value1=1 Value2=2

… or sometimes with:

Value2=2 Value1=1

Using a sequential match order the following example would work to find both values in Value1=1 Value2=2

<group>
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

… but this example would skip over Value2 and only find the value of Value1 if the input was Value2=2 Value1=1.

To be able to deal with content that contains these constructs in either order we need to change the match order to any.

When matching in any order, each match removes the matched section from the content rather than moving the position past the match so that all remaining content can be matched by subsequent expressions. In the following example the first expression would match and remove Value1=1 from the supplied content and the second expression would be presented with Value2=2 which it could also match.

<group matchOrder="any">
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

If the attribute is omitted by default the match order will be sequential. This is the default behaviour as tokens are most often in sequence and consuming content in this way is more efficient as content does not need to be copied by the parser to chop out sections as is required for matching in any order. It is only necessary to use this feature when fields that are identifiable with a specific match can occur in any order.

`reverse`

Occasionally it is desirable to reverse the content presented by a group to child expressions. This is because it is sometimes easier to form a pattern by matching content in reverse.

Take the following example content of name, value pairs delimited by = but with no spaces between names, multiple spaces between values and only a space between subsequent pairs:

ipAddress=123.123.123.123 zones=Zone 1, Zone 2, Zone 3 location=loc1 A user=An end user serverName=bigserver

We could write a pattern that matches each name value pair by matching up to the start of the next name, e.g.

<regex pattern="([^=]+)=(.+?)( [^=]+=)">

This would match the following:

ipAddress=123.123.123.123 zones=

Here we are capturing the name and value for each pair in separate groups but the pattern has to also match the name from the next name value pair to find the end of the value. By default Data Splitter will move the content buffer to the end of the match ready for subsequent matches so the next name will not be available for matching.

In addition to matching too much content the above example also uses a reluctant qualifier .+?. Use of reluctant qualifiers almost always impacts performance so they are to be avoided if at all possible.

A better way to match the example content is to match the input in reverse, reading characters from right to left.

The following example demonstrates this:

<group reverse="true">
  <regex pattern="([^=]+)=([^ ]+)">
    <data name="$2" value="$1" />
  </regex>
</group>

Using the reverse attribute on the parent group causes content to be supplied to all child expressions in reverse order. In the above example this allows the pattern to match values followed by names which enables us to cope with the fact that values have multiple spaces but names have no spaces.

Content is only presented to child regular expressions in reverse. When referencing values from match groups the content is returned in the correct order, e.g. the above example would return:

<data name="ipAddress" value="123.123.123.123" />
<data name="zones" value="Zone 1, Zone 2, Zone 3" />
<data name="location" value="loc1" />
<data name="user" value="An end user" />
<data name="serverName" value="bigserver" />

The reverse feature isn’t needed very often but there are a few cases where it really helps produce the desired output without the complexity and performance overhead of a reluctant match.

An alternative to using the reverse attribute is to use the original reluctant expression example but tell Data Splitter to make the subsequent name available for the next match by not advancing the content beyond the end of the previous value. This is done by using the advance attribute on the <regex>. However, the reverse attribute represents a better way to solve this particular problem and allows a simpler and more efficient regular expression to be used.

5.5.2 - Expressions

Expressions match some data supplied by a parent content provider. The content matched by an expression depends on the type of expression and how it is configured.

The <split>, <regex> and <all> elements are all expressions and match content as described below.

The `<split>` element

The <split> element directs Data Splitter to break up content using a specified character sequence as a delimiter. In addition to this it is possible to specify characters that are used to escape the delimiter as well as characters that contain or “quote” a value that may include the delimiter sequence but allow it to be ignored.

Attributes

The <split> element has the following attributes:

id
delimiter
escape
containerStart
containerEnd
maxMatch
minMatch
onlyMatch

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`delimiter`

A required attribute used to specify the character string that will be used as a delimiter to split the supplied content unless it is preceded by an escape character or within a container if specified. Several of the previous examples use this attribute.

`escape`

An optional attribute used to specify a character sequence that is used to escape the delimiter. Many delimited text formats have an escape character that is used to tell any parser that the following delimiter should be ignored, e.g. often a character such as ‘' is used to escape the character that follows it so that it is not treated as a delimiter. When specified this escape sequence also applies to any container characters that may be specified.

`containerStart`

An optional attribute used to specify a character sequence that will make this expression ignore the presence of delimiters until an end container is found. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters up to a delimiter.

If used containerEnd must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

`containerEnd`

An optional attribute used to specify a character sequence that will make this expression stop ignoring the presence of delimiters if it believes it is currently in a container. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters while ignoring the presence of any delimiter.

If used containerStart must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

`maxMatch`

An optional attribute used to specify the maximum number of times this expression is allowed to match the supplied content. If you do not supply this attribute then the Data Splitter will keep matching the supplied content until it reaches the end. If specified Data Splitter will stop matching the supplied content when it has matched it the specified number of times.

This attribute is used in the ‘CSV with header line’ example to ensure that only the first line is treated as a header line.

`minMatch`

An optional attribute used to specify the minimum number of times this expression should match the supplied content. If you do not supply this attribute then Data Splitter will not enforce that the expression matches the supplied content. If specified Data Splitter will generate an error if the expression does not match the supplied content at least as many times as specified.

Unlike maxMatch, minMatch does not control the matching process but instead controls the production of error messages generated if the parser is not seeing the expected input.

`onlyMatch`

Optional attribute to use this expression only for specific instances of a match of the parent expression, e.g. on the 4th, 5th and 8th matches of the parent expression specified by ‘4,5,8’. This is used when this expression should only be used to subdivide content from certain parent matches.

The `<regex>` element

The <regex> element directs Data Splitter to match content using the specified regular expression pattern. In addition to this the same match control attributes that are available on the <split> element are also present as well as attributes to alter the way the pattern works.

Attributes

The <regex> element has the following attributes:

id
pattern
dotAll
caseInsensitive
maxMatch
minMatch
onlyMatch
advance

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`pattern`

This is a required attribute used to specify a regular expression to use to match on the supplied content. The pattern is used to match the content multiple times until the end of the content is reached while the maxMatch and onlyMatch conditions are satisfied.

`dotAll`

An optional attribute used to specify if the use of ‘.’ in the supplied pattern matches all characters including new lines. If ’true’ ‘.’ will match all characters including new lines, if ‘false’ it will only match up to a new line. If this attribute is not specified it defaults to ‘false’ and will only match up to a new line.

This attribute is used in many of the multi-line examples above.

`caseInsensitive`

An optional attribute used to specify if the supplied pattern should match content in a case insensitive way. If ’true’ the expression will match content in a case insensitive manner, if ‘false’ it will match the content in a case sensitive manner. If this attribute is not specified it defaults to ‘false’ and will match the content in a case sensitive manner.

`maxMatch`

This is used in the same way it is on the <split> element, see maxMatch.

`minMatch`

This is used in the same way it is on the <split> element, see minMatch.

`onlyMatch`

This is used in the same way it is on the <split> element, see onlyMatch.

`advance`

After an expression has matched content in the buffer, the buffer start position is advanced so that it moves to the end of the entire match. This means that subsequent expressions operating on the content buffer will not see the previously matched content again. This is normally required behaviour, but in some cases some of the content from a match is still required for subsequent matches. Take the following example of name value pairs:

name1=some value 1 name2=some value 2 name3=some value 3

The first name value pair could be matched with the following expression:

<regex pattern="([^=]+)=(.+?) [^= ]+=">

The above expression would match as follows:

name1=some value 1 name2=some value 2 name3=some value 3

In this example we have had to do a reluctant match to extract the value in group 2 and not include the subsequent name. Because the reluctant match requires us to specify what we are reluctantly matching up to, we have had to include an expression after it that matches the next name.

By default the parser will move the character buffer to the end of the entire match so the next expression will be presented with the following:

some value 2 name3=some value 3

Therefore name2 will have been lost from the content buffer and will not be available for matching.

This behaviour can be altered by telling the expression how far to advance the character buffer after matching. This is done with the advance attribute and is used to specify the match group whose end position should be treated as the point the content buffer should advance to, e.g.

<regex pattern="([^=]+)=(.+?) [^= ]+=" advance="2">

In this example the content buffer will only advance to the end of match group 2 and subsequent expressions will be presented with the following content:

name2=some value 2 name3=some value 3

Therefore name2 will still be available in the content buffer.

It is likely that the advance feature will only be useful in cases where a reluctant match is performed. Reluctant matches are discouraged for performance reasons so this feature should rarely be used. A better way to tackle the above example would be to present the content in reverse, however this is only possible if the expression is within a group, i.e. is not a root expression. There may also be more complex cases where reversal is not an option and the use of a reluctant match is the only option.

The `<all>` element

The <all> element matches the entire content of the parent group and makes it available to child groups or <data> elements. The purpose of <all> is to act as a catch all expression to deal with content that is not handled by a more specific expression, e.g. to output some other unknown, unrecognised or unexpected data.

<group>
  <regex pattern="^\s*([^=]+)=([^=]+)\s*">
    <data name="$1" value="$2" />
  </regex>

  <!-- Output unexpected data -->
  <all>
    <data name="unknown" value="$" />
  </all>
</group>

The <all> element provides the same functionality as using .* as a pattern in a <regex> element and where dotAll is set to true, e.g. <regex pattern=".*" dotAll="true">. However it performs much faster as it doesn’t require pattern matching logic and is therefore always preferred.

Attributes

The <all> element has the following attributes:

id

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

5.5.3 - Variables

A variable is added to Data Splitter using the <var> element. A variable is used to store matches from a parent expression for use in a reference elsewhere in the configuration, see variable reference.

The most recent matches are stored for use in local references, i.e. references that are in the same match scope as the variable. Multiple matches are stored for use in references that are in a separate match scope. The concept of different variable scopes is described in scopes.

The `<var>` element

The <var> element is used to tell Data Splitter to store matches from a parent expression for use in a reference.

Attributes

The <var> element has the following attributes:

id

`id`

Mandatory attribute used to uniquely identify it within the configuration (see id) and is the means by which a variable is referenced, e.g. $VAR_ID$ .

5.5.4 - Output

As with all other aspects of Data Splitter, output XML is determined by adding certain elements to the Data Splitter configuration.

The `<data>` element

Output is created by Data Splitter using one or more <data> elements in the configuration. The first <data> element that is encountered within a matched expression will result in parent <record> elements being produced in the output.

Attributes

The <data> element has the following attributes:

id
name
value

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`name`

Both the name and value attributes of the <data> element can be specified using match references.

`value`

Both the name and value attributes of the <data> element can be specified using match references.

Single `<data>` element example

The simplest example that can be provided uses a single <data> element within a <split> expression.

Given the following input:

This is line 1
This is line 2
This is line 3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter 
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data value="$1"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data value="This is line 1" />
  </record>
  <record>
    <data value="This is line 2" />
  </record>
  <record>
    <data value="This is line 3" />
  </record>
</records>

Multiple `<data>` element example

You could also output multiple <data> elements for the same <record> by adding multiple elements within the same expression:

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="ip" value="$1"/>
    <data name="user" value="$2"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Multi level `<data>` elements

As long as all data elements occur within the same parent/ancestor expression, all data elements will be output within the same record.

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1"/>

    <group value="$1">
      <regex pattern="ip=([^ ]+) user=([^ ]+)">
        <data name="ip" value="$1"/>
        <data name="user" value="$2"/>
      </regex>
    </group>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1" />
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2" />
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3" />
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Nesting `<data>` elements

Rather than having <data> elements all appear as children of <record> it is possible to nest them either as direct children or within child groups.

Direct children

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="line" value="$">
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<record
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

Within child groups

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1">
      <group value="$1">
        <regex pattern="ip=([^ ]+) user=([^ ]+)">
          <data name="ip" value="$1"/>
          <data name="user" value="$2"/>
        </regex>
      </group>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

The above example produces the same output as the previous but could be used to apply much more complex expression logic to produce the child <data> elements, e.g. the inclusion of multiple child expressions to deal with different types of lines.

5.6 - Match References, Variables and Fixed Strings

The <group> and <data> elements can reference match groups from parent expressions or from stored matches in variables. In the case of the <group> element, referenced values are passed on to child expressions whereas the <data> element can use match group references for name and value attributes. In the case of both elements the way of specifying references is the same.

5.6.1 - Expression match references

Referencing matches in expressions is done using $. In addition to this a match group number may be added to just retrieve part of the expression match. The applicability and effect that this has depends on the type of expression used.

References to `<split>` Match Groups

In the following example a line matched by a parent <split> expression is referenced by a child <data> element.

<split delimiter="\n" >
  <data name="line" value="$"/>
</split>

A <split> element matches content up to and including the specified delimiter, so the above reference would output the entire line plus the delimiter. However there are various match groups that can be used by child <group> and <data> elements to reference sections of the matched content.

To illustrate the content provided by each match group, take the following example:

"This is some text\, that we wish to match", "This is the next text"

And the following <split> element:

<split delimiter="," escape="\">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

"This is some text\, that we wish to match",

$1: The content up to the specified delimiter at the end

"This is some text\, that we wish to match"

$2: The content up to the specified delimiter at the end and filtered to remove escape characters (more expensive than $1)

"This is some text, that we wish to match"

In addition to this behaviour match groups 1 and 2 will omit outermost whitespace and container characters if specified, e.g. with the following content:

"  This is some text\, that we wish to match  "  , "This is the next text"

And the following <split> element:

<split delimiter="," escape="\" containerStart="&quot" containerEnd="&quot">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

" This is some text\, that we wish to match " ,

$1: The content up to the specified delimiter at the end and strips outer containers.

This is some text\, that we wish to match

$2: The content up to the specified delimiter at the end and strips outer containers and filtered to remove escape characters (more computationally expensive than $1)

This is some text, that we wish to match

References to Match Groups

Like the <split> element various match groups can be referenced in a <regex> expression to retrieve portions of matched content. This content can be used as values for <group> and <data> elements.

Given the following input:

ip=1.1.1.1 user=user1

And the following <regex> element:

<regex pattern="ip=([^ ]+) user=([^ ]+)">

The match groups are as follows:

$ or $0: The entire content that is matched by the expression

ip=1.1.1.1 user=user1

$1: The content of the first match group

1.1.1.1

$2: The content of the second match group

user1

Match group numbers in regular expressions are determined by the order that their open bracket appears in the expression.

References to `<any>` Match Groups

The <any> element does not have any match groups and always returns the entire content that was passed to it when referenced with $.

5.6.2 - Variable reference

Variables are added to Data Splitter configuration using the <var> element, see variables. Each variable must have a unique id so that it can be referenced. References to variables have the form $VARIABLE_ID$ , e.g.

<data name="$heading$" value="$" />

Identification

Data Splitter validates the configuration on load and ensures that all element ids are unique and that referenced ids belong to a variable.

A variable will only store data if it is referenced so variables that are not referenced will do nothing. In addition to this a variable will only store data for match groups that are referenced, e.g. if $heading$1 is the only reference to a variable with an id of ‘heading’ then only data for match group 1 will be stored for reference lookup.

Scopes

Variables have two scopes which affect how data is retrieved when referenced:

Local scope
Remote scope

Local Scope

Variables are local to a reference if the reference exists as a descendant of the variables parent expression, e.g.

<split delimiter="\n" >
  <var id="line" />

  <group value="$1">
    <regex pattern="ip=([^ ]+) user=([^ ]+)">
      <data name="line" value="$line$"/>
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </regex>
  </group>
</split>

In the above example, matches for the outermost <split> expression are stored in the variable with the id of line. The only reference to this variable is in a data element that is a descendant of the variables parent expression <split>, i.e. it is nested within split/group/regex.

Because the variable is referenced locally only the most recent parent match is relevant, i.e. no retrieval of values by iteration, iteration offset or fixed position is applicable. These features only apply to remote variables that store multiple values.

Remote Scope

The CSV example with a heading is an example of a variable being referenced from a remote scope.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example the parent expression of the variable is not the ancestor of the reference in the <data> element. This makes the <data> elements reference to the variable a remote one. In this situation the variable knows that it must store multiple values as the remote reference <data> may retrieve one of many values from the variable based on:

The match count of the parent expression.
The match count of the parent expression, plus or minus an offset.
A fixed position in the variable store.

Retrieval of value by iteration

In the above example the first line is taken then repeatedly matched by delimiting with commas. This results in multiple values being stored in the ‘heading’ variable. Once this is done subsequent lines are matched and then also repeatedly matched by delimiting with commas in the same way the heading is.

Each time a line is matched the internal match count of all sub expressions, (e.g. the <split> expression that is delimited by comma) is reset to 0. Every time the sub <split> expression matches up to a comma delimiter the match count is incremented. Any references to remote variables will, by default, use the current match count as an index to retrieve one of the many values stored in the variable. This means that the <data> element in the above example will retrieve the corresponding heading for each value as the match count of the values will match the storage position of each heading.

Retrieval of value by iteration offset

In some cases there may be a mismatch between the position where a value is stored in a variable and the match count applicable when remotely referencing the variable.

Take the following input:

BAD,Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

In the above example we can see that the first heading ‘BAD’ is not correct for the first value of every line. In this situation we could either adjust the way the heading line is parsed to ignore ‘BAD’ or just adjust the way the heading variable is referenced.

To make this adjustment the reference just needs to be told what offset to apply to the current match count to correctly retrieve the stored value. In the above example this would be done like this:

<data name="$heading$1[+1]" value="$1" />

The above reference just uses the match count plus 1 to retrieve the stored value. Any integral offset plus or minus may be used, e.g. [+4] or [-10]. Offsets that result in a position that is outside of the storage range for the variable will not return a value.

Retrieval of value by fixed position

In addition to retrieval by offset from the current match count, a stored value can be returned by a fixed position that has no relevance to the current match count.

In the following example the value retrieved from the ‘heading’ variable will always be ‘IPAddress’ as this is the fourth value stored in the ‘heading’ variable and the position index starts at 0.

<data name="$heading$1[3]" value="$1" />

5.6.3 - Use of fixed strings

Any <group> value or <data> name and value can use references to matched content, but in addition to this it is possible just to output a known string, e.g.

<data name="somename" value="$" />

The above example would output somename as the <data> name attribute. This can often be useful where there are no headings specified in the input data but we want to associate certain names with certain values.

Given the following data:

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

We could provide useful headings with the following configuration:

<regex pattern="([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),">
  <data name="date" value="$1" />
  <data name="time" value="$2" />
  <data name="ipAddress" value="$3" />
  <data name="hostName" value="$4" />
  <data name="user" value="$5" />
  <data name="action" value="$6" />
</regex>

5.6.4 - Concatenation of references

It is possible to concatenate multiple fixed strings and match group references using the + character. As with all references and fixed strings this can be done in <group> value and <data> name and value attributes. However concatenation does have some performance overhead as new buffers have to be created to store concatenated content.

A good example of concatenation is the production of ISO8601 date format from data in the previous example:

01/01/2010,00:00:00

Here the following <regex> could be used to extract the relevant date, time groups:

<regex pattern="(\d{2})/(\d{2})/(\d{4}),(\d{2}):(\d{2}):(\d{2})">

The match groups from this expression can be concatenated with the following value output pattern in the data element:

<data name="dateTime" value="$3+’-‘+$2+’-‘+$1+’-‘+’T’+$4+’:’+$5+’:’+$6+’.000Z’" />

Using the original example, this would result in the output:

<data name="dateTime" value="2010-01-01T00:00:00.000Z" />

Note that the value output pattern wraps all fixed strings in single quotes. This is necessary when concatenating strings and references so that Data Splitter can determine which parts are to be treated as fixed strings. This also allows fixed strings to contain $ and + characters.

As single quotes are used for this purpose, a single quote needs to be escaped with another single quote if one is desired in a fixed string, e.g.

‘this ‘’is quoted text’’’

This will result in:

this ‘is quoted text’

6 - Event Feeds

Feeds provide the means to logically group data common to a data format and source system.

In order for Stroom to be able to handle the various data types as described in the previous section, Stroom must be told what the data is when it is received. This is achieved using Event Feeds. Each feed has a unique name within the system.

Events Feeds can be related to one or more Reference Feed. Reference Feeds are used to provide look up data for a translation. E.g. lookup a computer name by it’s IP address.

Feeds can also have associated context data. Context data used to provide look up information that is only applicable for the events file it relates to. E.g. if the events file is missing information relating to the computer it was generated on, and you don’t want to create separate feeds for each computer, an associated context file could be used to provide this information.

Feed Identifiers

Feed identifiers must be unique within the system. Identifiers can be in any format but an established convetnion is to use the following format

<SYSTEM>-<ENVIRONMENT>-<TYPE>-<EVENTS/REFERENCE>-<VERSION>

If feeds in a certain site need different reference data then the site can be broken down further.

_ may be used to represent a space.

7 - Indexing data

Indexing data for querying.

7.1 - Elasticsearch

Using Elasticsearch to index data

7.1.1 - Introduction

Concepts, assumptions and key differences to Solr and built-in Lucene indexing

Stroom supports using an external Elasticsearch cluster to index event data. This allows you to leverage all the features of the Elastic Stack, such as shard allocation, replication, fault tolerance and aggregations.

With Elasticsearch as an external service, your search infrastructure can scale independently of your Stroom data processing cluster, enhancing interoperability with other platforms by providing a performant and resilient time-series event data store. For instance, you can:

Deploy Kibana to search and visualise Elasticsearch data.
Index Stroom’s stream meta and Error streams so monitoring systems can generate metrics and alerts.
Use Apache Spark to perform stateful data processing and enrichment, through the use of the Elasticsearch-Hadoop connector.

Stroom achieves indexing and search integration by interfacing securely with the Elasticsearch REST API using the Java high-level client.

This guide will walk you through configuring a Stroom indexing pipeline, creating an Elasticsearch index template, activating a stream processor and searching the indexed data in both Stroom and Kibana.

Assumptions

You have created an Elasticsearch cluster. Elasticsearch 8.x is recommended, though the latest supported 7.x version will also work. For test purposes, you can quickly create a single-node cluster using Docker by following the steps in the Elasticsearch Docs .
The Elasticsearch cluster is reachable via HTTPS from all Stroom nodes participating in stream processing.
Elasticsearch security is enabled. This is mandatory and is enabled by default in Elasticsearch 8.x and above.
The Elasticsearch HTTPS interface presents a trusted X.509 server certificate. The Stroom node(s) connecting to Elasticsearch need to be able to verify the certificate, so for custom PKI, a Stroom truststore entry may be required.
You have a feed containing Event streams to index.

Key differences

Indexing data with Elasticsearch differs from Solr and built-in Lucene methods in a number of ways:

Unlike with Solr and built-in Lucene indexing, Elasticsearch field mappings are managed outside Stroom, through the use of index and component templates . These are normally created either via the Elasticsearch API, or interactively using Kibana.
Aside from creating the mandatory StreamId and EventId field mappings, explicitly defining mappings for other fields is optional. Elasticsearch will use dynamic mapping by default, to infer each field’s type at index time. Explicitly defining mappings is recommended where consistency or greater control are required, such as for IP address fields (Elasticsearch mapping type ip).

Next page - Getting Started

7.1.2 - Getting Started

Establishing an Elasticsearch cluster connection

Establish an Elasticsearch cluster connection in Stroom

The first step is to configure Stroom to connect to an Elasticsearch cluster. You can configure multiple cluster connections if required, such as a separate one for production and another for development. Each cluster connection is defined by an Elastic Cluster document within the Stroom UI.

In the Stroom Explorer pane ( ), right-click on the folder where you want to create the Elastic Cluster document.
Select:

New

Elastic Cluster
Give the cluster document a name and press OK .
Complete the fields as explained in the section below. Any fields not marked as “Optional” are mandatory.
Click Test Connection. A dialog will display with the test result. If Connection Success, details of the target cluster will be displayed. Otherwise, error details will be displayed.
Click to commit changes.

Warning

Ensure you restrict permissions to the Elastic Cluster document. The Read privilege permits retrieval of the Elasticsearch API key and secret, granting the holder the same level of privilege as Stroom. Users authorised to search Elasticsearch indices via Stroom dashboards should only be assigned the Use privilege.

Elastic Cluster document fields

Description

(Optional) You might choose to enter the Elasticsearch cluster name or purpose here.

Connection URLs

Enter one or more node or cluster addresses, including protocol, hostname and port. Only HTTPS is supported; attempts to use plain-text HTTP will fail.

Examples

Local development node: https://localhost:9200
FQDN: https://elasticsearch.example.com:9200
Kubernetes service: https://prod-es-http.elastic.svc:9200

CA certificate

PEM-format CA certificate chain used by Stroom to verify TLS connections to the Elasticsearch HTTPS REST interface. This is usually your organisation’s root enterprise CA certificate. For development, you can provide a self-signed certificate.

Use authentication

(Optional) Tick this box if Elasticsearch requires authentication. This is enabled by default from Elasticsearch version 8.0.

API key ID

Required if Use authentication is checked. Specifies the Elasticsearch API key ID for a valid Elasticsearch user account. This user requires at a minimum the following privileges :

Cluster privileges

monitor
manage_own_api_key

Index privileges

all

API key secret

Required if Use authentication is checked.

Socket timeout (ms)

Number of milliseconds to wait for an Elasticsearch indexing or search REST call to complete. Set to -1 (the default) to wait indefinitely, or until Elasticsearch closes the connection.

Next page - Indexing data

7.1.3 - Indexing data

Indexing event data to Elasticsearch

A typical workflow is for a Stroom pipeline to convert XML Event elements into the XML equivalent of JSON, complying with the schema http://www.w3.org/2005/xpath-functions, using a format identical to the output of the XML function xml-to-json().

Understanding JSON XML representation

In an Elasticsearch indexing pipeline translation, you model JSON documents in a compatible XML representation.

Common JSON primitives and examples of their XML equivalents are outlined below.

Arrays

Array of maps

<array key="users" xmlns="http://www.w3.org/2005/xpath-functions">
  <map>
    <string key="name">John Smith</string>
  </map>
</array>

Array of strings

<array key="userNames" xmlns="http://www.w3.org/2005/xpath-functions">
  <string>John Smith</string>
  <string>Jane Doe</string>
</array>

Maps and properties

<map key="user" xmlns="http://www.w3.org/2005/xpath-functions">
  <string key="name">John Smith</string>
  <boolean key="active">true</boolean>
  <number key="daysSinceLastLogin">42</number>
  <string key="loginDate">2022-12-25T01:59:01.000Z</string>
  <null key="emailAddress" />
  <array key="phoneNumbers">
    <string>1234567890</string>
  </array>
</map>

Note

It is recommended to insert a schema validation filter into your pipeline XML (with schema group JSON), to make it easier to diagnose JSON conversion errors.

We will now explore how create an Elasticsearch index template, which specifies field mappings and settings for one or more indices.

Create an Elasticsearch index template

For information on what index and component templates are, consult the Elastic documentation .

When Elasticsearch first receives a document from Stroom targeting an index, whose name matches any of the index_patterns entries in the index template, it will create a new index / data stream using the settings and mappings properties from the template. In this way, the index does not need to be manually created in advance.

Note

If an index doesn’t match a template when it is created, data will still be indexed - with default mappings and settings. This may be appropriate for small indices, but with a default shard count of 5, the indexing and search performance will likely be inadequate for large indices.

The following example creates a basic index template stroom-events-v1 in a local Elasticsearch cluster, with the following explicit field mappings:

StreamId – mandatory, data type long or keyword.
EventId – mandatory, data type long or keyword.
@timestamp – required if the index is to be part of a data stream (recommended).
User – An object containing properties Id, Name and Active, each with their own data type.
Tags – An array of one or more strings.
Message – Contains arbitrary content such as unstructured raw log data. Supports full-text search. Nested field wildcard supports regexp queries .

Note

Elasticsearch does not have a dedicated array field mapping data type. An Elasticsearch field may contain zero or more values by default. See:

Arrays in the Elastic documentation.

In Kibana Dev Tools, execute the following query:

PUT _index_template/stroom-events-v1

{
  "index_patterns": [
    "stroom-events-v1*" // Apply this template to index names matching this pattern.
  ],
  "data_stream": {}, // For time-series data. Recommended for event data.
  "template": {
    "settings": {
      "number_of_replicas": 1, // Replicas impact indexing throughput. This setting can be changed at any time.
      "number_of_shards": 10, // Consider the shard sizing guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#shard-size-recommendation
      "refresh_interval": "10s", // How often to refresh the index. For high-throughput indices, it's recommended to increase this from the default of 1s
      "lifecycle": {
        "name": "stroom_30d_retention_policy" // (Optional) Apply an ILM policy https://www.elastic.co/guide/en/elasticsearch/reference/current/set-up-lifecycle-policy.html
      }
    },
    "mappings": {
      "dynamic_templates": [],
      "properties": {
        "StreamId": { // Required.
          "type": "long"
        },
        "EventId": { // Required.
          "type": "long"
        },
        "@timestamp": { // Required if the index is part of a data stream.
          "type": "date"
        },
        "User": {
          "properties": {
            "Id": {
              "type": "keyword"
            },
            "Name": {
              "type": "keyword"
            },
            "Active": {
              "type": "boolean"
            }
          }
        },
        "Tags": {
          "type": "keyword"
        },
        "Message": {
          "type": "text",
          "fields": {
            "wildcard": {
              "type": "wildcard"
            }
          }
        }
      }
    }
  },
  "composed_of": [
    // Optional array of component template names.
  ]
}

Create an Elasticsearch indexing pipeline template

An Elasticsearch indexing pipeline is similar in structure to the built-in packaged Indexing template pipeline. It typically consists of the following pipeline elements:

Source

XMLParser

recordCount (read)

SplitFilter

IdEnrichmentFilter

XSLTFilter

SchemaFilter

ElasticIndexingFilter

recordCount (written)

XSLTFilter contains the translation mapping Events to JSON array.
SchemaFilter uses schema group JSON.

It is recommended to create a template Elasticsearch indexing pipeline, which can then be re-used.

Procedure

Right-click on the Template Pipelines folder in the Stroom Explorer pane ( ).
Select:

New

Pipeline
Enter the name Indexing (Elasticsearch) and click OK .
Define the pipeline structure as above, and customise the following pipeline elements:
1. Set the Split Filter splitCount property to a sensible default value, based on the expected source XML element count (e.g. 100).
2. Set the Schema Filter schemaGroup property to JSON.
3. Set the Elastic Indexing Filter cluster property to point to the Elastic Cluster document you created earlier.
4. Set the Write Record Count filter countRead property to false.

Now you have created a template indexing pipeline, it’s time to create a feed-specific pipeline that inherits this template.

Create an Elasticsearch indexing pipeline

Procedure

Right-click on a folder in the Stroom Explorer pane .
New

Pipeline
Enter a name for your pipeline and click OK .
Click the Inherit From button.
In the dialog that appears, select the template pipeline you created named Indexing (Elasticsearch) and click OK .
Select the Elastic Indexing Filter pipeline element.
Set the indexName property to the name of the destination index or data stream. indexName may be a simple string (static) or dynamic.
If using dynamic index names, configure the translation to output named element(s) that will be interpolated into indexName for each document indexed.

Choosing between simple and dynamic index names

Indexing data to a single, named data stream or index, is a simple and convenient way to manage data. There are cases however, where indices may contain significant volumes of data spanning long periods - and where a large portion of indexing will be performed up-front (such as when processing a feed with a lot of historical data). As Elasticsearch data stream indices roll over based on the current time (not event time), it is helpful to be able to partition data streams by user-defined properties such as year. This use case is met by Stroom’s dynamic index naming.

Note

An Elasticsearch

data stream consists of one or more backing indices, which automatically roll over once a size or date threshold are met. This abstraction assists with the lifecycle management of time-series log data, enabling users to define time and sized-based rules that can for instance, delete indices after they reach a certain age - or move older indices to different data tiers (e.g. cold storage).

Single named index or data stream

This is the simplest use case and is suitable where you want to write all data for a particular pipeline, to a single data stream or index. Whether data is written to an actual index or data stream depends on your index template, specifically whether you have declared data_stream: {}. If this property exists in the index template matching indexName, a data stream is created when the first document is indexed. Data streams, amongst many other features, provide the option to use Elasticsearch Index Lifecycle Management (ILM) policies to manage their lifecycle.

Note

When indexing to a data stream, ensure to include a string field named @timestamp in the output JSON XML. This is mandatory and indexing will fail if this field isn’t a valid date value.

Dynamic data stream names

With a dynamic stream name, indexName contains the names of elements, for example: stroom-events-v1-{year}. For each document, the final index name is computed based on the values of the corresponding elements within the resulting JSON XML. For example, if the JSON XML representation of an event consists of the following, the document will be indexed to the index or data stream named stroom-events-v1-2022:

<?xml version="1.1" encoding="UTF-8"?>
<array
        xmlns="http://www.w3.org/2005/xpath-functions"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2005/xpath-functions file://xpath-functions.xsd">
    <map>
        <number key="StreamId">3045516</number>
        <number key="EventId">1</number>
        <string key="@timestamp">2022-12-16T02:46:29.218Z</string>
        <number key="year">2022</number>
    </map>
</array>

This is due to the value of /map/number[@key='year'] being 2022. This approach can be useful when you need to apply different ILM policies, such as maintaining older data on slower storage tiers.

Warning

Any element names defined in indexName must exist in the JSON XML (unless it is an empty document). If a blank value is desired, output an empty string element.

Note

If an element name begins with _ (underscore), its value is only used for indexName interpolation, and it is not included in the final JSON.

Other applications for dynamic data stream names

Dynamic data stream names can also help in other scenarios, such as implementing fine-grained retention policies, such as deleting documents that aren’t user-attributed after 12 months. While Stroom ElasticIndex support data retention expressions, deleting documents in Elasticsearch by query is highly inefficient and doesn’t cause disk space to be freed (this requires an index to be force-merged, an expensive operation). A better solution therefore, is to use dynamic data stream names to partition data and assign certain partitions to specific ILM policies and/or data tiers.

Migrating older data streams to other data tiers

Say a feed is indexed, spanning data from 2020 through 2023. Assuming most searches only need to query data from the current year, the data streams stroom-events-v1-2020 and stroom-events-v1-2021 can be moved to cold storage. To achieve this, use index-level shard allocation filtering .

In Kibana Dev Tools, execute the following command:

PUT stroom-events-v1-2020,stroom-events-v1-2021/_settings

{
  "index.routing.allocation.include._tier_preference": "data_cold"
}

This example assumes a cold data tier has been defined for the cluster. If the command executes successfully, shards from the specified data streams are gradually migrated to the nodes comprising the destination data tier.

Create an indexing translation

In this example, let’s assume you have event data that looks like the following:

<?xml version="1.1" encoding="UTF-8"?>
<Events
    xmlns="event-logging:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="event-logging:3 file://event-logging-v3.5.2.xsd"
    Version="3.5.2">
  <Event>
    <EventTime>
      <TimeCreated>2022-12-16T02:46:29.218Z</TimeCreated>
    </EventTime>
    <EventSource>
      <System>
        <Name>Nginx</Name>
        <Environment>Development</Environment>
      </System>
      <Generator>Filebeat</Generator>
      <Device>
        <HostName>localhost</HostName>
      </Device>
      <User>
        <Id>john.smith1</Id>
        <Name>John Smith</Name>
        <State>active</State>
      </User>
    </EventSource>
    <EventDetail>
      <View>
        <Resource>
          <URL>http://localhost:8080/index.html</URL>
        </Resource>
        <Data Name="Tags" Value="dev,testing" />
        <Data 
          Name="Message" 
          Value="TLSv1.2 AES128-SHA 1.1.1.1 &quot;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&quot;" />
      </View>
    </EventDetail>
  </Event>
  <Event>
    ...
  </Event>
</Events>

We need to write an XSL transform (XSLT) to form a JSON document for each stream processed. Each document must consist of an array element one or more map elements (each representing an Event), each with the necessary properties as per our index template.

See XSLT Conversion for instructions on how to write an XSLT.

The output from your XSLT should match the following:

<?xml version="1.1" encoding="UTF-8"?>
<array
    xmlns="http://www.w3.org/2005/xpath-functions"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2005/xpath-functions file://xpath-functions.xsd">
  <map>
    <number key="StreamId">3045516</number>
    <number key="EventId">1</number>
    <string key="@timestamp">2022-12-16T02:46:29.218Z</string>
    <map key="User">
      <string key="Id">john.smith1</string>
      <string key="Name">John Smith</string>
      <boolean key="Active">true</boolean>
    </map>
    <array key="Tags">
      <string>dev</string>
      <string>testing</string>
    </array>
    <string key="Message">TLSv1.2 AES128-SHA 1.1.1.1 "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"</string>
  </map>
  <map>
    ...
  </map>
</array>

Assign the translation to the indexing pipeline

Having created your translation, you need to reference it in your indexing pipeline.

Open the pipeline you created.
Select the Structure tab.
Select the XSLTFilter pipeline element.
Double-click the xslt property value cell.
Select the XSLT you created and click OK .
Click .

Step the pipeline

At this point, you will want to step the pipeline to ensure there are no errors and that output looks as expected.

Execute the pipeline

Create a pipeline processor and filter to run the pipeline against one or more feeds. Stroom will distribute processing tasks to enabled nodes and send documents to Elasticsearch for indexing.

You can monitor indexing status via your Elasticsearch monitoring tool of choice.

Detecting and handling errors

If any errors occur while a stream is being indexed, an Error stream is created, containing details of each failure. Error streams can be found under the Data tab of either the indexing pipeline or receiving Feed.

Note

You can filter the selected pipeline or feed to list only Error streams. Click then add a condition Type = Error.

Once you have addressed the underlying cause for a particular type of error (such as an incorrect field mapping), reprocess affected streams:

Select any Error streams relating for reprocessing, by clicking the relevant checkboxes in the stream list (top pane).
Click .
In the dialog that appears, check Reprocess data and click OK .
Click OK for any confirmation prompts that follow.

Stroom will re-send data from the selected Event streams to Elasticsearch for indexing. Any existing documents matching the StreamId of the original Event stream are first deleted automatically to avoid duplication.

Tips and tricks

Use a common schema for your indices

An example is Elastic Common Schema (ECS) . This helps users understand the purpose of each field and to build cross-index queries simpler by using a set of common fields (such as a user ID).

With this in mind, it is important that common fields also have the same data type in each index. Component templates help make this easier and reduce the chance of error, by centralising the definition of common fields to a single component.

Use a version control system (such as git) to track index and component templates

This helps keep track of changes over time and can be an important resource for both administrators and users.

Rebuilding an index

Sometimes it is necessary to rebuild an index. This could be due to a change in field mapping, shard count or responding to a user feature request.

To rebuild an index:

Drain the indexing pipeline by deactivating any processor filters and waiting for any running tasks to complete.
Delete the index or data stream via the Elasticsearch API or Kibana.
Make the required changes to the index template and/or XSL translation.
Create a new processor filter either from scratch or using the button.
Activate the new processor filter.

Use a versioned index naming convention

As with the earlier example stroom-events-v1, a version number is appended to the name of the index or data stream. If a new field is added, or some other change occurred requiring the index to be rebuilt, users would experience downtime. This can be avoided by incrementing the version and performing the rebuild against a new index: stroom-events-v2. Users could continue querying stroom-events-v1 until it is deleted. This approach involves the following steps:

Create a new Elasticsearch index template targeting the new index name (in this case, stroom-events-v2).
Create a copy of the indexing pipeline, targeting the new index in the Elastic Indexing Filter.
Create and activate a processing filter for the new pipeline.
Once indexing is complete, update the Elastic Index document to point to stroom-events-v2. Users will now be searching against the new index.
Drain any tasks for the original indexing pipeline and delete it.
Delete index stroom-events-v1 using either the Elasticsearch API or Kibana.

If you created a data view in Kibana, you’ll also want to update this to point to the new index / data stream.

7.1.4 - Exploring Data in Kibana

Using Kibana to search, aggregate and explore data indexed in Stroom

Kibana is part of the Elastic Stack and provides users with an interactive, visual way to query, visualise and explore data in Elasticsearch.

It is highly customisable and provides users and teams with tools to create and share dashboards, searches, reports and other content.

Once data has been indexed by Stroom into Elasticsearch, it can be explored in Kibana. You will first need to create a data view in order to query your indices.

Why use Kibana?

There are several use cases that benefit from Kibana:

Convenient and powerful drag-and-drop charts and other visualisation types using Kibana Lens. Much more performant and easier to customise than built-in Stroom dashboard visualisations.
Field statistics and value summaries with Kibana Discover. Great for doing initial audit data survey.
Geospatial analysis and visualisation.
Search field auto-completion.
Runtime fields . Good for data exploration, at the cost of performance.

7.2 - Lucene Indexes

Stroom’s built in Apache Lucene based indexing.

Stroom uses Apache Lucene for its built-in indexing solution. Index documents are stored in a Volume .

TODO

Complete this page.

Field configuration

Field Types

Id - Treated as a Long.
Boolean - True/False values.
Integer - Whole numbers from -2,147,483,648 to 2,147,483,647.
Long - Whole numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
Float - Fractional numbers. Sufficient for storing 6 to 7 decimal digits.
Double - Fractional numbers. Sufficient for storing 15 decimal digits.
Date - Date and time values.
Text - Text data.
Number - An alias for Long.

Stored fields

If a field is Stored then it means the complete field value will be stored in the index. This means the value can be retrieved from the index when building search results rather than using the slower Search Extraction process. Storing field values comes at the cost of hight storage requirements for the index. If storage space is not an issue then storing all fields that you want to return in search results is the optimum.

Indexed fields

An Indexed field is one that will be processed by Lucene so that the field can be queried. How the field is indexed will depend on the Field type and the Analyser used.

If you have fields that you do not want to be able to filter (i.e. that you won’t use as a query term) then you can include them as non-Indexed fields. Including a non-indexed field means it will be available for the user to select in the Dashboard table. A non-indexed field would either need to be Stored in the index or added via Search Extraction to be available in the search results.

Positions

If Positions is selected then Lucene will store the positions of all the field terms in the document.

Analyser types

The Analyser determines how Lucene reads the fields value and extracts tokens from it. The choice of Analyser will depend on the date in the field and how you want to search it.

Keyword - Treats the whole field value as one token. Useful for things like IDs and post codes. Supports the Case Sensitivity setting.
Alpha - Tokenises on any non-letter characters, e.g. one1 two2 three 3 => one two three. Strips non-letter characters. Supports the Case Sensitivity setting.
Numeric -
Alpha numeric - Tokenises on any non-letter/digit characters, e.g. one1 two2 three 3 => one1 two2 three 3. Supports the Case Sensitivity setting.
Whitespace - Tokenises only on white space. Not affected by the Case Sensitivity setting, case sensitive.
Stop words - Tokenises bases on non-letter characters and removes Stop Words, e.g. and. Not affected by the Case Sensitivity setting. Case insensitive.
Standard - The most common analyser. Tokenises the value on spaces and punctuation but recognises URLs and email addresses. Removes Stop Words, e.g. and. Not affected by the Case Sensitivity setting. Case insensitive. e.g. Find Stroom at github.com/stroom => Find Stroom at github.com/stroom.

Stop words

Some of the Analysers use a set of stop words for the tokenisers. This is the list of stop words that will not be indexed.

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Case sensitivity

Some of the Analyser types support case (in)sensitivity. For example if the Analyser supports it the value TWO two would either be tokenised as TWO two or two two.

7.3 - Solr Integration

Indexing data using a Solr cluster.

TODO

Complete this section.

8 - Nodes

Configuring the nodes in a Stroom cluster.

All nodes in an Stroom cluster must be configured correctly for them to communicate with each other.

Configuring nodes

Open Monitoring/Nodes from the top menu. The nodes screen looks like this:

TODO

Screenshot

You need to edit each line by selecting it and then clicking the edit icon at the bottom. The URL for each node needs to be set as above but obviously substituting in the host name of the individual node, e.g. http://<HOST_NAME>:8080/stroom/clustercall.rpc

Nodes are expected communicate with each other on port 8080 over http. Ensure you have configured your firewall to allow nodes to talk to each other over this port. You can configure the URL to use a different port and possibly HTTPS but performance will be better with HTTP as no SSL termination is required.

Once you have set the URLs of each node you should also set the master assignment priority for each node to be different to all of the others. In the image above the priorities have been set in a random fashion to ensure that node3 assumes the role of master node for as long as it is enabled. You also need to check all of the nodes are enabled that you want to take part in processing or any other jobs.

Keep refreshing the table until all nodes show healthy pings as above. If you do not get ping results for each node then they are not configured correctly.

Once a cluster is configured correctly you will get proper distribution of processing tasks and search will be able to access all nodes to take part in a distributed query.

9 - Pipelines

Pipelines are the mechanism for processing and transforming ingested data.

Stroom uses Pipelines to process its data. A pipeline is a set of pipeline elements connected together. Pipelines are very powerful and flexible and allow the user to transform, index, store and forward data in a wide variety of ways.

Example Pipeline

Pipelines can take many forms and be used for a wide variety of purposes, however a typical pipeline to convert CSV data into cooked events might look like this:

Source

DSParser

recordCount (read)

SplitFilter

IdEnrichmentFilter

XSLTFilter

SchemaFilter

recordCount (written)

XMLWriter

StreamAppender

Input Data

Pipelines process data in batches. This batch of data is referred to as a Stream . The input for the pipeline is a single Stream that exists within a Feed and this data is fed into the left-hand side of the pipeline at Source . Pipelines can accept streams from multiple Feeds assuming those feeds contain similar data.

The data in the Stream is always text data (XML, JSON, CSV, fixed-width, etc.) in a known character encoding . Stroom does not currently support processing binary formats.

XML

The working format for pipeline processing is XML (with the exception of raw streaming). Data can be input and output in other forms, e.g. JSON, CSV, fixed-width, etc. but the majority of pipelines do most of their processing in XML. Input data is converted into XML SAX events, processed using XSLT to transform it into different shapes of XML then either consumed as XML (e.g. an IndexingFilter ) or converted into a desired output format for storage/forwarding.

Forks

Pipelines can also be forked at any point in the pipeline. This allows the same data to processed in different ways.

Note

Rather than creating complicated pipelines with forks, it is sometimes better to create multiple pipelines as this makes it easer to handle errors in one fork of the processing. It also makes it easier to re-use common simple pipelines. For example if you have a pipeline to transform CSV events into normalised XML then index it and forward it to a remote server, it may be better to have a pipeline to cook the events, then a common one to index those XML events and one to forward XML events.

Pipeline Inheritance

It is possible for pipelines to inherit from other pipelines. This allows for the creation of a standard abstract pipelines with a set structure, though not fully configured, to be inherited by many concrete pipelines.

For example you may have a standard pipeline for indexing XML events, i.e. read XML data and pass it to an IndexingFilter , but the IndexingFilter is not configured with the actual Index to send documents to. A pipeline that inherits this one can then be simply configured with the Index to use.

Pipeline inheritance allows for changes to the inherited structure, e.g. adding additional elements in line. Multi level inheritance is also supported.

Pipeline Element Types

Stroom has a number of categories of pipeline element.

Reader

Readers are responsible for reading the raw bytes of the input data and converting it to character data using the Feed’s character encoding. They also provide functionality to modify the data before or after it is decoded to characters, e.g. Bye Order Mark removal, or doing find/replace on the character data. You can chain multiple Readers.

Parser

A parser is designed to convert the character data into XML for processing. For example, the JSONParser will use a JSON parser to read the character data as JSON and convert it into XML elements and attributes that represent the JSON structure, so that it can be transformed downstream using XSLT.

Parsers have a built in reader so if they are not preceded by a Reader they will decode the raw bytes into character data before parsing.

Filter

A filter is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and can either return those events unchanged or modify them. An example of Filter is an XSLTFilter element. Multiple filters can be chained, with each one consuming the events output by the one preceding it, therefore you can have lots of common reusable XSLTFilters that all do small incremental changes to a document.

Writer

A writer is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and converts them into encoded character data (using a specified encoding) of some form. The preceding filter may have been an XSLTFilter which transformed XML into plain text, in which case only character data events will be output and a TextWriter can just write these out as text data. Other writers will handle the XML SAX events to convert them into another format, e.g. the JSONWriter before encoding them as character data.

Destination

A destination element is a consumer of character data, as produced by a writer. A typical destination is a StreamAppender that writes the character data (which may be XML, JSON, CSV, etc.) to a new Stream in Stroom’s stream store. Other destinations can be used for sending the encoded character data to Kafka, a file on a file system or forwarding to an HTTP URL.

9.1 - Pipeline Recipies

A set of basic pipeline structure recipies for common use cases.

The following are a basic set of pipeline recipes for doing typical tasks in Stroom. Is it not an exhaustive list as the possibilities with Pipelines are vast. They are intended as a rough guide to get you started with building Pipelines.

Data Ingest and Transformation

CSV to Normalised XML

CSV data is ingested.
The Data Splitter parser parses the records and fields into records format XML using an XML based TextConverter document.
The first XSLTFilter is used to normalise the events in records XML into event-logging XML.
The second XSLTFilter is used to decorate the events with additional data, e.g. <UserDetails> using reference data lookups.
The SchemaFilter ensures that the XML output by the stages of XSLT transformation conforms to the event-logging XMLSchema.
The XML events are then written out as an Event Stream to the Stream store.

Source

Data Splitter

Rec Count (read)

Split

ID

Normalise

Decorate

SchemaFilter

Rec Count (written)

XMLWriter

StreamAppender

Configured Content

Data Splitter - A TextConverter containing XML conforming to data-splitter:3.
Normalise - An XSLT transforming records:2 => event-logging:3.
Decorate - An XSLT transforming event-logging:3 => event-logging:3.
SchemaFilter - XMLSchema event-logging:3

JSON to Normalised XML

The same as ingesting CSV data above, except the input JSON is converted into an XML representation of the JSON by the JSONParser. The Normalise XSLTFilter will be specific to the format of the JSON being ingested. The Decorate) XSLTFilter will likely be identical to that used for the CSV ingest above, demonstrating reuse of pipeline element content.

Source

JSONParser

Rec Count (read)

Split

ID

Normalise

Decorate

SchemaFilter

Rec Count (written)

XMLWriter

StreamAppender

Configured Content

Normalise - An XSLT transforming http://www.w3.org/2013/XSL/json => event-logging:3.
Decorate - An XSLT transforming event-logging:3 => event-logging:3.
SchemaFilter - XMLSchema event-logging:3

XML (not event-logging) to Normalised XML

As above except that the input data is already XML, though not in event-logging format. The XMLParser simply reads the XML character data and converts it to XML SAX events for processing. The Normalise XSLTFilter will be specific to the format of this XML and will transform it into event-logging format.

Source

XMLParser

Rec Count (read)

Split

ID

Normalise

Decorate

SchemaFilter

Rec Count (written)

XMLWriter

StreamAppender

Configured Content

Normalise - An XSLT transforming a 3rd party schema => event-logging:3.
Decorate - An XSLT transforming event-logging:3 => event-logging:3.
SchemaFilter - XMLSchema event-logging:3

XML (event-logging) to Normalised XML

As above except that the input data is already in event-logging XML format, so no normalisation is required. Decoration is still needed though.

Source

XMLParser

Rec Count (read)

Split

ID

Decorate

SchemaFilter

Rec Count (written)

XMLWriter

StreamAppender

Configured Content

Decorate - An XSLT transforming event-logging:3 => event-logging:3.
SchemaFilter - XMLSchema event-logging:3

XML Fragments to Normalised XML

XML Fragments are where the input data looks like:

<Event>
  ...
</Event>
<Event>
  ...
</Event>

In other words, it is technically badly formed XML as it has no root element or declaration. This format is however easier for client systems to send as they can send multiple <Event> blocks in one stream (e.g. just appending them together in a rolled log file) but don’t need to wrap them with an outer <Events> element.

The XMLFragmentParser understands this format and will add the wrapping element to make well-formed XML. If the XML fragments are already in event-logging format then no Normalise XSLTFilter is required.

Source

XMLFragParser

Rec Count (read)

Split

ID

Decorate

SchemaFilter

Rec Count (write)

XMLWriter

StreamAppender

Configured Content

XMLFragParser - Content similar to:

<?xml version="1.1" encoding="utf-8"?>
<!DOCTYPE Records [
<!ENTITY fragment SYSTEM "fragment">]>
<Events
    xmlns="event-logging:3"
    xmlns:stroom="stroom"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="event-logging:3 file://event-logging-v3.4.2.xsd"
    Version="3.4.2">
&fragment;
</Events>

Decorate - An XSLT transforming event-logging:3 => event-logging:3.
SchemaFilter - XMLSchema event-logging:3

Handling Malformed Data

Cleaning Malformed XML data

In some cases client systems may send XML containing characters that are not supported by the XML standard. These can be removed using the InvalidXMLCharFilterReader .

The input data may also be known to contain other sets of characters that will cause problems in processing. The FindReplaceFilter can be used to remove/replace either a fixed string or a Regex pattern.

Source

InvalidXMLCharFilterReader

FindReplaceFilter

XMLParser

[Pipeline truncated]

Raw Streaming

In cases where you want to export the raw (or cooked) data from a feed you can have a very simply pipeline to pipe the source data directly to an appender. This may be so that the raw data can be ingested into another system for analysis. In this case the data is being written to disk using a file appender.

Source

FileAppender

Note

Be careful when specifying the directory structure for the FileAppender so that you don’t end up with too many files in one folder, which can cause some OS issues.

Indexing

XML to Stroom Lucene Index

This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above. The XSLTFilter is used to transform the event into records format, extracting the fields to be indexed from the event. The IndexingFilter reads the records XML and loads each one into Stroom’s internal Lucene index .

Source

XMLParser

Rec Count (read)

Split

ID

XSLTFilter

SchemaFilter

IndexingFilter

Rec Count (written)

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => records:2.
SchemaFilter - XMLSchema records:2

The records:2 XML looks something like this, with each <data> element representing an indexed field value.

<?xml version="1.1" encoding="UTF-8"?>
<records 
    xmlns="records:2"
    xmlns:stroom="stroom"
    xmlns:sm="stroom-meta"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="2.0">
  <record>
    <data name="StreamId" value="1997" />
    <data name="EventId" value="1" />
    <data name="Feed" value="MY_FEED" />
    <data name="EventTime" value="2010-01-01T00:00:00.000Z" />
    <data name="System" value="MySystem" />
    <data name="Generator" value="CSV" />
    <data name="IPAddress" value="1.1.1.1" />
    <data name="UserId" analyzer="KEYWORD" value="user1" />
    <data name="Action" value="Authenticate" />
    <data name="Description" value="Some message 1" />
  </record>
</records>

XML to Stroom Lucene Index (Dynamic)

Dynamic indexing in Stroom allows you to use the XSLT to define the fields that are being indexed and how each field should be indexed. This avoids having to define all the fields up front in the Index and allows for the creation of fields based on the actual data received. The only difference with normal indexing in Stroom is that is uses the DynamicIndexingFilter and rather than transforming the event into records:2 XML, it is transformed into index-documents:1 XML as shown in the example below.

<?xml version="1.1" encoding="UTF-8"?>
<index-documents
    xmlns="index-documents:1"
    xmlns:stroom="stroom"
    xmlns:sm="stroom-meta"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="index-documents:1 file://index-documents-v1.0.xsd"
    version="1.0">
  <document>
    <field><name>StreamId</name><type>Id</type><indexed>true</indexed><stored>true</stored><value>1997</value></field>
    <field><name>EventId</name><type>Id</type><indexed>true</indexed><stored>true</stored><value>1</value></field>
    <field><name>Feed</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>MY_FEED</value></field>
    <field><name>EventTime</name><type>Date</type><indexed>true</indexed><value>2010-01-01T00:00:00.000Z</value></field>
    <field><name>System</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>MySystem</value></field>
    <field><name>Generator</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>CSV</value></field>
    <field><name>IPAddress</name><type>Text</type><indexed>true</indexed><value>1.1.1.1</value></field>
    <field><name>UserId</name><type>Text</type><indexed>true</indexed><value>user1</value></field>
    <field><name>Action</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>Authenticate</value></field>
    <field><name>Description</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>Some message 1</value></field>
  </document>
</index-documents>

Source

XMLParser

Rec Count (read)

Split

ID

XSLTFilter

SchemaFilter

DynamicIndexingFilter

Rec Count (written)

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => index-documents:1.
SchemaFilter - XMLSchema index-documents:1

XML to an Elastic Search Index

This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above. The XSLTFilter is used to transform the event into records format, extracting the fields to be indexed from the event. The ElasticIndexingFilter reads the records XML and loads each one into an external Elasticsearch index .

Source

XMLParser

Rec Count (read)

Split

ID

XSLTFilter

SchemaFilter

ElasticIndexingFilter

Rec Count (written)

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => records:2.
SchemaFilter - XMLSchema records:2

Search Extraction

Search extraction is the process of combining the data held in the index with data obtained from the original indexed document, i.e. the event. Search extraction is useful when you do not want to store the whole of an event in the index (to reduce storage used) but still want to be able to access all the event data in a Dashboard/View. An extraction pipeline is required to combine data in this way. Search extraction pipelines are referenced in Dashboard and View settings.

Standard Lucene Index Extraction

This is a non-dynamic search extraction pipeline for a Lucene index.

Source

XMLParser

Split

ID

XSLTFilter

SearchResultOutputFilter

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => records:2.

Dynamic Lucene Index Extraction

This is a dynamic search extraction pipeline for a Lucene index.

Source

XMLParser

Split

ID

XSLTFilter

DynamicSearchResultOutputFilter

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => index-documents:1.

Data Egress

XML to CSV File

An recipe of writing normalised XML events (as produced by an ingest pipeline above) to a file, but in a flat file format like CSV. The XSLTFilter transforms the events XML into CSV data with XSLT including this:

<xsl:output method="text" omit-xml-declaration="yes" indent="no"/>

The TextWriter converts the XML character events into a stream of characters encoded using the desired output character encoding. The data is appended to a file on a file system, with one file per Stream.

Source

XMLParser

Rec Count (read)

Split

ID

XSLTFilter

Rec Count (written)

TextWriter

FileAppender

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => schemaless plain text.
SchemaFilter - XMLSchema records:2

XML to JSON Rolling File

This is similar to the above recipe for writing out CSV, except that the XSLTFilter converts the event XML into XML conforming to the https://www.w3.org/2013/XSL/json/ XMLSchema. The JSONWriter can read this format of XML and convert it into JSON using the desired character encoding. The RollingFileAppender will append the encoded JSON character data to a file on the file system that is rolled based on a size/time threshold.

Source

XMLParser

Rec Count (read)

Split

ID

XSLTFilter

SchemaFilter

Rec Count (written)

JSONWriter

RollingFileAppender

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => http://www.w3.org/2013/XSL/json.
SchemaFilter - XMLSchema http://www.w3.org/2013/XSL/json.

XML to HTTP Destination

This recipe is for sending normalised XML events to another system over HTTP. The HTTPAppender is configured with the URL and any TLS certificates/keys/credentials.

Source

XMLParser

Rec Count (read)

Split

Rec Count (written)

XMLWriter

HTTPAppender

Reference Data

Reference Loader

A typical pipeline for loading XML reference data (conforming to the reference-data:2 XMLSchema) into the reference data store. The ReferenceDataFilter reads the reference-data:2 format data and loads each entry into the appropriate map in the store.

As an example, the reference-data:2 XML for mapping userIDs to staff numbers looks something like this:

<?xml version="1.1" encoding="UTF-8"?>
<referenceData
    xmlns="reference-data:2"
    xmlns:evt="event-logging:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.1.xsd"
    version="2.0.1">
  <reference>
    <map>USER_ID_TO_STAFF_NO_MAP</map>
    <key>user1</key>
    <value>staff1</value>
  </reference>
  <reference>
    <map>USER_ID_TO_STAFF_NO_MAP</map>
    <key>user2</key>
    <value>staff2</value>
  </reference>
  ...
</referenceData>

Source

XMLParser

ReferenceDataFilter

Statistics

This recipe converts normalised XML data and converts it into statistic events (confirming to the statistics:4 XMLSchema). Stroom’s Statistic Stores are a way to store aggregated counts or averaged values over time periods. For example you may want counts of certain types of event, aggregated over fixed time buckets. Each XML event is transformed using the XSLTFilter to either return no output or a statistic event. An example of statistics:4 data for two statistic events is:

<?xml version="1.1" encoding="UTF-8"?>
<statistics
    xmlns="statistics:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="statistics:2 file://statistics-v2.0.xsd">
  <statistic>
    <time>2023-12-22T00:00:00.000Z</time>
    <count>1</count>
    <tags>
      <tag name="user" value="user1" />
    </tags>
  </statistic>
  <statistic>
    <time>2023-12-23T00:00:00.000Z</time>
    <count>5</count>
    <tags>
      <tag name="user" value="user6" />
    </tags>
  </statistic>
</statistics>

Source

XMLParser

Split

XSLTFilter

SchemaFilter

StatisticsFilter

Configured Content

XSLTFilter - An XSLT transforming event-logging:3 => statistics:2.
SchemaFilter - XMLSchema statistics:2.

9.2 - Parser

Parsing input data.

The following capabilities are available to parse input data:

XML - XML input can be parsed with the XML parser.
XML Fragment - Treat input data as an XML fragment, i.e. XML that does not have an XML declaration or root elements.
Data Splitter - Delimiter and regular expression based language for turning non XML data into XML (e.g. CSV)

9.2.1 - XML Fragments

Handling XML data without root level elements.

Some input XML data may be missing an XML declaration and root level enclosing elements. This data is not a valid XML document and must be treated as an XML fragment. To use XML fragments the input type for a translation must be set to ‘XML Fragment’. A fragment wrapper must be defined in the XML conversion that tells Stroom what declaration and root elements to place around the XML fragment data.

Here is an example:

<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE records [
<!ENTITY fragment SYSTEM "fragment">
]>
<records
  xmlns="records:2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="records:2 file://records-v2.0.xsd"
  version="2.0">
  &fragment;
</records>

During conversion Stroom replaces the fragment text entity with the input XML fragment data. Note that XML fragments must still be well formed so that they can be parsed correctly.

9.3 - XSLT Conversion

Using Extensible Stylesheet Language Transformations (XSLT) to transform data.

XSLT is a language that is typically used for transforming XML documents into either a different XML document or plain text. XSLT is key part of Stroom’s pipeline processing as it is used to normalise bespoke events into a common XML audit event document conforming to the event-logging XML Schema .

Once a text file has been converted into intermediary XML (or the feed is already XML), XSLT is used to translate the XML into the event-logging XML format.

The XSLTFilter pipeline element defines the XSLT document and is used to do the transformation of the input XML into XML or plain text. You can have multiple XSLTFilter elements in a pipeline if you want to break the transformation into steps, or wish to have simpler XSLTs that can be reused.

Raw Event Feeds are typically translated into the event-logging:3 schema and Raw Reference into the reference-data:2 schema.

9.3.1 - XSLT Basics

The basics of using XSLT and the XSLTFilter element.

XSLT is a very powerful language and allows the user to perform very complex transformations of XML data. This documentation does not aim to document how to write XSLT documents, for that, we strongly recommend you refer to online references (e.g. W3Schools or obtain a book covering XSLT 2.0 and XPath ). It does however aim to document aspects of XSLT that are specific to the use of XSLT in Stroom.

Examples

Event Normalisation

Here is an example XSLT document that transforms XML data in the records:2 namespace (which is the output of the DSParser element) into event XML in the event-logging:3 namespace. It is an example of event normalisation from a bespoke format.

Warning

This example aims to show some typical uses of XSLT in a typical Stroom use case. It does not necessarily represent best practice in terms of creation of a normalised event.

<?xml version="1.1" encoding="UTF-8" ?>
<xsl:stylesheet 
    xpath-default-namespace="records:2" 
    xmlns="event-logging:3" 
    xmlns:stroom="stroom" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    version="2.0">

  <!-- Match the root element -->
  <xsl:template match="records">
    <Events 
        xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd" 
        Version="3.0.0">
      <xsl:apply-templates />
    </Events>
  </xsl:template>

  <!-- Match each 'record' element -->
  <xsl:template match="record">
    <xsl:variable name="user" select="data[@name='User']/@value" />
    <Event>
      <xsl:call-template name="header" />
      <xsl:value-of select="stroom:log('info', concat('Processing user: ', $user))"/>
      <EventDetail>
        <TypeId>0001</TypeId>
        <Description>
          <xsl:value-of select="data[@name='Message']/@value" />
        </Description>
        <Authenticate>
          <Action>Logon</Action>
          <LogonType>Interactive</LogonType>
          <User>
            <Id>
              <xsl:value-of select="$user" />
            </Id>
          </User>
          <Data Name="FileNo">
            <xsl:attribute name="Value" select="data[@name='FileNo']/@value" />
          </Data>
          <Data Name="LineNo">
            <xsl:attribute name="Value" select="data[@name='LineNo']/@value" />
          </Data>
        </Authenticate>
      </EventDetail>
    </Event>
  </xsl:template>

  <xsl:template name="header">
    <xsl:variable name="date" select="data[@name='Date']/@value" />
    <xsl:variable name="time" select="data[@name='Time']/@value" />
    <xsl:variable name="dateTime" select="concat($date, $time)" />
    <xsl:variable name="formattedDateTime" select="stroom:format-date($dateTime, 'dd/MM/yyyyHH:mm:ss')" />
    <xsl:variable name="user" select="data[@name='User']/@value" />
    <EventTime>
      <TimeCreated>
        <xsl:value-of select="$formattedDateTime" />
      </TimeCreated>
    </EventTime>
    <EventSource>
      <System>
        <Name>Test</Name>
        <Environment>Test</Environment>
      </System>
      <Generator>CSV</Generator>
      <Device>
        <IPAddress>1.1.1.1</IPAddress>
        <MACAddress>00-00-00-00-00-00</MACAddress>
        <xsl:variable name="location" select="stroom:lookup('FILENO_TO_LOCATION_MAP', data[@name='FileNo']/@value, $formattedDateTime)" />
        <xsl:if test="$location">
          <xsl:copy-of select="$location" />
        </xsl:if>
        <Data Name="Zone1">
          <xsl:attribute name="Value" select="stroom:lookup('IPToLocation', stroom:numeric-ip('192.168.1.1'))" />
        </Data>
      </Device>
      <User>
        <Id>
          <xsl:value-of select="$user" />
        </Id>
      </User>
    </EventSource>
  </xsl:template>
</xsl:stylesheet>

Reference Data

Here is an example of transforming Reference Data in the records:2 namespace (which is the output of the DSParser element) into XML in the reference-data:2 namespace that is suitable for loading using the ReferenceDataFilter

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet 
    xpath-default-namespace="records:2" 
    xmlns="reference-data:2" 
    xmlns:evt="event-logging:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0">

    <xsl:template match="records">
        <referenceData 
           xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.1.xsd event-logging:3 file://event-logging-v3.0.0.xsd"
           version="2.0.1">
            <xsl:apply-templates/>
        </referenceData>
    </xsl:template>

    <xsl:template match="record">
        <reference>
            <map>USER_ID_TO_STAFF_NO_MAP</map>
            <key><xsl:value-of select="data[@name='userId']/@value"/></key>
            <value><xsl:value-of select="data[@name='staffNo']/@value"/></value>
        </reference>
    </xsl:template>
    
</xsl:stylesheet>

Identity Transformation

If you want an XSLT to decorate an Events XML document with some additional data or to change it slightly without changing its namespace then a good starting point is the identity transformation.

<xsl:stylesheet 
    version="1.0" 
    xpath-default-namespace="event-logging:3" 
    xmlns="event-logging:3" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <!-- Match Root Object -->
  <xsl:template match="Events">
    <Events 
        xmlns="event-logging:3" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xsi:schemaLocation="event-logging:3 file://event-logging-v3.4.2.xsd" 
        Version="3.4.2">

      <xsl:apply-templates />
    </Events>
  </xsl:template>
  

  <!-- Whenever you match any node or any attribute -->
  <xsl:template match="node( )|@*">

    <!-- Copy the current node -->
    <xsl:copy>

      <!-- Including any attributes it has and any child nodes -->
      <xsl:apply-templates select="@*|node( )" />
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

This XSLT will copy every node and attribute as they are, returning the input document completely un-changed. You can then add additional templates to match on specific elements and modify them, for example decorating a user’s UserDetails elements with value obtained from a reference data lookup on a user ID.

Note

You can insert this identity skeleton into an XSLT editor using this editor snippet.

`<xsl:message>`

Stroom supports the standard <xsl:message> element from the http://www.w3.org/1999/XSL/Transform . This element behaves in a similar way to the stroom:log() XSLT function. The element text is logged to the Error stream with a default severity of ERROR.

A child element can optionally be used to set the severity level (one of FATAL|ERROR|WARN|INFO). The namespace of this element does not matter. You can also set the attribute terminate="yes" to log the message at severity FATAL and halt processing of that stream part. If the stream is multi-part then processing will continue with the next part.

Note

Setting terminate="yes" will trump any severity defined by a child element. It will always be logged at FATAL.

The following are some examples of using <xsl:message>.

<!-- Log a message using default severity of ERROR -->
<xsl:message>Invalid length</xsl:message>

<!-- terminate="yes" means log the message as a FATAL ERROR and halt processing of the stream part -->
<xsl:message terminate="yes">Invalid length</xsl:message>

<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
  <warn>Invalid length</warn>
</xsl:message>

<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
  <info>Invalid length</info>
</xsl:message>

<!-- Log a message, specifying the severity and using a dynamic value. -->
<xsl:message>
  <info>
    <xsl:value-of select="concat('User ID ', $userId, ' is invalid')" />
  </info>
</xsl:message>

9.3.2 - XSLT Functions

Custom XSLT functions available in Stroom.

By including the following namespace:

xmlns:stroom="stroom"

E.g.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
    xmlns="event-logging:3"
    xmlns:stroom="stroom"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    version="2.0">

The following functions are available to aid your translation:

bitmap-lookup(String map, String key) - Bitmap based look up against reference data map using the period start time
bitmap-lookup(String map, String key, String time) - Bitmap based look up against reference data map using a specified time, e.g. the event time
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings) - Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace) - Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.
cidr-to-numeric-ip-range() - Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end addresses of the range.
classification() - The classification of the feed for the data being processed
col-from() - The column in the input that the current record begins on (can be 0).
col-to() - The column in the input that the current record ends at.
current-time() - The current system time
current-user() - The current user logged into Stroom (only relevant for interactive use, e.g. search)
decode-url(String encodedUrl) - Decode the provided url.
dictionary(String name) - Loads the contents of the named dictionary for use within the translation
encode-url(String url) - Encode the provided url.
feed-attribute(String attributeKey) - NOTE: This function is deprecated, use meta(String key) instead. The value for the supplied feed attributeKey.
feed-name() - Name of the feed for the data being processed
fetch-json(String url) - Simplistic version of http-call that sends a request to the passed url and converts the JSON response body to XML using json-to-xml. Currently does not support SSL configuration like http-call does.
format-date(String date, String pattern) - Format a date that uses the specified pattern using the default time zone
format-date(String date, String pattern, String timeZone) - Format a date that uses the specified pattern with the specified time zone
format-date(String date, String patternIn, String timeZoneIn, String patternOut, String timeZoneOut) - Parse a date with the specified input pattern and time zone and format the output with the specified output pattern and time zone
format-date(String milliseconds) - Format a date that is specified as a number of milliseconds since a standard base time known as “the epoch”, namely January 1, 1970, 00:00:00 GMT
get(String key) - Returns the value associated with a key that has been stored in a map using the put() function. The map is in the scope of the current pipeline process so values do not live after the stream has been processed.
hash(String value) - Hash a string value using the default SHA-256 algorithm and no salt
hash(String value, String algorithm, String salt) - Hash a string value using the specified hashing algorithm and supplied salt value. Supported hashing algorithms include SHA-256, SHA-512, MD5.
hex-to-dec(String hex) - Convert hex to dec representation.
hex-to-oct(String hex) - Convert hex to oct representation.
hex-to-string(String hex, String textEncoding) - Convert hex to string using the specified text encoding.
host-address(String hostname) - Convert a hostname into an IP address.
host-name(String ipAddress) - Convert an IP address into a hostname.
http-call(String url, String headers, String mediaType, String data, String clientConfig) - Makes an HTTP(S) request to a remote server.
ip-in-cidr(String ipAddress, String cidr) - Return whether an IPv4 address is within the specified CIDR (e.g. 192.168.1.0/24).
json-to-xml(String json) - Returns an XML representation of the supplied JSON value for use in XPath expressions
line-from() - The line in the input that the current record begins on (1 based).
line-to() - The line in the input that the current record ends at.
link(String url) - Creates a stroom dashboard table link.
link(String title, String url) - Creates a stroom dashboard table link.
link(String title, String url, String type) - Creates a stroom dashboard table link.
log(String severity, String message) - Logs a message to the processing log with the specified severity
lookup(String map, String key) - Look up a reference data map using the period start time
lookup(String map, String key, String time) - Look up a reference data map using a specified time, e.g. the event time
lookup(String map, String key, String time, Boolean ignoreWarnings) - Look up a reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup
lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace) - Look up a reference data map using a specified time, e.g. the event time, ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.
meta(String key) - Lookup a meta data value for the current stream using the specified key. The key can be Feed, StreamType, CreatedTime, EffectiveTime, Pipeline or any other attribute supplied when the stream was sent to Stroom, e.g. meta(‘System’).
meta-keys() - Returns an array of meta keys for the current stream. Each key can then be used to retrieve its corresponding meta value, by calling meta($key).
numeric-ip(String ipAddress) - Convert an IP address to a numeric representation for range comparison
part-no() - The current part within a multi part aggregated input stream (AKA the substream number) (1 based)
parse-uri(String URI) - Returns an XML structure of the URI providing authority, fragment, host, path, port, query, scheme, schemeSpecificPart, and userInfo components if present.
pipeline-name() - Get the name of the pipeline currently processing the stream.
pointIsInsideXYPolygon(Number xPos, Number yPos, Number[] xPolyData, Number[] yPolyData) - Get the name of the pipeline currently processing the stream.
random() - Get a system generated random number between 0 and 1.
record-no() - The current record number within the current part (substream) (1 based).
search-id() - Get the id of the batch search when a pipeline is processing as part of a batch search
source() - Returns an XML structure with the stroom-meta namespace detailing the current source location.
source-id() - Get the id of the current input stream that is being processed
stream-id() - An alias for source-id included for backward compatibility.
pipeline-name() - Name of the current processing pipeline using the XSLT
put(String key, String value) - Store a value for use later on in the translation

bitmap-lookup()

The bitmap-lookup() function looks up a bitmap key from reference or context data a value (which can be an XML node set) for each set bit position and adds it to the resultant XML.

bitmap-lookup(String map, String key)
bitmap-lookup(String map, String key, String time)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)

map - The name of the reference data map to perform the lookup against.
key - The bitmap value to lookup. This can either be represented as a decimal integer (e.g. 14) or as hexadecimal by prefixing with 0x (e.g 0xE).
time - Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.

If the look up fails no result will be returned.

The key is a bitmap expressed as either a decimal integer or a hexidecimal value, e.g. 14/0xE is 1110 as a binary bitmap. For each bit position that is set, (i.e. has a binary value of 1) a lookup will be performed using that bit position as the key. In this example, positions 1, 2 & 3 are set so a lookup would be performed for these bit positions. The result of each lookup for the bitmap are concatenated together in bit position order, separated by a space.

If ignoreWarnings is true then any lookup failures will be ignored and it will return the value(s) for the bit positions it was able to lookup.

This function can be useful when you have a set of values that can be represented as a bitmap and you need them to be converted back to individual values. For example if you have a set of additive account permissions (e.g Admin, ManageUsers, PerformExport, etc.), each of which is associated with a bit position, then a user’s permissions could be defined as a single decimal/hex bitmap value. Thus a bitmap lookup with this value would return all the permissions held by the user.

For example the reference data store may contain:

Key (Bit position)	Value
0	Administrator
1	Manage_Users
2	Perform_Export
3	View_Data
4	Manage_Jobs
5	Delete_Data
6	Manage_Volumes

The following are example lookups using the above reference data:

Lookup Key (decimal)	Lookup Key (Hex)	Bitmap	Result
`0`	`0x0`	`0000000`	-
`1`	`0x1`	`0000001`	`Administrator`
`74`	`0x4A`	`1001010`	`Manage_Users View_Data Manage_Volumes`
`2`	`0x2`	`0000010`	`Manage_Users`
`96`	`0x60`	`1100000`	`Delete_Data Manage_Volumes`

cidr-to-numeric-ip-range()

Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end (broadcast) of the range.

When storing the result in a variable, ensure you indicate the type as a string array (xs:string*), as shown in the below example.

Example XSLT

<xsl:variable name="range" select="stroom:cidr-to-numeric-ip-range('192.168.1.0/24')" as="xs:string*" />
<Range>
  <Start><xsl:value-of select="$range[1]" /></Start>
  <End><xsl:value-of select="$range[2]" /></End>
</Range>

Example output

<Range>
  <Start>3232235776</Start>
  <End>3232236031</End>
</Range>

dictionary()

The dictionary() function gets the contents of the specified dictionary for use during translation. The main use for this function is to allow users to abstract the management of a set of keywords from the XSLT so that it is easier for some users to make quick alterations to a dictionary that is used by some XSLT, without the need for the user to understand the complexities of XSLT.

format-date()

The format-date() function combines parsing and formatting of date strings. In its simplest form it will parse a date string and return the parsed date in the XML standard Date Format. It also supports supplying a custom format pattern to output the parsed date in a specified format.

Function Signatures

The following are the possible forms of the format-date function.

<!-- Convert time in millis to standard date format -->
format-date(long millisSinceEpoch)

<!-- Convert inputDate to standard date format -->
format-date(String inputDate, String inputPattern)

<!-- Convert inputDate to standard date format using specified input time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone)

<!-- Convert inputDate to a custom date format using optional input time zone inputTimeZone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern)

<!-- Convert inputDate to a custom date format using optional input time zone and a specified output time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern, String outputTimeZone)

millisSinceEpoch - The date/time expressed as the number of milliseconds since the UNIX epoch .
inputDate - The input date string, e.g. 2009/08/01 12:34:11.
inputPattern - The pattern that defines the structure of inputDate (see Custom Date Formats).
inputTimeZone - Optional time zone of the inputDate. If null then the UTC/Zulu time zone will be used. If inputTimeZone is present, the inputPattern must not include the time zone pattern tokens (z and Z).
outputPattern - The pattern that defines the format of the output date (see Custom Date Formats).
inputTimeZone - Optional time zone of the output date. If null then the UTC/Zulu time zone will be used.

Time Zones

The following is a list of some common time zone values:

Values	Zone Name
`GMT/BST`	A Stroom specific value for UK daylight saving time (see below)
`UTC`, `UCT`, `Zulu`, `Universal`, `+00:00`, `-00:00`, `+00`, `+0`	Coordinated Universal Time (UTC)
`GMT`, `GMT0`, `Greenwich`	Greenwich Mean Time (GMT)
`GB`, `GB-Eire`, `Europe/London`	British Time
`NZ`, `Pacific/Auckland`	New Zealand Time
`Australia/Canberra`, `Australia/Sydney`	Eastern Australia Time
`CET`	Central European Time
`EET`	Eastern European Time
`Canada/Atlantic`	Atlantic Time
`Canada/Central`	Central Time
`Canada/Pacific`	Pacific Time
`US/Central`	Central Time
`US/Eastern`	Eastern Time
`US/Mountain`	Mountain Time
`US/Pacific`	Pacific Time
`+02:00`, `+02`, `+2`	UTC +2hrs
`-03:00`, `-03`, `-3`	UTC -3hrs

A special time zone value of GMT/BST can be used when the inputDate is in local wall clock time with time zone information. In this case, the date/time will be used to determine whether the date is in British Summer Time or in GMT and adjust the output accordingly. See the examples below.

Parsing Examples

The following table shows various examples of calls to stroom:format-date() with their output. The stroom:format-date part has been omitted for brevity.

<!-- Date in millis since UNIX epoch -->
stroom:format-date('1269270011640')
-> '2010-03-22T15:00:11.640Z'

<!-- Simple date UK style date -->
stroom:format-date('29/08/24', 'dd/MM/yy')
-> '2024-08-29T00:00:00.000Z'

<!-- Simple date US style date -->
stroom:format-date('08/29/24', 'MM/dd/yy')
-> '2024-08-29T00:00:00.000Z'

<!-- ISO date with no delimiters -->
stroom:format-date('20010801184559', 'yyyyMMddHHmmss')
-> '2001-08-01T18:45:59.000Z'

<!-- Standard output, no TZ -->
stroom:format-date('2001/08/01 18:45:59', 'yyyy/MM/dd HH:mm:ss')
-> '2001-08-01T18:45:59.000Z'

<!-- Standard output, date only, with TZ -->
stroom:format-date('2001/08/01', 'yyyy/MM/dd', '-07:00')
-> '2001-08-01T07:00:00.000Z'

<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '-08:00')
-> '2001-08-01T09:00:00.000Z'

<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '+01:00')
-> '2001-08-01T00:00:00.000Z'

<!-- Single digit day and month, no padding -->
stroom:format-date('2001 8 1', 'yyyy MM dd')
-> '2001-08-01T00:00:00.000Z'

<!-- Double digit day and month, no padding -->
stroom:format-date('2001 12 28', 'yyyy MM dd')
-> '2001-12-28T00:00:00.000Z'

<!-- Single digit day and month, with optional padding -->
stroom:format-date('2001  8  1', 'yyyy ppMM ppdd')
-> '2001-08-01T00:00:00.000Z'

<!-- Double digit day and month, with optional padding -->
stroom:format-date('2001 12 31', 'yyyy ppMM ppdd')
-> '2001-12-31T00:00:00.000Z'

<!-- With abbreviated day of week month -->
stroom:format-date('Wed Aug 14 2024', 'EEE MMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'

<!-- With long form day of week and month -->
stroom:format-date('Wednesday August 14 2024', 'EEEE MMMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'

<!-- With 12 hour clock, AM -->
stroom:format-date('Wed Aug 14 2024 10:32:58 AM', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T10:32:58.000Z'

<!-- With 12 hour clock, PM (lower case) -->
stroom:format-date('Wed Aug 14 2024 10:32:58 pm', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T22:32:58.000Z'

<!-- Using minimal symbols -->
stroom:format-date('2001 12 31 22:58:32.123', 'y M d H:m:s.S')
-> '2001-12-31T22:58:32.123Z'

<!-- Optional time portion, with time -->
stroom:format-date('2001/12/31 22:58:32.123', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T22:58:32.123Z'

<!-- Optional time portion, without time -->
stroom:format-date('2001/12/31', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T00:00:00.000Z'

<!-- Optional millis portion, with millis -->
stroom:format-date('2001/12/31 22:58:32.123', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.123Z'

<!-- Optional millis portion, without millis -->
stroom:format-date('2001/12/31 22:58:32', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.000Z'

<!-- Optional millis/nanos portion, with nanos -->
stroom:format-date('2001/12/31 22:58:32.123456', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.123Z'

<!-- Fixed text -->
stroom:format-date('Date: 2001/12/31 Time: 22:58:32.123', ''Date: 'yyyy/MM/dd 'Time: 'HH:mm:ss.SSS')
-> '2001-12-31T22:58:32.123Z'

<!-- GMT/BST date that is BST -->
stroom:format-date('2009/06/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT/BST')
-> '2009-06-01T11:34:11.000Z'

<!-- GMT/BST date that is GMT -->
stroom:format-date('2009/02/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT/BST')
-> '2009-02-01T12:34:11.000Z'

<!-- Time zone offset -->
stroom:format-date('2009/02/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', '+01:00')
-> '2009-02-01T11:34:11.000Z'

<!-- Named timezone -->
stroom:format-date('2009/02/01 23:34:11', 'yyyy/MM/dd HH:mm:ss', 'US/Eastern')
-> '2009-02-02T04:34:11.000Z'

Note

Parsing is done in lenient mode so, the count of each symbol is not critical, e.g. you can parse the year 2024 with y, yy, yyy or yyyy. Despite this, it is advisable to use a pattern that matches the known format of the input dates, e.g. in this example yyyy, to avoid confusing with anyone else reading your XSLT.

The count of each symbol is however critical when it comes to formatting.

Formatting Examples

<!-- Specific output, no input or output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', null, 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'

<!-- Specific output, UTC input, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', 'UTC', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'

<!-- Specific output, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 13:30 (59 secs)'

<!-- Specific output, with input and output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm', '+02:00')
-> 'Wed 01 Aug 2001 15:30'

<!-- Padded 12 hour clock output -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy pph:ppm:pps a')
-> 'Wed 01 Aug 2001  2: 7: 5 PM'

<!-- Padded 12 hour clock output -->
stroom:format-date('2001/08/01 22:27:25.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy pph:ppm:pps a')
-> 'Wed 01 Aug 2001 10:27:25 PM'

<!-- Non-Padded 12 hour clock output -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy h:m:s a')
-> 'Wed 01 Aug 2001 2:7:5 PM'

<!-- Long form text -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'EEEE d MMMM yyyy HH:mm:ss')
-> 'Wednesday 1 August 2001 14:07:05'

Reference Time

When parsing a date string that does not contain a full zoned date and time, certain assumptions will be made.

If there is no time zone in inputDate and no inputTimeZone argument has been passed then the time zone of the input date will be assumed to be in the UTC time zone.

If any of the date parts are not present, e.g. an input of 28 Oct then Stroom will use a reference date to fill in the gaps. The reference date is the first of these values that is non-null

The create time of the stream being processed by the XSLT.
The current time, i.e. now().

For example for a call of stroom:format-date('28 Oct', 'dd MMM') and a stream create time of 2024, it will return 2024-10-28T00:00:00.000Z.

hex-to-string()

For a hexadecimal input string, decode it using the specified character set to its original form.

Valid character set names are listed at: https://www.iana.org/assignments/character-sets/character-sets.xhtml. Common examples are: ASCII, UTF-8 and UTF-16.

Example

Input

<string><xsl:value-of select="hex-to-string('74 65 73 74 69 6e 67 20 31 32 33', 'UTF-8')" /></string>

Output

<string>testing 123</string>

http-call()

Executes an HTTP(S) request to a remote server and returns the response.

http-call(String url, [String headers], [String mediaType], [String data], [String clientConfig])

The arguments are as follows:

url - The URL to send the request to.
headers - A newline (
) delimited list of HTTP headers to send. Each header is of the form key:value.
mediaType - The media (or MIME) type of the request data, e.g. application/json. If not set application/json; charset=utf-8 will be used.
data - The data to send. The data type should be consistent with mediaType. Supplying the data argument means a POST request method will be used rather than the default GET.
clientConfig - A JSON object containing the configuration for the HTTP client to use, including any SSL configuration.

The function returns the response as XML with namespace stroom-http. The XML includes the body of the response in addition to the status code, success status, message and any headers.

`clientConfig`

The client can be configured using a JSON object containing various optional configuration items. The following is an example of the client configuration object with all keys populated.

{
  "callTimeout": "PT30S",
  "connectionTimeout": "PT30S",
  "followRedirects": false,
  "followSslRedirects": false,
  "httpProtocols": [
    "http/2",
    "http/1.1"
  ],
  "readTimeout": "PT30S",
  "retryOnConnectionFailure": true,
  "sslConfig": {
    "keyStorePassword": "password",
    "keyStorePath": "/some/path/client.jks",
    "keyStoreType": "JKS",
    "trustStorePassword": "password",
    "trustStorePath": "/some/path/ca.jks",
    "trustStoreType": "JKS",
    "sslProtocol": "TLSv1.2",
    "hostnameVerificationEnabled": false
  },
  "writeTimeout": "PT30S"
}

If you are using two-way SSL then you may need to set the protocol to HTTP/1.1.

  "httpProtocols": [
    "http/1.1"
  ],

Example output

The following is an example of the XML returned from the http-call function:

<response xmlns="stroom-http">
  <successful>true</successful>
  <code>200</code>
  <message>OK</message>
  <headers>
    <header>
      <key>cache-control</key>
      <value>public, max-age=600</value>
    </header>
    <header>
      <key>connection</key>
      <value>keep-alive</value>
    </header>
    <header>
      <key>content-length</key>
      <value>108</value>
    </header>
    <header>
      <key>content-type</key>
      <value>application/json;charset=iso-8859-1</value>
    </header>
    <header>
      <key>date</key>
      <value>Wed, 29 Jun 2022 13:03:38 GMT</value>
    </header>
    <header>
      <key>expires</key>
      <value>Wed, 29 Jun 2022 13:13:38 GMT</value>
    </header>
    <header>
      <key>server</key>
      <value>nginx/1.21.6</value>
    </header>
    <header>
      <key>vary</key>
      <value>Accept-Encoding</value>
    </header>
    <header>
      <key>x-content-type-options</key>
      <value>nosniff</value>
    </header>
    <header>
      <key>x-frame-options</key>
      <value>sameorigin</value>
    </header>
    <header>
      <key>x-xss-protection</key>
      <value>1; mode=block</value>
    </header>
  </headers>
  <body>{"buildDate":"2022-06-29T09:22:41.541886118Z","buildVersion":"SNAPSHOT","upDate":"2022-06-29T11:06:26.869Z"}</body>
</response>

Example usage

This is an example of how to use the function call in your XSLT. It is recommended to place the clientConfig JSON in a Dictionary to make it easier to edit and to avoid having to escape all the quotes.

  ...
  <xsl:template match="record">
    ...
    <!-- Read the client config from a Dictionary into a variable -->
    <xsl:variable name="clientConfig" select="stroom:dictionary('HTTP Client Config')" />
    <!-- Make the HTTP call and store the response in a variable -->
    <xsl:variable name="response" select="stroom:http-call('https://reqbin.com/echo', null, null, null, $clientConfig)" />
    <!-- Apply 'response' templates to the response -->
    <xsl:apply-templates mode="response" select="$response" />
    ...
  </xsl:template>
  
  <xsl:template mode="response" match="http:response">
    <!-- Extract just the body of the response -->
    <val><xsl:value-of select="./http:body/text()" /></val>
  </xsl:template>
  ...

link()

Create a string that represents a hyperlink for display in a dashboard table.

link(url)
link(title, url)
link(title, url, type)

Example

link('https://www.somehost.com/somepath')
> [https://www.somehost.com/somepath](https://www.somehost.com/somepath)
link('Click Here','https://www.somehost.com/somepath')
> [Click Here](https://www.somehost.com/somepath)
link('Click Here','https://www.somehost.com/somepath', 'dialog')
> [Click Here](https://www.somehost.com/somepath){dialog}
link('Click Here','https://www.somehost.com/somepath', 'dialog|Dialog Title')
> [Click Here](https://www.somehost.com/somepath){dialog|Dialog Title}

Type can be one of:

dialog : Display the content of the link URL within a stroom popup dialog.
tab : Display the content of the link URL within a stroom tab.
browser : Display the content of the link URL within a new browser tab.
dashboard : Used to launch a stroom dashboard internally with parameters in the URL.

If you wish to override the default title or URL of the target link in either a tab or dialog you can. Both dialog and tab types allow titles to be specified after a |, e.g. dialog|My Title.

log()

The log() function writes a message to the processing log with the specified severity. Severities of INFO, WARN, ERROR and FATAL can be used. Severities of ERROR and FATAL will result in records being omitted from the output if a RecordOutputFilter is used in the pipeline. The counts for RecWarn, RecError will be affected by warnings or errors generated in this way therefore this function is useful for adding business rules to XML output.

E.g. Warn if a SID is not the correct length.

<xsl:if test="string-length($sid) != 7">
  <xsl:value-of select="stroom:log('WARN', concat($sid, ' is not the correct length'))"/>
</xsl:if>

The same functionality can also be achieved using the standard xsl:message element, see <xsl:message>

lookup()

The lookup() function looks up from reference or context data a value (which can be an XML node set) and adds it to the resultant XML.

lookup(String map, String key)
lookup(String map, String key, String time)
lookup(String map, String key, String time, Boolean ignoreWarnings)
lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)

map - The name of the reference data map to perform the lookup against.
key - The key to lookup. The key can be a simple string, an integer value in a numeric range or a nested lookup key.
time - Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.

If the look up fails no result will be returned. By testing the result a default value may be output if no result is returned.

E.g. Look up a SID given a PF

<xsl:variable name="pf" select="PFNumber"/>
<xsl:if test="$pf">
   <xsl:variable name="sid" select="stroom:lookup('PF_TO_SID', $pf, $formattedDateTime)"/>

   <xsl:choose>
      <xsl:when test="$sid">
         <User>
             <Id><xsl:value-of select="$sid"/></Id>
         </User>
      </xsl:when>
      <xsl:otherwise>
         <data name="PFNumber">
            <xsl:attribute name="Value"><xsl:value-of select="$pf"/></xsl:attribute>
         </data>
      </xsl:otherwise>
   </xsl:choose>
</xsl:if>

Range lookups

Reference data entries can either be stored with single string key or a key range that defines a numeric range, e.g 1-100. When a lookup is preformed the passed key is looked up as if it were a normal string key. If that lookup fails Stroom will try to convert the key to an integer (long) value. If it can be converted to an integer than a second lookup will be performed against entries with key ranges to see if there is a key range that includes the requested key.

Range lookups can be used for looking up an IP address where the reference data values are associated with ranges of IP addresses. In this use case, the IP address must first be converted into a numeric value using numeric-ip(), e.g:

stroom:lookup('IP_TO_LOCATION', numeric-ip($ipAddress))

Similarly the reference data must be stored with key ranges whose bounds were created using this function.

Nested Maps

The lookup function allows you to perform chained lookups using nested maps. For example you may have a reference data map called USER_ID_TO_LOCATION that maps user IDs to some location information for that user and a map called USER_ID_TO_MANAGER that maps user IDs to the user ID of their manager. If you wanted to decorate a user’s event with the location of their manager you could use a nested map to achieve the lookup chain. To perform the lookup set the map argument to the list of maps in the lookup chain, separated by a /, e.g. USER_ID_TO_MANAGER/USER_ID_TO_LOCATION.

This will perform a lookup against the first map in the list using the requested key. If a value is found the value will be used as the key in a lookup against the next map. The value from each map lookup is used as the key in the next map all the way down the chain. The value from the last lookup is then returned as the result of the lookup() call. If no value is found at any point in the chain then that results in no value being returned from the function.

In order to use nested map lookups each intermediate map must contain simple string values. The last map in the chain can either contain string values or XML fragment values.

put() and get()

You can put values into a map using the put() function. These values can then be retrieved later using the get() function. Values are stored against a key name so that multiple values can be stored. These functions can be used for many purposes but are most commonly used to count a number of records that meet certain criteria.

The map is in the scope of the current pipeline process so values do not live after the stream has been processed. Also, the map will only contain entries that were put() within the current pipeline process.

An example of how to count records is shown below:

<!-- Get the current record count -->
<xsl:variable name="currentCount" select="number(s:get('count'))" />

<!-- Increment the record count -->
<xsl:variable name="count">
  <xsl:choose>
    <xsl:when test="$currentCount">
      <xsl:value-of select="$currentCount + 1" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="1" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:variable>

<!-- Store the count for future retrieval -->
<xsl:value-of select="stroom:put('count', $count)" />

<!-- Output the new count -->
<data name="Count">
  <xsl:attribute name="Value" select="$count" />
</data>

meta-keys()

When calling this function and assigning the result to a variable, you must specify the variable data type of xs:string* (array of strings).

The following fragment is an example of using meta-keys() to emit all meta values for a given stream, into an Event/Meta element:

<Event>
  <xsl:variable name="metaKeys" select="stroom:meta-keys()" as="xs:string*" />
  <Meta>
    <xsl:for-each select="$metaKeys">
      <string key="{.}"><xsl:value-of select="stroom:meta(.)" /></string>
    </xsl:for-each>
  </Meta>
</Event>

parse-uri()

The parse-uri() function takes a Uniform Resource Identifier (URI) in string form and returns an XML node with a namespace of uri containing the URI’s individual components of authority, fragment, host, path, port, query, scheme, schemeSpecificPart and userInfo. See either RFC 2306: Uniform Resource Identifiers (URI): Generic Syntax or Java’s java.net.URI Class for details regarding the components.

The following xml

<!-- Display and parse the URI contained within the text of the rURI element -->
<xsl:variable name="u" select="stroom:parseUri(rURI)" />

<URI>
  <xsl:value-of select="rURI" />
</URI>
<URIDetail>
  <xsl:copy-of select="$v"/>
</URIDetail>

given the rURI text contains

   http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2#more-details

would provide

<URL>http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2#more-details</URL>
<URIDetail>
  <authority xmlns="uri">foo:bar@w1.superman.com:8080</authority>
  <fragment xmlns="uri">more-details</fragment>
  <host xmlns="uri">w1.superman.com</host>
  <path xmlns="uri">/very/long/path.html</path>
  <port xmlns="uri">8080</port>
  <query xmlns="uri">p1=v1&amp;p2=v2</query>
  <scheme xmlns="uri">http</scheme>
  <schemeSpecificPart xmlns="uri">//foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2</schemeSpecificPart>
  <userInfo xmlns="uri">foo:bar</userInfo>
</URIDetail>

pointIsInsideXYPolygon()

Returns true if the specified point is inside the specified polygon. Useful for determining if a user is inside a physical zone based on their location and the boundary of that zone.

pointIsInsideXYPolygon(Number xPos, Number yPos, Number[] xPolyData, Number[] yPolyData)

Arguments:

xPos - The X value of the point to be tested.
yPos - The Y value of the point to be tested.
xPolyData - A sequence of X values that define the polygon.
yPolyData - A sequence of Y values that define the polygon.

The list of values supplied for xPolyData must correspond with the list of values supplied for yPolyData. The points that define the polygon must be provided in order, i.e. starting from one point on the polygon and then traveling round the path of the polygon until it gets back to the beginning.

9.3.3 - XSLT Includes

Using an XSLT import to include XSLT from another translation.

You can use an XSLT import to include XSLT from another translation. E.g.:

<xsl:import href="ApacheAccessCommon" />

This would include the XSLT from the ApacheAccessCommon translation.

9.4 - File Output

Substitution variables for use in output file names and paths.

When outputting files with Stroom, the output file names and paths can include various substitution variables to form the file and path names.

Context Variables

The following replacement variables are specific to the current processing context.

${feed} - The name of the feed that the stream being processed belongs to
${pipeline} - The name of the pipeline that is producing output
${sourceId} - The id of the input data being processed
${partNo} - The part number of the input data being processed where data is in aggregated batches
${searchId} - The id of the batch search being performed. This is only available during a batch search
${node} - The name of the node producing the output

Time Variables

The following replacement variables can be used to include aspects of the current time in UTC.

${year} - Year in 4 digit form, e.g. 2000
${month} - Month of the year padded to 2 digits
${day} - Day of the month padded to 2 digits
${hour} - Hour padded to 2 digits using 24 hour clock, e.g. 22
${minute} - Minute padded to 2 digits
${second} - Second padded to 2 digits
${millis} - Milliseconds padded to 3 digits
${ms} - Milliseconds since the epoch

System (Environment) Variables

System variables (environment variables) can also be used, e.g. ${TMP}.

File Name References

rolledFileName in RollingFileAppender can use references to the fileName to incorporate parts of the non rolled file name.

${fileName} - The complete file name
${fileStem} - Part of the file name before the file extension, i.e. everything before the last ‘.’
${fileExtension} - The extension part of the file name, i.e. everything after the last ‘.’

Other Variables

${uuid} - A randomly generated UUID to guarantee unique file names

9.5 - Reference Data

Performing temporal reference data lookups to decorate event data.

Reference Data Structure

A reference data entry essentially consists of the following:

Effective time - The data/time that the entry was effective from, i.e the time the raw reference data was received.
Map name - A unique name for the key/value map that the entry will be stored in. The name only needs to be unique within all map names that may be loaded within an XSLT Filter. In practice it makes sense to keep map names globally unique.
Key - The text that will be used to lookup the value in the reference data map. Mutually exclusive with Range.
Range - The inclusive range of integer keys that the entry applies to. Mutually exclusive with Key.
Value - The value can either be simple text, e.g. an IP address, or an XML fragment that can be inserted into another XML document. XML values must be correctly namespaced.

The following is an example of some reference data that has been converted from its raw form into reference-data:2 XML. In this example the reference data contains three entries that each belong to a different map. Two of the entries are simple text values and the last has an XML value.

<?xml version="1.1" encoding="UTF-8"?>
<referenceData 
    xmlns="reference-data:2" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:stroom="stroom" 
    xmlns:evt="event-logging:3" 
    xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd" 
    version="2.0.1">

  <!-- A simple string value -->
  <reference>
    <map>FQDN_TO_IP</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <IPAddress>192.168.2.245</IPAddress>
    </value>
  </reference>

  <!-- A simple string value -->
  <reference>
    <map>IP_TO_FQDN</map>
    <key>192.168.2.245</key>
    <value>
      <HostName>stroomnode00.strmdev00.org</HostName>
    </value>
  </reference>

  <!-- A key range -->
  <reference>
    <map>USER_ID_TO_COUNTRY_CODE</map>
    <range>
      <from>1</from>
      <to>1000</to>
    </range>
    <value>GBR</value>
  </reference>

  <!-- An XML fragment value -->
  <reference>
    <map>FQDN_TO_LOC</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <evt:Location>
        <evt:Country>GBR</evt:Country>
        <evt:Site>Bristol-S00</evt:Site>
        <evt:Building>GZero</evt:Building>
        <evt:Room>R00</evt:Room>
        <evt:TimeZone>+00:00/+01:00</evt:TimeZone>
      </evt:Location>
    </value>
  </reference>
</referenceData>

Reference Data Namespaces

When XML reference data values are created, as in the example XML above, the XML values must be qualified with a namespace to distinguish them from the reference-data:2 XML that surrounds them. In the above example the XML fragment will become as follows when injected into an event:

      <evt:Location xmlns:evt="event-logging:3" >
        <evt:Country>GBR</evt:Country>
        <evt:Site>Bristol-S00</evt:Site>
        <evt:Building>GZero</evt:Building>
        <evt:Room>R00</evt:Room>
        <evt:TimeZone>+00:00/+01:00</evt:TimeZone>
      </evt:Location>

Even if evt is already declared in the XML being injected into it, if it has been declared for the reference fragment then it will be explicitly declared in the destination. While duplicate namespacing may appear odd it is valid XML.

The namespacing can also be achieved like this:

<?xml version="1.1" encoding="UTF-8"?>
<referenceData 
    xmlns="reference-data:2" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:stroom="stroom" 
    xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd" 
    version="2.0.1">

  <!-- An XML value -->
  <reference>
    <map>FQDN_TO_LOC</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <Location xmlns="event-logging:3">
        <Country>GBR</Country>
        <Site>Bristol-S00</Site>
        <Building>GZero</Building>
        <Room>R00</Room>
        <TimeZone>+00:00/+01:00</TimeZone>
      </Location>
    </value>
  </reference>
</referenceData>

This reference data will be injected into event XML exactly as it, i.e.:

      <Location xmlns="event-logging:3">
        <Country>GBR</Country>
        <Site>Bristol-S00</Site>
        <Building>GZero</Building>
        <Room>R00</Room>
        <TimeZone>+00:00/+01:00</TimeZone>
      </Location>

Reference Data Storage

Reference data is stored in two different places on a Stroom node. All reference data is only visible to the node where it is located. Each node that is performing reference data lookups will need to load and store its own reference data. While this will result in duplicate data being held by nodes it makes the storage of reference data and its subsequent lookup very performant.

On-Heap Store

The On-Heap store is the reference data store that is held in memory in the Java Heap. This store is volatile and will be lost on shut down of the node. The On-Heap store is only used for storage of context data.

Off-Heap Store

The Off-Heap store is the reference data store that is held in memory outside of the Java Heap and is persisted to to local disk. As the store is also persisted to local disk it means the reference data will survive the shutdown of the stroom instance. Storing the data off-heap means Stroom can run with a much smaller Java Heap size.

The Off-Heap store is based on the Lightning Memory-Mapped Database (LMDB). LMDB makes use of the Linux page cache to ensure that hot portions of the reference data are held in the page cache (making use of all available free memory). Infrequently used portions of the reference data will be evicted from the page cache by the Operating System. Given that LMDB utilises the page cache for holding reference data in memory the more free memory the host has the better as there will be less shifting of pages in/out of the OS page cache. When storing large amounts of data you may experience the OS reporting very little free memory as a large amount will be in use by the page cache. This is not an issue as the OS will evict pages when memory is needed for other applications, e.g. the Java Heap.

Local Disk

The Off-Heap store is intended to be located on local disk on the Stroom node. The location of the store is set using the property stroom.pipeline.referenceData.localDir. Using LMDB on remote storage is NOT advised, see http://www.lmdb.tech/doc. Using the fastest storage (i.g. fast SSDs) is advised to reduce load times and lookups of data that is not in memory.

Warning

If you are running stroom on AWS EC2 instances then you will need to attach some local instance storage to the host, e.g. SSD, to use for the reference data store. In tests EBS storage was found to be VERY slow.

It should be noted that AWS instance storage is not persistent between instance stops, terminations and hardware failure. However any loss of the reference data store will mean that the next time Stroom boots a new store will be created and reference data will be loaded on demand as normal.

Transactions

LMDB is a transactional database with ACID semantics. All interaction with LMDB is done within a read or write transaction. There can only be one write transaction at a time so if there are a number of concurrent reference data loads then they will have to wait in line.

Read transactions, i.e. lookups, are not blocked by each other but may be blocked by a write transaction depending on the value of the system property stroom.pipeline.referenceData.lmdb.readerBlockedByWriter. LMDB can operate such that readers are not blocked by writers but if there is an open read transaction while a write transaction is writing data to the store then it is unable to make use of free space (from previous deletes, see Store Size & Compaction) so will result in the store increasing in size. If read transactions are likely while writes are taking place then this can lead to excessive growth of the store. Setting stroom.pipeline.referenceData.lmdb.readerBlockedByWriter to true will block all reads while a load is happening so any free space can be re-used, at the cost of making all lookups wait for the load to complete. Use of this setting will depend on how likely it is that loads will clash with lookups and the store size should be monitored.

Read-Ahead Mode

When data is read from the store, if the data is not already in the page cache then it will be read from disk and added to the page cache by the OS. Read-ahead is the process of speculatively reading ahead to load more pages into the page cache than were requested. This is on the basis that future requests for data may need the pages speculatively read into memory as it is more efficient to read multiple pages at once. If the reference data store is very large or is larger than the available memory then it is recommended to turn read-ahead off as the result will be to evict hot reference data from the page cache to make room for speculative pages that may not be needed. It can be tuned off with the system property stroom.pipeline.referenceData.readAheadEnabled.

Key Size

When reference data is created care must be taken to ensure that the Key used for each entry is less than 507 bytes. For simple ASCII characters then this means less than 507 characters. If non-ASCII characters are in the key then these will take up more than one byte per character so the length of the key in characters will be less. This is a limitation inherent to LMDB.

Commit intervals

The property stroom.pipeline.referenceData.maxPutsBeforeCommit controls the number of entries that are put into the store between each commit. As there can be only one transaction writing to the store at a time, committing periodically allows other process to jump in and make writes. There is a trade off though as reducing the number of entries put between each commit can seriously affect performance. For the fastest single process performance a value of 0 should be used which means it will not commit mid-load. This however means all other processes wanting to write to the store will need to wait. Low values (e.g. in the hundreds) mean very frequent commits so will hamper performance.

Cloning The Off Heap Store

If you are provisioning a new stroom node it is possible to copy the off heap store from another node. Stroom should not be running on the node being copied from. Simply copy the contents of stroom.pipeline.referenceData.localDir into the same configured location on the other node. The new node will use the copied store and have access to its reference data.

Store Size & Compaction

Due to the way LMDB works the store can only grow in size, it will never shrink, even if reference data is deleted. Deleted data frees up space for new writes to the store so will be reused but will never be freed back to the operating system. If there is a regular process of purging old data and adding new reference data then this should not be an issue as the new reference data will use the space made available by the purged data.

If store size becomes an issue then it is possible to compact the store. lmdb-utils is package that is available on some package managers and this has an mdb_copy command that can be used with the -c switch to copy the LMDB environment to a new one, compacting it in the process. This should be done when Stroom is down to avoid writes happening to the store while the copy is happening.

The following is an example of how to compact the store assuming Stroom has been shut down first.

# Navigate to the 'stroom.pipeline.referenceData.localDir' directory
cd /some/path/to/reference_data
# Verify contents
ls
(out) data.mdb  lock.mdb
# Create a directory to write the compacted file to
mkdir compacted
# Run the compaction, writing the new data.mdb file to the new sub-dir
mdb_copy -c ./ ./compacted
# Delete the existing store
rm data.mdb lock.mdb
# Copy the compacted store back in (note a lock file gets created as needed)
mv compacted/data.mdb ./
# Remove the created directory
rmdir compacted

Now you can re-start Stroom and it will use the new compacted store, creating a lock file for it.

The compaction process is fast. A test compaction of a 4Gb store, compacted down to 1.6Gb took about 7s on non-flash HDD storage.

Alternatively, given that the store is essentially a cache and all data can be re-loaded another option is to delete the contents of stroom.pipeline.referenceData.localDir when Stroom is not running. On boot Stroom will create a brand new empty store and reference data will be re-loaded as required. This approach will result in all data having to be re-loaded so will slow lookups down until it has been loaded.

The Loading Process

Reference data is loaded into the store on demand during the processing of a stroom:lookup() method call. Reference data will only be loaded if it does not already exist in the store, however it is always loaded as a complete stream, rather than entry by entry.

The test for existence in the store is based on the following criteria:

The UUID of the reference loader pipeline.
The version of the reference loader pipeline.
The Stream ID for the stream of reference data that has been deemed effective for the lookup.
The Stream Number (in the case of multi part streams).

If a reference stream has already been loaded matching the above criteria then no additional load is required.

IMPORTANT: It should be noted that as the version of the reference data pipeline forms part of the criteria, if the reference loader pipeline is changed, for whatever reason, then this will invalidate ALL existing reference data associated with that reference loader pipeline.

Typically the reference loader pipeline is very static so this should not be an issue.

Standard practice is to convert raw reference data into reference:2 XML on receipt using a pipeline separate to the reference loader. The reference loader is then only concerned with reading cooked reference:2 into the Reference Data Filter.

In instances where reference data streams are infrequently used it may be preferable to not convert the raw reference on receipt but instead to do it in the reference loader pipeline.

Duplicate Keys

The Reference Data Filter pipeline element has a property overrideExistingValues which if set to true means if an entry is found in an effective stream with the same key as an entry already loaded then it will overwrite the existing one. Entries are loaded in the order they are found in the reference:2 XML document. If set to false then the existing entry will be kept. If warnOnDuplicateKeys is set to true then a warning will be logged for any duplicate keys, whether an overwrite happens or not.

Value De-Duplication

Only unique values are held in the store to reduce the storage footprint. This is useful given that typically, reference data updates may be received daily and each one is a full snapshot of the whole reference data. As a result this can mean many copies of the same value being loaded into the store. The store will only hold the first instance of duplicate values.

Querying the Reference Data Store

The reference data store can be queried within a Dashboard in Stroom by selecting Reference Data Store in the data source selection pop-up. Querying the store is currently an experimental feature and is mostly intended for use in fault finding. Given the localised nature of the reference data store the dashboard can currently only query the store on the node that the user interface is being served from. In a multi-node environment where some nodes are UI only and most are processing only, the UI nodes will have no reference data in their store.

Purging Old Reference Data

Reference data loading and purging is done at the level of a reference stream. Whenever a reference lookup is performed the last accessed time of the reference stream in the store is checked. If it is older than one hour then it will be updated to the current time. This last access time is used to determine reference streams that are no longer in active use and thus can be purged.

The Stroom job Ref Data Off-heap Store Purge is used to perform the purge operation on the Off-Heap reference data store. No purge is required for the On-Heap store as that only holds transient data. When the purge job is run it checks the time since each reference stream was accessed against the purge cut-off age. The purge age is configured via the property stroom.pipeline.referenceData.purgeAge. It is advised to schedule this job for quiet times when it is unlikely to conflict with reference loading operations as they will fight for access to the single write transaction.

Lookups

Lookups are performed in XSLT Filters using the XSLT functions. In order to perform a lookup one or more reference feeds must be specified on the XSLT Filter pipeline element. Each reference feed is specified along with a reference loader pipeline that will ingest the specified feed (optional convert it into reference:2 XML if it is not already) and pass it into a Reference Data Filter pipeline element.

Reference Feeds & Loaders

In the XSLT Filter pipeline element multiple combinations of feed and reference loader pipeline can be specified. There must be at least one in order to perform lookups. If there are multiple then when a lookup is called for a given time, the effective stream for each feed/loader combination is determined. The effective stream for each feed/loader combination will be loaded into the store, unless it is already present.

When the actual lookup is performed Stroom will try to find the key in each of the effective streams that have been loaded and that contain the map in the lookup call. If the lookup is unsuccessful in the effective stream for the first feed/loader combination then it will try the next, and so on until it has tried all of them. For this reason if you have multiple feed/loader combinations then order is important. It is possible for multiple effective streams to contain the same map/key so a feed/loader combination higher up the list will trump one lower down with the same map/key. Also if you have some lookups that may not return a value and others that should always return a value then the feed/loader for the latter should be higher up the list so it is searched first.

Effective Streams

Reference data lookups have the concept of Effective Streams. An effective stream is the most recent stream for a given Feed that has an effective date that is less than or equal to the date used for the lookup. When performing a lookup, Stroom will search the stream store to find all the effective streams in a time bucket that surrounds the lookup time. These sets of effective streams are cached so if a new reference stream is created it will not be used until the cached set has expired. To rectify this you can clear the cache Reference Data - Effective Stream Cache on the Caches screen accessed from:

Standard Key/Value Lookups

Standard key/value lookups consist of a simple string key and a value that is either a simple string or an XML fragment. Standard lookups are performed using the various forms of the stroom:lookup() XSLT function.

Note

If the key is not found and the key is an integer then it will attempt a range lookup using the same key. This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.

Range Lookups

Range lookups consist of a key that is an integer and a value that is either a simple string or an XML fragment. For more detail on range lookups see the XSLT function stroom:lookup().

Note

The lookup will initially look for a single key that matches the lookup key. If an exact match is not found then it will look for a range that contains the key. This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.

Nested Map Lookups

Nested map lookups involve chaining a number of lookups with the value of each map being used as the key for the next. For more detail on nested lookups see the XSLT function stroom:lookup().

Bitmap Lookups

A bitmap lookup is a special kind of lookup that actually performs a lookup for each enabled bit position of the passed bitmap value. For more detail on bitmap lookups see the XSLT function stroom:bitmap-lookup().

Values can either be a simple string or an XML fragment.

Context data lookups

Some event streams have a Context stream associated with them. Context streams allow the system sending the events to Stroom to supply an additional stream of data that provides context to the raw event stream. This can be useful when the system sending the events has no control over the event content but needs to supply additional information. The context stream can be used in lookups as a reference source to decorate events on receipt. Context reference data is specific to a single event stream so is transient in nature, therefore the On Heap Store is used to hold it for the duration of the event stream processing only.

Typically the reference loader for a context stream will include a translation step to convert the raw context data into reference:2 XML.

Reference Data API

See Reference Data API.

9.6 - Context Data

Context data is additional contextual data Stream that is sent alongside the main event data Stream.

TODO

This section needs some explanation.

Context File

Input File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
	<SomeEvent>
			<SomeTime>01/01/2009:12:00:01</SomeTime>
			<SomeAction>OPEN</SomeAction>
			<SomeUser>userone</SomeUser>
			<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
	</SomeEvent>
</SomeData>

Context File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeContext>
	<Machine>MyMachine</Machine>
</SomeContext>

Context XSLT:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
	xmlns="reference-data:2"
	xmlns:evt="event-logging:3"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	version="2.0">
		
		<xsl:template match="SomeContext">
			<referenceData 
					xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
					version="2.0.1">
							
					<xsl:apply-templates/>
			</referenceData>
		</xsl:template>

		<xsl:template match="Machine">
			<reference>
					<map>CONTEXT</map>
					<key>Machine</key>
					<value><xsl:value-of select="."/></value>
			</reference>
		</xsl:template>
		
</xsl:stylesheet>

Context XML Translation:

<?xml version="1.0" encoding="UTF-8"?>
<referenceData xmlns:evt="event-logging:3"
								xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
								xmlns="reference-data:2"
								xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
								version="2.0.1">
		<reference>
			<map>CONTEXT</map>
			<key>Machine</key>
			<value>MyMachine</value>
		</reference>
</referenceData>

Input File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
	<SomeEvent>
			<SomeTime>01/01/2009:12:00:01</SomeTime>
			<SomeAction>OPEN</SomeAction>
			<SomeUser>userone</SomeUser>
			<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
	</SomeEvent>
</SomeData>

Main XSLT (Note the use of the context lookup):

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
	xmlns="event-logging:3"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
	version="2.0">
	
    <xsl:template match="SomeData">
        <Events xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd" Version="3.0.0">
            <xsl:apply-templates/>
        </Events>
    </xsl:template>
    <xsl:template match="SomeEvent">
        <xsl:if test="SomeAction = 'OPEN'">
            <Event>
                <EventTime>
                        <TimeCreated>
                            <xsl:value-of select="s:format-date(SomeTime, 'dd/MM/yyyy:hh:mm:ss')"/>
                        </TimeCreated>
                </EventTime>
				<EventSource>
					<System>Example</System>
					<Environment>Example</Environment>
					<Generator>Very Simple Provider</Generator>
					<Device>
						<IPAddress>182.80.32.132</IPAddress>
						<Location>
							<Country>UK</Country>
							<Site><xsl:value-of select="s:lookup('CONTEXT', 'Machine')"/></Site>
							<Building>Main</Building>
							<Floor>1</Floor>              
							<Room>1aaa</Room>
						</Location>           
					</Device>
					<User><Id><xsl:value-of select="SomeUser"/></Id></User>
				</EventSource>
				<EventDetail>
					<View>
						<Document>
							<Title>UNKNOWN</Title>
							<File>
								<Path><xsl:value-of select="SomeFile"/></Path>
							</File>
						</Document>
					</View>
				</EventDetail>
            </Event>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Main Output XML:

<?xml version="1.0" encoding="UTF-8"?>
<Events xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
				xmlns="event-logging:3"
				xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd"
				Version="3.0.0">
		<Event Id="6:1">
			<EventTime>
					<TimeCreated>2009-01-01T00:00:01.000Z</TimeCreated>
			</EventTime>
			<EventSource>
					<System>Example</System>
					<Environment>Example</Environment>
					<Generator>Very Simple Provider</Generator>
					<Device>
						<IPAddress>182.80.32.132</IPAddress>
						<Location>
								<Country>UK</Country>
								<Site>MyMachine</Site>
								<Building>Main</Building>
								<Floor>1</Floor>
								<Room>1aaa</Room>
						</Location>
					</Device>
					<User>
						<Id>userone</Id>
					</User>
			</EventSource>
			<EventDetail>
					<View>
						<Document>
								<Title>UNKNOWN</Title>
								<File>
									<Path>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</Path>
								</File>
						</Document>
					</View>
			</EventDetail>
		</Event>
</Events>

10 - Properties

Configuration of Stroom’s application properties.

Properties are the means of configuring the Stroom application and are typically maintained by the Stroom system administrator. The value of some properties are required in order for Stroom to function, e.g. database connection details, and thus need to be set prior to running Stroom. Some properties can be changed at runtime to alter the behaviour of Stroom.

Sources

Property values can be defined in the following locations.

System Default

The system defaults are hard-coded into the Stroom application code by the developers and can’t be changed. These represent reasonable defaults, where applicable, but may need to be changed, e.g. to reflect the scale of the Stroom system or the specific environment.

The default property values can either be viewed in the Stroom user interface or in the file config/config-defaults.yml in the Stroom distribution. Properties can be accessed in the user interface by selecting this from the top menu:

Global Database Value

Global database values are property values stored in the database that are global across the whole cluster.

The global database value is defined as a record in the config table in the database. The database record will only exist if a database value has explicitly been set. The database value will apply to all nodes in the cluster, overriding the default value, unless a node also has a value set in its YAML configuration.

Database values can be set from the Stroom user interface, accessed by selecting this from the top menu:

Some properties are marked Read Only which means they cannot have a database value set for them. These properties can only be altered via the YAML configuration file on each node. Such properties are typically used to configure values required for Stroom to be able to boot, so it does not make sense for them to be configurable from the User Interface.

YAML Configuration file

Stroom is built on top of a framework called Dropwizard. Dropwizard uses a YAML configuration file on each node to configure the application. This is typically config.yml and is located in the config directory.

For details of the structure of this file, see Stroom and Stroom-Proxy Common Configuration

Source Precedence

The three sources (Default, Database & YAML) are listed in increasing priority, i.e YAML trumps Database, which trumps Default.

For example, in a two node cluster, this table shows the effective value of a property on each node. A - indicates the value has not been set in that source. NULL indicates that the value has been explicitly set to NULL.

Source	Node1	Node2
Default	red	red
Database	-	-
YAML	-	blue
Effective	red	blue

Or where a Database value is set.

Source	Node1	Node2
Default	red	red
Database	green	green
YAML	-	blue
Effective	green	blue

Or where a YAML value is explicitly set to NULL.

Source	Node1	Node2
Default	red	red
Database	green	green
YAML	-	NULL
Effective	green	NULL

Data Types

Stroom property values can be set using a number of different data types. Database property values are currently set in the user interface using the string form of the value. For each of the data types defined below, there will be an example of how the data type is recorded in its string form.

Data Type	Example UI String Forms	Example YAML form
Boolean	`true` `false`	`true` `false`
String	`This is a string`	`"This is a string"`
Integer/Long	`123`	`123`
Float	`1.23`	`1.23`
Stroom Duration	`P30D` `P1DT12H` `PT30S` `30d` `30s` `30000`	`"P30D"` `"P1DT12H"` `"PT30S"` `"30d"` `"30s"` `"30000"` See Stroom Duration Data Type.
List	`#red#Green#Blue` `,1,2,3`	See List Data Type
Map	`,=red=FF0000,Green=00FF00,Blue=0000FF`	See Map Data Type
DocRef	`,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name)`	See DocRef Data Type
Enum	`HIGH` `LOW`	`"HIGH"` `"LOW"`
Path	`/some/path/to/a/file`	`"/some/path/to/a/file"`
ByteSize	`32`, `512Kib`	`32`, `512Kib` See Byte Size Data Type

Stroom Duration Data Type

The Stroom Duration data type is used to specify time durations, for example the time to live of a cache or the time to keep data before it is purged. Stroom Duration uses a number of string forms to support legacy property values.

ISO 8601 Durations

Stroom Duration can be expressed using ISO 8601 duration strings. It does NOT support the full ISO 8601 format, only D, H, M and S. For details of how the string is parsed to a Stroom Duration, see Duration

The following are examples of ISO 8601 duration strings:

P30D - 30 days
P1DT12H - 1 day 12 hours (36 hours)
PT30S - 30 seconds
PT0.5S - 500 milliseconds

Legacy Stroom Durations

This format was used in versions of Stroom older than v7 and is included to support legacy property values.

The following are examples of legacy duration strings:

30d - 30 days
12h - 12 hours
10m - 10 minutes
30s - 30 seconds
500 - 500 milliseconds

Combinations such as 1m30s are not supported.

List Data Type

This type supports ordered lists of items, where an item can be of any supported data type, e.g. a list of strings or list of integers.

The following is an example of how a property (statusValues) that is is List of strings is represented in the YAML:

  annotation:
    statusValues:
    - "New"
    - "Assigned"
    - "Closed"

This would be represented as a string in the User Interface as:

|New|Assigned|Closed.

See Delimiters in String Conversion for details of how the items are delimited in the string.

The following is an example of how a property (cpu) that is is List of DocRefs is represented in the YAML:

  statistics:
    internal:
      cpu:
      - type: "StatisticStore"
        uuid: "af08c4a7-ee7c-44e4-8f5e-e9c6be280434"
        name: "CPU"
      - type: "StroomStatsStore"
        uuid: "1edfd582-5e60-413a-b91c-151bd544da47"
        name: "CPU"

This would be represented as a string in the User Interface as:

|,docRef(StatisticStore,af08c4a7-ee7c-44e4-8f5e-e9c6be280434,CPU)|,docRef(StroomStatsStore,1edfd582-5e60-413a-b91c-151bd544da47,CPU)

See Delimiters in String Conversion for details of how the items are delimited in the string.

Map Data Type

This type supports a collection of key/value pairs where the key is unique within the collection. The type of the key must be string, but the type of the value can be any supported type.

The following is an example of how a property (mapProperty) that is a map of string => string would be represented in the YAML:

mapProperty:
  red: "FF0000"
  green: "00FF00"
  blue: "0000FF"

This would be represented as a string in the User Interface as:

,=red=FF0000,Green=00FF00,Blue=0000FF

The delimiter between pairs is defined first, then the delimiter for the key and value.

See Delimiters in String Conversion for details of how the items are delimited in the string.

DocRef Data Type

A DocRef (or Document Reference) is a type specific to Stroom that defines a reference to an instance of a Document within Stroom, e.g. an XLST, Pipeline, Dictionary, etc. A DocRef consists of three parts, the type, the UUID and the name of the Document.

The following is an example of how a property (aDocRefProperty) that is a DocRef would be represented in the YAML:

aDocRefProperty:
  type: "MyType"
  uuid: "a56ff805-b214-4674-a7a7-a8fac288be60"
  name: "My DocRef name"

This would be represented as a string in the User Interface as:

,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name)

See Delimiters in String Conversion for details of how the items are delimited in the string.

Byte Size Data Type

The Byte Size data type is used to represent a quantity of bytes using the IEC standard. Quantities are represented as powers of 1024, i.e. a Kib (Kibibyte) means 1024 bytes.

Examples of Byte Size values in string form are (a YAML value would optionally be surrounded with double quotes):

32, 32b, 32B, 32bytes - 32 bytes
32K, 32KB, 32KiB - 32 kibibytes
32M, 32MB, 32MiB - 32 mebibytes
32G, 32GB, 32GiB - 32 gibibytes
32T, 32TB, 32TiB - 32 tebibytes
32P, 32PB, 32PiB - 32 pebibytes

The *iB form is preferred as it is more explicit and avoids confusion with SI units.

Delimiters in String Conversion

The string conversion used for collection types like List, Map etc. relies on the string form defining the delimiter(s) to use for the collection. The delimiter(s) are added as the first n characters of the string form, e.g. |red|green|blue or |=red=FF0000|Green=00FF00|Blue=0000FF. It is possible to use a number of different delimiters to allow for delimiter characters appearing in the actual value, e.g. #some text#some text with a | in it The following are the delimiter characters that can be used.

|, :, ;, ,, !, /, \, #, @, ~, -, _, =, +, ?

When Stroom records a property value to the database it may use a delimiter of its own choosing, ensuring that it picks a delimiter that is not used in the property value.

Paths

File and directory paths can either be absolute (e.g. /some/path/) or relative (e.g. some/path). All relative paths will be resolved to an absolute path using the value of stroom.home as the base.

Path values also support variable substitution. For full details on the possible variable substitution options, see File Output.

Restart Required

Some properties are marked as requiring a restart. There are two scopes for this:

Requires UI Refresh

If a property is marked in UI as requiring a UI refresh then this means that a change to the property requires that the Stroom nodes serving the UI are restarted for the new value to take effect.

Requires Restart

If a property is marked in UI as requiring a restart then this means that a change to the property requires that all Stroom nodes are restarted for the new value to take effect.

11 - Searching Data

Searching the data held in Stroom using Dashboards, Queries, Views and Analytic Rules.

Data in stroom (and in external Elastic indexes) can be searched using a number of ways:

Dashboard Combines multiple query expressions, result tables and visualisations in one configurable layout.
Query Executes a single search query written in StroomQL and displays the results as a table or visualisation.
Analytic Rule Executes a StroomQL search query either against data as it is ingested into Stroom or on a scheduled basis.

11.1 - Data Sources

Stroom has multiple different types of data sources that can be queried by Stroom using Dashboards, Queries and Analytic Rules.

11.1.1 - Lucene Index Data Source

Stroom’s own Lucene based index for indexing and searching its stream data.

Stroom’s primary data source is its internal Lucene based search indexes. For details of how data is indexed see Lucene Indexes.

TODO

Complete this section

11.1.2 - Statistics

Using Stroom’s statistic stores as a data source.

TODO

Complete this section

11.1.3 - Elasticsearch

Using Elasticsearch as a data source.

Stroom can integrate with external Elasticsearch indexes to allow querying using Stroom’s various mechanisms for querying data sources. These indexes may have been populated using a Stroom pipeline (See here).

Searching using a Stroom dashboard

Searching an Elasticsearch index (or data stream) using a Stroom dashboard is conceptually similar to the process described in Dashboards.

Before you set the dashboard’s data source, you must first create an Elastic Index document to tell Stroom which index (or indices) you wish to query.

Create an Elastic Index document

Right-click a folder in the Stroom Explorer pane ( ).
Select:

New

Elastic Index
Enter a name for the index document and click OK .
Click next to the Cluster configuration field label.
In the dialog that appears, select the Elastic Cluster document where the index exists, and click OK .
Enter the name of an index or data stream in Index name or pattern. Data view (formerly known as index pattern) syntax is supported, which enables you to query multiple indices or data streams at once. For example: stroom-events-v1.
(Optional) Set Search slices, which is the number of parallel workers that will query the index. For very large indices, increasing this value up to and including the number of shards can increase scroll performance, which will allow you to download results faster.
(Optional) Set Search scroll size, which specifies the number of documents to return in each search response. Greater values generally increase efficiency. By default, Elasticsearch limits this number to 10,000.
Click Test Connection. A dialog will appear with the result, which will state Connection Success if the connection was successful and the index pattern matched one or more indices.
Click .

Set the Elastic Index document as the dashboard data source

Open or create a dashboard.
Click in the Query panel.
Click next to the Data Source field label.
Select the Elastic Index document you created and click OK .
Configure the query expression as explained in Dashboards. Note the tips for particular Elasticsearch field mapping data types.
Configure the table.

Query expression tips

Certain Elasticsearch field mapping types support special syntax when used in a Stroom dashboard query expression.

To identify the field mapping type for a particular field:

Click in the Query panel to add a new expression item.
Select the Elasticsearch field name in the drop-down list.
Note the blue data type indicator to the far right of the row. Common examples are: keyword, text and number.

After you identify the field mapping type, move the mouse cursor over the mapping type indicator. A tooltip appears, explaining various types of queries you can perform against that particular field’s type.

Searching multiple indices

Using data view (index pattern) syntax, you can create powerful dashboards that query multiple indices at a time. An example of this is where you have multiple indices covering different types of email systems. Let’s assume these indices are named: stroom-exchange-v1, stroom-domino-v1 and stroom-mailu-v1.

There is a common set of fields across all three indices: @timestamp, Subject, Sender and Recipient. You want to allow search across all indices at once, in effect creating a unified email dashboard.

You can achieve this by creating an Elastic Index document called (for example) Elastic-Email-Combined and setting the property Index name or pattern to: stroom-exchange-v1,stroom-domino-v1,stroom-mailu-v1. Click and re-open the dashboard. You’ll notice that the available fields are a union of the fields across all three indices. You can now search by any of these - in particular, the fields common to all three.

11.1.4 - Internal Data Sources

A set of data sources for querying the inner workings of Stroom.

Stroom provides a number of built in data sources for querying the inner workings of stroom. These data sources do not have a corresponding Document so do not feature in the explorer tree.

These data sources appear as children of the root folder when selecting a data source in a Dashboard , View . They are also available in the list of data sources when editing a Query .

Analytics

TODO

Complete

Annotations

Annotations are a means of annotating search results with additional information and for assigning those annotations to users. The Annotations data source allows you to query the annotations that have been created.

Field	Type	Description
`annotaion:Id`	Long	Annotation unique identifier.
`annotation:CreatedOn`	Date	Date created.
`annotation:CreatedBy`	String	Username of the user that created the annotation.
`annotation:UpdatedOn`	Date	Date last updated.
`annotation:UpdatedBy`	String	Username of the user that last updated the annotation.
`annotation:Title`	String
`annotation:Subject`	String
`annotation:AssignedTo`	String	Username the annotation is assigned to.
`annotation:Comment`	String	Any comments on the annotation.
`annotation:History`	String	History of changes to the annotation.

Dual

The Dual data source is one with a single field that always returns one row with the same value. This data source can be useful for testing expression functions. It can also be useful when combined with an extraction pipeline that uses the stroom:http-call() XSLT function in order to make a single HTTP call using Dashboard parameter values.

Field	Type	Description
`Dummy`	String	Always one row that has the value `X`

Index Shards

Exposes the details of the index shards that make up Stroom’s Lucene based index. Each index is split up into one or more partitions and each partition is further divided into one or more shards. Each row represents one index shard.

Field	Type	Description
`Node`	String	The name of the node that the index belongs to.
`Index`	String	The name of the index document.
`Index Name`	String	The name of the index document.
`Volume Path`	String	The file path for the index shard.
`Volume Group`	String	The name of the volume group the index is using.
`Partition`	String	The name of the partition that the shard is in.
`Doc Count`	Integer	The number of documents in the shard.
`File Size`	Long	The size of the shard on disk in bytes.
`Status`	String	The status of the shard (`Closed`, `Open`, `Closing`, `Opening`, `New`, `Deleted`, `Corrupt`).
`Last Commit`	Date	The time and date of the last commit to the shard.

Meta Store

Exposes details of the streams held in Stroom’s stream (aka meta) store. Each row represents one stream.

Field	Type	Description
`Feed`	String	The name of the feed the stream belongs to.
`Pipeline`	String	The name of the pipeline that created the stream. [Optional]
`Pipeline Name`	String	The name of the pipeline that created the stream. [Optional]
`Status`	String	The status of the stream (`Unlocked`, `Locked`, `Deleted`).
`Type`	String	The Stream Type , e.g. `Events`, `Raw Events`, etc.
`Id`	Long	The unique ID (within this Stroom cluster) for the stream .
`Parent Id`	Long	The unique ID (within this Stroom cluster) for the parent stream, e.g. the Raw stream that spawned an Events stream. [Optional]
`Processor Id`	Long	The unique ID (within this Stroom cluster) for the processor that produced this stream. [Optional]
`Processor Filter Id`	Long	The unique ID (within this Stroom cluster) for the processor filter that produced this stream. [Optional]
`Processor Task Id`	Long	The unique ID (within this Stroom cluster) for the processor task that produced this stream. [Optional]
`Create Time`	Date	The time the stream was created.
`Effective Time`	Date	The time that the data in this stream is effective for. This is only used for reference data stream and is the time that the snapshot of reference data was captured. [Optional]
`Status Time`	Date	The time that the status was last changed.
`Duration`	Long	The time it took to process the stream in milliseconds. [Optional]
`Read Count`	Long	The number of records read in segmented streams. [Optional]
`Write Count`	Long	The number of records written in segmented streams. [Optional]
`Info Count`	Long	The number of INFO messages.
`Warning Count`	Long	The number of WARNING messages.
`Error Count`	Long	The number of ERROR messages.
`Fatal Error Count`	Long	The number of FATAL_ERROR messages.
`File Size`	Long	The compressed size of the stream on disk in bytes.
`Raw Size`	Long	The un-compressed size of the stream on disk in bytes.

Processor Tasks

Exposes details of the tasks spawned by the processor filters. Each row represents one processor task.

Field	Type	Description
`Create Time`	Date	The time the task was created.
`Create Time Ms`	Long	The time the task was created (milliseconds).
`Start Time`	Date	The time the task was executed.
`Start Time Ms`	Long	The time the task was executed (milliseconds).
`End Time`	Date	The time the task finished.
`End Time Ms`	Long	The time the task finished (milliseconds).
`Status Time`	Date	The time the status of the task was last updated.
`Status Time Ms`	Long	The time the status of the task was last updated (milliseconds).
`Meta Id`	Long	The unique ID (unique within this Stroom cluster) of the stream the task was for.
`Node`	String	The name of the node that the task was executed on.
`Pipeline`	String	The name of the pipeline that spawned the task.
`Pipeline Name`	String	The name of the pipeline that spawned the task.
`Processor Filter Id`	Long	The ID of the processor filter that spawned the task.
`Processor Filter Priority`	Integer	The priority of the processor filter when the task was executed.
`Processor Id`	Long	The unique ID (unique within this Stroom cluster) of the pipeline processor that spawned this task.
`Feed`	String
`Status`	String	The status of the task (`Created`, `Queued`, `Processing`, `Complete`, `Failed`, `Deleted`).
`Task Id`	Long	The unique ID (unique within this Stroom cluster) of this task.

Reference Data Store

Warning

This data source is for advanced users only and is primarily aimed at debugging issues with reference data.

Reference data is written to a persistent cache on storage local to the node. This data source exposes the data held in the store on the local node only. Given that most Stroom deployments are clustered and the UI nodes are typically not doing processing, this means the UI node will have no reference data.

Task Manager

This data source exposed the back ground tasks currently running across the Stroom cluster. Each row represents a single background server task.

Requires the Manage Tasks application permission.

Field	Type	Description
`Node`	String	The name of the node that the task is running on.
`Name`	String	The name of the task.
`User`	String	The user name of the user that the task is running as.
`Submit Time`	Date	The time the task was submitted.
`Age`	Duration	The time the task has been running for.
`Info`	String	The latest information message from the task.

11.2 - Dashboards

A Dashboard document is a way to combine multiple search queries, tables and visualisations in a configurable layout.

11.2.1 - Queries

How to query the data in Stroom.

Dashboard queries are created with the query expression builder. The expression builder allows for complex boolean logic to be created across multiple index fields. The way in which different index fields may be queried depends on the type of data that the index field contains.

Date Time Fields

Time fields can be queried for times equal, greater than, greater than or equal, less than, less than or equal or between two times.

Times can be specified in two ways:

Absolute times
Relative times

Absolute Times

An absolute time is specified in ISO 8601 date time format, e.g. 2016-01-23T12:34:11.844Z

Relative Times

In addition to absolute times it is possible to specify times using expressions. Relative time expressions create a date time that is relative to the execution time of the query. Supported expressons are as follows:

now() - The current execution time of the query.
second() - The current execution time of the query rounded down to the nearest second.
minute() - The current execution time of the query rounded down to the nearest minute.
hour() - The current execution time of the query rounded down to the nearest hour.
day() - The current execution time of the query rounded down to the nearest day.
week() - The current execution time of the query rounded down to the first day of the week (Monday).
month() - The current execution time of the query rounded down to the start of the current month.
year() - The current execution time of the query rounded down to the start of the current year.

Adding/Subtracting Durations

With relative times it is possible to add or subtract durations so that queries can be constructed to provide for example, the last week of data, the last hour of data etc.

To add/subtract a duration from a query term the duration is simply appended after the relative time, e.g.

now() + 2d

Multiple durations can be combined in the expression, e.g.

now() + 2d - 10h

now() + 2w - 1d10h

Durations consist of a number and duration unit. Supported duration units are:

s - Seconds
m - Minutes
h - Hours
d - Days
w - Weeks
M - Months
y - Years

Using these durations a query to get the last weeks data could be as follows:

between now() - 1w and now()

Or midnight a week ago to midnight today:

between day() - 1w and day()

Or if you just wanted data for the week so far:

greater than week()

Or all data for the previous year:

between year() - 1y and year()

Or this year so far:

greater than year()

11.2.2 - Internal Links

Adding links within Stroom to internal features/items or external URLs.

Within Stroom, links can be created in dashboard tables or dashboard text panes that will direct Stroom to display an item in various ways.

Links are inserted in the form:

[Link Text](URL and parameters){Link Type}

In dashboard tables links can be inserted using the link() function or more specialised functions such as data() or stepping().

In dashboard text panes, links can be inserted into the HTML as link attributes on elements.

Note

The text pane must be set to Show As HTML for links to operate.

  <div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" link="[link](uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da&amp;params=user%3Duser1&amp;title=Details%20For%20user1){dashboard}">Details For user1</span>
  </div>

The link type can be one of the following:

dialog : Display the content of a link URL within a stroom popup dialog.
tab : Display the content of a link URL within a stroom tab.
browser : Display the content of a link URL within a new browser tab.
dashboard : Used to launch a Stroom dashboard internally with parameters in the URL.
stepping : Used to launch Stroom stepping internally with parameters in the URL.
data : Used to show Stroom data internally with parameters in the URL.
annotation : Used to show a Stroom annotation internally with parameters in the URL.

Dialog

Dialog links are used to embed any referenced URL in a Stroom popup Dialog. Dialog links look something like this in HTML:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[Show](https://www.somehost.com/somepath){dialog|Embedded In Stroom}">
        Show In Stroom Dialog
    </span>
</div>

Note

The dialog title can be controlled by adding a | and required title after the type, e.g.

{dialog|Embedded In Stroom}

Tab

Tab links are similar to dialog links are used to embed any referenced URL in a Stroom tab. Tab links look something like this in HTML:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[Show](https://www.somehost.com/somepath){tab|Embedded In Stroom}">
        Show In Stroom Tab
    </span>
</div>

Note

The tab title can be controlled by adding a | and required title after the type, e.g.

{tab|Embedded In Stroom}

Browser

Browser links are used to open any referenced URL in a new browser tab. In most cases this is easily accomplished via a normal hyperlink but Stroom also provides a mechanism to do this as a link event so that dashboard tables are also able to open new browser tabs. This can be accomplished by using the link() table function. In a dashboard text pane the HTML could look like this:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[Show](https://www.somehost.com/somepath){browser}">
        Show In Browser Tab
    </span>
</div>

Note

Unlike the other link types there is no way to control the browser tab title.

Dashboard

In addition to viewing/embedding external URLs, Stroom links can be used to direct Stroom to show an internal item or feature. The dashboard link type allows Stroom to open a new tab and show a dashboard with the specified parameters.

The format for a dashboard link is as follows:

[Link Text](uuid=<UUID>&params=<PARAMS>&title=<CUSTOM_TITLE>){dashboard}

The parameters for dashboard links are:

uuid - The UUID of the dashboard to open.
params - A URL encoded list of params to supply to the dashboard, e.g. params=user%3Duser1.
title - An optional URL encoded title to better identify the specific instance of the dashboard, e.g. title=Details%20For%20user1.

Note

Parameter values can be URL encoded in XSLT using the encode-for-uri function.

An example of this type of link in HTML:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[link](uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da&amp;params=user%3Duser1&amp;title=Details%20For%20user1){dashboard}">
        Details For user1
    </span>
</div>

Note

By using a pipeline with the appropriate XSLT it is possible to dynamically generate links in dashboard text panes that will be specific to the data being displayed.

Data

A link can be created to open a sub-set of a source of data (i.e. part of a stream) for viewing. The data can either be opened in a popup dialog (dialog) or in another stroom tab (tab). It can also be display in preview form (with formatting and syntax highlighting) or unaltered source form.

Note

To make full use of data links for viewing raw data, you need to use the stroom:source() XSLT Function to decorate an event with the details of the source location it derived from.

The format for a data link is as follows:

[Link Text](id=<STREAM_ID>&partNo=<PART_NO>&recordNo=<RECORD_NO>&lineFrom=<LINE_FROM>&colFrom=<COL_FROM>&lineTo=<LINE_TO>&colTo=<COL_TO>&viewType=<VIEW_TYPE>&displayType=<DISPLAY_TYPE>){data}

Stroom deals in two main types of stream, segmented and non-segmented (see Streams). Data in a non-segmented (i.e. raw) stream is identified by an id, a partNo and optionally line and column positions to define the sub-set of that stream part to display. Data in a segmented (i.e. cooked) stream is identified by an id, a recordNo and optionally line and column positions to define the sub-set of that record (i.e. event) within that stream.

The parameters for data links are:

id - The stream ID.
partNo - The part number of the stream (one based). Always 1 for segmented (cooked) streams.
recordNo - The record number within a segmented stream (optional). Not applicable for non-segmented streams so use null() instead.
lineFrom - The line number of the start of the sub-set of data (optional, one based).
colFrom - The column number of the start of the sub-set of data (optional, one based).
lineTo - The line number of the end of the sub-set of data (optional, one based).
colTo - The column number of the end of the sub-set of data (optional, one based).
viewType - The type of view of the data (optional, defaults to preview):
- preview : Display the data as a formatted preview of a limited portion of the data.
- source : Display the un-formatted data in its original form with the ability to navigate around all of the data source.
displayType - The way of displaying the data (optional, defaults to dialog):
- dialog : Open as a modal popup dialog.
- tab : Open as a top level tab within the Stroom browser tab.

In preview mode the line and column positions will limit the data displayed to the specified selection. In source mode the line and column positions define a highlight block of text within the part/record.

Warning

The displayType value tab is not supported if the dashboard is viewed via a Direct URL. This is because a direct URL displays only the dashboard without Stroom’s top level tab bar so it is not possible to open it as a top level tab.

An example of this type of link in HTML:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[link](id=1822&amp;partNo=1&amp;recordNo=1){data}">
        Show Source</span>
</div>

View Type

The additional parameter viewType can be used to switch the data view mode from preview (default) to source.

In preview mode the optional parameters lineFrom, colFrom, lineTo, colTo can be used to limit the portion of the data that is displayed.

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[link](id=1822&amp;partNo=1&amp;recordNo=1&amp;viewType=preview&amp;lineFrom=1&amp;colFrom=1&amp;lineTo=10&amp;colTo=8){data}">
        Show Source Preview
    </span>
</div>

In source mode the optional parameters lineFrom, colFrom, lineTo, colTo can be used to highlight a portion of the data that is displayed.

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[link](id=1822&amp;partNo=1&amp;recordNo=1&amp;viewType=source&amp;lineFrom=1&amp;colFrom=1&amp;lineTo=10&amp;colTo=8){data}">
        Show Source
    </span>
</div>

Display Type

Choose whether to display data in a dialog (default) or a Stroom tab.

Stepping

A stepping link can be used to launch the data stepping feature with the specified data. The format for a stepping link is as follows:

[Link Text](id=<STREAM_ID>&partNo=<PART_NO>&recordNo=<RECORD_NO>){stepping}

The parameters for stepping links are as follows:

id - The id of the stream to step.
partNo - The sub part no within the stream to step (usually 1).
recordNo - The record or event number within the stream to step.

An example of this type of link in HTML:

<div style="padding: 5px;">
    <span style="text-decoration:underline;color:blue;cursor:pointer" 
          link="[link](id=1822&amp;partNo=1&amp;recordNo=1){stepping}">
        Step Source</span>
</div>

Annotation

A link can be used to edit or create annotations. To view or edit an existing annotation the id must be known or one can be found using a stream and event id. If all parameters are specified an annotation will either be created or edited depending on whether it exists or not. The format for an annotation link is as follows:

[Link Text](annotationId=<ANNOTATION_ID>&streamId=<STREAM_ID>&eventId=<EVENT_ID>&title=<TITLE>&subject=<SUBJECT>&status=<STATUS>&assignedTo=<ASSIGNED_TO>&comment=<COMMENT>){annotation}

The parameters for annotation links are as follows:

annotationId - The optional existing id of an annotation if one already exists.
streamId - An optional stream id to link to a newly created annotation, or used to lookup an existing annotation if no annotation id is provided.
eventId - An optional event id to link to a newly created annotation, or used to lookup an existing annotation if no annotation id is provided.
title - An optional default title to give the annotation if a new one is created.
subject - An optional default subject to give the annotation if a new one is created.
status - An optional default status to give the annotation if a new one is created.
assignedTo - An optional initial assignedTo value to give the annotation if a new one is created.
comment - An optional initial comment to give the annotation if a new one is created.

11.2.3 - Direct URLs

Navigating directly to a specific Stroom dashboard using a direct URL.

It is possible to navigate directly to a specific Stroom dashboard using a direct URL. This can be useful when you have a dashboard that needs to be viewed by users that would otherwise not be using the Stroom user interface.

URL format

The format for the URL is as follows:

https://<HOST>/stroom/dashboard?type=Dashboard&uuid=<DASHBOARD UUID>[&title=<DASHBOARD TITLE>][&params=<DASHBOARD PARAMETERS>]

Example:

https://localhost/stroom/dashboard?type=Dashboard&uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09&title=My%20Dash&params=userId%3DFred%20Bloggs

Host and path

The host and path are typically https://<HOST>/stroom/dashboard where <HOST> is the hostname/IP for Stroom.

type

type is a required parameter and must always be Dashboard since we are opening a dashboard.

uuid

uuid is a required parameter where <DASHBOARD UUID> is the UUID for the dashboard you want a direct URL to, e.g. uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09

The UUID for the dashboard that you want to link to can be found by right clicking on the dashboard icon in the explorer tree and selecting Info.

The Info dialog will display something like this and the UUID can be copied from it:

DB ID: 4
UUID: c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09
Type: Dashboard
Name: Stroom Family App Events Dashboard
Created By: INTERNAL
Created On: 2018-12-10T06:33:03.275Z
Updated By: admin
Updated On: 2018-12-10T07:47:06.841Z

title (Optional)

title is an optional URL parameter where <DASHBOARD TITLE> allows the specification of a specific title for the opened dashboard instead of the default dashboard name.

The inclusion of ${name} in the title allows the default dashboard name to be used and appended with other values, e.g. 'title=${name}%20-%20' + param.name

params (Optional)

params is an optional URL parameter where <DASHBOARD PARAMETERS> includes any parameters that have been defined for the dashboard in any of the expressions, e.g. params=userId%3DFred%20Bloggs

Permissions

In order for as user to view a dashboard they will need the necessary permission on the various entities that make up the dashboard.

For a Lucene index query and associated table the following permissions will be required:

Read permission on the Dashboard entity.
Use permission on any Indexe entities being queried in the dashboard.
Use permission on any Pipeline entities set as search extraction Pipelines in any of the dashboard’s tables.
Use permission on any XSLT entities used by the above search extraction Pipeline entites.
Use permission on any ancestor pipelines of any of the above search extraction Pipeline entites (if applicable).
Use permission on any Feed entities that you want the user to be able to see data for.

For a SQL Statistics query and associated table the following permissions will be required:

Read permission on the Dashboard entity.
Use permission on the StatisticStore entity being queried.

For a visualisation the following permissions will be required:

Read permission on any Visualiation entities used in the dashboard.
Read permission on any Script entities used by the above Visualiation entities.
Read permission on any Script entities used by the above Script entities.

11.3 - Query

A Query document defines a search query in text form using the Stroom Query Language and displays the results as a table or a visualisation.

TODO

Complete this section.

11.3.1 - Stroom Query Language

Stroom Query Language (StroomQL) is a query language for retrieving data in Stroom.

Query Format

Stroom Query Language (StroomQL) is a text based replacement for the existing Dashboard query builder and allows you to express the same queries in text form as well as providing additional functionality. It is currently used on the Query entity as the means of defining a query.

The following shows the supported syntax for a StroomQL query.

from <DATA_SOURCE>
where <FIELD> <CONDITION> <VALUE> [and|or|not]
[and|or|not]
[window] <TIME_FIELD> by <WINDOW_SIZE> [advance <ADVANCE_WINDOW_SIZE>]
[filter] <FIELD> <CONDITION> <VALUE> [and|or|not]
[and|or|not]
[eval...] <FIELD> = <EXPRESSION>
[having] <FIELD> <CONDITION> <VALUE> [and|or|not]
[group by] <FIELD>
[sort by] <FIELD> [desc|asc] // asc by default
[limit] <MAX_ROWS> 
select <FIELD> [as <COLUMN NAME>], ...
[show as] <VIS_NAME> (<VIS_CONTROL_ID_1> = <COLUMN_1>, <VIS_CONTROL_ID_2> = <COLUMN_2>)

Keywords

From

The first part of a StroomQL expression is the from clause that defines the single data source to query. All queries must include the from clause.

Select the data source to query, e.g.

from my_source

If the name of the data source contains white space then it must be quoted, e.g.

from "my source"

Where

Use where to construct query criteria, e.g.

where feed = "my feed"

Add boolean logic with and, or and not to build complex criteria, e.g.

where feed = "my feed"
or feed = "other feed"

Use brackets to group logical sub expressions, e.g.

where user = "bob"
and (feed = "my feed" or feed = "other feed")

Conditions

Supported conditions are:

=
!=
>
>=
<
<=
is null
is not null

And|Or|Not

Logical operators to add to where and filter clauses.

Bracket groups

You can force evaluation of items in a specific order using bracketed groups.

and X = 5 OR (name = foo and surname = bar)

Window

window <TIME_FIELD> by <WINDOW_SIZE> [advance <ADVANCE_WINDOW_SIZE>]

Windowing groups data by a specified window size applied to a time field. A window inserts additional rows for future periods so that rows for future periods contain count columns for previous periods.

Specify the field to window by and a duration. Durations are specified in simple terms e.g. 1d, 2w etc.

By default, a window will insert a count into the next period row. This is because by default we advance by the specified window size. If you wish to advance by a different duration you can specify the advance amount which will insert counts into multiple future rows.

Filter

Use filter to filter values that have not been indexed during search retrieval. This is used the same way as the where clause but applies to data after being retrieved from the index, e.g.

filter obscure_field = "some value"

Add boolean logic with and, or and not to build complex criteria as supported by the where clause. Use brackets to group logical sub expressions as supported by the where clause.

Note

As filters do not make use of the index they can be considerably slower than a where clause, however they allow filtering on fields that have not been indexed for some reason. Frequent use of filter on a field suggests you may want to consider including that field in an index.

Eval

Use eval to assign the value returned from an Expression Function to a named variable, e.g.

eval my_count = count()

Here the result of the count() function is being stored in a variable called my_count. Functions can be nested and applied to variables, e.g.

eval new_name = concat(
  substring(name, 3, 5),
  substring(name, 8, 9))

Note that all fields in the data source selected using from will be available as variables by default.

Multiple eval statements can also be used to breakup complex function expressions and make it easier to comment out individual evaluations, e.g.

eval name_prefix = substring(name, 3, 5)
eval name_suffix = substring(name, 8, 9)
eval new_name = concat(
  name_prefix,
  name_suffix)

Variables can be reused, e.g.

eval name_prefix = substring(name, 3, 5)
eval new_name = substring(name, 8, 9)
eval new_name = concat(
  name_prefix,
  new_name)

In this example, the second assignment of new_name will override the value initially assigned to it. Note that that when reusing a variable name, the assignment can depend on the previous value assigned to that variable.

Add boolean logic with and, or and not to build complex criteria, e.g.

where feed = "my feed" or feed = "other feed"

Use brackets to group logical sub expressions, e.g.

where user = "bob" and (feed = "my feed" or feed = "other feed")

Having

A post aggregate filter that is applied at query time to return only rows that match the having conditions.

having count > 3

Group By

Use to group by columns, e.g.

group by feed

You can group across multiple columns, e.g.

group by feed, name

You can create nested groups, e.g.

group by feed
group by name

Sort By

Use to sort by columns, e.g.

sort by feed

You can sort across multiple columns, e.g.

sort by feed, name

You can change the sort direction, e.g.

sort by feed asc

Or

sort by feed desc

Limit

Limit the number of results, e.g.

limit 10

Select

The select keyword is used to define the fields that will be selected out of the data source (and any eval’d fields) for display in the table output.

select feed, name

You can optionally rename the fields so that they appear in the table with more human friendly names.

select feed as 'my feed column',
  name as 'my name column'

Show

The show keyword is used to tell StroomQL how to show the data resulting from the select. A Stroom visualisation can be specified and then passed column values from the select for the visualisation control properties.

show LineChart(x = EventTime, y = count)
show Doughnut(names = Feed, values = count)

For visualisations that contain spaces in their names it is necessary to use quotes, e.g.

show "My Visualisation" (x = EventTime, y = count)

Comments

Single line

StroomQL supports single line comments using //. For example:

from "index_view" // view
where EventTime > now() - 1227d
// and StreamId = 1210
select StreamId as "Stream Id", EventTime as "Event Time"

Multi line

Multiple lines can be commented by surrounding sections with /* and */. For example:

from "index_view" // view
where EventTime > now() - 1227d
/*
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
*/
select StreamId as "Stream Id", EventTime as "Event Time"

Examples

The following are various example queries.

// add a where
from "index_view" // view
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10

from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10

from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
// eval count = count()
// group by StreamId
// sort by Sl desc
select StreamId as "Stream Id", EventId as "Event Id"
// limit 10

from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10

11.4 - Analytic Rules

Analytic Rules are queries that can be run against the data either as it is ingested or on a scheduled basis.

TODO

Complete this section.

11.5 - Search Extraction

The process of combining data extracted from events with the data stored in an index.

When indexing data it is possible to store (see Stored Fields all data in the index. This comes with a storage cost as the data is then held in two places; the event; and the index document.

Stroom has the capability of doing Search Extraction at query time. This involves combining the data stored in the index document with data extracted using a search extraction pipeline. Extracting data in this way is slower but reduces the data stored in the index, so it is a trade off between performance and storage space consumed.

Search Extraction relies on the StreamId and EventId being stored in the Index. Stroom can then used these two fields to locate the event in the stream store and process it with the search extraction pipeline.

TODO

Add more detail

11.6 - Dictionaries

Creating

Right click on a folder in the explorer tree that you want to create a dictionary in. Choose ‘New/Dictionary’ from the popup menu:

Call the dictionary something like ‘My Dictionary’ and click OK .

Now just add any search terms you want to the newly created dictionary and click .

You can add multiple terms.

Terms on separate lines act as if they are part of an ‘OR’ expression when used in a search.
```
apple
banana
orange
```
Terms on a single line separated by spaces act as if they are part of an ‘AND’ expression when used in a search.
```
apple,banana,orange
```

Using the Dictionary

To perform a search using your dictionary, just choose the newly created dictionary as part of your search expression:

TODO: Fix image

12 - Security

All aspects of securing Stroom and the content within it. Includes application security, user and group accounts and the permissions model.

12.1 - Deployment

There are many aspects of security that should be considered when installing and running Stroom.

Shared Storage

For most large installations Stroom uses shared storage for its data store. This storage could be a CIFS, NFS or similar shared file system. It is recommended that access to this shared storage is protected so that only the application can access it. This could be achieved by placing the storage and application behind a firewall and by requiring appropriate authentication to the shared storage. It should be noted that NFS is unauthenticated so should be used with appropriate safeguards.

MySQL

Accounts

It is beyond the scope of this article to discuss this in detail but all MySQL accounts should be secured on initial install. Official guidance for doing this can be found here .

Communication

Communication between MySQL and the application should be secured. This can be achieved in one of the following ways:

Placing MySQL and the application behind a firewall
Securing communication through the use of iptables
Making MySQL and the application communicate over SSL (see here for instructions)

The above options are not mutually exclusive and may be combined to better secure communication.

Application

Node to node communication

In a multi node Stroom deployment each node communicates with the master node. This can be configured securely in one of several ways:

Direct communication to Tomcat on port 8080 - Secured by being behind a firewall or using iptables
Direct communication to Tomcat on port 8443 - Secured using SSL and certificates
Removal of Tomcat connectors other than AJP and configuration of Apache to communicate on port 443 using SSL and certificates

Application to Stroom Proxy Communication

The application can be configured to share some information with Stroom Proxy so that Stroom Proxy can decide whether or not to accept data for certain feeds based on the existence of the feed or it’s reject/accept status. The amount of information shared between the application and the proxy is minimal but could be used to discover what feeds are present within the system. Securing this communication is harder as both the application and the proxy will not typically reside behind the same firewall. Despite this communication can still be performed over SSL thus protecting this potential attack vector.

Admin port

Stroom (v6 and above) and its associated family of stroom-* Dropwizard based services all expose an admin port (8081 in the case of stroom). This port serves up various health check and monitoring pages as well as a number of restful services for initiating admin tasks. There is currently no authentication on this admin port so it is assumed that access to this port will be tightly controlled using a firewall, iptables or similar.

Servlets

There are several servlets in Stroom that are accessible by certain URLs. Considerations should be made about what URLs are made available via Apache and who can access them. The servlets, path and function are described below:

Servlet	Path	Function	Risk
DataFeed	/datafeed or /datafeed/*	Used to receive data	Possible denial of service attack by posting too much data/noise
RemoteFeedService	/remoting/remotefeedservice.rpc	Used by proxy to ask application about feed status (described in previous section)	Possible to systematically discover which feeds are available. Communication with this service should be secured over SSL discussed above
DynamicCSSServlet	/stroom/dynamic.css	Serves dynamic CSS based on theme configuration	Low risk as no important data is made available by this servlet
DispatchService	/stroom/dispatch.rpc	Service for UI and server communication	All back-end services accessed by this umbrella service are secured appropriately by the application
ImportFileServlet	/stroom/importfile.rpc	Used during configuration upload	Users must be authenticated and have appropriate permissions to import configuration
ScriptServlet	/stroom/script	Serves user defined visualisation scripts to the UI	The visualisation script is considered to be part of the application just as the CSS so is not secured
ClusterCallService	/clustercall.rpc	Used for node to node communication as discussed above	Communication must be secured as discussed above
ExportConfig	/export/*	Servlet used to export configuration data	Servlet access must be restricted with Apache to prevent configuration data being made available to unauthenticated users
Status	/status	Shows the application status including volume usage	Needs to be secured so that only appropriate users can see the application status
Echo	/echo	Block GZIP data posted to the echo servlet is sent back uncompressed. This is a utility servlet for decompression of external data	URL should be secured or not made available
Debug	/debug	Servlet for echoing HTTP header arguments including certificate details	Should be secured in production environments
SessionList	/sessionList	Lists the logged in users	Needs to be secured so that only appropriate users can see who is logged in
SessionResourceStore	/resourcestore/*	Used to create, download and delete temporary files liked to a users session such as data for export	This is secured by using the users session and requiring authentication

HDFS, Kafka, HBase, Zookeeper

Stroom and stroom-stats can integrate with HDFS, Kafka, HBase and Zookeeper. It should be noted that communication with these external services is currently not secure. Until additional security measures (e.g. authentication) are put in place it is assumed that access to these services will be careful controlled (using a firewall, iptables or similar) so that only stroom nodes can access the open ports.

Content

It may be possible for a user to write XSLT, Data Splitter or other content that may expose data that we do not wish to or to cause the application some harm. At present processing operations are not isolated processes and so it is easy to cripple processing performance with a badly written translation whether written accidentally or on purpose. To mitigate this risk it is recommended that users that are given permission to create XSLT, Data Splitter and Pipeline configurations are trusted to do so.

Visualisations can be completely customised with javascript. The javascript that is added is executed in a clients browser potentially opening up the possibility of XSS attacks, an attack on the application to access data that a user shouldn’t be able to access, an attack to destroy data or simply failure/incorrect operation of the user interface. To mitigate this risk all user defined javascript is executed within a separate browser IFrame. In addition all javascript should be examined before being added to a production system unless the author is trusted. This may necessitate the creation of a separate development and testing environment for user content.

12.2 - User Accounts

User accounts for authentication when using Stroom’s internal identity provider.

TODO

The Users, Groups and Permissions screens are undergoing significant change in Stroom v7.6. Therefore this section will be updated with more detail in v7.6.

Note

If Stroom is configured to use an external Identity Provider (e.g. Azure Active Directory or AWS Cognito) then all user accounts are managed within that Identity Provider and the Manage Accounts screen in Stroom will not be available. For more details about external Identity Providers, see Open ID Connect.

Accounts vs Stroom Users

See Accounts vs Users for details on the difference between a Stroom User Account and a Stroom User.

Creating User Accounts

User accounts can only be created by a user that holds the Manage Users or Administrator Application Permission .

Create a new user account by selecting

from the main menu.

As a minimum a user account must have a unique identifier that will be used to identify them in Stroom.

If the user’s email address is added then Stroom will be able to email the user to reset their password. This functionality is configured using the properties starting with this prefix stroom.security.identity.email..

Account Flags

User accounts have a number of flags that can be set by an administrator or automatically by Stroom.

Enabled - Enables/disables the account. A disabled account cannot login. Useful for disabling a user that is temporarily on leave.
Locked - Set when a users has too many failed login attempts (controlled by the property stroom.security.identity.failedLoginLockThreshold). Can be un-set by a user with Manage Users Application Permission . A locked account cannot login.
Inactive - Set automatically in one of these cases:
- A brand new account has not been used for a duration greater than stroom.security.identity.passwordPolicy.neverUsedAccountDeactivationThreshold.
- An account has not been used for a duration greater than stroom.security.identity.passwordPolicy.unusedAccountDeactivationThreshold. A inactive account cannot login.

12.3 - Users and Groups

The Stroom user and group entities that can be granted application and document permissions.

TODO

The Users, Groups and Permissions screens are undergoing significant change in Stroom v7.6. Therefore this section will be updated with more detail in v7.6.

Accounts vs Stroom Users

See Accounts vs Users for details on the difference between a Stroom User Account and a Stroom User.

User

A Stroom User represents a human user and is linked to either a User Account in Stroom or to a user account in an external Identity Provider . It can also represent non-human processing user, e.g. where a Stroom User is created and has an API Key created for it to allow a client system to use Stroom’s API .

All audited activity in Stroom will be attributed to a Stroom User and their unique identifier will be included in the audit events.

A User can have the following:

Membership of one or more Groups.
One or more Application Permissions granted to it.
One or more Document Permissions granted to it.

Group

A Group represents a collection of Stroom Users and/or other Groups. A Group can be used to ease the management of application and document permissions by granting permissions to one Group then adding users to that Group. For example if all the users in a team require the same application and document permissions, then a Group can be created for them and the permissions assigned to the Group. When a user joins or leaves the team it is simply a case of editing the membership of the Group.

A Group can have the following:

One or more members (Users and/or other Groups).
Membership of one or more other Groups.
One or more Application Permissions granted to it.
One or more Document Permissions granted to it.

12.4 - Application Permissions

Assigning application level permssions (such as ‘Manage Users’) to users or groups.

TODO

The Users, Groups and Permissions screens are undergoing significant change in Stroom v7.6. Therefore this section will be updated with more detail in v7.6.

An Application Permission is a permission to perform an action that is not associated with a single Document or is unrelated to Documents. Application Permissions can be granted to Users or Groups.

In order to grant Application Permissions to yourself or to other Users/Groups you must have the Manage Users or Administrator Application Permissions. If you have one of these permissions then you can access the Application Permissions screen from the main menu:

Application Permission Types

The following is the list of different application permissions that can be granted to users/groups.

Permission	Description
Administrator	Full administrator rights to access and manage all data, documents and screens, i.e. everything.
Annotations	Create and view annotations in query results.
Change Owner	Change the ownership of a document or folder to another user.
Data - Delete	Delete streams.
Data - Export	Download/export streams from a feed.
Data - Import	Upload stream data into a feed.
Data - View	View stream data (e.g. in the Data Viewer or a Dashboard text pane
Data - View With Pipeline	View data in a Dashboard text pane that uses a pipeline.
Download Search Results	Download search result data on a Dashboard.
Export Configuration	Export content (i.e. documents, that you have permission to view) to a file.
Import Configuration	Import content from a file.
Manage API Keys	Access the API Keys screen to view, create, edit, delete the user’s own API keys. ‘Manage Users’ permission is also required to managed other users API keys
Manage Cache	Access the Caches screen to view and clear system caches.
Manage DB	Access the Monitoring -> Database Tables screen to view the state of the tables in the database.
Manage Index Shards	Access the Shards sub-tab on an Index document.
Manage Jobs	Access the Jobs screen to manage Stroom’s background jobs.
Manage Nodes	Access the Nodes screen to view the nodes the cluster and manage their priority and enabled states.
Manage Policies	Access the Data Retention screen to manage data retention rules.
Manage Processors	Access the Processors tab and manage the processors/filters used to process stream data through pipelines.
Manage Properties	Access to the Properties to manage the system configuration.
Manage Tasks	Access the Server Tasks screen to view/stop tasks running on the nodes.
Manage Users	Access the screens to manage users, groups, document/application permissions. Also gives the user the ability to manage API keys for other users.
Manage Volumes	Access the Data Volumes and Index Volumes screens to create/edit/delete the index/data volumes used for Lucene indexing and the stream store.
Pipeline Stepping	Step data through a pipeline using the Stepper.
View System Information	Use the System Information API. This is used by the administrators for viewing some of the internal working of Stroom to aid in debugging issues.

12.5 - Document Permissions

Assigning document level permssions (such as ‘View’) to users or groups.

TODO

The Users, Groups and Permissions screens are undergoing significant change in Stroom v7.6. Therefore this section will be updated with more detail in v7.6.

Document Permissions are permissions that are granted to Users or Groups for a specific Document . They control what documents a user/group can see and what they can do to those documents. They allow very fine grained control over what a user/group can see or do in Stroom.

For example, User jbloggs may be granted Use permission on the Index named Alert Index in order for him to be able to query that index in a dashboard, but not be able to see it in the explorer tree or change it in any way.

By default a new user with no Application Permissions, Document Permissions or Group memberships cannot view/use/modify any documents. They do not even have permission to create any documents. When logging into Stroom, they will simply see an empty explorer tree.

A user can gain varying levels of access to documents in a number of ways:

Being added to a Group that has direct or inherited permissions on one or more existing documents.
Being added to a Group that has direct or inherited permissions to create one or more document types.
Being directly granted permissions one or more existing documents.
Being directly granted the permission to create one or more document types.
Being granted the Administrator Application Permission which gives them access to ALL documents.

In order to modify the permissions on a document, you must either hold Owner permission on the document or have the Administrator Application Permission. The Document Permissions screen for a document/folder can be accessed by right clicking on it in the explorer tree and selecting:

Permission Types

The following is the list of different permissions that can be granted to users/groups on a document.

Permission	Description
Owner	Same as delete plus ability to change the document’s permissions (i.e. grant permissions on this document to other users/groups.
Delete	Same as edit plus permission to delete the document.
Edit	Same as view plus permission to edit, move, rename or add tags to the document.
View	Permission to see the document in the explorer tree, open it as read-only, copy it or export it (subject to also having the `Export Configuration` application permission.
Use	Only allow use of a document, e.g. allow use of an index as part of a search process but do not allow viewing of the document itself.

The following is the list of different permissions that can be granted to users/groups on a folder .

Permission	Description
Owner	Same as delete plus ability to change the folder’s permissions (i.e. grant permissions on this folder to other users/groups.
Delete	Same as edit plus permission to delete the folder.
Edit	Same as view plus permission to edit, move, rename or add tags to the folder.
View	Permission to see the folder in the explorer tree (and it’s child items that you also have View permission on), open it as read-only, copy it or export it (subject to also having the `Export Configuration` application permission.
Use	Only allow use of a folder, e.g. allow use of an index as part of a search process but do not allow viewing of the folder itself.

Implied Permissions

Note that each permission in the two tables above also includes all the permissions below it in the table, e.g. a user with Edit permission on a document will also have the implied permissions View and Use. There is no need to grant these lower permissions to the user, though doing so will have no impact as Stroom will user the highest value permission when checking permissions.

Inherited Permissions

If a User jbloggs is a member of Group Team A and that group is a member of group Division 123, then jbloggs will inherit all permissions from both Team A and Division 123. A User/Group will inherit all permissions of the groups that they are a member of and also from any ancestor groups of those groups.

User/Group	Permissions	Direct/Inherited
Division 123	View on Dictionary IP Allow List	Direct
Team A	View on Dictionary IP Allow List	Inherited
Team A	Owner on Dashboard Team Dashboard	Direct
jbloggs	View on Dictionary IP Allow List	Inherited
jbloggs	Owner on Dashboard Team Dashboard	Inherited
jbloggs	View on Dashboard Frank’s Dashboard	Direct

`Owner` Permission

A document can have multiple owners. An owner can be a user or a group. When a document is created by a user they are automatically made an owner of it. Any user with the Administrator role has implied ownership of ALL documents.

Having Owner permission on a document means the user can grant permissions on that document to other users, or revoke permissions from other users.

`Use` Permission

This permission allows users to access a document but not actually see it in the explorer tree or open the document in Stroom. They can however make use of the document, e.g. selecting and querying an Index in a Dashboard .

The Use permission is not relevant to all document types.

Permissions on Folders

Folders in the explorer tree work mostly in the same way as documents when it comes to permissions. There are a couple of exceptions to this.

Permission on Folder Contents

The permissions on a folder apply only to the folder itself and has no bearing on what you can/can’t do to its child items. The permissions on each child item in the folder control what you can/can’t do to those items.

For example, if you only have View permission on a folder, but have Delete on a document in that folder, then you are able to delete that document and thus change the contents of the folder.

Similarly, if you have View permission on a folder but have no permission on any of its child items, then you will just see an empty folder.

Ancestor Folder Visibility

A folder will be visible to a user in the explorer tree if the user has View permission on it OR if the user has View permission on any single document/folder that is a descendant of it.

For example, if a user has View permission on a Dictionary Dictionary_XYZ with path

System / Folder_A / Folder_B / Dictionary_XYZ

but no permissions on Folder A or Folder B, they will be able to see both Folders in the explorer tree in addition to the Dictionary. They will however not be able to open those Folders as they do not have the permission.

Therefore, when granting permissions on a document/folder to a user/group, you are also implicitly granting visibility (but not View permission) on all ancestor folders.

Create Permissions

Folders can have one or more Create Permissions granted on them to users/groups.

There is a Create Permission for each document type, e.g. Index, Dictionary, Feed, etc. A Create Permission is the ability to create a new document of that type in that folder.

For example, user jbloggs is an analyst and is granted Create Dashboard and Create Query permissions on the Folder named Joe's Folder. This means Joe can only create Dashboard or Query documents in that folder and nothing else.

Applying Changes to Descendants

When making changes to the permissions on Folder you have the option of making the changes to just that folder or to all descendants of that folder. Selecting to apply to all descendants will make all permission changes apply to every descendant, i.e. including any sub-folders and their contents or own sub-folders.

Moving and Copying Documents

When you move or copy a document/folder you have the choice of how the destination document/folder’s permissions should be derived. The move/copy dialog offers the following choices:

None - Removes all current permissions. Ignores permissions of the destination folder. You will be the owner of the moved document if not already.
Source - Keep the current permissions and ownership as they are.
Destination - Removes all current permissions. Adds the permissions of the destination folder. You will be the owner of the moved document if not already.
Combined - Keep the current permissions and add the permissions of the destination folder. You will be the owner of the moved document if not already.

Note

You must have Owner permission on the source document/folder (or Admistrator Application Permission ) if you wish to use None, Destination or Combined as these all involve a change of permissions.

13 - Tools

Various additional tools to assist in administering Stroom and accessing its data.

13.1 - Command Line Tools

Command line actions for administering Stroom.

Stroom has a number of tools that are available from the command line in addition to starting the main application.

Running commands

The basic structure of the shell command for starting one of stroom’s commands depends on whether you are running the zip distribution of stroom or a docker stack.

In either case, COMMAND is the name of the stroom command to run, as specified by the various headings on this page. Each command value is described in its own section and may take no arguments or a mixture of mandatory and optional arguments.

Note

These commands are very powerful and potentially dangerous in the wrong hands, e.g. they allow the changing of user’s passwords. Access to these commands should be strictly limited. Also, each command will run in its own JVM so are not really intended to be run when Stroom is running on the node.

Running commands with the zip distribution

The commands are run by passing the command and any of its arguments to the java command. The jar file is in the bin directory of the zip distribution.

java -jar /absolute/path/to/stroom-app-all.jar \
COMMAND \
[COMMAND_ARG...] \
path/to/config.yml

For example:

java -jar /opt/stroom/bin/stroom-app-all.jar \
reset_password \
-u joe \
-p "correct horse battery staple" \
/opt/stroom/config/config.yml

Running commands in a stroom Docker stack

Commands are run in a Docker stack using the command.sh script found in the root of the stack directory structure.

Note

You do not specify the config file location as the script does this for you.

./command.sh COMMAND [COMMAND_ARG...]

For example:

./command.sh \
reset_password \
-u joe \
-p "correct horse battery staple"

Command reference

Note

All the examples below assume you are running stroom as part of the zip distribution. If you are running a Docker stack then you will need to use the command.sh script (as described above) with the same arguments but omitting the config file path.

`server`

java -jar /absolute/path/to/stroom-app-all.jar \
server \
path/to/config.yml

This is the normal command for starting the Stroom application using the supplied YAML configuration file. The example above will start the application as a foreground process. Stroom would typically be started using the start.sh shell script, but the command above is listed for completeness.

When stroom starts it will check the database to see if any migration is required. If migration from an earlier version (including from an empty database) is required then this will happen as part of the application start process.

`migrate`

java -jar /absolute/path/to/stroom-app-all.jar migrate path/to/config.yml

There may be occasions where you want to migrate an old version but not start the application, e.g. during migration testing or to initiate the migration before starting up a cluster. This command will run the process that checks for any required migrations and then performs them. On completion of the process it exits. This runs as a foreground process.

`create_account`

java -jar /absolute/path/to/stroom-app-all.jar \
create_account \
--u USER \
--p PASSWORD \
[OPTIONS] \
path/to/config.yml

Where the named arguments are:

-u --user - The username for the user.
-p --password - The password for the user.
-e --email - The email address of the user.
-f --firstName - The first name of the user.
-s --lastName - The last name of the user.
--noPasswordChange - If set do not require a password change on first login.
--neverExpires - If set, the account will never expire.

This command will create an account in the internal identity provider within Stroom. Stroom is able to use an external OpenID identity providers such as Google or AWS Cognito but by default will use its own. When configured to use its own (the default) it will auto create an admin account when starting up a fresh instance. There are times when you may wish to create this account manually which this command allows.

Authentication Accounts and Stroom Users

The user account used for authentication is distinct to the Stroom user entity that is used for authorisation within Stroom. If an external IDP is used then the mechanism for creating the authentication account will be specific to that IDP. If using the default internal Stroom IDP then an account must be created in order to authenticate, either from within the UI if you are already authenticated as a privileged used or using this command. In either case a Stroom user will need to exist with the same username as the authentication account.

The command will fail if the user already exists. This command should NOT be run if you are using an external identity provider.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

`reset_password`

java -jar /absolute/path/to/stroom-app-all.jar \
reset_password \
--u USER \
--p PASSWORD \
path/to/config.yml

Where the named arguments are:

-u --user - The username for the user.
-p --password - The password for the user.

This command is used for changing the password of an existing account in stroom’s internal identity provider. It will also reset all locked/inactive/disabled statuses to ensure the account can be logged into. This command should NOT be run if you are using an external identity provider. It will fail if the account does not exist.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

`manage_users`

java -jar /absolute/path/to/stroom-app-all.jar \
manage_users \
[OPTIONS] \
path/to/config.yml

Where the named arguments are:

--createUser USER_IDENTIFIER - Creates a Stroom user with the supplied user identifier.
--greateGroup GROUP_IDENTIFIER - Creates a Stroom user group with the supplied group name.
--addToGroup USER_OR_GROUP_IDENTIFIER TARGET_GROUP - Adds a user/group to an existing group.
--removeFromGroup USER_OR_GROUP_IDENTIFIER TARGET_GROUP - Removes a user/group from an existing group.
--grantPermission USER_OR_GROUP_IDENTIFIER PERMISSION_IDENTIFIER - Grants the named application permission to the user/group.
--revokePermission USER_OR_GROUP_IDENTIFIER PERMISSION_IDENTIFIER - Revokes the named application permission from the user/group.
--listPermissions - Lists all the valid permission names.

This command allows you to manage the account permissions within stroom regardless of whether the internal identity provider or an external party is used. A typical use case for this is when using a external identity provider. In this instance Stroom has no way of auto creating an admin account when first started so the association between the account on the 3rd party IDP and the stroom user account needs to be made manually. To set up an admin account to enable you to login to stroom you can do:

This command is not intended for automation of user management tasks on a running Stroom instance that you can authenticate with. It is only intended for cases where you cannot authenticate with Stroom, i.e. when setting up a new Stroom with a 3rd party IDP or when scripting the creation of a test environment. If you want to automate actions that can be performed in the UI then you can make use of the REST API that is described at /stroom/noauth/swagger-ui.

Warning

See the section above about the distinction between authentication accounts and stroom users.

The following is an example command to create a new stroom user jbloggs, create a group called Administrators with the Administrator application permission and then add jbloggs to the Administrators group. This is a typical command to bootstrap a stroom instance with one admin user so they can login to stroom with full privileges to manage other users from within the application.

java -jar /absolute/path/to/stroom-app.jar \
manage_users \
--createUser jbloggs \
--createGroup Administrators \
--addToGroup jbloggs Administrators \
--grantPermission Administrators "Administrator" \
path/to/config.yml

Where jbloggs is the user name of the account on the identity provider.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

The named arguments can be used as many times as you like so you can create multiple users/groups/grants/etc. Regardless of the order of the arguments, the changes are executed in the following order:

Create users
Create groups
Add users/groups to a group
Remove users/groups from a group
Grant permissions to users/groups
Revoke permissions from users/groups

External Identity Providers

The manageUsers command is particularly useful when using stroom with an external identity provider. In order to use a new install of stroom that is configured with an external identity provider you must first set up a user with the Administrator system permission. If this is not done, users will be able to log in to stroom but will have no permissions to do anything. You can optionally set up other groups/users with other permissions to bootstrap the stroom instance.

External OIDC identity providers have a unique identifier for each user (this may be called sub or oid) and this often takes the form of a UUID . Stroom stores this unique identifier (know as a Subject ID in stroom) against a user so it is able to associate the stroom user with the identity provider user. Identity providers may also have a more friendly display name and full name for the user, though these may not be unique.

`USER_IDENTIFIER`

The USER_IDENTIFIER is of the form subject_id[,display_name[,full_name]] e.g.:

eaddac6e-6762-404c-9778-4b74338d4a17
eaddac6e-6762-404c-9778-4b74338d4a17,jbloggs
eaddac6e-6762-404c-9778-4b74338d4a17,jbloggs,Joe Bloggs

The optional parts are so that stroom can display more human friendly identifiers for a user. They are only initial values and will always be over written with the values from the identity provider when the user logs in.

The following are examples of various uses of the --createUser argument group.

# Create a user using their unique IDP identifier and add them to group Administrators
java -jar /absolute/path/to/stroom-app.jar \
manage_users \
--createUser "45744aee-0b4c-414b-a82a-8b8b134cc201" \
--addToGroup "45744aee-0b4c-414b-a82a-8b8b134cc201"  Administrators \
path/to/config.yml

# Create a user using their unique IDP identifier, display name and full name
java -jar /absolute/path/to/stroom-app.jar \
manage_users \
--createUser "45744aee-0b4c-414b-a82a-8b8b134cc201,jbloggs,Joe Bloggs" \
--addToGroup "jbloggs"  Administrators \
path/to/config.yml

# Create multiple users at once, adding them to appropriate groups
java -jar /absolute/path/to/stroom-app.jar \
manage_users \
--createUser "45744aee-0b4c-414b-a82a-8b8b134cc201,jbloggs,Joe Bloggs" \
--createUser "37fb1eb4-f59c-4040-8e1d-57485e0f912f,jdoe,John Doe" \
--addToGroup "jbloggs"  Administrators \
--addToGroup "jdoe"  Analysts \
path/to/config.yml

`GROUP_IDENTIFIER`

The GROUP_IDENTIFIER is the name of the group in stroom, e.g. Administrators, Analysts, etc. Groups are created by an admin to help manage permissions for large number of similar users. Groups relate only to stroom and have nothing to do with the identity provider.

`USER_OR_GROUP_IDENTIFIER`

The USER_OR_GROUP_IDENTIFIER can either be the identifier for a user or a group, e.g. when granting a permission to a user/group.

It takes the following forms (with examples for each):

user_subject_id
- eaddac6e-6762-404c-9778-4b74338d4a17
user_display_name
- jbloggs
group_name
- Administrators

The value for the argument will first be treated as a unique identifier (i.e. the subject ID or group name). If the user cannot be found it will fall back to using the display name to find the user.

`create_api_key`

The create_api_key command can be used to create an API Key for a user. This is useful if, when bootstrapping a cluster, you want to set up a user and associated API Key to allow an external process to monitor/manage that Stroom cluster, e.g. using an Operator in Kubernetes.

java -jar /absolute/path/to/stroom-app.jar \
create_api_key \
--user jbloggs \
--expiresDays 365 \
--keyName "Test key" \
--outFile  /tmp/api_key.txt \
path/to/config.yml

The arguments to the command are as follows:

u user - The identity of the user to create the API Key for. This is the unique subject ID of the user.
n keyName - The name of the key. This must be unique for the user.
e expiresDays - Optional number of days after which the key should expire. This must not be greater than the configured property stroom.security.authentication.maxApiKeyExpiryAge. If not set, it will be defaulted to the maximum configured age.
c comments - Optional string to set the comments for the API Key.
o outFile - Optional path to use to output the API Key string to. If not set, the API Key string will be output to stdout.

13.2 - Stream Dump Tool

A tool for exporting stream data from Stroom.

Data within Stroom can be exported to a directory using the StreamDumpTool. The tool is contained within the core Stroom Java library and can be accessed via the command line, e.g.

java -cp "apache-tomcat-7.0.53/lib/*:lib/*:instance/webapps/stroom/WEB-INF/lib/*" stroom.util.StreamDumpTool outputDir=output

Note the classpath may need to be altered depending on your installation.

The above command will export all content from Stroom and output it to a directory called output. Data is exported to zip files in the same format as zip files in proxy repositories. The structure of the exported data is ${feed}/${pathId}/${id} by default with a .zip extension.

To provide greater control over what is exported and how the following additional parameters can be used:

feed - Specify the name of the feed to export data for (all feeds by default).

streamType - The single stream type to export (all stream types by default).

createPeriodFrom - Exports data created after this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z (exports from earliest data by default).

createPeriodTo - Exports data created before this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z (exports up to latest data by default).

outputDir - The output directory to write data to (required).

format - The format of the output data directory and file structure (${feed}/${pathId}/${id} by default).

Format

The format parameter can include several replacement variables:

feed - The name of the feed for the exported data.

streamType - The data type of the exported data, e.g. RAW_EVENTS.

streamId - The id of the data being exported.

pathId - A incrementing numeric id that creates sub directories when required to ensure no directory ends up containing too many files.

id - A incrementing numeric id similar to pathId but without sub directories.

14 - User Content

This section relates to the content created in Stroom by the user, e.g. Dashboards, Translations, Feeds, etc.

14.1 - Editing Text

How to edit user defined text content within Stroom.

Stroom uses the Ace text editor for editing and viewing text, such as XSLTs, raw data, cooked events, stepping, etc. The editor provides various useful features:

Syntax highlighting
Themes
Find/replace (see Keyboard Shortcuts)
Code auto-completion

Keyboard shortcuts

See Keyboard Shortcuts for details of the keyboard shortcuts available when using the Ace editor.

Vim key bindings

If you are familiar with the Vi/Vim text editors then it is possible to enable Vim key bindings in Stroom. This can be done in two ways.

Either globally by setting Editor Key Bindings to Vim in your user preferences:

Or within an editor using the context menu. This latter option allows you to temporarily change your bindings.

The Ace editor does not support all features of Vim however the core navigation/editing key bindings are present. The key supported features of Vim are:

Visual mode and visual block mode.
Searching with / (javascript flavour regex)
Search/replace with commands like :%s/foo/bar/g
Incrementing/decrementing numbers with Ctrl ^ + a / Ctrl ^ + b
Code (un-)folding with z , o , z , c , etc.
Text objects, e.g. >, ), ], ', ", p paragraph, w word.
Repetition with the . command.
Jumping to a line with :<line no>.

Notable features not supported by the Ace editor:

The following text objects are not supported
- b - Braces, i.e { or [.
- t - Tags, i.e. XML tags <value>.
- s - Sentence.
The g command mode command, i.e. :g/foo/d
Splits

For a list of useful Vim key bindings see this cheat sheet , though not all bindings will be available in Stroom’s Ace editor.

Use of `Esc` key in Vim mode

The Esc key is bound to the close action in Stroom, so pressing Esc will typically close a popup, dialog, selection box, etc. Dialogs will not be closed if the Ace editor has focus but as Esc is used so frequently with Vim bindings it may be advisable to use an alternative key to exit insert mode to avoid accidental closure. You can use the standard Vim binding of Ctrl ^ + [ or the custom binding of k , b as alternatives to Esc .

Auto-Completion And Snippets

The editor supports a number of different types of auto-completion of text. Completion suggestions are triggered by the following mechanisms:

Ctrl ^ + Space ␣ - when live auto-complete is disabled.
Typing - when live auto-complete is enabled.

When completion suggestions are triggered the follow types of completion may be available depending on the text being edited.

Local - any word/token found in the existing document. Useful if you have typed a long word and need to type it again.
Keyword - A word/token that has been defined in the syntax highlighting rules for the text type, i.e. function is a keyword when editing Javascript.
Snippet - A block of text that has been defined as a snippet for the editor mode (XML, Javascript, etc.).

Snippets

Snippets allow you to quickly enter pre-defined blocks of common text into the editor. For example when editing an XSLT you may want to insert a call-template with parameters. To do this using snippets you can do the following:

Type call then hit Ctrl ^ + Space ␣ .
In the list of options use the cursor keys to select call-template with-param then hit Enter ↵ or Tab ↹ to insert the snippet. The snippet will look like
```
<xsl:call-template name="template">
  <xsl:with-param name="param"></xsl:with-param>
</xsl:call-template>
```
The cursor will be positioned on the first tab stop (the template name) with the tab stop text selected.
At this point you can type in your template name, e.g. MyTemplate, then hit Tab ↹ to advance to the next tab stop (the param name)
Now type the name of the param, e.g. MyParam, then hit Tab ↹ to advance to the last tab stop positioned within the <with-param> ready to enter the param value.

Snippets can be disabled from the list of suggestions by selecting the option in the editor context menu.

Tab triggers

Some snippets can be triggered by typing an abbreviation and then hitting Tab ↹ to insert the snippet. This mechanism is faster than hitting Ctrl ^ + Space ␣ and selecting the snippet, if you can remember the snippet tab trigger abbreviations.

Available snippets

For a list of the available completion snippets see the Completion Snippet Reference.

Theme

The editor has a number of different themes that control what colours are used for the different elements in syntax highlighted text. The theme can be set User Preferences, from the main menu , select:

The list of themes available match the main Stroom theme, i.e. dark Ace editor themes for a dark Stroom theme.

14.2 - Naming Conventions

A set of guidelines for how to name and organise your content.

Stroom has been in use by GCHQ for many years and is used to process logs from a large number of different systems. This sections aims to provide some guidelines on how to name and organise your content, e.g. Feeds, XSLTs, Pipelines, Folders, etc. These are not hard rules and you do not have to follow them, however it may help when it comes to sharing content.

TODO

Complete this section

14.3 - Documenting content

The ability to document each entity created in Stroom.

The screen for each Entity in Stroom has a Documentation sub-tab. The purpose of this sub-tab is to allow the user to provide any documentation about the entity that is relevant. For example a user might want to provide information about the system that a Feed receives data from, or document the purpose of a complex XSLT translation.

In previous versions of stroom this documentation was a small and simple Description text field, however now it is a full screen of rich text. This screen defaults to its read-only preview mode, but the user can toggle it to the edit mode to edit the content. In the edit mode, the documentation can be created/edited using plain text, or Markdown . Markdown is a fairly simple markup language for producing richly formatted text from plain text.

There are many variants of markdown that all have subtly different features or syntax. Stroom uses the Showdown markdown converter to render users’ markdown content into formatted text. This link is the definitive source for supported markdown syntax.

Note

The Showdown markdown processor used in stroom is not the same as the markdown processor used within this documentation site (stroom-docs), so there may be some subtle differences in syntax.

Example Markdown Content

The following is a brief guide to the most common formatting that can be done with markdown and that is supported in Stroom.

# Markdown Example

This is an example of a markdown document.


## Headings Example

This is at level 2.


### Heading Level 3

This is at level 3.


#### Heading Level 4

This is at level 4.


## Text Styles

**bold**, __bold__, *italic*, _italic_, ***bold and italic***, ~~strike-through~~


## Bullets 

Use four spaces to indent a sub-item.

* Bullet 1
    * Bullet 1a
* Bullet 2
    * Bullet 2a

## Numbered Lists

Use four spaces to indent a sub-item.
Using `1` for all items means the makrdown processor will replace them with the correct number, making it easier to re-order items.

1. Item 1
    1. Item 1a
    1. Item 1b
1. Item 2
    1. Item 2a
    1. Item 2b

## Quoted Text

> This is a quote.

Text

> This is another quote.  
> It has multiple lines...
>
> ...and gaps and bullets
> * Item 1
> * Item 2


## Tables

Note `---:` to right align a column, `:---:` to center align it.

| Syntax      | Description | Value | Fruit  |
| ----------- | ----------- | -----:| :----: |
| Row 1       | Title       | 1     | Pear   |
| Row 2       | Text        | 10    | Apple  |
| Row 3       | Text        | 100   | Kiwi   |
| Row 4       | Text        | 1000  | Orange |

Table using `<br>` for multi-line cells.

| Name      | Description     |
|-----------|-----------------|
| Row 1     | Line 1<br>Line2 |
| Row 2     | Line 1<br>Line2 |


## Links

Line: [title](https://www.example.com)


## Simple Lists

Add two spaces to the end of each line to stop each line being treated as a paragraph.

One  
Two  
Three  

## Paragraphs

Lines not separated by a blank line will be joined together with a space between them.
Stroom will wrap long lines when rendered.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

## Task Lists

The `X` indicates a task has been completed.

* [x] Write the press release
* [ ] Update the website
* [ ] Contact the media


## Images

A non-standard syntax is supported to render images at a set size in the form of `<width>x<height>`.
Use a `*` for one of the dimensions to scale it proportionately.

![This is my alt text](images/logo.svg =200x* "This is my tooltip title")

## Separator

This is a horizontal rule separator

---

## Code

Code can be represented in-line like this `echo "hello world"` by surround it with single back-ticks.

Multi-line blocks of code can rendered with the appropriate syntax highlighting using a fenced block comprising three back-ticks.
Specify the language after the first set of three back ticks, or `text` for plain text.
Only certain languages are supported in Stroom.

**JSON**
```json
{
  "key1": "some text",
  "key2": 123
}
```

**XML**
```xml
  <record>
    <data name="dateTime" value="2020-09-28T14:30:33.476" />
    <data name="machineIp" value="19.141.201.14" />
  </record>
```

**bash**
```bash
#!/bin/bash
now="$(date)"
computer_name="$(hostname)"
echo "Current date and time : $now"
echo "Computer name : $computer_name"
```

Wrapping

Long paragraphs will be wrapped

Code Syntax Highlighting

This is an example of a fenced code block.

```xml
  <record>
    <data name="dateTime" value="2020-09-28T14:30:33.476" />
  </record>
```

In this example, xml defines the language used within the fenced block.

Stroom supports the following languages for fenced code blocks. If you require additional languages then please raised a ticket here . If your language is not currently supported or is just plain text then use text.

text
sh
bash
xml
css
javascript
csv
regex
powershell
sql
json
yaml
properties
toml

Fenced blocks with content that is wider than the pane will result in the fenced block having its own horizontal scroll bar.

Escaping Characters

It is common to use _ characters in Feed names, however if there are two of these in a word then the markdown processor will interpret them as italic markup. To prevent this, either surround the word with back ticks to be rendered as code or escape each underscore with a \, i.e. THIS\_IS\_MY\_FEED. THIS_IS_MY_FEED.

HTML

While it is possible to use HTML in the documentation, its use is not recomended as it increases the complexity of the documentation content and requires that other users have knowledge of HTML. Markdown should be sufficient for most cases, with the possible exception of complex tables where HTML may be prefereable.

Note

No form of HTML scripting (i.e. Javascript) is supported within the documentation content.

14.4 - Finding Things

How to find things in Stroom, for example content, simple string values, etc.

Explorer Tree

The Explorer Tree in stroom is the primary means of finding user created content, for example Feeds, XSLTs, Pipelines, etc.

Branches of the Explorer Tree can be expanded and collapsed to reveal/hide the content at different levels.

Filtering by Type

The Explorer Tree can be filtered by the type of content, e.g. to display only Feeds, or only Feeds and XSLTs. This is done by clicking the filter icon . The following is an example of filtering by Feeds and XSLTs.

images/user-guide/finding-things/explorer_tree_type_filter_picker.png — Explorer Tree Type Filtering

Clicking All/None toggles between all types selected and no types selected.

Filtering by type can also be achieved using the Quick Filter by entering the type name (or a partial form of the type name), prefixed with type:. I.e:

type:feed

For example:

images/user-guide/finding-things/explorer_tree_type_quick_filter.png — Explorer Tree Type Filtering

NOTE: If both the type picker and the Quick Filter are used to filter on type then the two filters will be combined as an AND.

Filtering by Name

The Explorer Tree can be filtered by the name of the entity. This is done by entering some text in the Quick Filter field. The tree will then be updated to only show entities matching the Quick Filter. The way the matching works for entity names is described in Common Fuzzy Matching

Filtering by UUID

What is a UUID?

The Explorer Tree can be filtered by the UUID of the entity. The UUID Universally unique identifier is an identifier that can be relied on to be unique both within the system and universally across all other systems. Stroom uses UUIDs as the primary identifier for all content (Feeds, XSLTs, Pipelines, etc.) created in Stroom. An entity’s UUID is generated randomly by Stroom upon creation and is fixed for the life of that entity.

When an entity is exported it is exported with its UUID and if it is then imported into another instance of Stroom the same UUID will be used. The name of an entity can be changed within Stroom but its UUID remains un-changed.

With the exception of Feeds, Stroom allows multiple entities to have the same name. This is because entities may exist that a user does not have access to see so restricting their choice of names based on existing invisible entities would be confusing. Where there are multiple entities with the same name the UUID can be used to distinguish between them.

The UUID of an entity can be viewed using the context menu for the entity. The context menu is accessed by right-clicking on the entity.

images/user-guide/finding-things/entity_context_menu.png — Entity Context Menu

Clicking Info displays the entities UUID.

images/user-guide/finding-things/entity_info.png — Entity Info

The UUID can be copied by selecting it and then pressing Ctrl ^ + c .

UUID Quick Filter Matching

In the Explorer Tree Quick Filter you can filter by UUIDs in the following ways:

To show the entity matching a UUID, enter the full UUID value (with dashes) prefixed with the field qualifier uuid, e.g. uuid:a95e5c59-2a3a-4f14-9b26-2911c6043028.

To filter on part of a UUID you can do uuid:/2a3a to find an entity whose UUID contains 2a3a or uuid:^2a3a to find an entity whose UUID starts with 2a3a.

Quick Filters

Quick Filter controls are used in a number of screens in Stroom. The most prominent use of a Quick Filter is in the Explorer Tree as described above. Quick filters allow for quick searching of a data set or a list of items using a text based query language. The basis of the query language is described in Common Fuzzy Matching.

A number of the Quick Filters are used for filter tables of data that have a number of fields. The quick filter query language supports matching in specified fields. Each Quick Filter will have a number of named fields that it can filter on. The field to match on is specified by prefixing the match term with the name of the field followed by a :, i.e. type:. Multiple field matches can be used, each separate by a space. E.g:

name:^xml name:events$ type:feed

In the above example the filter will match on items with a name beginning xml, a name ending events and a type partially matching feed.

All the match terms are combined together with an AND operator. The same field can be used multiple times in the match. The list of filterable fields and their qualifier names (sometimes a shortened form) are listed by clicking on the help icon .

One or more of the fields will be defined as default fields. This means the if no qualifier is entered the match will be applied to all default fields using an OR operator. Sometimes all fields may be considered default which means a match term will be tested against all fields and an item will be included in the results if one or more of those fields match.

For example if the Quick Filter has fields Name, Type and Status, of which Name and Type are default:

feed status:ok

The above would match items where the Name OR Type fields match feed AND the Status field matches ok.

Match Negation

Each match item can be negated using the ! prefix. This is also described in Common Fuzzy Matching. The prefix is applied after the field qualifier. E.g:

name:xml source:!/default

In the above example it would match on items where the Name field matched xml and the Source field does NOT match the regex pattern default.

Spaces and Quotes

If your match term contains a space then you can surround the match term with double quotes. Also if your match term contains a double quote you can escape it with a \ character. The following would be valid for example.

"name:csv splitter" "default field match" "symbol:\""

Multiple Terms

If multiple terms are provided for the same field then an AND is used to combine them. This can be useful where you are not sure of the order of words within the items being filtered.

For example:

User input: spain plain rain

Will match:

The rain in spain stays mainly in the plain
    ^^^^    ^^^^^                     ^^^^^
rainspainplain
^^^^^^^^^^^^^^
spain plain rain
^^^^^ ^^^^^ ^^^^
raining spain plain
^^^^^^^ ^^^^^ ^^^^^

Won’t match: sprain, rain, spain

OR Logic

There is no support for combining terms with an OR. However you can acheive this using a sinlge regular expression term. For example

User input: status:/(disabled|locked)

Will match:

Locked
^^^^^^
Disabled
^^^^^^^^

Won’t match: A MAN, HUMAN

Suggestion Input Fields

Stroom uses a number of suggestion input fields, such as when selecting Feeds, Pipelines, types, status values, etc. in the pipeline processor filter screen.

images/user-guide/finding-things/feed_suggestion.png — Feed Input Suggestions

These fields will typically display the full list of values or a truncated list where the total number of value is too large. Entering text in one of these fields will use the fuzzy matching algorithm to partially/fully match on values. See CommonFuzzy Matching below for details of how the matching works.

Common Fuzzy Matching

A common fuzzy matching mechanism is used in a number of places in Stroom. It is used for partially matching the user input to a list of a list of possible values.

In some instances, the list of matched items will be truncated to a more manageable size with the expectation that the filter will be refined.

The fuzzy matching employs a number of approaches that are attempted in the following order:

NOTE: In the following examples the ^ character is used to indicate which characters have been matched.

No Input

If no input is provided all items will match.

Contains (Default)

If no prefixes or suffixes are used then all characters in the user input will need to be contained as a whole somewhere within the string being tested. The matching is case insensitive.

User input: bad

Will match:

bad angry dog
^^^          
BAD
^^^
very badly
     ^^^  
Very bad
     ^^^

Won’t match: dab, ba d, ba

Characters Anywhere Matching

If the user input is prefixed with a ~ (tilde) character then characters anywhere matching will be employed. The matching is case insensitive.

User input: bad

Will match:

Big Angry Dog
^   ^     ^  
bad angry dog
^^^          
BAD
^^^
badly
^^^  
Very bad
     ^^^
b a d
^ ^ ^
bbaadd
^ ^ ^

Won’t match: dab, ba

Word Boundary Matching

If the user input is prefixed with a ? character then word boundary matching will be employed. This approache uses upper case letters to denote the start of a word. If you know the some or all of words in the item you are looking for, and their order, then condensing those words down to their first letters (capitalised) makes this a more targeted way to find what you want than the characters anywhere matching above. Words can either be separated by characters like _- ()[]., or be distinguished with lowerCamelCase or upperCamelCase format. An upper case letter in the input denotes the beginning of a word and any subsequent lower case characters are treated as contiguously following the character at the start of the word.

User input: ?OTheiMa

Will match:

the cat sat on their mat
            ^  ^^^^  ^^                                                                  
ON THEIR MAT
^  ^^^^  ^^ 
Of their magic
^  ^^^^  ^^   
o thei ma
^ ^^^^ ^^
onTheirMat
^ ^^^^ ^^ 
OnTheirMat
^ ^^^^ ^^

Won’t match: On the mat, the cat sat on there mat, On their moat

User input: ?MFN

Will match:

MY_FEED_NAME
^  ^    ^   
MY FEED NAME
^  ^    ^   
MY_FEED_OTHER_NAME
^  ^          ^   
THIS_IS_MY_FEED_NAME_TOO
        ^  ^    ^                  
myFeedName
^ ^   ^   
MyFeedName
^ ^   ^   
also-my-feed-name
     ^  ^    ^   
MFN
^^^
stroom.something.somethingElse.maxFileNumber
                               ^  ^   ^

Won’t match: myfeedname, MY FEEDNAME

Regular Expression Matching

If the user input is prefixed with a / character then the remaining user input is treated as a Java syntax regular expression. An string will be considered a match if any part of it matches the regular expression pattern. The regular expression operates in case insensitive mode. For more details on the syntax of java regular expressions see this internet link https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html.

User input: /(^|wo)man

Will match:

MAN
^^^
A WOMAN
  ^^^^^
Manly
^^^  
Womanly
^^^^^

Won’t match: A MAN, HUMAN

Exact Match

If the user input is prefixed with a ^ character and suffixed with a $ character then a case-insensitive exact match will be used. E.g:

User input: ^xml-events$

Will match:

xml-events
^^^^^^^^^^
XML-EVENTS
^^^^^^^^^^

Won’t match: xslt-events, XML EVENTS, SOME-XML-EVENTS, AN-XML-EVENTS-PIPELINE

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Starts With

If the user input is prefixed with a ^ character then a case-insensitive starts with match will be used. E.g:

User input: ^events

Will match:

events
^^^^^^
EVENTS_FEED
^^^^^^     
events-xslt
^^^^^^

Won’t match: xslt-events, JSON_EVENTS

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Ends With

If the user input is suffixed with a $ character then a case-insensitive ends with match will be used. E.g:

User input: events$

Will match:

events
^^^^^^
xslt-events
     ^^^^^^
JSON_EVENTS
     ^^^^^^

Won’t match: EVENTS_FEED, events-xslt

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Wild-Carded Case Sensitive Exact Matching

If one or more * characters are found in the user input then this form of matching will be used.

This form of matching is to support those fields that accept wild-carded values, e.g. a whild-carded feed name expression term. In this instance you are NOT picking a value from the suggestion list but entering a wild-carded value that will be evaluated when the expression/filter is actually used. The user may want an expression term that matches on all feeds starting with XML_, in which case they would enter XML_*. To give an indication of what it would match on if the list of feeds remains the same, the list of suggested items will reflect the wild-carded input.

User input: XML_*

Will match:

XML_
^^^^
XML_EVENTS
^^^^

Won’t match: BAD_XML_EVENTS, XML-EVENTS, xml_events

User input: XML_*EVENTS*

Will match:

XML_EVENTS
^^^^^^^^^^
XML_SEC_EVENTS
^^^^    ^^^^^^
XML_SEC_EVENTS_FEED
^^^^    ^^^^^^

Won’t match: BAD_XML_EVENTS, xml_events

Match Negation

A match can be negated, ie. the NOT operator using the prefix !. This prefix can be applied before all the match prefixes listed above. E.g:

!/(error|warn)

In the above example it will match everything except those matched by the regex pattern (error|warn).

15 - Viewing Data

How to view data in Stroom.

Viewing Data

The data viewer is shown on the Data tab when you open (by double clicking) one of these items in the explorer tree:

Feed - to show all data for that feed.
Folder - to show all data for all feeds that are descendants of the folder.
System Root Folder - to show all data for all feeds that are ancestors of the folder.

In all cases the data shown is dependant on the permissions of the user performing the action and any permissions set on the feeds/folders being viewed.

The Data Viewer screen is made up of the following three parts which are shown as three panes split horizontally.

Stream List

This shows all streams within the opened entity (feed or folder). The streams are shown in reverse chronological order. By default Deleted and Locked streams are filtered out. The filtering can be changed by clicking on the Filter icon. This will show all stream types by default so may be a mixture of Raw events, Events, Errors, etc. depending on the feed/folder in question.

This list only shows data when a stream is selected in the streams list above it. It shows all streams related to the currently selected stream. It may show streams that are ‘ancestors’ of the selected stream, e.g. showing the Raw Events stream for an Events stream, or show descendants, e.g. showing the Errors stream which resulted from processing the selected Raw Events stream.

Content Viewer Pane

This pane shows the contents of the stream selected in the Related Streams List. The content of a stream will differ depending on the type of stream selected and the child stream types in that stream. For more information on the anatomy of streams, see Streams. This pane is split into multiple sub tabs depending on the different types of content available.

Info Tab

This sub-tab shows the information for the stream, such as creation times, size, physical file location, state, etc.

Error Tab

This sub-tab is only visible for an Error stream. It shows a table of errors and warnings with associated messages and locations in the stream that it relates to.

Data Preview Tab

This sub-tab shows the content of the data child stream, formatted if it is XML. It will only show a limited amount of data so if the data child stream is large then it will only show the first n characters.

If the stream is multi-part then you will see Part navigation controls to switch between parts. For each part you will be shown the first n character of that part (formatted if applicable).

If the stream is a Segmented stream stream then you will see the Record navigation controls to switch between records. Only one record is shown at once. If a record is very large then only the first n characters of the record will be shown.

This sub-tab is intended for seeing a quick preview of the data in a form that is easy to read, i.e. formatted. If you want to see the full data in its original form then click on the View Source link at the top right of the sub-tab.

The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.

Context Tab

This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated context data child stream. For more details of context streams, see Context This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.

Meta Tab

This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated meta data child stream. For more details of meta streams, see Meta This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.

Source View

The source view is accessed by clicking the View Source link on the Data Preview sub-tab or from the data() dashboard column function. Its purpose is to display the selected child stream (data, context, meta, etc) or record in the form in which it was received, i.e un-formatted.

The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.

In order to navigate through the data you have three options

Click on the ‘progress bar’ to show a porting of the data starting from the position clicked on.
Page through the data using the navigation controls.
Select a source range to display using the Set Source Range dialog which is accessed by clicking on the Lines or Chars links. This allows you to precisely select the range to display. You can either specify a range with a just start point or a start point and some form of size/position limit. If no limit is specified then Stroom will limit the data shown to the configured maximum (stroom.ui.source.maxCharactersPerFetch). If a range is entered that is too big to display Stroom will limit the data to its maximum.

A Note About Characters

Stroom does not know the size of a stream in terms of character lines/cols, it only knows the size in bytes. Due to the way character data is encoded into bytes it is not possible to say how many characters are in a stream based on its size in bytes. Stroom can only provide an estimate based on the ratio of characters to bytes seen so far in the stream.

Data Progress Bar

Stroom often handles very large streams of data and it is not feasible to show all of this data in the editor at once. Therefore Stroom will show a limited amount of the data in the editor at a time. The ‘progress’ bar at the top of the Data Preview and Source View screens shows what percentage of the data is visible in the editor and where in the stream the visible portion is located. If all of the data is visible in the editor (which includes scrolling down to see it) the bar will be green and will occupy the full width. If only some of the data is visible then the bar will be blue and the coloured part will only occupy part of the width.

16 - Volumes

Stroom’s logical storage volumes for storing event and index data.

TODO

Describe volumes

User Guide

1 - Application Programming Interfaces (API)

1.1 - API Specification

Swagger UI

API Endpoints in Application Logs

1.2 - Calling an API

Authentication

Calling an API method with curl

HTTP Requests Without a Body

Warning

Requests With a Body

Note

Handling JSON

1.3 - Query APIs

Common endpoints

Datasource

Search

Stroom as a query builder

Destroy

Keep alive

1.4 - Export Content API

Export All - /api/export/v1

Performing an Export

Note

Export Zip Format

Filenames

1.5 - Reference Data API

/api/refData/v1/lookup

Note

1.6 - Explorer API

Creating a New Document

1.7 - Feed API

Creating a Feed

Updating a Feed

Example Script

2 - Background Jobs

2.1 - Scheduler

Note

Frequency Schedules

Cron Schedules

3 - Concepts

3.1 - Streams

Anatomy of a Stream

Child Stream Types

Data

Context

Meta Data

Non-Segmented Stream

Segmented Stream

Error Stream

4 - Data Retention

Rules

Rule Precedence

Creating a Rule

Impact Summary

5 - Data Splitter

5.1 - Simple CSV Example

5.2 - Simple CSV example with heading

Input

Configuration

Output

5.3 - Complex example with regex and user defined names

Input

Configuration

Output

5.4 - Multi Line Example

Input

Configuration

Output

5.5 - Element Reference

5.5.1 - Content Providers

Root element <dataSplitter>

Attributes

ignoreErrors

bufferSize (Advanced)

Group element <group>

Attributes

id

value

See Also

Calling an API method with `curl`

Export All - `/api/export/v1`

`/api/refData/v1/lookup`

Root element `<dataSplitter>`

`ignoreErrors`

`bufferSize` (Advanced)

Group element `<group>`

`id`

`value`

`ignoreErrors`

`matchOrder`

`reverse`

The `<split>` element

`id`

`delimiter`

`escape`

`containerStart`

`containerEnd`

`maxMatch`

`minMatch`

`onlyMatch`

The `<regex>` element

`id`

`pattern`

`dotAll`

`caseInsensitive`

`maxMatch`

`minMatch`

`onlyMatch`

`advance`

The `<all>` element

`id`

The `<var>` element

`id`

The `<data>` element

`id`

`name`

`value`

Single `<data>` element example

Multiple `<data>` element example

Multi level `<data>` elements

Nesting `<data>` elements

References to `<split>` Match Groups

References to `<any>` Match Groups