Stroom has a number of public APIs to allow other systems to interact with Stroom. The APIs are defined by a Swagger spec that can be viewed as json (external link) or yaml (external link). A dynamic user interface (external link) is also available for viewing the APIs.

1.1 - Query API

An API to allow other systems to query the data held in Stroom.

The Query API uses common request/response models and end points for querying each type of data source held in Stroom. The request/response models are defined in stroom-query (external link).

Currently Stroom exposes a set of query endpoints for the following data source types. Each data source type will have its own endpoint due to differences in the way the data is queried and the restrictions imposed on the query terms. However they all share the same API definition.

stroom-index (external link) (the Lucene based event index)
sqlstatistics (external link) (Stroom’s own statistics store)

The detailed documentation for the request/responses is contained in the Swagger definition linked to above.

Common endpoints

The standard query endpoints are

/datasource
/search
/destroy

datasource

The data source endpoint is used to query Stroom for the details of a data source with a given docRef. The details will include such things as the fields available and any restrictions on querying the data.

search

The search endpoint is used to initiate a search against a data source or to request more data for an active search. A search request can be made using iterative mode, where it will perform the search and then only return the data it has immediately available. Subsequent requests for the same queryKey will also return the data immediately available, expecting that more results will have been found by the query. Requesting a search in non-iterative mode will result in the response being returned when the query has completed and all known results have been found.

The SearchRequest model is fairly complicated and contains not only the query terms but also a definition of how the data should be returned. A single SearchRequest can include multiple ResultRequest sections to return the queried data in multiple ways, e.g. as flat data and in an alternative aggregated form.

Stroom as a query builder

Stroom is able to export the json form of a SearchRequest model from its dashboards. This makes the dashboard a useful tool for building a query and the table settings to go with it. You can use the dashboard to defined the data source, define the query terms tree and build a table definition (or definitions) to describe how the data should be returned. The, clicking the download icon on the query pane of the dashboard will generate the SearchRequest json which can be immediately used with the /search API or modified to suit.

destroy

This endpoint is used to kill an active query by supplying the queryKey for query in question.

2 - Concepts

Describes a number of core concepts involved in using Stroom.

2.1 - Streams

The unit of data that Stroom operates on, essentially a bounded stream of data.

Streams can either be created when data is directly POSTed in to Stroom or during the proxy aggregation process. When data is directly POSTed to Stroom the content of the POST will be stored as one Stream. With proxy aggregation multiple files in the proxy repository will/can be aggregated together into a single Stream.

Anatomy of a Stream

A Stream is made up of a number of parts of which the raw or cooked data is just one. In addition to the data the Stream can contain a number of other child stream types, e.g. Context and Meta Data.

The hierarchy of a stream is as follows:

Stream nnn
- Part [1 to *]
  - Data [1-1]
  - Context [0-1]
  - Meta Data [0-1]

Although all streams conform to the above hierarchy there are three main types of Stream that are used in Stroom:

Non-segmented Stream - Raw events, Raw Reference
Segmented Stream - Events, Reference
Segmented Error Stream - Error

Segmented means that the data has been demarcated into segments or records.

Child Stream Types

Data

This is the actual data of the stream, e.g. the XML events, raw CSV, JSON, etc.

Context

This is additional contextual data that can be sent with the data. Context data can be used for reference data lookups.

Meta Data

This is the data about the Stream (e.g. the feed name, receipt time, user agent, etc.). This meta data either comes from the HTTP headers when the data was POSTed to Stroom or is added by Stroom or Stroom-Proxy on receipt/processing.

Non-Segmented Stream

The following is a representation of a non-segmented stream with three parts, each with Meta Data and Context child streams.

Raw Events and Raw Reference streams contain non-segmented data, e.g. a large batch of CSV, JSON, XML, etc. data. There is no notion of a record/event/segment in the data, it is simply data in any form (including malformed data) that is yet to be processed and demarcated into records/events, for example using a Data Splitter or an XML parser.

The Stream may be single-part or multi-part depending on how it is received. If it is the product of proxy aggregation then it is likely to be multi-part. Each part will have its own context and meta data child streams, if applicable.

Segmented Stream

The following is a representation of a segmented stream that contains three records (i.e events) and the Meta Data.

Cooked Events and Reference data are forms of segmented data. The raw data has been parsed and split into records/events and the resulting data is stored in a way that allows Stroom to know where each record/event starts/ends. These streams only have a single part.

Error Stream

Error streams are similar to segmented Event/Reference streams in that they are single-part and have demarcated records (where each error/warning/info message is a record). Error streams do not have any Meta Data or Context child streams.

3 - Dashboards

The means of displaying and visualising the results of queries.

3.1 - Dashboard Expressions

Expression language used to manipulate data on Stroom Dashboards.

Expressions can be used to manipulate data on Stroom Dashboards.

Each function has a name, and some have additional aliases.

In some cases, functions can be nested. The return value for some functions being used as the arguments for other functions.

The arguments to functions can either be other functions, literal values, or they can refer to fields on the input data using the field reference ${val} syntax.

Aggregate Functions

String Functions

Mathematics Functions

Type Checking Functions

Link Functions

Cast Functions

Date Functions

Logic Functions

Rounding Functions

Selection Functions

URI Functions

Value Functions

3.1.1 - Aggregate Functions

Functions that produce aggregates over multiple data points.

Average

Takes an average value of the arguments

average(arg)
mean(arg)

Examples

average(${val})
${val} = [10, 20, 30, 40]
> 25

mean(${val})
${val} = [10, 20, 30, 40]
> 25

Count

Counts the number of records that are passed through it. Doesn’t take any notice of the values of any fields.

count()

Examples

Supplying 3 values...

count()
> 3

Count Groups

This is used to count the number of unique values where there are multiple group levels. For Example, a data set grouped as follows

Group by Name
Group by Type

A groupCount could be used to count the number of distinct values of ’type’ for each value of ’name'

Count Unique

This is used to count the number of unique values passed to the function where grouping is used to aggregate values in other columns. For Example, a data set grouped as follows

Group by Name
Group by Type

countUnique() could be used to count the number of distinct values of ’type’ for each value of ’name'

Examples

countUnique(${val})
${val} = ['bill', 'bob', 'fred', 'bill']
> 3

Joining

Concatenates all values together into a single string. If a delimter is supplied then the delimiter is placed bewteen each concatenated string. If a limit is supplied then it will only concatenate up to limit values.

joining(values)
joining(values, delimiter)
joining(values, delimiter, limit)

Examples

joining(${val}, ', ')
${val} = ['bill', 'bob', 'fred', 'bill']
> 'bill, bob, fred, bill'

Max

Determines the maximum value given in the args

max(arg)

Examples

max(${val})
${val} = [100, 30, 45, 109]
> 109

# They can be nested
max(max(${val}), 40, 67, 89)
${val} = [20, 1002]
> 1002

Min

Determines the minimum value given in the args

min(arg)

Examples

min(${val})
${val} = [100, 30, 45, 109]
> 30

They can be nested

min(max(${val}), 40, 67, 89)
${val} = [20, 1002]
> 20

Standard Deviation

Calculate the standard deviation for a set of input values.

stDev(arg)

Examples

round(stDev(${val}))
${val} = [600, 470, 170, 430, 300]
> 147

Sum

Sums all the arguments together

sum(arg)

Examples

sum(${val})
${val} = [89, 12, 3, 45]
> 149

Variance

Calculate the variance of a set of input values.

variance(arg)

Examples

variance(${val})
${val} = [600, 470, 170, 430, 300]
> 21704

3.1.2 - Cast Functions

A set of functions for converting between different data types or for working with data types.

To Boolean

Attempts to convert the passed value to a boolean data type.

toBoolean(arg1)

Examples:

toBoolean(1)
> true
toBoolean(0)
> false
toBoolean('true')
> true
toBoolean('false')
> false

To Double

Attempts to convert the passed value to a double data type.

toDouble(arg1)

Examples:

toDouble('1.2')
> 1.2

To Integer

Attempts to convert the passed value to a integer data type.

toInteger(arg1)

Examples:

toInteger('1')
> 1

To Long

Attempts to convert the passed value to a long data type.

toLong(arg1)

Examples:

toLong('1')
> 1

To String

Attempts to convert the passed value to a string data type.

toString(arg1)

Examples:

toString(1.2)
> '1.2'

3.1.3 - Date Functions

Functions for manipulating dates and times.

Parse Date

Parse a date and return a long number of milliseconds since the epoch.

parseDate(aString)
parseDate(aString, pattern)
parseDate(aString, pattern, timeZone)

Example

parseDate('2014 02 22', 'yyyy MM dd', '+0400')
> 1393012800000

Format Date

Format a date supplied as milliseconds since the epoch.

formatDate(aLong)
formatDate(aLong, pattern)
formatDate(aLong, pattern, timeZone)

Example

formatDate(1393071132888, 'yyyy MM dd', '+1200')
> '2014 02 23'

Ceiling Year/Month/Day/Hour/Minute/Second

ceilingYear(args...)
ceilingMonth(args...)
ceilingDay(args...)
ceilingHour(args...)
ceilingMinute(args...)
ceilingSecond(args...)

Examples

ceilingSecond("2014-02-22T12:12:12.888Z"
> "2014-02-22T12:12:13.000Z"
ceilingMinute("2014-02-22T12:12:12.888Z"
> "2014-02-22T12:13:00.000Z"
ceilingHour("2014-02-22T12:12:12.888Z"
> "2014-02-22T13:00:00.000Z"
ceilingDay("2014-02-22T12:12:12.888Z"
> "2014-02-23T00:00:00.000Z"
ceilingMonth("2014-02-22T12:12:12.888Z"
> "2014-03-01T00:00:00.000Z"
ceilingYear("2014-02-22T12:12:12.888Z"
> "2015-01-01T00:00:00.000Z"

Floor Year/Month/Day/Hour/Minute/Second

floorYear(args...)
floorMonth(args...)
floorDay(args...)
floorHour(args...)
floorMinute(args...)
floorSecond(args...)

Examples

floorSecond("2014-02-22T12:12:12.888Z"
> "2014-02-22T12:12:12.000Z"
floorMinute("2014-02-22T12:12:12.888Z"
> "2014-02-22T12:12:00.000Z"
floorHour("2014-02-22T12:12:12.888Z"
> 2014-02-22T12:00:00.000Z"
floorDay("2014-02-22T12:12:12.888Z"
> "2014-02-22T00:00:00.000Z"
floorMonth("2014-02-22T12:12:12.888Z"
> "2014-02-01T00:00:00.000Z"
floorYear("2014-02-22T12:12:12.888Z"
> "2014-01-01T00:00:00.000Z"

Round Year/Month/Day/Hour/Minute/Second

roundYear(args...)
roundMonth(args...)
roundDay(args...)
roundHour(args...)
roundMinute(args...)
roundSecond(args...)

Examples

roundSecond("2014-02-22T12:12:12.888Z")
> "2014-02-22T12:12:13.000Z"
roundMinute("2014-02-22T12:12:12.888Z")
> "2014-02-22T12:12:00.000Z"
roundHour("2014-02-22T12:12:12.888Z"
> "2014-02-22T12:00:00.000Z"
roundDay("2014-02-22T12:12:12.888Z"
> "2014-02-23T00:00:00.000Z"
roundMonth("2014-02-22T12:12:12.888Z"
> "2014-03-01T00:00:00.000Z"
roundYear("2014-02-22T12:12:12.888Z"
> "2014-01-01T00:00:00.000Z"

3.1.4 - Link Functions

Functions for linking to other screens in Stroom and/or to particular sets of data.

Annotation

A helper function to make forming links to annotations easier than using Link. The Annotation function allows you to create a link to open the Annotation editor, either to view an existing annotation or to begin creating one with pre-populated values.

annotation(text, annotationId)
annotation(text, annotationId, [streamId, eventId, title, subject, status, assignedTo, comment])

If you provide just the text and an annotationId then it will produce a link that opens an existing annotation with the supplied ID in the Annotation Edit dialog.

Example

annotation('Open annotation', ${annotation:Id})
> [Open annotation](?annotationId=1234){annotation}
annotation('Create annotation', '', ${StreamId}, ${EventId})
> [Create annotation](?annotationId=&streamId=1234&eventId=45){annotation}
annotation('Escalate', '', ${StreamId}, ${EventId}, 'Escalation', 'Triage required')
> [Escalate](?annotationId=&streamId=1234&eventId=45&title=Escalation&subject=Triage%20required){annotation}

If you don’t supply an annotationId then the link will open the Annotation Edit dialog pre-populated with the optional arguments so that an annotation can be created. If the annotationId is not provided then you must provide a streamId and an eventId. If you don’t need to pre-populate a value then you can use '' or null() instead.

Example

annotation('Create suspect event annotation', null(), 123, 456, 'Suspect Event', null(), 'assigned', 'jbloggs')
> [Create suspect event annotation](?streamId=123&eventId=456&title=Suspect%20Event&assignedTo=jbloggs){annotation}

Dashboard

A helper function to make forming links to dashboards easier than using Link.

dashboard(text, uuid)
dashboard(text, uuid, params)

Example

dashboard('Click Here','e177cf16-da6c-4c7d-a19c-09a201f5a2da')
> [Click Here](?uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da){dashboard}
dashboard('Click Here','e177cf16-da6c-4c7d-a19c-09a201f5a2da', 'userId=user1')
> [Click Here](?uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da&params=userId%3Duser1){dashboard}

Data

Creates a link to open a source for data for viewing.

data(text, id, partNo, [recordNo, lineFrom, colFrom, lineTo, colTo, viewType, displayType])

viewType can be one of:

preview : Display the data as a formatted preview of a limited portion of the data.
source : Display the un-formatted data in its original form with the ability to navigate around all of the data source.

displayType can be one of:

dialog : Open as a modal popup dialog.
tab : Open as a top level tab within the Stroom browser tab.

Example

data('Quick View', ${StreamId}, 1)
> [Quick View]?id=1234&&partNo=1)

Link

Create a string that represents a hyperlink for display in a dashboard table.

link(url)
link(text, url)
link(text, url, type)

Example

link('http://www.somehost.com/somepath')
> [http://www.somehost.com/somepath](http://www.somehost.com/somepath)
link('Click Here','http://www.somehost.com/somepath')
> [Click Here](http://www.somehost.com/somepath)
link('Click Here','http://www.somehost.com/somepath', 'dialog')
> [Click Here](http://www.somehost.com/somepath){dialog}
link('Click Here','http://www.somehost.com/somepath', 'dialog|Dialog Title')
> [Click Here](http://www.somehost.com/somepath){dialog|Dialog Title}

Type can be one of:

dialog : Display the content of the link URL within a stroom popup dialog.
tab : Display the content of the link URL within a stroom tab.
browser : Display the content of the link URL within a new browser tab.
dashboard : Used to launch a stroom dashboard internally with parameters in the URL.

If you wish to override the default title or URL of the target link in either a tab or dialog you can. Both dialog and tab types allow titles to be specified after a |, e.g. dialog|My Title.

Stepping

Open the Stepping tab for the requested data source.

stepping(text, id)
stepping(text, id, partNo)
stepping(text, id, partNo, recordNo)

Example

stepping('Click here to step',${StreamId})
> [Click here to step](?id=1)

3.1.5 - Logic Funtions

Equals

Evaluates if arg1 is equal to arg2

arg1 = arg2
equals(arg1, arg2)

Examples

'foo' = 'bar'
> false
'foo' = 'foo'
> true
51 = 50
> false
50 = 50
> true

equals('foo', 'bar')
> false
equals('foo', 'foo')
> true
equals(51, 50)
> false
equals(50, 50)
> true

Note that equals cannot be applied to null and error values, e.g. x=null() or x=err(). The isNull() and isError() functions must be used instead.

Greater Than

Evaluates if arg1 is greater than to arg2

arg1 > arg2
greaterThan(arg1, arg2)

Examples

51 > 50
> true
50 > 50
> false
49 > 50
> false

greaterThan(51, 50)
> true
greaterThan(50, 50)
> false
greaterThan(49, 50)
> false

Greater Than or Equal To

Evaluates if arg1 is greater than or equal to arg2

arg1 >= arg2
greaterThanOrEqualTo(arg1, arg2)

Examples

51 >= 50
> true
50 >= 50
> true
49 >= 50
> false

greaterThanOrEqualTo(51, 50)
> true
greaterThanOrEqualTo(50, 50)
> true
greaterThanOrEqualTo(49, 50)
> false

If

Evaluates the supplied boolean condition and returns one value if true or another if false

if(expression, trueReturnValue, falseReturnValue)

Examples

if(5 < 10, 'foo', 'bar')
> 'foo'
if(5 > 10, 'foo', 'bar')
> 'bar'
if(isNull(null()), 'foo', 'bar')
> 'foo'

Less Than

Evaluates if arg1 is less than to arg2

arg1 < arg2
lessThan(arg1, arg2)

Examples

51 < 50
> false
50 < 50
> false
49 < 50
> true

lessThan(51, 50)
> false
lessThan(50, 50)
> false
lessThan(49, 50)
> true

Less Than or Equal To

Evaluates if arg1 is less than or equal to arg2

arg1 <= arg2
lessThanOrEqualTo(arg1, arg2)

Examples

51 <= 50
> false
50 <= 50
> true
49 <= 50
> true

lessThanOrEqualTo(51, 50)
> false
lessThanOrEqualTo(50, 50)
> true
lessThanOrEqualTo(49, 50)
> true

Not

Inverts boolean values making true, false etc.

not(booleanValue)

Examples

not(5 > 10)
> true
not(5 = 5)
> false
not(false())
> true

3.1.6 - Mathematics Functions

Standard mathematical functions, such as add subtract, multiple, etc.

Add

arg1 + arg2

Or reduce the args by successive addition

add(args...)

Examples

34 + 9
> 43
add(45, 6, 72)
> 123

Average

Takes an average value of the arguments

average(args...)
mean(args...)

Examples

average(10, 20, 30, 40)
> 25
mean(8.9, 24, 1.2, 1008)
> 260.525

Divide

Divides arg1 by arg2

arg1 / arg2

Or reduce the args by successive division

divide(args...)

Examples

42 / 7
> 6
divide(1000, 10, 5, 2)
> 10
divide(100, 4, 3)
> 8.33

Max

Determines the maximum value given in the args

max(args...)

Examples

max(100, 30, 45, 109)
> 109

# They can be nested
max(max(${val}), 40, 67, 89)
${val} = [20, 1002]
> 1002

Min

Determines the minimum value given in the args

min(args...)

Examples

min(100, 30, 45, 109)
> 30

They can be nested

min(max(${val}), 40, 67, 89)
${val} = [20, 1002]
> 20

Modulo

Determines the modulus of the dividend divided by the divisor.

modulo(dividend, divisor)

Examples

modulo(100, 30)
> 10

Multiply

Multiplies arg1 by arg2

arg1 * arg2

Or reduce the args by successive multiplication

multiply(args...)

Examples

4 * 5
> 20
multiply(4, 5, 2, 6)
> 240

Negate

Multiplies arg1 by -1

negate(arg1)

Examples

negate(80)
> -80
negate(23.33)
> -23.33
negate(-9.5)
> 9.5

Power

Raises arg1 to the power arg2

arg1 ^ arg2

Or reduce the args by successive raising to the power

power(args...)

Examples

4 ^ 3
> 64
power(2, 4, 3)
> 4096

Random

Generates a random number between 0.0 and 1.0

random()

Examples

random()
> 0.78
random()
> 0.89
...you get the idea

Subtract

arg1 - arg2

Or reduce the args by successive subtraction

subtract(args...)

Examples

29 - 8
> 21
subtract(100, 20, 34, 2)
> 44

Sum

Sums all the arguments together

sum(args...)

Examples

sum(89, 12, 3, 45)
> 149

3.1.7 - Rounding Functions

Functions for rounding data to a set precision.

These functions require a value, and an optional decimal places. If the decimal places are not given it will give you nearest whole number.

Ceiling

ceiling(value, decimalPlaces<optional>)

Examples

ceiling(8.4234)
> 9
ceiling(4.56, 1)
> 4.6
ceiling(1.22345, 3)
> 1.223

Floor

floor(value, decimalPlaces<optional>)

Examples

floor(8.4234)
> 8
floor(4.56, 1)
> 4.5
floor(1.2237, 3)
> 1.223

Round

round(value, decimalPlaces<optional>)

Examples

round(8.4234)
> 8
round(4.56, 1)
> 4.6
round(1.2237, 3)
> 1.224

3.1.8 - Selection Functions

Functions for selecting a sub-set of a set of data.

Selection functions are a form of aggregate function operating on grouped data.

Any

Selects the first value found in the group that is not null() or err(). If no explicit ordering is set then the value selected is indeterminate.

any(${val})

Examples

any(${val})
${val} = [10, 20, 30, 40]
> 10

Bottom

Selects the bottom N values and returns them as a delimited string in the order they are read.

bottom(${val}, delimiter, limit)

Examples

bottom(${val}, ', ', 2)
${val} = [10, 20, 30, 40]
> '30, 40'

First

Selects the first value found in the group even if it is null() or err(). If no explicit ordering is set then the value selected is indeterminate.

first(${val})

Examples

first(${val})
${val} = [10, 20, 30, 40]
> 10

Last

Selects the last value found in the group even if it is null() or err(). If no explicit ordering is set then the value selected is indeterminate.

last(${val})

Examples

last(${val})
${val} = [10, 20, 30, 40]
> 40

Nth

Selects the Nth value in a set of grouped values. If there is no explicit ordering on the field selected then the value returned is indeterminate.

nth(${val}, position)

Examples

nth(${val}, 2)
${val} = [20, 40, 30, 10]
> 40

Top

Selects the top N values and returns them as a delimited string in the order they are read.

top(${val}, delimiter, limit)

Examples

top(${val}, ', ', 2)
${val} = [10, 20, 30, 40]
> '10, 20'

3.1.9 - String Functions

Functions for manipulating strings (text data).

Concat

Appends all the arguments end to end in a single string

concat(args...)

Example

concat('this ', 'is ', 'how ', 'it ', 'works')
> 'this is how it works'

Current User

Returns the username of the user running the query.

currentUser()

Example

currentUser()
> 'jbloggs'

Decode

The arguments are split into 3 parts

The input value to test
Pairs of regex matchers with their respective output value
A default result, if the input doesn’t match any of the regexes

decode(input, test1, result1, test2, result2, ... testN, resultN, otherwise)

It works much like a Java Switch/Case statement

Example

decode(${val}, 'red', 'rgb(255, 0, 0)', 'green', 'rgb(0, 255, 0)', 'blue', 'rgb(0, 0, 255)', 'rgb(255, 255, 255)')
${val}='blue'
> rgb(0, 0, 255)
${val}='green'
> rgb(0, 255, 0)
${val}='brown'
> rgb(255, 255, 255) // falls back to the 'otherwise' value

in Java, this would be equivalent to

String decode(value) {
    switch(value) {
        case "red":
            return "rgb(255, 0, 0)"
        case "green":
            return "rgb(0, 255, 0)"
        case "blue":
            return "rgb(0, 0, 255)"
        default:
            return "rgb(255, 255, 255)"
    }
}

decode('red')
> 'rgb(255, 0, 0)'

DecodeUrl

Decodes a URL

decodeUrl('userId%3Duser1')
> userId=user1

EncodeUrl

Encodes a URL

encodeUrl('userId=user1')
> userId%3Duser1

Exclude

If the supplied string matches one of the supplied match strings then return null, otherwise return the supplied string

exclude(aString, match...)

Example

exclude('hello', 'hello', 'hi')
> null
exclude('hi', 'hello', 'hi')
> null
exclude('bye', 'hello', 'hi')
> 'bye'

Hash

Cryptographically hashes a string

hash(value)
hash(value, algorithm)
hash(value, algorithm, salt)

Example

hash(${val}, 'SHA-512', 'mysalt')
> A hashed result...

If not specified the hash() function will use the SHA-256 algorithm. Supported algorithms are determined by Java runtime environment.

Include

If the supplied string matches one of the supplied match strings then return it, otherwise return null

include(aString, match...)

Example

include('hello', 'hello', 'hi')
> 'hello'
include('hi', 'hello', 'hi')
> 'hi'
include('bye', 'hello', 'hi')
> null

Index Of

Finds the first position of the second string within the first

indexOf(firstString, secondString)

Example

indexOf('aa-bb-cc', '-')
> 2

Last Index Of

Finds the last position of the second string within the first

lastIndexOf(firstString, secondString)

Example

lastIndexOf('aa-bb-cc', '-')
> 5

Lower Case

Converts the string to lower case

lowerCase(aString)

Example

lowerCase('Hello DeVeLoPER')
> 'hello developer'

Match

Test an input string using a regular expression to see if it matches

match(input, regex)

Example

match('this', 'this')
> true
match('this', 'that')
> false

Query Param

Returns the value of the requested query parameter.

queryParam(paramKey)

Examples

queryParam('user')
> 'jbloggs'

Query Params

Returns all query parameters as a space delimited string.

queryParams()

Examples

queryParams()
> 'user=jbloggs site=HQ'

Replace

Perform text replacement on an input string using a regular expression to match part (or all) of the input string and a replacement string to insert in place of the matched part

replace(input, regex, replacement)

Example

replace('this', 'is', 'at')
> 'that'

String Length

Takes the length of a string

stringLength(aString)

Example

stringLength('hello')
> 5

Substring

Take a substring based on start/end index of letters

substring(aString, startIndex, endIndex)

Example

substring('this', 1, 2)
> 'h'

Substring After

Get the substring from the first string that occurs after the presence of the second string

substringAfter(firstString, secondString)

Example

substringAfter('aa-bb', '-')
> 'bb'

Substring Before

Get the substring from the first string that occurs before the presence of the second string

substringBefore(firstString, secondString)

Example

substringBefore('aa-bb', '-')
> 'aa'

Upper Case

Converts the string to upper case

upperCase(aString)

Example

upperCase('Hello DeVeLoPER')
> 'HELLO DEVELOPER'

3.1.10 - Type Checking Functions

Functions for evaluating the type of a value.

Is Boolean

Checks if the passed value is a boolean data type.

isBoolean(arg1)

Examples:

isBoolean(toBoolean('true'))
> true

Is Double

Checks if the passed value is a double data type.

isDouble(arg1)

Examples:

isDouble(toDouble('1.2'))
> true

Is Error

Checks if the passed value is an error caused by an invalid evaluation of an expression on passed values, e.g. some values passed to an expression could result in a divide by 0 error. Note that this method must be used to check for error as error equality using x=err() is not supported.

isError(arg1)

Examples:

isError(toLong('1'))
> false
isError(err())
> true

Is Integer

Checks if the passed value is an integer data type.

isInteger(arg1)

Examples:

isInteger(toInteger('1'))
> true

Is Long

Checks if the passed value is a long data type.

isLong(arg1)

Examples:

isLong(toLong('1'))
> true

Is Null

Checks if the passed value is null. Note that this method must be used to check for null as null equality using x=null() is not supported.

isNull(arg1)

Examples:

isNull(toLong('1'))
> false
isNull(null())
> true

Is Number

Checks if the passed value is a numeric data type.

isNumber(arg1)

Examples:

isNumber(toLong('1'))
> true

Is String

Checks if the passed value is a string data type.

isString(arg1)

Examples:

isString(toString(1.2))
> true

Is Value

Checks if the passed value is a value data type, e.g. not null or error.

isValue(arg1)

Examples:

isValue(toLong('1'))
> true
isValue(null())
> false

Type Of

Returns the data type of the passed value as a string.

typeOf(arg1)

Examples:

typeOf('abc')
> string
typeOf(toInteger(123))
> integer
typeOf(err())
> error
typeOf(null())
> null
typeOf(toBoolean('false'))
> false

3.1.11 - URI Functions

Functions for extracting parts from a Uniform Resource Identifier (URI).

Fields containing a Uniform Resource Identifier (URI) in string form can queried to extract the URI’s individual components of authority, fragment, host, path, port, query, scheme, schemeSpecificPart and userInfo. See either RFC 2306: Uniform Resource Identifiers (URI): Generic Syntax or Java’s java.net.URI Class for details regarding the components. If any component is not present within the passed URI, then an empty string is returned.

The extraction functions are

extractAuthorityFromUri() - extract the Authority component
extractFragmentFromUri() - extract the Fragment component
extractHostFromUri() - extract the Host component
extractPathFromUri() - extract the Path component
extractPortFromUri() - extract the Port component
extractQueryFromUri() - extract the Query component
extractSchemeFromUri() - extract the Scheme component
extractSchemeSpecificPartFromUri() - extract the Scheme specific part component
extractUserInfoFromUri() - extract the UserInfo component

If the URI is http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details the table below displays the extracted components

Expression	Extraction
extractAuthorityFromUri(${URI})	foo:bar@w1.superman.com:8080
extractFragmentFromUri(${URI})	more-details
extractHostFromUri(${URI})	w1.superman.com
extractPathFromUri(${URI})	/very/long/path.html
extractPortFromUri(${URI})	8080
extractQueryFromUri(${URI})	p1=v1&p2=v2
extractSchemeFromUri(${URI})	http
extractSchemeSpecificPartFromUri(${URI})	//foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2
extractUserInfoFromUri(${URI})	foo:bar

extractAuthorityFromUri

Extracts the Authority component from a URI

extractAuthorityFromUri(uri)

Example

extractAuthorityFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'foo:bar@w1.superman.com:8080'

extractFragmentFromUri

Extracts the Fragment component from a URI

extractFragmentFromUri(uri)

Example

extractFragmentFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'more-details'

extractHostFromUri

Extracts the Host component from a URI

extractHostFromUri(uri)

Example

extractHostFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'w1.superman.com'

extractPathFromUri

Extracts the Path component from a URI

extractPathFromUri(uri)

Example

extractPathFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> '/very/long/path.html'

extractPortFromUri

Extracts the Port component from a URI

extractPortFromUri(uri)

Example

extractPortFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> '8080'

extractQueryFromUri

Extracts the Query component from a URI

extractQueryFromUri(uri)

Example

extractQueryFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'p1=v1&p2=v2'

extractSchemeFromUri

Extracts the Scheme component from a URI

extractSchemeFromUri(uri)

Example

extractSchemeFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'http'

extractSchemeSpecificPartFromUri

Extracts the SchemeSpecificPart component from a URI

extractSchemeSpecificPartFromUri(uri)

Example

extractSchemeSpecificPartFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> '//foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2'

extractUserInfoFromUri

Extracts the UserInfo component from a URI

extractUserInfoFromUri(uri)

Example

extractUserInfoFromUri('http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details')
> 'foo:bar'

3.1.12 - Value Functions

Functions that return a static value.

Err

Returns err

err()

False

Returns boolean false

false()

Null

Returns null

null()

True

Returns boolean true

true()

3.2 - Dictionaries

Creating

Right click on a folder in the explorer tree that you want to create a dictionary in. Choose ‘New/Dictionary’ from the popup menu:

TODO: Fix image

Call the dictionary something like ‘My Dictionary’ and click OK.

TODO: Fix image

Now just add any search terms you want to the newly created dictionary and click save.

TODO: Fix image

You can add multiple terms.

Terms on separate lines act as if they are part of an ‘OR’ expression when used in a search.
Terms on a single line separated by spaces act as if they are part of an ‘AND’ expression when used in a search.

Using

To perform a search using your dictionary, just choose the newly created dictionary as part of your search expression:

TODO: Fix image

3.3 - Direct URLs

Navigating directly to a specific Stroom dashboard using a direct URL.

It is possible to navigate directly to a specific Stroom dashboard using a direct URL. This can be useful when you have a dashboard that needs to be viewed by users that would otherwise not be using the Stroom user interface.

URL format

The format for the URL is as follows:

https://<HOST>/stroom/dashboard?type=Dashboard&uuid=<DASHBOARD UUID>[&title=<DASHBOARD TITLE>][&params=<DASHBOARD PARAMETERS>]

Example:

https://localhost/stroom/dashboard?type=Dashboard&uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09&title=My%20Dash&params=userId%3DFred%20Bloggs

Host and path

The host and path are typically https://<HOST>/stroom/dashboard where <HOST> is the hostname/IP for Stroom.

type

type is a required parameter and must always be Dashboard since we are opening a dashboard.

uuid

uuid is a required parameter where <DASHBOARD UUID> is the UUID for the dashboard you want a direct URL to, e.g. uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09

The UUID for the dashboard that you want to link to can be found by right clicking on the dashboard icon in the explorer tree and selecting Info.

The Info dialog will display something like this and the UUID can be copied from it:

DB ID: 4
UUID: c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09
Type: Dashboard
Name: Stroom Family App Events Dashboard
Created By: INTERNAL
Created On: 2018-12-10T06:33:03.275Z
Updated By: admin
Updated On: 2018-12-10T07:47:06.841Z

title (Optional)

title is an optional URL parameter where <DASHBOARD TITLE> allows the specification of a specific title for the opened dashboard instead of the default dashboard name.

The inclusion of ${name} in the title allows the default dashboard name to be used and appended with other values, e.g. 'title=${name}%20-%20' + param.name

params (Optional)

params is an optional URL parameter where <DASHBOARD PARAMETERS> includes any parameters that have been defined for the dashboard in any of the expressions, e.g. params=userId%3DFred%20Bloggs

Permissions

In order for as user to view a dashboard they will need the necessary permission on the various entities that make up the dashboard.

For a Lucene index query and associated table the following permissions will be required:

Read permission on the Dashboard entity.
Use permission on any Indexe entities being queried in the dashboard.
Use permission on any Pipeline entities set as search extraction Pipelines in any of the dashboard’s tables.
Use permission on any XSLT entities used by the above search extraction Pipeline entites.
Use permission on any ancestor pipelines of any of the above search extraction Pipeline entites (if applicable).
Use permission on any Feed entities that you want the user to be able to see data for.

For a SQL Statistics query and associated table the following permissions will be required:

Read permission on the Dashboard entity.
Use permission on the StatisticStore entity being queried.

For a visualisation the following permissions will be required:

Read permission on any Visualiation entities used in the dashboard.
Read permission on any Script entities used by the above Visualiation entities.
Read permission on any Script entities used by the above Script entities.

3.4 - Queries

How to query the data in Stroom.

Dashboard queries are created with the query expression builder. The expression builder allows for complex boolean logic to be created across multiple index fields. The way in which different index fields may be queried depends on the type of data that the index field contains.

Date Time Fields

Time fields can be queried for times equal, greater than, greater than or equal, less than, less than or equal or between two times.

Times can be specified in two ways:

Absolute times
Relative times

Absolute Times

An absolute time is specified in ISO 8601 date time format, e.g. 2016-01-23T12:34:11.844Z

Relative Times

In addition to absolute times it is possible to specify times using expressions. Relative time expressions create a date time that is relative to the execution time of the query. Supported expressons are as follows:

now() - The current execution time of the query.
second() - The current execution time of the query rounded down to the nearest second.
minute() - The current execution time of the query rounded down to the nearest minute.
hour() - The current execution time of the query rounded down to the nearest hour.
day() - The current execution time of the query rounded down to the nearest day.
week() - The current execution time of the query rounded down to the first day of the week (Monday).
month() - The current execution time of the query rounded down to the start of the current month.
year() - The current execution time of the query rounded down to the start of the current year.

Adding/Subtracting Durations

With relative times it is possible to add or subtract durations so that queries can be constructed to provide for example, the last week of data, the last hour of data etc.

To add/subtract a duration from a query term the duration is simply appended after the relative time, e.g.

now() + 2d

Multiple durations can be combined in the expression, e.g.

now() + 2d - 10h

now() + 2w - 1d10h

Durations consist of a number and duration unit. Supported duration units are:

s - Seconds
m - Minutes
h - Hours
d - Days
w - Weeks
M - Months
y - Years

Using these durations a query to get the last weeks data could be as follows:

between now() - 1w and now()

Or midnight a week ago to midnight today:

between day() - 1w and day()

Or if you just wanted data for the week so far:

greater than week()

Or all data for the previous year:

between year() - 1y and year()

Or this year so far:

greater than year()

4 - Data Retention

Controlling the purging/retention of old data.

By default Stroom will retain all the data it ingests and creates forever. It is likely that storage constraints/costs will mean that data needs to be deleted after a certain time. It is also likely that certain types of data may need to be kept for longer than other types.

Rules

Stroom allows for a set of data retention policy rules to be created to control at a fine grained level what data is deleted and what is retained.

The data retention rules are accessible by selecting Data Retention from the Tools menu. On first use the Rules tab of the Data Retention screen will show a single rule named Default Retain All Forever Rule. This is the implicit rule in stroom that retains all data and is always in play unless another rule overrides it. This rule cannot be edited, moved or removed.

Rule Precedence

Rules have a precedence, with a lower rule number being a higher priority. When running the data retention job, Stroom will look at each stream held on the system and the retention policy of the first rule (starting from the lowest numbered rule) that matches that stream will apply. One a matching rule is found all other rules with higher rule numbers (lower priority) are ignored. For example if rule 1 says to retain streams from feed X-EVENTS for 10 years and rule 2 says to retain streams from feeds *-EVENTS for 1 year then rule 1 would apply to streams from feed X-EVENTS and they would be kept for 10 years, but rule 2 would apply to feed Y-EVENTS and they would only be kept for 1 year. Rules are re-numbered as new rules are added/deleted/moved.

Creating a Rule

To create a rule do the following:

Click the icon to add a new rule.
Edit the expression to define the data that the rule will match on.
Provide a name for the rule to help describe what its purpose is.
Set the retention period for data matching this rule, i.e Forever or a set time period.

The new rule will be added at the top of the list of rules, i.e. with the highest priority. The and icons can be used to change the priority of the rule.

Rules can be enabled/disabled by clicking the checkbox next to the rule.

Changes to rules will not take effect until the icon is clicked.

Rules can also be deleted ( ) and copied ( ).

Impact Summary

When you have a number of complex rules it can be difficult to determine what data will actually be deleted next time the Policy Based Data Retention job runs. To help with this, Stroom has the Impact Summary tab that acts as a dry run for the active rules. The impact summary provides a count of the number of streams that will be deleted broken down by rule, stream type and feed name. On large systems with lots of data or complex rules, this query may take a long time to run.

The impact summary operates on the current state of the rules on the Rules tab whether saved or un-saved. This allows you to make a change to the rules and test its impact before saving it.

5 - Data Splitter

Data Splitter was created to transform text into XML. The XML produced is basic but can be processed further with XSLT to form any desired XML output.

Data Splitter works by using regular expressions to match a region of content or tokenizers to split content. The whole match or match group can then be output or passed to other expressions to further divide the matched data.

The root <dataSplitter> element controls the way content is read and buffered from the source. It then passes this content on to one or more child expressions that attempt to match the content. The child expressions attempt to match content one at a time in the order they are specified until one matches. The matching expression then passes the content that it has matched to other elements that either emit XML or apply other expressions to the content matched by the parent.

This process of content supply, match, (supply, match)*, emit is best illustrated in a simple CSV example. Note that the elements and attributes used in all examples are explained in detail in the element reference.

5.1 - Simple CSV Example

The following CSV data will be split up into separate fields using Data Splitter.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

The first thing we need to do is match each record. Each record in a CSV file is delimited by a new line character. The following configuration will split the data into records using ‘\n’ as a delimiter:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  
  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n"/>

</dataSplitter>

In the above example the ‘split’ tokenizer matches all of the supplied content up to the end of each line ready to pass each line of content on for further treatment.

We can now add a <group> element within <split> to take content matched by the tokenizer.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

    </group>
  </split>
</dataSplitter>

The <group> within the <split> chooses to take the content from the <split> without including the new line ‘\n’ delimiter by using match group 1, see expression match references for details.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

The content selected by the <group> from its parent match can then be passed onto sub expressions for further matching:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

      </split>
    </group>
  </split>
</dataSplitter>

In the above example the additional <split> element within the <group> will match the content provided by the group repeatedly until it has used all of the group content.

The content matched by the inner <split> element can be passed to a <data> element to emit XML content.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

        <!-- Output the value from group 1 (as above using group 1
        ignores the delimiters, without this each value would include
        the comma) -->
        <data value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example each match from the inner <split> is made available to the inner <data> element that chooses to output content from match group 1, see expression match references for details.

The above configuration results in the following XML output for the whole input:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data value="01/01/2010" />
    <data value="00:00:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logon" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:01:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="create" />
    <data value="c:\test.txt" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:02:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logoff" />
  </record>
</records>

5.2 - Simple CSV example with heading

In addition to referencing content produced by a parent element it is often desirable to store content and reference it later. The following example of a CSV with a heading demonstrates how content can be stored in a variable and then referenced later on.

Input

This example will use a similar input to the one in the previous CSV example but also adds a heading line.

Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the
  first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value
        from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:00:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logon" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:01:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="create" />
    <data name="Detail" value="c:\test.txt" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:02:00" />
    <data name="IPAdress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logoff" />
  </record>
</records>

5.3 - Complex example with regex and user defined names

The following example uses a real world Apache log and demonstrates the use of regular expressions rather than simple ‘split’ tokenizers. The usage and structure of regular expressions is outside of the scope of this document but Data Splitter uses Java’s standard regular expression library that is POSIX compliant and documented in numerous places.

This example also demonstrates that the names and values that are output can be hard coded in the absence of field name information to make XSLT conversion easier later on. Also shown is that any match can be divided into further fields with additional expressions and the ability to nest data elements to provide structure if needed.

Input

192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /doc.htm HTTP/1.1" 200 4235 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /default.css HTTP/1.1" 200 3494 "http://some.server:8080/doc.htm" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!--
  Standard Apache Format

  %h - host name should be ok without quotes
  %l - Remote logname (from identd, if supplied). This will return a dash unless IdentityCheck is set On.
  \"%u\" - user name should be quoted to deal with DNs
  %t - time is added in square brackets so is contained for parsing purposes
  \"%r\" - URL is quoted
  %>s - Response code doesn't need to be quoted as it is a single number
  %b - The size in bytes of the response sent to the client
  \"%{Referer}i\" - Referrer is quoted so that’s ok
  \"%{User-Agent}i\" - User agent is quoted so also ok

  LogFormat "%h %l \"%u\" %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
  -->

  <!-- Match line -->
  <split delimiter="\n">
    <group value="$1">

      <!-- Provide a regular expression for the whole line with match
      groups for each field we want to split out -->
      <regex pattern="^([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; \[([^\]]+)] &#34;([^&#34;]+)&#34; ([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; &#34;([^&#34;]+)&#34;">
        <data name="host" value="$1" />
        <data name="log" value="$2" />
        <data name="user" value="$3" />
        <data name="time" value="$4" />
        <data name="url" value="$5">

          <!-- Take the 5th regular expression group and pass it to
          another expression to divide into smaller components -->
          <group value="$5">
            <regex pattern="^([^ ]+) ([^ ]+) ([^ /]*)/([^ ]*)">
              <data name="httpMethod" value="$1" />
              <data name="url" value="$2" />
              <data name="protocol" value="$3" />
              <data name="version" value="$4" />
            </regex>
          </group>
        </data>
        <data name="response" value="$6" />
        <data name="size" value="$7" />
        <data name="referrer" value="$8" />
        <data name="userAgent" value="$9" />
      </regex>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /doc.htm HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/doc.htm" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="4235" />
    <data name="referrer" value="-" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /default.css HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/default.css" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="3494" />
    <data name="referrer" value="http://some.server:8080/doc.htm" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
</records>

5.4 - Multi Line Example

Example multi line file where records are split over may lines. There are various ways this data could be treated but this example forms a record from data created when some fictitious query starts plus the subsequent query results.

Input

09/07/2016    14:49:36    User = user1
09/07/2016    14:49:36    Query = some query

09/07/2016    16:34:40    Results:
09/07/2016    16:34:40    Line 1:   result1
09/07/2016    16:34:40    Line 2:   result2
09/07/2016    16:34:40    Line 3:   result3
09/07/2016    16:34:40    Line 4:   result4

09/07/2009    16:35:21    User = user2
09/07/2009    16:35:21    Query = some other query

09/07/2009    16:45:36    Results:
09/07/2009    16:45:36    Line 1:   result1
09/07/2009    16:45:36    Line 2:   result2
09/07/2009    16:45:36    Line 3:   result3
09/07/2009    16:45:36    Line 4:   result4

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match each record. We want to treat the query and results as a single event so match the two sets of data separated by a double new line -->
  <regex pattern="\n*((.*\n)+?\n(.*\n)+?\n)|\n*(.*\n?)+">
    <group>

      <!-- Split the record into query and results -->
      <regex pattern="(.*?)\n\n(.*)" dotAll="true">

        <!-- Create a data element to output query data -->
        <data name="query">
          <group value="$1">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>

        <!-- Create a data element to output result data -->
        <data name="results">
          <group value="$2">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>
      </regex>
    </group>
  </regex>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="2.0">
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="14:49:36" />
      <data name="User" value="user1" />
      <data name="Query" value="some query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:34:40" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:35:21" />
      <data name="User" value="user2" />
      <data name="Query" value="some other query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:45:36" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
</records>

5.5 - Element Reference

There are various elements used in a Data Splitter configuration to control behaviour. Each of these elements can be categorised as one of the following:

5.5.1 - Content Providers

Content providers take some content from the input source or elsewhere (see fixed strings and provide it to one or more expressions. Both the root element <dataSplitter> and <group> elements are content providers.

Root element <dataSplitter>

The root element of a Data Splitter configuration is <dataSplitter>. It supplies content from the input source to one or more expressions defined within it. The way that content is buffered is controlled by the root element and the way that errors are handled as a result of child expressions not matching all of the content it supplies.

Attributes

The following attributes can be added to the <dataSplitter> root element:

ignoreErrors
bufferSize

ignoreErrors

Data Splitter generates errors if not all of the content is matched by the regular expressions beneath the <dataSplitter> or within <group> elements. The error messages are intended to aid the user in writing good Data Splitter configurations. The intent is to indicate when the input data is not being matched fully and therefore possibly skipping some important data. Despite this, in some cases it is laborious to have to write expressions to match all content. In these cases it is preferable to add this attribute to ignore these errors. However it is often better to write expressions that capture all of the supplied content and discard unwanted characters. This attribute also affects errors generated by the use of the minMatch attribute on <regex> which is described later on.

Take the following example input:

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This could be matched with the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n[^#]+">
…
  </regex>
</dataSplitter>

The above configuration would only match up to a comment for each record line, e.g.

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This may well be the desired functionality but if there was useful content within the comment it would be lost. Because of this Data Splitter warns you when expressions are failing to match all of the content presented so that you can make sure that you aren’t missing anything important. In the above example it is obvious that this is the required behaviour but in more complex cases you might be otherwise unaware that your expressions were losing data.

To maintain this assurance that you are handling all content it is usually best to write expressions to explicitly match all content even though you may do nothing with some matches, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n([^#]+)#.+">
…
  </regex>
</dataSplitter>

The above example would match all of the content and would therefore not generate warnings. Sub-expressions of ‘body’ could use match group 1 and ignore the comment.

However as previously stated it might often be difficult to write expressions that will just match content that is to be discarded. In these cases ignoreErrors can be used to suppress errors caused by unmatched content.

bufferSize (Advanced)

This is an optional attribute used to tune the size of the character buffer used by Data Splitter. The default size is 20000 characters and should be fine for most translations. The minimum value that this can be set to is 20000 characters and the maximum is 1000000000. The only reason to specify this attribute is when individual records are bigger than 10000 characters which is rarely the case.

Group element <group>

Groups behave in a similar way to the root element in that they provide content for one or more inner expressions to deal with, e.g.

<group value="$1">
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
  ...
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
  ...

Attributes

As the <group> element is a content provider it also includes the same ‘ignoreErrors’ attribute which behaves in the same way. The complete list of attributes for the <group> element is as follows:

id
value
ignoreErrors
matchOrder
reverse

id

When Data Splitter reports errors it outputs an XPath to describe the part of the configuration that generated the error, e.g.

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group>

It is often a little difficult to identify the configuration element that generated the error by looking at the path and the element description, particularly when multiple elements are the same, e.g. many <group> elements without attributes. To make identification easier you can add an ‘id’ attribute to any element in the configuration resulting in error descriptions as follows:

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group id="myGroupId">

value

This attribute determines what content to present to child expressions. By default the entire content matched by a group’s parent expression is passed on by the group to child expressions. If required, content from a specific match group in the parent expression can be passed to child expressions using the value attribute, e.g. value="$1". In addition to this content can be composed in the same way as it is for data names and values. see match references for a full description of match references.

ignoreErrors

This behaves in the same way as for the root element.

matchOrder

This is an optional attribute used to control how content is consumed by expression matches. Content can be consumed in sequence or in any order using matchOrder="sequence" or matchOrder="any". If the attribute is not specified, Data Splitter will default to matching in sequence.

When matching in sequence, each match consumes some content and the content position is moved beyond the match ready for the subsequent match. However, in some cases the order of these constructs is not predictable, e.g. we may sometimes be presented with:

Value1=1 Value2=2

… or sometimes with:

Value2=2 Value1=1

Using a sequential match order the following example would work to find both values in Value1=1 Value2=2

<group>
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

… but this example would skip over Value2 and only find the value of Value1 if the input was Value2=2 Value1=1.

To be able to deal with content that contains these constructs in either order we need to change the match order to any.

When matching in any order, each match removes the matched section from the content rather than moving the position past the match so that all remaining content can be matched by subsequent expressions. In the following example the first expression would match and remove Value1=1 from the supplied content and the second expression would be presented with Value2=2 which it could also match.

<group matchOrder="any">
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

If the attribute is omitted by default the match order will be sequential. This is the default behaviour as tokens are most often in sequence and consuming content in this way is more efficient as content does not need to be copied by the parser to chop out sections as is required for matching in any order. It is only necessary to use this feature when fields that are identifiable with a specific match can occur in any order.

reverse

Occasionally it is desirable to reverse the content presented by a group to child expressions. This is because it is sometimes easier to form a pattern by matching content in reverse.

Take the following example content of name, value pairs delimited by = but with no spaces between names, multiple spaces between values and only a space between subsequent pairs:

ipAddress=123.123.123.123 zones=Zone 1, Zone 2, Zone 3 location=loc1 A user=An end user serverName=bigserver

We could write a pattern that matches each name value pair by matching up to the start of the next name, e.g.

<regex pattern="([^=]+)=(.+?)( [^=]+=)">

This would match the following:

ipAddress=123.123.123.123 zones=

Here we are capturing the name and value for each pair in separate groups but the pattern has to also match the name from the next name value pair to find the end of the value. By default Data Splitter will move the content buffer to the end of the match ready for subsequent matches so the next name will not be available for matching.

In addition to matching too much content the above example also uses a reluctant qualifier .+?. Use of reluctant qualifiers almost always impacts performance so they are to be avoided if at all possible.

A better way to match the example content is to match the input in reverse, reading characters from right to left.

The following example demonstrates this:

<group reverse="true">
  <regex pattern="([^=]+)=([^ ]+)">
    <data name="$2" value="$1" />
  </regex>
</group>

Using the reverse attribute on the parent group causes content to be supplied to all child expressions in reverse order. In the above example this allows the pattern to match values followed by names which enables us to cope with the fact that values have multiple spaces but names have no spaces.

Content is only presented to child regular expressions in reverse. When referencing values from match groups the content is returned in the correct order, e.g. the above example would return:

<data name="ipAddress" value="123.123.123.123" />
<data name="zones" value="Zone 1, Zone 2, Zone 3" />
<data name="location" value="loc1" />
<data name="user" value="An end user" />
<data name="serverName" value="bigserver" />

The reverse feature isn’t needed very often but there are a few cases where it really helps produce the desired output without the complexity and performance overhead of a reluctant match.

An alternative to using the reverse attribute is to use the original reluctant expression example but tell Data Splitter to make the subsequent name available for the next match by not advancing the content beyond the end of the previous value. This is done by using the advance attribute on the <regex>. However, the reverse attribute represents a better way to solve this particular problem and allows a simpler and more efficient regular expression to be used.

5.5.2 - Expressions

Expressions match some data supplied by a parent content provider. The content matched by an expression depends on the type of expression and how it is configured.

The <split>, <regex> and <all> elements are all expressions and match content as described below.

The <split> element

The <split> element directs Data Splitter to break up content using a specified character sequence as a delimiter. In addition to this it is possible to specify characters that are used to escape the delimiter as well as characters that contain or “quote” a value that may include the delimiter sequence but allow it to be ignored.

Attributes

The <split> element has the following attributes:

id
delimiter
escape
containerStart
containerEnd
maxMatch
minMatch
onlyMatch

id

Optional attribute used to debug the location of expressions causing errors, see id.

delimiter

A required attribute used to specify the character string that will be used as a delimiter to split the supplied content unless it is preceded by an escape character or within a container if specified. Several of the previous examples use this attribute.

escape

An optional attribute used to specify a character sequence that is used to escape the delimiter. Many delimited text formats have an escape character that is used to tell any parser that the following delimiter should be ignored, e.g. often a character such as ‘' is used to escape the character that follows it so that it is not treated as a delimiter. When specified this escape sequence also applies to any container characters that may be specified.

containerStart

An optional attribute used to specify a character sequence that will make this expression ignore the presence of delimiters until an end container is found. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters up to a delimiter.

If used containerEnd must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

containerEnd

An optional attribute used to specify a character sequence that will make this expression stop ignoring the presence of delimiters if it believes it is currently in a container. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters while ignoring the presence of any delimiter.

If used containerStart must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

maxMatch

An optional attribute used to specify the maximum number of times this expression is allowed to match the supplied content. If you do not supply this attribute then the Data Splitter will keep matching the supplied content until it reaches the end. If specified Data Splitter will stop matching the supplied content when it has matched it the specified number of times.

This attribute is used in the ‘CSV with header line’ example to ensure that only the first line is treated as a header line.

minMatch

An optional attribute used to specify the minimum number of times this expression should match the supplied content. If you do not supply this attribute then Data Splitter will not enforce that the expression matches the supplied content. If specified Data Splitter will generate an error if the expression does not match the supplied content at least as many times as specified.

Unlike maxMatch, minMatch does not control the matching process but instead controls the production of error messages generated if the parser is not seeing the expected input.

onlyMatch

Optional attribute to use this expression only for specific instances of a match of the parent expression, e.g. on the 4th, 5th and 8th matches of the parent expression specified by ‘4,5,8’. This is used when this expression should only be used to subdivide content from certain parent matches.

The <regex> element

The <regex> element directs Data Splitter to match content using the specified regular expression pattern. In addition to this the same match control attributes that are available on the <split> element are also present as well as attributes to alter the way the pattern works.

Attributes

The <regex> element has the following attributes:

id
pattern
dotAll
caseInsensitive
maxMatch
minMatch
onlyMatch
advance

id

Optional attribute used to debug the location of expressions causing errors, see id.

pattern

This is a required attribute used to specify a regular expression to use to match on the supplied content. The pattern is used to match the content multiple times until the end of the content is reached while the maxMatch and onlyMatch conditions are satisfied.

dotAll

An optional attribute used to specify if the use of ‘.’ in the supplied pattern matches all characters including new lines. If ’true’ ‘.’ will match all characters including new lines, if ‘false’ it will only match up to a new line. If this attribute is not specified it defaults to ‘false’ and will only match up to a new line.

This attribute is used in many of the multiline examples above.

caseInsensitive

An optional attribute used to specify if the supplied pattern should match content in a case insensitive way. If ’true’ the expression will match content in a case insensitive manner, if ‘false’ it will match the content in a case sensitive manner. If this attribute is not specified it defaults to ‘false’ and will match the content in a case sensitive manner.

maxMatch

This is used in the same way it is on the <split> element, see maxMatch.

minMatch

This is used in the same way it is on the <split> element, see minMatch.

onlyMatch

This is used in the same way it is on the <split> element, see onlyMatch.

advance

After an expression has matched content in the buffer, the buffer start position is advanced so that it moves to the end of the entire match. This means that subsequent expressions operating on the content buffer will not see the previously matched content again. This is normally required behaviour, but in some cases some of the content from a match is still required for subsequent matches. Take the following example of name value pairs:

name1=some value 1 name2=some value 2 name3=some value 3

The first name value pair could be matched with the following expression:

<regex pattern="([^=]+)=(.+?) [^= ]+=">

The above expression would match as follows:

name1=some value 1 name2=some value 2 name3=some value 3

In this example we have had to do a reluctant match to extract the value in group 2 and not include the subsequent name. Because the reluctant match requires us to specify what we are reluctantly matching up to, we have had to include an expression after it that matches the next name.

By default the parser will move the character buffer to the end of the entire match so the next expression will be presented with the following:

some value 2 name3=some value 3

Therefore name2 will have been lost from the content buffer and will not be available for matching.

This behaviour can be altered by telling the expression how far to advance the character buffer after matching. This is done with the advance attribute and is used to specify the match group whose end position should be treated as the point the content buffer should advance to, e.g.

<regex pattern="([^=]+)=(.+?) [^= ]+=" advance="2">

In this example the content buffer will only advance to the end of match group 2 and subsequent expressions will be presented with the following content:

name2=some value 2 name3=some value 3

Therefore name2 will still be available in the content buffer.

It is likely that the advance feature will only be useful in cases where a reluctant match is performed. Reluctant matches are discouraged for performance reasons so this feature should rarely be used. A better way to tackle the above example would be to present the content in reverse, however this is only possible if the expression is within a group, i.e. is not a root expression. There may also be more complex cases where reversal is not an option and the use of a reluctant match is the only option.

The <all> element

The <all> element matches the entire content of the parent group and makes it available to child groups or <data> elements. The purpose of <all> is to act as a catch all expression to deal with content that is not handled by a more specific expression, e.g. to output some other unknown, unrecognised or unexpected data.

<group>
  <regex pattern="^\s*([^=]+)=([^=]+)\s*">
    <data name="$1" value="$2" />
  </regex>

  <!-- Output unexpected data -->
  <all>
    <data name="unknown" value="$" />
  </all>
</group>

The <all> element provides the same functionality as using .* as a pattern in a <regex> element and where dotAll is set to true, e.g. <regex pattern=".*" dotAll="true">. However it performs much faster as it doesn’t require pattern matching logic and is therefore always preferred.

Attributes

The <all> element has the following attributes:

id

Optional attribute used to debug the location of expressions causing errors, see id.

5.5.3 - Output

As with all other aspects of Data Splitter, output XML is determined by adding certain elements to the Data Splitter configuration.

The <data> element

Output is created by Data Splitter using one or more <data> elements in the configuration. The first <data> element that is encountered within a matched expression will result in parent <record> elements being produced in the output.

Attributes

The <data> element has the following attributes:

id
name
value

id

Optional attribute used to debug the location of expressions causing errors, see id.

name

Both the name and value attributes of the <data> element can be specified using match references.

value

Both the name and value attributes of the <data> element can be specified using match references.

Single <data> element example

The simplest example that can be provided uses a single <data> element within a <split> expression.

Given the following input:

This is line 1
This is line 2
This is line 3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <split delimiter="\n" >
    <data value="$1"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data value="This is line 1" />
  </record>
  <record>
    <data value="This is line 2" />
  </record>
  <record>
    <data value="This is line 3" />
  </record>
</records>

Multiple <data> element example

You could also output multiple <data> elements for the same <record> by adding multiple elements within the same expression:

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="ip" value="$1"/>
    <data name="user" value="$2"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Multi level <data> elements

As long as all data elements occur within the same parent/ancestor expression, all data elements will be output within the same record.

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1"/>

    <group value="$1">
      <regex pattern="ip=([^ ]+) user=([^ ]+)">
        <data name="ip" value="$1"/>
        <data name="user" value="$2"/>
      </regex>
    </group>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1" />
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2" />
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3" />
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Nesting <data> elements

Rather than having <data> elements all appear as children of <record> it is possible to nest them either as direct children or within child groups.

Direct children

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="line" value="$">
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

Within child groups

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1">
      <group value="$1">
        <regex pattern="ip=([^ ]+) user=([^ ]+)">
          <data name="ip" value="$1"/>
          <data name="user" value="$2"/>
        </regex>
      </group>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

The above example produces the same output as the previous but could be used to apply much more complex expression logic to produce the child <data> elements, e.g. the inclusion of multiple child expressions to deal with different types of lines.

5.5.4 - Variables

A variable is added to Data Splitter using the <var> element. A variable is used to store matches from a parent expression for use in a reference elsewhere in the configuration, see variable reference.

The most recent matches are stored for use in local references, i.e. references that are in the same match scope as the variable. Multiple matches are stored for use in references that are in a separate match scope. The concept of different variable scopes is described in scopes.

The <var> element

The <var> element is used to tell Data Splitter to store matches from a parent expression for use in a reference.

Attributes

The <var> element has the following attributes:

id

Mandatory attribute used to uniquely identify it within the configuration (see id) and is the means by which a variable is referenced, e.g. $VAR_ID$ .

5.6 - Match References, Variables and Fixed Strings

The <group> and <data> elements can reference match groups from parent expressions or from stored matches in variables. In the case of the <group> element, referenced values are passed on to child expressions whereas the <data> element can use match group references for name and value attributes. In the case of both elements the way of specifying references is the same.

5.6.1 - Concatenation of references

It is possible to concatenate multiple fixed strings and match group references using the + character. As with all references and fixed strings this can be done in <group> value and <data> name and value attributes. However concatenation does have some performance overhead as new buffers have to be created to store concatenated content.

A good example of concatenation is the production of ISO8601 date format from data in the previous example:

01/01/2010,00:00:00

Here the following <regex> could be used to extract the relevant date, time groups:

<regex pattern="(\d{2})/(\d{2})/(\d{4}),(\d{2}):(\d{2}):(\d{2})">

The match groups from this expression can be concatenated with the following value output pattern in the data element:

<data name="dateTime" value="$3+’-‘+$2+’-‘+$1+’-‘+’T’+$4+’:’+$5+’:’+$6+’.000Z’" />

Using the original example, this would result in the output:

<data name="dateTime" value="2010-01-01T00:00:00.000Z" />

Note that the value output pattern wraps all fixed strings in single quotes. This is necessary when concatenating strings and references so that Data Splitter can determine which parts are to be treated as fixed strings. This also allows fixed strings to contain $ and + characters.

As single quotes are used for this purpose, a single quote needs to be escaped with another single quote if one is desired in a fixed string, e.g.

‘this ‘’is quoted text’’’

will result in:

this ‘is quoted text’

5.6.2 - Expression match references

Referencing matches in expressions is done using $. In addition to this a match group number may be added to just retrieve part of the expression match. The applicability and effect that this has depends on the type of expression used.

References to <split> Match Groups

In the following example a line matched by a parent <split> expression is referenced by a child <data> element.

<split delimiter="\n" >
  <data name="line" value="$"/>
</split>

A <split> element matches content up to and including the specified delimiter, so the above reference would output the entire line plus the delimiter. However there are various match groups that can be used by child <group> and <data> elements to reference sections of the matched content.

To illustrate the content provided by each match group, take the following example:

"This is some text\, that we wish to match", "This is the next text"

And the following <split> element:

<split delimiter="," escape="\">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

"This is some text\, that we wish to match",

$1: The content up to the specified delimiter at the end

"This is some text\, that we wish to match"

$2: The content up to the specified delimiter at the end and filtered to remove escape characters (more expensive than $1)

"This is some text, that we wish to match"

In addition to this behaviour match groups 1 and 2 will omit outermost whitespace and container characters if specified, e.g. with the following content:

"  This is some text\, that we wish to match  "  , "This is the next text"

And the following <split> element:

<split delimiter="," escape="\" containerStart="&quot" containerEnd="&quot">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

" This is some text\, that we wish to match " ,

$1: The content up to the specified delimiter at the end and strips outer containers.

This is some text\, that we wish to match

$2: The content up to the specified delimiter at the end and strips outer containers and filtered to remove escape characters (more computationally expensive than $1)

This is some text, that we wish to match

References to <regex> Match Groups

Like the <split> element various match groups can be referenced in a <regex> expression to retrieve portions of matched content. This content can be used as values for <group> and <data> elements.

Given the following input:

ip=1.1.1.1 user=user1

And the following <regex> element:

<regex pattern="ip=([^ ]+) user=([^ ]+)">

The match groups are as follows:

$ or $0: The entire content that is matched by the expression

ip=1.1.1.1 user=user1

$1: The content of the first match group

1.1.1.1

$2: The content of the second match group

user1

Match group numbers in regular expressions are determined by the order that their open bracket appears in the expression.

References to <any> Match Groups

The <any> element does not have any match groups and always returns the entire content that was passed to it when referenced with $.

5.6.3 - Use of fixed strings

Any <group> value or <data> name and value can use references to matched content, but in addition to this it is possible just to output a known string, e.g.

<data name="somename" value="$" />

The above example would output somename as the <data> name attribute. This can often be useful where there are no headings specified in the input data but we want to associate certain names with certain values.

Given the following data:

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

We could provide useful headings with the following configuration:

<regex pattern="([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),">
  <data name="date" value="$1" />
  <data name="time" value="$2" />
  <data name="ipAddress" value="$3" />
  <data name="hostName" value="$4" />
  <data name="user" value="$5" />
  <data name="action" value="$6" />
</regex>

5.6.4 - Variable reference

Variables are added to Data Splitter configuration using the <var> element, see variables. Each variable must have a unique id so that it can be referenced. References to variables have the form $VARIABLE_ID$ , e.g.

<data name="$heading$" value="$" />

Identification

Data Splitter validates the configuration on load and ensures that all element ids are unique and that referenced ids belong to a variable.

A variable will only store data if it is referenced so variables that are not referenced will do nothing. In addition to this a variable will only store data for match groups that are referenced, e.g. if $heading$1 is the only reference to a variable with an id of ‘heading’ then only data for match group 1 will be stored for reference lookup.

Scopes

Variables have two scopes which affect how data is retrieved when referenced:

Local scope
Remote scope

Local Scope

Variables are local to a reference if the reference exists as a descendant of the variables parent expression, e.g.

<split delimiter="\n" >
  <var id="line" />

  <group value="$1">
    <regex pattern="ip=([^ ]+) user=([^ ]+)">
      <data name="line" value="$line$"/>
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </regex>
  </group>
</split>

In the above example, matches for the outermost <split> expression are stored in the variable with the id of line. The only reference to this variable is in a data element that is a descendant of the variables parent expression <split>, i.e. it is nested within split/group/regex.

Because the variable is referenced locally only the most recent parent match is relevant, i.e. no retrieval of values by iteration, iteration offset or fixed position is applicable. These features only apply to remote variables that store multiple values.

Remote Scope

The CSV example with a heading is an example of a variable being referenced from a remote scope.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example the parent expression of the variable is not the ancestor of the reference in the <data> element. This makes the <data> elements reference to the variable a remote one. In this situation the variable knows that it must store multiple values as the remote reference <data> may retrieve one of many values from the variable based on:

The match count of the parent expression.
The match count of the parent expression, plus or minus an offset.
A fixed position in the variable store.

Retrieval of value by iteration

In the above example the first line is taken then repeatedly matched by delimiting with commas. This results in multiple values being stored in the ‘heading’ variable. Once this is done subsequent lines are matched and then also repeatedly matched by delimiting with commas in the same way the heading is.

Each time a line is matched the internal match count of all sub expressions, (e.g. the <split> expression that is delimited by comma) is reset to 0. Every time the sub <split> expression matches up to a comma delimiter the match count is incremented. Any references to remote variables will, by default, use the current match count as an index to retrieve one of the many values stored in the variable. This means that the <data> element in the above example will retrieve the corresponding heading for each value as the match count of the values will match the storage position of each heading.

Retrieval of value by iteration offset

In some cases there may be a mismatch between the position where a value is stored in a variable and the match count applicable when remotely referencing the variable.

Take the following input:

BAD,Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

In the above example we can see that the first heading ‘BAD’ is not correct for the first value of every line. In this situation we could either adjust the way the heading line is parsed to ignore ‘BAD’ or just adjust the way the heading variable is referenced.

To make this adjustment the reference just needs to be told what offset to apply to the current match count to correctly retrieve the stored value. In the above example this would be done like this:

<data name="$heading$1[+1]" value="$1" />

The above reference just uses the match count plus 1 to retrieve the stored value. Any integral offset plus or minus may be used, e.g. [+4] or [-10]. Offsets that result in a position that is outside of the storage range for the variable will not return a value.

Retrieval of value by fixed position

In addition to retrieval by offset from the current match count, a stored value can be returned by a fixed position that has no relevance to the current match count.

In the following example the value retrieved from the ‘heading’ variable will always be ‘IPAddress’ as this is the fourth value stored in the ‘heading’ variable and the position index starts at 0.

<data name="$heading$1[3]" value="$1" />

6 - Editing and Viewing Data

How to view data and edit content.

Viewing Data

The data viewer is shown on the Data tab when you open (by double clicking) one of these items in the explorer tree:

Feed - to show all data for that feed.
Folder - to show all data for all feeds that are descendants of the folder.
System Root Folder - to show all data for all feeds that are ancestors of the folder.

In all cases the data shown is dependant on the permissions of the user performing the action and any permissions set on the feeds/folders being viewed.

The Data Viewer screen is made up of the following three parts which are shown as three panes split horizontally.

Stream List

This shows all streams within the opened entity (feed or folder). The streams are shown in reverse chronological order. By default Deleted and Locked streams are filtered out. The filtering can be changed by clicking on the Filter icon. This will show all stream types by default so may be a mixture of Raw events, Events, Errors, etc. depending on the feed/folder in question.

This list only shows data when a stream is selected in the streams list above it. It shows all streams related to the currently selected stream. It may show streams that are ‘ancestors’ of the selected stream, e.g. showing the Raw Events stream for an Events stream, or show descendants, e.g. showing the Errors stream which resulted from processing the selected Raw Events stream.

Content Viewer Pane

This pane shows the contents of the stream selected in the Related Streams List. The content of a stream will differ depending on the type of stream selected and the child stream types in that stream. For more information on the anatomy of streams, see Streams. This pane is split into multiple sub tabs depending on the different types of content available.

Info Tab

This sub-tab shows the information for the stream, such as creation times, size, physical file location, state, etc.

Error Tab

This sub-tab is only visible for an Error stream. It shows a table of errors and warnings with associated messages and locations in the stream that it relates to.

Data Preview Tab

This sub-tab shows the content of the data child stream, formatted if it is XML. It will only show a limited amount of data so if the data child stream is large then it will only show the first n characters.

If the stream is multi-part then you will see Part navigation controls to switch between parts. For each part you will be shown the first n character of that part (formatted if applicable).

If the stream is a Segmented stream stream then you will see the Record navigation controls to switch between records. Only one record is shown at once. If a record is very large then only the first n characters of the record will be shown.

This sub-tab is intended for seeing a quick preview of the data in a form that is easy to read, i.e. formatted. If you want to see the full data in its original form then click on the View Source link at the top right of the sub-tab.

The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.

Context Tab

This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated context data child stream. For more details of context streams, see Context This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.

Meta Tab

This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated meta data child stream. For more details of meta streams, see Meta This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.

Source View

The source view is accessed by clicking the View Source link on the Data Preview sub-tab or from the data() dashboard column function. Its purpose is to display the selected child stream (data, context, meta, etc) or record in the form in which it was received, i.e un-formatted.

The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.

In order to navigate through the data you have three options

Click on the ‘progress bar’ to show a porting of the data starting from the position clicked on.
Page through the data using the navigation controls.
Select a source range to display using the Set Source Range dialog which is accessed by clicking on the Lines or Chars links. This allows you to precisely select the range to display. You can either specify a range with a just start point or a start point and some form of size/position limit. If no limit is specified then Stroom will limit the data shown to the configured maximum (stroom.ui.source.maxCharactersPerFetch). If a range is entered that is too big to display Stroom will limit the data to its maximum.

A Note About Characters

Stroom does not know the size of a stream in terms of character lines/cols, it only knows the size in bytes. Due to the way character data is encoded into bytes it is not possible to say how many characters are in a stream based on its size in bytes. Stroom can only provide an estimate based on the ratio of characters to bytes seen so far in the stream.

Data Progress Bar

Stroom often handles very large streams of data and it is not feasible to show all of this data in the editor at once. Therefore Stroom will show a limited amount of the data in the editor at a time. The ‘progress’ bar at the top of the Data Preview and Source View screens shows what percentage of the data is visible in the editor and where in the stream the visible portion is located. If all of the data is visible in the editor (which includes scrolling down to see it) the bar will be green and will occupy the full width. If only some of the data is visible then the bar will be blue and the coloured part will only occupy part of the width.

Editor

Stroom uses the Ace editor for editing and viewing text, such as XSLTs, raw data, cooked events, stepping, etc.

Keyboard shortcuts

The following are some useful keyboard shortcuts in the editor:

ctrl-z - Undo last action.
ctrl-shift-z - Redo previously undone action.
ctrl-/ - Toggle commenting of current line/selection. Applies when editing XML, XSLT or Javascript.
alt-up/alt-down - Move line/selection up/down respectively
ctrl-d - Delete current line.
ctrl-f - Open find dialog.
ctrl-h - Open find/replace dialog.
ctrl-k - Find next match.
ctrl-shift-k - Find previous match.
tab - Indent selection.
shift-tab - Outdent selection.
ctrl-u - Make selection upper-case.

See here (external link) for more.

Vim key bindings

If you are familiar with the Vi/Vim text editors then it is possible to enable Vim key bindings in Stroom. This is currently done by enabling Vim bindings in the editor context menu in each editor screen. In future versions of Stroom it will be possible to store these preferences on a per user basis.

The Ace editor does not support all features of Vim however the core navigation/editing key bindings are present. The key supported features of Vim are:

Visual mode and visual block mode.
Searching with / (javascript flavour regex)
Search/replace with commands like :%s/foo/bar/g
Incrementing/decrementing numbers with ctrl-a/ctrl-x
Code (un-)folding with zo, zc, etc.
Text objects, e.g. >, ), ], ', ", p paragraph, w word.
Repetition with the . command.
Jumping to a line with :<line no>.

Notable features not supported by the Ace editor:

The following text objects are not supported
- b - Braces, i.e { or [.
- t - Tags, i.e. XML tags <value>.
- s - Sentence.
The g command mode command, i.e. :g/foo/d
Splits

For a list of useful Vim key bindings see this cheat sheet (external link), though not all bindings will be available in Stroom’s Ace editor.

Auto-Completion And Snippets

The editor supports a number of different types of auto-completion of text. Completion suggestions are triggered by the following mechanisms:

ctrl-space - when live auto-complete is disabled.
Typing - when live auto-complete is enabled.

When completion suggestions are triggered the follow types of completion may be available depending on the text being edited.

Local - any word/token found in the existing text. Useful if you have typed a long word and need to type it again.
Keyword - A word/token that has been defined in the syntax highlighting rules for the text type, i.e. function is a keyword when editing Javascript.
Snippet - A block of text that has been defined as a snippet for the editor mode (XML, Javascript, etc.).

Snippets

Snippets allow you to quickly entry pre-defined blocks of common text. For example when editing an XSLT you may want to insert a call-template with parameters. To do this using snippets you can do the following:

Type call then hit ctrl-space.
In the list of options use the cursor keys to select call-template with-param then hit enter or tab to insert the snippet. The snippet will look like
```
<xsl:call-template name="template">
  <xsl:with-param name="param"></xsl:with-param>
</xsl:call-template>
```
The cursor will be positioned on the first tab stop (the template name) with the tab stop text selected.
At this point you can type in your template name, e.g. MyTemplate, then hit tab to advance to the next tab stop (the param name)
Now type the name of the param, e.g. MyParam, then hit tab to advance to the last tab stop positioned within the <with-param> ready to enter the param value.

Snippets can be disabled from the list of suggestions by selecting the option in the editor context menu.

7 - Event Feeds

Feeds provide the means to logically group data common to a data format and source system.

In order for Stroom to be able to handle the various data types as described in the previous section, Stroom must be told what the data is when it is received. This is achieved using Event Feeds. Each feed has a unique name within the system.

Events Feeds can be related to one or more Reference Feed. Reference Feeds are used to provide look up data for a translation. E.g. lookup a computer name by it’s IP address.

Feeds can also have associated context data. Context data used to provide look up information that is only applicable for the events file it relates to. E.g. if the events file is missing information relating to the computer it was generated on, and you don’t want to create separate feeds for each computer, an associated context file could be used to provide this information.

Feed Identifiers

Feed identifiers must be unique within the system. Identifiers can be in any format but an established convetnion is to use the following format

<SYSTEM>-<ENVIRONMENT>-<TYPE>-<EVENTS/REFERENCE>-<VERSION>

If feeds in a certain site need different reference data then the site can be broken down further.

_ may be used to represent a space.

8 - Finding Things

How to find things in Stroom, for example content, simple string values, etc.

Version Information: Created with Stroom v7.0
Last Updated: 01 Jun 2020

Explorer Tree

The Explorer Tree in stroom is the primary means of finding user created content, for example Feeds, XSLTs, Pipelines, etc.

Branches of the Explorer Tree can be expanded and collapsed to reveal/hide the content at different levels.

Filtering by Type

The Explorer Tree can be filtered by the type of content, e.g. to display only Feeds, or only Feeds and XSLTs. This is done by clicking the filter icon . The following is an example of filtering by Feeds and XSLTs.

images/user-guide/finding-things/explorer_tree_type_filter_picker.png — Explorer Tree Type Filtering

Clicking All/None toggles between all types selected and no types selected.

Filtering by type can also be achieved using the Quick Filter by entering the type name (or a partial form of the type name), prefixed with type:. I.e:

type:feed

For example:

images/user-guide/finding-things/explorer_tree_type_quick_filter.png — Explorer Tree Type Filtering

NOTE: If both the type picker and the Quick Filter are used to filter on type then the two filters will be combined as an AND.

Filtering by Name

The Explorer Tree can be filtered by the name of the entity. This is done by entering some text in the Quick Filter field. The tree will then be updated to only show entities matching the Quick Filter. The way the matching works for entity names is described in Common Fuzzy Matching

Filtering by UUID

What is a UUID?

The Explorer Tree can be filtered by the UUID of the entity. The UUID Universally unique identifier (external link) is an identifier that can be relied on to be unique both within the system and universally across all other systems. Stroom uses UUIDs as the primary identifier for all content (Feeds, XSLTs, Pipelines, etc.) created in Stroom. An entity’s UUID is generated randomly by Stroom upon creation and is fixed for the life of that entity.

When an entity is exported it is exported with its UUID and if it is then imported into another instance of Stroom the same UUID will be used. The name of an entity can be changed within Stroom but its UUID remains un-changed.

With the exception of Feeds, Stroom allows multiple entities to have the same name. This is because entities may exist that a user does not have access to see so restricting their choice of names based on existing invisible entities would be confusing. Where there are multiple entities with the same name the UUID can be used to distinguish between them.

The UUID of an entity can be viewed using the context menu for the entity. The context menu is accessed by right-clicking on the entity.

images/user-guide/finding-things/entity_context_menu.png — Entity Context Menu

Clicking Info displays the entities UUID.

images/user-guide/finding-things/entity_info.png — Entity Info

The UUID can be copied by selecting it and then pressing ctrl-c.

UUID Quick Filter Matching

In the Explorer Tree Quick Filter you can filter by UUIDs in the following ways:

To show the entity matching a UUID, enter the full UUID value (with dashes) prefixed with the field qualifier uuid, e.g. uuid:a95e5c59-2a3a-4f14-9b26-2911c6043028.

To filter on part of a UUID you can do uuid:/2a3a to find an entity whose UUID contains 2a3a or uuid:^2a3a to find an entity whose UUID starts with 2a3a.

Quick Filters

Quick Filter controls are used in a number of screens in Stroom. The most prominent use of a Quick Filter is in the Explorer Tree as described above. Quick filters allow for quick searching of a data set or a list of items using a text based query language. The basis of the query language is described in Common Fuzzy Matching.

A number of the Quick Filters are used for filter tables of data that have a number of fields. The quick filter query language supports matching in specified fields. Each Quick Filter will have a number of named fields that it can filter on. The field to match on is specified by prefixing the match term with the name of the field followed by a :, i.e. type:. Multiple field matches can be used, each separate by a space. E.g:

name:^xml name:events$ type:feed

In the above example the filter will match on items with a name beginning xml, a name ending events and a type partially matching feed.

All the match terms are combined together with an AND operator. The same field can be used multiple times in the match. The list of filterable fields and their qualifier names (sometimes a shortened form) are listed by clicking on the help icon .

One or more of the fields will be defined as default fields. This means the if no qualifier is entered the match will be applied to all default fields using an OR operator. Sometimes all fields may be considered default which means a match term will be tested against all fields and an item will be included in the results if one or more of those fields match.

For example if the Quick Filter has fields Name, Type and Status, of which Name and Type are default:

feed status:ok

The above would match items where the Name OR Type fields match feed AND the Status field matches ok.

Match Negation

Each match item can be negated using the ! prefix. This is also described in Common Fuzzy Matching. The prefix is applied after the field qualifier. E.g:

name:xml source:!/default

In the above example it would match on items where the Name field matched xml and the Source field does NOT match the regex pattern default.

Spaces and Quotes

If your match term contains a space then you can surround the match term with double quotes. Also if your match term contains a double quote you can escape it with a \ character. The following would be valid for example.

"name:csv splitter" "default field match" "symbol:\""

Multiple Terms

If multiple terms are provided for the same field then an AND is used to combine them. This can be useful where you are not sure of the order of words within the items being filtered.

For example:

User input: spain plain rain

Will match:

The rain in spain stays mainly in the plain
    ^^^^    ^^^^^                     ^^^^^
rainspainplain
^^^^^^^^^^^^^^
spain plain rain
^^^^^ ^^^^^ ^^^^
raining spain plain
^^^^^^^ ^^^^^ ^^^^^

Won’t match: sprain, rain, spain

OR Logic

There is no support for combining terms with an OR. However you can acheive this using a sinlge regular expression term. For example

User input: status:/(disabled|locked)

Will match:

Locked
^^^^^^
Disabled
^^^^^^^^

Won’t match: A MAN, HUMAN

Suggestion Input Fields

Stroom uses a number of suggestion input fields, such as when selecting Feeds, Pipelines, types, status values, etc. in the pipeline processor filter screen.

images/user-guide/finding-things/feed_suggestion.png — Feed Input Suggestions

These fields will typically display the full list of values or a truncated list where the total number of value is too large. Entering text in one of these fields will use the fuzzy matching algorithm to partially/fully match on values. See CommonFuzzy Matching below for details of how the matching works.

Common Fuzzy Matching

A common fuzzy matching mechanism is used in a number of places in Stroom. It is used for partially matching the user input to a list of a list of possible values.

In some instances, the list of matched items will be truncated to a more manageable size with the expectation that the filter will be refined.

The fuzzy matching employs a number of approaches that are attempted in the following order:

NOTE: In the following examples the ^ character is used to indicate which characters have been matched.

No Input

If no input is provided all items will match.

Contains (Default)

If no prefixes or suffixes are used then all characters in the user input will need to be contained as a whole somewhere within the string being tested. The matching is case insensitive.

User input: bad

Will match:

bad angry dog
^^^          
BAD
^^^
very badly
     ^^^  
Very bad
     ^^^

Won’t match: dab, ba d, ba

Characters Anywhere Matching

If the user input is prefixed with a ~ (tilde) character then characters anywher matching will be employed. The matching is case insensitive.

User input: bad

Will match:

Big Angry Dog
^   ^     ^  
bad angry dog
^^^          
BAD
^^^
badly
^^^  
Very bad
     ^^^
b a d
^ ^ ^
bbaadd
^ ^ ^

Won’t match: dab, ba

Word Boundary Matching

If the user input is prefixed with a ? character then word boundary matching will be employed. This approache uses upper case letters to denote the start of a word. If you know the some or all of words in the item you are looking for, and their order, then condensing those words down to their first letters (capitalised) makes this a more targeted way to find what you want than the characters anywhere matching above. Words can either be separated by characters like _- ()[]., or be distinguished with lowerCamelCase or upperCamelCase format. An upper case letter in the input denotes the beginning of a word and any subsequent lower case characters are treated as contiguously following the character at the start of the word.

User input: ?OTheiMa

Will match:

the cat sat on their mat
            ^  ^^^^  ^^                                                                  
ON THEIR MAT
^  ^^^^  ^^ 
Of their magic
^  ^^^^  ^^   
o thei ma
^ ^^^^ ^^
onTheirMat
^ ^^^^ ^^ 
OnTheirMat
^ ^^^^ ^^

Won’t match: On the mat, the cat sat on there mat, On their moat

User input: ?MFN

Will match:

MY_FEED_NAME
^  ^    ^   
MY FEED NAME
^  ^    ^   
MY_FEED_OTHER_NAME
^  ^          ^   
THIS_IS_MY_FEED_NAME_TOO
        ^  ^    ^                  
myFeedName
^ ^   ^   
MyFeedName
^ ^   ^   
also-my-feed-name
     ^  ^    ^   
MFN
^^^
stroom.something.somethingElse.maxFileNumber
                               ^  ^   ^

Won’t match: myfeedname, MY FEEDNAME

Regular Expression Matching

If the user input is prefixed with a / character then the remaining user input is treated as a Java syntax regular expression. An string will be considered a match if any part of it matches the regular expression pattern. The regular expression operates in case insensitive mode. For more details on the syntax of java regular expressions see this internet link https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html.

User input: /(^|wo)man

Will match:

MAN
^^^
A WOMAN
  ^^^^^
Manly
^^^  
Womanly
^^^^^

Won’t match: A MAN, HUMAN

Exact Match

If the user input is prefixed with a ^ character and suffixed with a $ character then a case-insensitive exact match will be used. E.g:

User input: ^xml-events$

Will match:

xml-events
^^^^^^^^^^
XML-EVENTS
^^^^^^^^^^

Won’t match: xslt-events, XML EVENTS, SOME-XML-EVENTS, AN-XML-EVENTS-PIPELINE

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Starts With

If the user input is prefixed with a ^ character then a case-insensitive starts with match will be used. E.g:

User input: ^events

Will match:

events
^^^^^^
EVENTS_FEED
^^^^^^     
events-xslt
^^^^^^

Won’t match: xslt-events, JSON_EVENTS

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Ends With

If the user input is suffixed with a $ character then a case-insensitive ends with match will be used. E.g:

User input: events$

Will match:

events
^^^^^^
xslt-events
     ^^^^^^
JSON_EVENTS
     ^^^^^^

Won’t match: EVENTS_FEED, events-xslt

Note: Despite the similarity in syntax, this is NOT regular expression matching.

Wild-Carded Case Sensitive Exact Matching

If one or more * characters are found in the user input then this form of matching will be used.

This form of matching is to support those fields that accept wild-carded values, e.g. a whild-carded feed name expression term. In this instance you are NOT picking a value from the suggestion list but entering a wild-carded value that will be evaluated when the expression/filter is actually used. The user may want an expression term that matches on all feeds starting with XML_, in which case they would enter XML_*. To give an indication of what it would match on if the list of feeds remains the same, the list of suggested items will reflect the wild-carded input.

User input: XML_*

Will match:

XML_
^^^^
XML_EVENTS
^^^^

Won’t match: BAD_XML_EVENTS, XML-EVENTS, xml_events

User input: XML_*EVENTS*

Will match:

XML_EVENTS
^^^^^^^^^^
XML_SEC_EVENTS
^^^^    ^^^^^^
XML_SEC_EVENTS_FEED
^^^^    ^^^^^^

Won’t match: BAD_XML_EVENTS, xml_events

Match Negation

A match can be negated, ie. the NOT operator using the prefix !. This prefix can be applied before all the match prefixes listed above. E.g:

!/(error|warn)

In the above example it will match everything except those matched by the regex pattern (error|warn).

9 - Nodes

Configuring the nodes in a Stroom cluster.

All nodes in an Stroom cluster must be configured correctly for them to communicate with each other.

Configuring nodes

Open Monitoring/Nodes from the top menu. The nodes screen looks like this:

TODO

Screenshot

You need to edit each line by selecting it and then clicking the edit icon at the bottom. The URL for each node needs to be set as above but obviously substituting in the host name of the individual node, e.g. http://<HOST_NAME>:8080/stroom/clustercall.rpc

Nodes are expected communicate with each other on port 8080 over http. Ensure you have configured your firewall to allow nodes to talk to each other over this port. You can configure the URL to use a different port and possibly HTTPS but performance will be better with HTTP as no SSL termination is required.

Once you have set the URLs of each node you should also set the master assignment priority for each node to be different to all of the others. In the image above the priorities have been set in a random fashion to ensure that node3 assumes the role of master node for as long as it is enabled. You also need to check all of the nodes are enabled that you want to take part in processing or any other jobs.

Keep refreshing the table until all nodes show healthy pings as above. If you do not get ping results for each node then they are not configured correctly.

Once a cluster is configured correctly you will get proper distribution of processing tasks and search will be able to access all nodes to take part in a distributed query.

10 - Pipelines

Pipelines are the mechanism for processin and transforming ingested data.

Pipelines

Every feed has an associated translation. The translation is used to convert the input text or XML into event logging XML format.

XSLT is used to translate from XML to event logging XML.

10.1 - Reference Data

Performing temporal reference data lookups to decorate event data.

Reference Data Structure

A reference data entry essentially consists of the following:

Effective time - The data/time that the entry was effective from, i.e the time the raw reference data was received.
Map name - A unique name for the key/value map that the entry will be stored in. The name only needs to be unique within all map names that may be loaded within an XSLT Filter. In practice it makes sense to keep map names globally unique.
Key - The text that will be used to lookup the value in the reference data map. Mutually exclusive with Range.
Range - The inclusive range of integer keys that the entry applies to. Mutually exclusive with Key.
Value - The value can either be simple text, e.g. an IP address, or an XML fragment that can be inserted into another XML document. XML values must be correctly namespaced.

The following is an example of some reference data that has been converted from its raw form into reference-data:2 XML. In this example the reference data contains three entries that each belong to a different map. Two of the entries are simple text values and the last has an XML value.

<?xml version="1.1" encoding="UTF-8"?>
<referenceData 
    xmlns="reference-data:2" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:stroom="stroom" 
    xmlns:evt="event-logging:3" 
    xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd" 
    version="2.0.1">

  <!-- A simple string value -->
  <reference>
    <map>FQDN_TO_IP</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <IPAddress>192.168.2.245</IPAddress>
    </value>
  </reference>

  <!-- A simple string value -->
  <reference>
    <map>IP_TO_FQDN</map>
    <key>192.168.2.245</key>
    <value>
      <HostName>stroomnode00.strmdev00.org</HostName>
    </value>
  </reference>

  <!-- A key range -->
  <reference>
    <map>USER_ID_TO_COUNTRY_CODE</map>
    <range>
      <from>1</from>
      <to>1000</to>
    </range>
    <value>GBR</value>
  </reference>

  <!-- An XML fragment value -->
  <reference>
    <map>FQDN_TO_LOC</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <evt:Location>
        <evt:Country>GBR</evt:Country>
        <evt:Site>Bristol-S00</evt:Site>
        <evt:Building>GZero</evt:Building>
        <evt:Room>R00</evt:Room>
        <evt:TimeZone>+00:00/+01:00</evt:TimeZone>
      </evt:Location>
    </value>
  </reference>
</referenceData>

Reference Data Namespaces

When XML reference data values are created, as in the example XML above, the XML values must be qualified with a namespace to distinguish them from the reference-data:2 XML that surrounds them. In the above example the XML fragment will become as follows when injected into an event:

      <evt:Location xmlns:evt="event-logging:3" >
        <evt:Country>GBR</evt:Country>
        <evt:Site>Bristol-S00</evt:Site>
        <evt:Building>GZero</evt:Building>
        <evt:Room>R00</evt:Room>
        <evt:TimeZone>+00:00/+01:00</evt:TimeZone>
      </evt:Location>

Even if evt is already declared in the XML being injected into it, if it has been declared for the reference fragment then it will be explicitly declared in the destination. While duplicate namespacing may appear odd it is valid XML.

The namespacing can also be achieved like this:

<?xml version="1.1" encoding="UTF-8"?>
<referenceData 
    xmlns="reference-data:2" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:stroom="stroom" 
    xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd" 
    version="2.0.1">

  <!-- An XML value -->
  <reference>
    <map>FQDN_TO_LOC</map>
    <key>stroomnode00.strmdev00.org</key>
    <value>
      <Location xmlns="event-logging:3">
        <Country>GBR</Country>
        <Site>Bristol-S00</Site>
        <Building>GZero</Building>
        <Room>R00</Room>
        <TimeZone>+00:00/+01:00</TimeZone>
      </Location>
    </value>
  </reference>
</referenceData>

This reference data will be injected into event XML exactly as it, i.e.:

      <Location xmlns="event-logging:3">
        <Country>GBR</Country>
        <Site>Bristol-S00</Site>
        <Building>GZero</Building>
        <Room>R00</Room>
        <TimeZone>+00:00/+01:00</TimeZone>
      </Location>

Reference Data Storage

Reference data is stored in two different places on a Stroom node. All reference data is only visible to the node where it is located. Each node that is performing reference data lookups will need to load and store its own reference data. While this will result in duplicate data being held by nodes it makes the storage of reference data and its subsequent lookup very performant.

On-Heap Store

The On-Heap store is the reference data store that is held in memory in the Java Heap. This store is volatile and will be lost on shut down of the node. The On-Heap store is only used for storage of context data.

Off-Heap Store

The Off-Heap store is the reference data store that is held in memory outside of the Java Heap and is persisted to to local disk. As the store is also persisted to local disk it means the reference data will survive the shutdown of the stroom instance. Storing the data off-heap means Stroom can run with a much smaller Java Heap size.

The Off-Heap store is based on the Lightning Memory-Mapped Database (LMDB). LMDB makes use of the Linux page cache to ensure that hot portions of the reference data are held in the page cache (making use of all available free memory). Infrequently used portions of the reference data will be evicted from the page cache by the Operating System. Given that LMDB utilises the page cache for holding reference data in memory the more free memory the host has the better as there will be less shifting of pages in/out of the OS page cache. When storing large amounts of data you may experience the OS reporting very little free memory as a large amount will be in use by the page cache. This is not an issue as the OS will evict pages when memory is needed for other applications, e.g. the Java Heap.

Local Disk

The Off-Heap store is intended to be located on local disk on the Stroom node. The location of the store is set using the property stroom.pipeline.referenceData.localDir. Using LMDB on remote storage is NOT advised, see http://www.lmdb.tech/doc.

If you are running stroom on AWS EC2 instances then you will need to attach some local instance storage to the host, e.g. SSD, to use for the reference data store. In tests EBS storage was found to be VERY slow. It should be noted that AWS instance storage is not persistent between instance stops, terminations and hardware failure. However any loss of the reference data store will mean that the next time Stroom boots a new store will be created and reference data will be loaded on demand as normal.

Transactions

LMDB is a transactional database with ACID semantics. All interaction with LMDB is done within a read or write transaction. There can only be one write transaction at a time so if there are a number of concurrent reference data loads then they will have to wait in line. Read transactions, i.e. lookups, are not blocked by each other or by write transactions so once the data is loaded and is in memory lookups can be performed very quickly.

Read-Ahead Mode

When data is read from the store, if the data is not already in the page cache then it will be read from disk and added to the page cache by the OS. Read-ahead is the process of speculatively reading ahead to load more pages into the page cache then were requested. This is on the basis that future requests for data may need the pages speculatively read into memory. If the reference data store is very large or is larger than the available memory then it is recommended to turn read-ahead off as the result will be to evict hot reference data from the page cache to make room for speculative pages that may not be needed. It can be tuned off with the system property stroom.pipeline.referenceData.readAheadEnabled.

Key Size

When reference data is created care must be taken to ensure that the Key used for each entry is less than 507 bytes. For simple ASCII characters then this means less than 507 characters. If non-ASCII characters are in the key then these will take up more than one byte per character so the length of the key in characters will be less. This is a limitation inherent to LMDB.

Commit intervals

The property stroom.pipeline.referenceData.maxPutsBeforeCommit controls the number of entries that are put into the store between each commit. As there can be only one transaction writing to the store at a time, committing periodically allows other process to jump in and make writes. There is a trade off though as reducing the number of entries put between each commit can seriously affect performance. For the fastest single process performance a value of zero should be used which means it will not commit mid-load. This however means all other processes wanting to write to the store will need to wait.

Cloning The Off Heap Store

If you are provisioning a new stroom node it is possible to copy the off heap store from another node. Stroom should not be running on the node being copied from. Simply copy the contents of stroom.pipeline.referenceData.localDir into the same configured location on the other node. The new node will use the copied store and have access to its reference data.

Store Size & Compaction

Due to the way LMDB works the store can only grow in size, it will never shrink, even if data is deleted. Deleted data frees up space for new writes to the store so will be reused but never freed. If there is a regular process of purging old data and adding new reference data then this should not be an issue as the new reference data will use the space made available by the purged data.

If store size becomes an issue then it is possible to compact the store. lmdb-utils is available on some package managers and this has an mdb_copy command that can be used with the -c switch to copy the LMDB environment to a new one, compacting it in the process. This should be done when Stroom is down to avoid writes happening to the store while the copy is happening.

Given that the store is essentially a cache and all data can be re-loaded another option is to delete the contents of stroom.pipeline.referenceData.localDir when Stroom is not running. On boot Stroom will create a brand new store and reference data will be re-loaded as required.

The Loading Process

Reference data is loaded into the store on demand during the processing of a stroom:lookup() method call. Reference data will only be loaded if it does not already exist in the store, however it is always loaded as a complete stream, rather than entry by entry.

The test for existence in the store is based on the following criteria:

The UUID of the reference loader pipeline.
The version of the reference loader pipeline.
The Stream ID for the stream of reference data that has been deemed effective for the lookup.
The Stream Number (in the case of multi part streams).

If a reference stream has already been loaded matching the above criteria then no additional load is required.

IMPORTANT: It should be noted that as the version of the reference data pipeline forms part of the criteria, if the reference loader pipeline is changed, for whatever reason, then this will invalidate ALL existing reference data associated with that reference loader pipeline.

Typically the reference loader pipeline is very static so this should not be an issue.

Standard practice is to convert raw reference data into reference:2 XML on receipt using a pipeline separate to the reference loader. The reference loader is then only concerned with reading cooked reference:2 into the Reference Data Filter.

In instances where reference data streams are infrequently used it may be preferable to not convert the raw reference on receipt but instead to do it in the reference loader pipeline.

Duplicate Keys

The Reference Data Filter pipeline element has a property overrideExistingValues which if set to true means if an entry is found in an effective stream with the same key as an entry already loaded then it will overwrite the existing one. Entries are loaded in the order they are found in the reference:2 XML document. If set to false then the existing entry will be kept. If warnOnDuplicateKeys is set to true then a warning will be logged for any duplicate keys, whether an overwrite happens or not.

Value De-Duplication

Only unique values are held in the store to reduce the storage footprint. This is useful given that typically, reference data updates may be received daily and each one is a full snapshot of the whole reference data. As a result this can mean many copies of the same value being loaded into the store. The store will only hold the first instance of duplicate values.

Querying the Reference Data Store

The reference data store can be queried within a Dashboard in Stroom by selecting Reference Data Store in the data source selection pop-up. Querying the store is currently an experimental feature and is mostly intended for use in fault finding. Given the localised nature of the reference data store the dashboard can currently only query the store on the node that the user interface is being served from. In a multi-node environment where some nodes are UI only and most are processing only, the UI nodes will have no reference data in their store.

Purging Old Reference Data

Reference data loading and purging is done at the level of a reference stream. Whenever a reference lookup is performed the last accessed time of the reference stream in the store is checked. If it is older than one hour then it will be updated to the current time. This last access time is used to determine reference streams that are no longer in active use and thus can be purged.

The Stroom job Ref Data Off-heap Store Purge is used to perform the purge operation on the Off-Heap reference data store. No purge is required for the On-Heap store as that only holds transient data. When the purge job is run it checks the time since each reference stream was accessed against the purge cut-off age. The purge age is configured via the property stroom.pipeline.referenceData.purgeAge. It is advised to schedule this job for quiet times when it is unlikely to conflict with reference loading operations as they will fight for access to the single write transaction.

Lookups

Lookups are performed in XSLT Filters using the XSLT functions. In order to perform a lookup one or more reference feeds must be specified on the XSLT Filter pipeline element. Each reference feed is specified along with a reference loader pipeline that will ingest the specified feed (optional convert it into reference:2 XML if it is not already) and pass it into a Reference Data Filter pipeline element.

Reference Feeds & Loaders

In the XSLT Filter pipeline element multiple combinations of feed and reference loader pipeline can be specified. There must be at least one in order to perform lookups. If there are multiple then when a lookup is called for a given time, the effective stream for each feed/loader combination is determined. The effective stream for each feed/loader combination will be loaded into the store, unless it is already present.

When the actual lookup is performed Stroom will try to find the key in each of the effective streams that have been loaded and that contain the map in the lookup call. If the lookup is unsuccessful in the effective stream for the first feed/loader combination then it will try the next, and so on until it has tried all of them. For this reason if you have multiple feed/loader combinations then order is important. It is possible for multiple effective streams to contain the same map/key so a feed/loader combination higher up the list will trump one lower down with the same map/key. Also if you have some lookups that may not return a value and others that should always return a value then the feed/loader for the latter should be higher up the list so it is searched first.

Standard Key/Value Lookups

Standard key/value lookups consist of a simple string key and a value that is either a simple string or an XML fragment. Standard lookups are performed using the various forms of the stroom:lookup() XSLT function.

Range Lookups

Range lookups consist of a key that is an integer and a value that is either a simple string or an XML fragment. For more detail on range lookups see the XSLT function stroom:lookup().

Nested Map Lookups

Nested map lookups involve chaining a number of lookups with the value of each map being used as the key for the next. For more detail on nested lookups see the XSLT function stroom:lookup().

Bitmap Lookups

A bitmap lookup is a special kind of lookup that actually performs a lookup for each enabled bit position of the passed bitmap value. For more detail on bitmap lookups see the XSLT function stroom:bitmap-lookup().

Values can either be a simple string or an XML fragment.

Context data lookups

Some event streams have a Context stream associated with them. Context streams allow the system sending the events to Stroom to supply an additional stream of data that provides context to the raw event stream. This can be useful when the system sending the events has no control over the event content but needs to supply additional information. The context stream can be used in lookups as a reference source to decorate events on receipt. Context reference data is specific to a single event stream so is transient in nature, therefore the On Heap Store is used to hold it for the duration of the event stream processing only.

Typically the reference loader for a context stream will include a translation step to convert the raw context data into reference:2 XML.

Reference Data API

The reference data store has an API to allow other systems to access the reference data store. The lookup endpoint requires the caller to provide details of the reference feed and loader pipeline so if the effective stream is not in the store it can be loaded prior to performing the lookup.

Below is an example of a lookup request.

{
  "mapName": "USER_ID_TO_LOCATION",
  "effectiveTime": "2020-12-02T08:37:02.772Z",
  "key": "jbloggs",
  "referenceLoaders": [
    {
      "loaderPipeline" : {
        "name" : "Reference Loader",
        "uuid" : "da1c7351-086f-493b-866a-b42dbe990700",
        "type" : "Pipeline"
      },
      "referenceFeed" : {
        "name": "USER_ID_TOLOCATION-REFERENCE",
        "uuid": "60f9f51d-e5d6-41f5-86b9-ae866b8c9fa3",
        "type" : "Feed"
      }
    }
  ]
}

10.2 - File Output

When outputting files with Stroom, the output file names and paths can include various substitution variables to form the file and path names.

Context Variables

The following replacement variables are specific to the current processing context.

${feed} - The name of the feed that the stream being processed belongs to
${pipeline} - The name of the pipeline that is producing output
${sourceId} - The id of the input data being processed
${partNo} - The part number of the input data being processed where data is in aggregated batches
${searchId} - The id of the batch search being performed. This is only available during a batch search
${node} - The name of the node producing the output

Time Variables

The following replacement variables can be used to include aspects of the current time in UTC.

${year} - Year in 4 digit form, e.g. 2000
${month} - Month of the year padded to 2 digits
${day} - Day of the month padded to 2 digits
${hour} - Hour padded to 2 digits using 24 hour clock, e.g. 22
${minute} - Minute padded to 2 digits
${second} - Second padded to 2 digits
${millis} - Milliseconds padded to 3 digits
${ms} - Milliseconds since the epoch

System (Environment) Variables

System variables (environment variables) can also be used, e.g. ${TMP}.

File Name References

rolledFileName in RollingFileAppender can use references to the fileName to incorporate parts of the non rolled file name.

${fileName} - The complete file name
${fileStem} - Part of the file name before the file extension, i.e. everything before the last ‘.’
${fileExtension} - The extension part of the file name, i.e. everything after the last ‘.’

Other Variables

${uuid} - A randomly generated UUID to guarantee unique file names

10.3 - Parser

Parsing input data.

The following capabilities are available to parse input data:

XML - XML input can be parsed with the XML parser.
XML Fragment - Treat input data as an XML fragment, i.e. XML that does not have an XML declaration or root elements.
Data Splitter - Delimiter and regular expression based language for turning non XML data into XML (e.g. CSV)

10.3.1 - Context Data

##Context File

###Input File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
	<SomeEvent>
			<SomeTime>01/01/2009:12:00:01</SomeTime>
			<SomeAction>OPEN</SomeAction>
			<SomeUser>userone</SomeUser>
			<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
	</SomeEvent>
</SomeData>

###Context File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeContext>
	<Machine>MyMachine</Machine>
</SomeContext>

###Context XSLT:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
	xmlns="reference-data:2"
	xmlns:evt="event-logging:3"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	version="2.0">
		
		<xsl:template match="SomeContext">
			<referenceData 
					xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
					version="2.0.1">
							
					<xsl:apply-templates/>
			</referenceData>
		</xsl:template>

		<xsl:template match="Machine">
			<reference>
					<map>CONTEXT</map>
					<key>Machine</key>
					<value><xsl:value-of select="."/></value>
			</reference>
		</xsl:template>
		
</xsl:stylesheet>

###Context XML Translation:

<?xml version="1.0" encoding="UTF-8"?>
<referenceData xmlns:evt="event-logging:3"
								xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
								xmlns="reference-data:2"
								xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
								version="2.0.1">
		<reference>
			<map>CONTEXT</map>
			<key>Machine</key>
			<value>MyMachine</value>
		</reference>
</referenceData>

###Input File:

<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
	<SomeEvent>
			<SomeTime>01/01/2009:12:00:01</SomeTime>
			<SomeAction>OPEN</SomeAction>
			<SomeUser>userone</SomeUser>
			<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
	</SomeEvent>
</SomeData>

###Main XSLT (Note the use of the context lookup):

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
	xmlns="event-logging:3"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
	version="2.0">
	
    <xsl:template match="SomeData">
        <Events xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd" Version="3.0.0">
            <xsl:apply-templates/>
        </Events>
    </xsl:template>
    <xsl:template match="SomeEvent">
        <xsl:if test="SomeAction = 'OPEN'">
            <Event>
                <EventTime>
                        <TimeCreated>
                            <xsl:value-of select="s:format-date(SomeTime, 'dd/MM/yyyy:hh:mm:ss')"/>
                        </TimeCreated>
                </EventTime>
				<EventSource>
					<System>Example</System>
					<Environment>Example</Environment>
					<Generator>Very Simple Provider</Generator>
					<Device>
						<IPAddress>182.80.32.132</IPAddress>
						<Location>
							<Country>UK</Country>
							<Site><xsl:value-of select="s:lookup('CONTEXT', 'Machine')"/></Site>
							<Building>Main</Building>
							<Floor>1</Floor>              
							<Room>1aaa</Room>
						</Location>           
					</Device>
					<User><Id><xsl:value-of select="SomeUser"/></Id></User>
				</EventSource>
				<EventDetail>
					<View>
						<Document>
							<Title>UNKNOWN</Title>
							<File>
								<Path><xsl:value-of select="SomeFile"/></Path>
							</File>
						</Document>
					</View>
				</EventDetail>
            </Event>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

###Main Output XML:

<?xml version="1.0" encoding="UTF-8"?>
<Events xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
				xmlns="event-logging:3"
				xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd"
				Version="3.0.0">
		<Event Id="6:1">
			<EventTime>
					<TimeCreated>2009-01-01T00:00:01.000Z</TimeCreated>
			</EventTime>
			<EventSource>
					<System>Example</System>
					<Environment>Example</Environment>
					<Generator>Very Simple Provider</Generator>
					<Device>
						<IPAddress>182.80.32.132</IPAddress>
						<Location>
								<Country>UK</Country>
								<Site>MyMachine</Site>
								<Building>Main</Building>
								<Floor>1</Floor>
								<Room>1aaa</Room>
						</Location>
					</Device>
					<User>
						<Id>userone</Id>
					</User>
			</EventSource>
			<EventDetail>
					<View>
						<Document>
								<Title>UNKNOWN</Title>
								<File>
									<Path>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</Path>
								</File>
						</Document>
					</View>
			</EventDetail>
		</Event>
</Events>

10.3.2 - XML Fragments

Handling XML data without root level elements.

Some input XML data may be missing an XML declaration and root level enclosing elements. This data is not a valid XML document and must be treated as an XML fragment. To use XML fragments the input type for a translation must be set to ‘XML Fragment’. A fragment wrapper must be defined in the XML conversion that tells Stroom what declaration and root elements to place around the XML fragment data.

Here is an example:

<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE records [
<!ENTITY fragment SYSTEM "fragment">
]>
<records
  xmlns="records:2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="records:2 file://records-v2.0.xsd"
  version="2.0">
  &fragment;
</records>

During conversion Stroom replaces the fragment text entity with the input XML fragment data. Note that XML fragments must still be well formed so that they can be parsed correctly.

10.4 - XSLT Conversion

Using Extensible Stylesheet Language Transformations (XSLT) to transform data.

Once the text file has been converted into Intermediary XML (or the feed is already XML), XSLT is used to translate the XML into event logging XML format.

Event Feeds must be translated into the events schema and Reference into the reference schema. You can browse documentation relating to the schemas within the application.

Here is an example XSLT:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
    xmlns="event-logging:3"
    xmlns:s="stroom"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    version="2.0">

    <xsl:template match="SomeData">
        <Events
        xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd"
        Version="3.0.0">
            <xsl:apply-templates/>
        </Events>
    </xsl:template>
    <xsl:template match="SomeEvent">
        <xsl:variable name="dateTime" select="SomeTime"/>
        <xsl:variable name="formattedDateTime" select="s:format-date($dateTime, 'dd/MM/yyyyhh:mm:ss')"/>

        <xsl:if test="SomeAction = 'OPEN'">
            <Event>
            <EventTime>
                <TimeCreated>
                    <xsl:value-of select="$formattedDateTime"/>
                </TimeCreated>
            </EventTime>
            <EventSource>
                <System>Example</System>
                <Environment>Example</Environment>
                <Generator>Very Simple Provider</Generator>
                <Device>
                    <IPAddress>3.3.3.3</IPAddress>
                </Device>
                <User>
                    <Id><xsl:value-of select="SomeUser"/></Id>
                </User>
            </EventSource>
            <EventDetail>
                <View>
                    <Document>
                        <Title>UNKNOWN</Title>
                        <File>
                        <Path><xsl:value-of select="SomeFile"/></Path>
                        <File>
                    </Document>
                </View>
            </EventDetail>
            </Event>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

10.4.1 - XSLT Functions

Custom XSLT functions available in Stroom.

By including the following namespace:

xmlns:s="stroom"

E.g.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
    xmlns="event-logging:3"
    xmlns:s="stroom"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    version="2.0">

The following functions are available to aid your translation:

bitmap-lookup(String map, String key) - Bitmap based look up against reference data map using the period start time
bitmap-lookup(String map, String key, String time) - Bitmap based look up against reference data map using a specified time, e.g. the event time
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings) - Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace) - Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.
classification() - The classification of the feed for the data being processed
col-from() - The column in the input that the current record begins on (can be 0).
col-to() - The column in the input that the current record ends at.
current-time() - The current system time
current-user() - The current user logged into Stroom (only relevant for interactive use, e.g. search)
decode-url(String encodedUrl) - Decode the provided url.
dictionary(String name) - Loads the contents of the named dictionary for use within the translation
encode-url(String url) - Encode the provided url.
feed-name() - Name of the feed for the data being processed
format-date(String date, String pattern) - Format a date that uses the specified pattern using the default time zone
format-date(String date, String pattern, String timeZone) - Format a date that uses the specified pattern with the specified time zone
format-date(String date, String patternIn, String timeZoneIn, String patternOut, String timeZoneOut) - Parse a date with the specified input pattern and time zone and format the output with the specified output pattern and time zone
format-date(String milliseconds) - Format a date that is specified as a number of milliseconds since a standard base time known as “the epoch”, namely January 1, 1970, 00:00:00 GMT
get(String key) - Returns the value associated with a key that has been stored using put()
hash(String value) - Hash a string value using the default SHA-256 algorithm and no salt
hash(String value, String algorithm, String salt) - Hash a string value using the specified hashing algorithm and supplied salt value. Supported hashing algorithms include SHA-256, SHA-512, MD5.
hex-to-dec(String hex) - Convert hex to dec representation
hex-to-oct(String hex) - Convert hex to oct representation
json-to-xml(String json) - Returns an XML representation of the supplied JSON value for use in XPath expressions
line-from() - The line in the input that the current record begins on (1 based).
line-to() - The line in the input that the current record ends at.
link(String url) - Creates a stroom dashboard table link.
link(String title, String url) - Creates a stroom dashboard table link.
link(String title, String url, String type) - Creates a stroom dashboard table link.
log(String severity, String message) - Logs a message to the processing log with the specified severity
lookup(String map, String key) - Look up a reference data map using the period start time
lookup(String map, String key, String time) - Look up a reference data map using a specified time, e.g. the event time
lookup(String map, String key, String time, Boolean ignoreWarnings) - Look up a reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup
lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace) - Look up a reference data map using a specified time, e.g. the event time, ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.
meta(String key) - Lookup a meta data value for the current stream using the specified key. The key can be Feed, StreamType, CreatedTime, EffectiveTime, Pipeline or any other attribute supplied when the stream was sent to Stroom, e.g. meta(‘System’).
numeric-ip(String ipAddress) - Convert an IP address to a numeric representation for range comparison
part-no() - The current part within a multi part aggregated input stream (AKA the substream number) (1 based)
parse-uri(String URI) - Returns an XML structure of the URI providing authority, fragment, host, path, port, query, scheme, schemeSpecificPart, and userInfo components if present.
random() - Get a system generated random number between 0 and 1.
record-no() - The current record number within the current part (substream) (1 based).
search-id() - Get the id of the batch search when a pipeline is processing as part of a batch search
source() - Returns an XML structure with the stroom-meta namespace detailing the current source location.
source-id() - Get the id of the current input stream that is being processed
stream-id() - An alias for source-id included for backward compatibility.
pipeline-name() - Name of the current processing pipeline using the XSLT
put(String key, String value) - Store a value for use later on in the translation

bitmap-lookup()

The bitmap-lookup() function looks up a bitmap key from reference or context data a value (which can be an XML node set) for each set bit position and adds it to the resultant XML.

bitmap-lookup(String map, String key)
bitmap-lookup(String map, String key, String time)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)

map - The name of the reference data map to perform the lookup against.
key - The bitmap value to lookup. This can either be represented as a decimal integer (e.g. 14) or as hexadecimal by prefixing with 0x (e.g 0xE).
time - Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.

If the look up fails no result will be returned.

The key is a bitmap expressed as either a decimal integer or a hexidecimal value, e.g. 14/0xE is 1110 as a binary bitmap. For each bit position that is set, (i.e. has a binary value of 1) a lookup will be performed using that bit position as the key. In this example, positions 1, 2 & 3 are set so a lookup would be performed for these bit positions. The result of each lookup for the bitmap are concatenated together in bit position order, separated by a space.

If ignoreWarnings is true then any lookup failures will be ignored and it will return the value(s) for the bit positions it was able to lookup.

This function can be useful when you have a set of values that can be represented as a bitmap and you need them to be converted back to individual values. For example if you have a set of additive account permissions (e.g Admin, ManageUsers, PerformExport, etc.), each of which is associated with a bit position, then a user’s permissions could be defined as a single decimal/hex bitmap value. Thus a bitmap lookup with this value would return all the permissions held by the user.

For example the reference data store may contain:

Key (Bit position)	Value
0	Administrator
1	Manage_Users
2	Perform_Export
3	View_Data
4	Manage_Jobs
5	Delete_Data
6	Manage_Volumes

The following are example lookups using the above reference data:

Lookup Key (decimal)	Lookup Key (Hex)	Bitmap	Result
`0`	`0x0`	`0000000`	-
`1`	`0x1`	`0000001`	`Administrator`
`74`	`0x4A`	`1001010`	`Manage_Users View_Data Manage_Volumes`
`2`	`0x2`	`0000010`	`Manage_Users`
`96`	`0x60`	`1100000`	`Delete_Data Manage_Volumes`

dictionary()

The dictionary() function gets the contents of the specified dictionary for use during translation. The main use for this function is to allow users to abstract the management of a set of keywords from the XSLT so that it is easier for some users to make quick alterations to a dictionary that is used by some XSLT, without the need for the user to understand the complexities of XSLT.

format-date()

The format-date() function takes a Pattern and optional TimeZone arguments and replaces the parsed contents with an XML standard Date Format. The pattern must be a Java based SimpleDateFormat. If the optional TimeZone argument is present the pattern must not include the time zone pattern tokens (z and Z). A special time zone value of “GMT/BST” can be used to guess the time based on the date (BST during British Summer Time).

E.g. Convert a GMT date time “2009/12/01 12:34:11”

<xsl:value-of select="s:format-date('2009/08/01 12:34:11', 'yyyy/MM/dd HH:mm:ss')"/>

E.g. Convert a GMT or BST date time “2009/08/01 12:34:11”

<xsl:value-of select="s:format-date('2009/08/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT/BST')"/>

E.g. Convert a GMT+1:00 date time “2009/08/01 12:34:11”

<xsl:value-of select="s:format-date('2009/08/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT+1:00')"/>

E.g. Convert a date time specified as milliseconds since the epoch “1269270011640”

<xsl:value-of select="s:format-date('1269270011640')"/>

Time Zone Must be as per the rules defined in SimpleDateFormat under General Time Zone syntax.

link()

Create a string that represents a hyperlink for display in a dashboard table.

link(url)
link(title, url)
link(title, url, type)

Example

link('http://www.somehost.com/somepath')
> [http://www.somehost.com/somepath](http://www.somehost.com/somepath)
link('Click Here','http://www.somehost.com/somepath')
> [Click Here](http://www.somehost.com/somepath)
link('Click Here','http://www.somehost.com/somepath', 'dialog')
> [Click Here](http://www.somehost.com/somepath){dialog}
link('Click Here','http://www.somehost.com/somepath', 'dialog|Dialog Title')
> [Click Here](http://www.somehost.com/somepath){dialog|Dialog Title}

Type can be one of:

dialog : Display the content of the link URL within a stroom popup dialog.
tab : Display the content of the link URL within a stroom tab.
browser : Display the content of the link URL within a new browser tab.
dashboard : Used to launch a stroom dashboard internally with parameters in the URL.

If you wish to override the default title or URL of the target link in either a tab or dialog you can. Both dialog and tab types allow titles to be specified after a |, e.g. dialog|My Title.

log()

The log() function writes a message to the processing log with the specified severity. Severities of INFO, WARN, ERROR and FATAL can be used. Severities of ERROR and FATAL will result in records being omitted from the output if a RecordOutputFilter is used in the pipeline. The counts for RecWarn, RecError will be affected by warnings or errors generated in this way therefore this function is useful for adding business rules to XML output.

E.g. Warn if a SID is not the correct length.

<xsl:if test="string-length($sid) != 7">
  <xsl:value-of select="s:log('WARN', concat($sid, ' is not the correct length'))"/>
</xsl:if>

lookup()

The lookup() function looks up from reference or context data a value (which can be an XML node set) and adds it to the resultant XML.

lookup(String map, String key)
lookup(String map, String key, String time)
lookup(String map, String key, String time, Boolean ignoreWarnings)
lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)

map - The name of the reference data map to perform the lookup against.
key - The key to lookup. The key can be a simple string, an integer value in a numeric range or a nested lookup key.
time - Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.

If the look up fails no result will be returned. By testing the result a default value may be output if no result is returned.

E.g. Look up a SID given a PF

<xsl:variable name="pf" select="PFNumber"/>
<xsl:if test="$pf">
   <xsl:variable name="sid" select="s:lookup('PF_TO_SID', $pf, $formattedDateTime)"/>

   <xsl:choose>
      <xsl:when test="$sid">
         <User>
             <Id><xsl:value-of select="$sid"/></Id>
         </User>
      </xsl:when>
      <xsl:otherwise>
         <data name="PFNumber">
            <xsl:attribute name="Value"><xsl:value-of select="$pf"/></xsl:attribute>
         </data>
      </xsl:otherwise>
   </xsl:choose>
</xsl:if>

Range lookups

Reference data entries can either be stored with single string key or a key range that defines a numeric range, e.g 1-100. When a lookup is preformed the passed key is looked up as if it were a normal string key. If that lookup fails Stroom will try to convert the key to an integer (long) value. If it can be converted to an integer than a second lookup will be performed against entries with key ranges to see if there is a key range that includes the requested key.

Range lookups can be used for looking up an IP address where the reference data values are associated with ranges of IP addresses. In this use case, the IP address must first be converted into a numeric value using numeric-ip(), e.g:

stroom:lookup('IP_TO_LOCATION', numeric-ip($ipAddress))

Similarly the reference data must be stored with key ranges whose bounds were created using this function.

Nested Maps

The lookup function allows you to perform chained lookups using nested maps. For example you may have a reference data map called USER_ID_TO_LOCATION that maps user IDs to some location information for that user and a map called USER_ID_TO_MANAGER that maps user IDs to the user ID of their manager. If you wanted to decorate a user’s event with the location of their manager you could use a nested map to achieve the lookup chain. To perform the lookup set the map argument to the list of maps in the lookup chain, separated by a /, e.g. USER_ID_TO_MANAGER/USER_ID_TO_LOCATION.

This will perform a lookup against the first map in the list using the requested key. If a value is found the value will be used as the key in a lookup against the next map. The value from each map lookup is used as the key in the next map all the way down the chain. The value from the last lookup is then returned as the result of the lookup() call. If no value is found at any point in the chain then that results in no value being returned from the function.

In order to use nested map lookups each intermediate map must contain simple string values. The last map in the chain can either contain string values or XML fragment values.

put() and get()

You can put values into a map using the put() function. These values can then be retrieved later using the get() function. Values are stored against a key name so that multiple values can be stored. These functions can be used for many purposes but are most commonly used to count a number of records that meet certain criteria.

An example of how to count records is shown below:

<!-- Get the current record count -->
<xsl:variable name="currentCount" select="number(s:get('count'))" />

<!-- Increment the record count -->
<xsl:variable name="count">
  <xsl:choose>
    <xsl:when test="$currentCount">
      <xsl:value-of select="$currentCount + 1" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="1" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:variable>

<!-- Store the count for future retrieval -->
<xsl:value-of select="s:put('count', $count)" />

<!-- Output the new count -->
<data name="Count">
  <xsl:attribute name="Value" select="$count" />
</data>

parse-uri()

The parse-uri() function takes a Uniform Resource Identifier (URI) in string form and returns an XML node with a namespace of uri containing the URI’s individual components of authority, fragment, host, path, port, query, scheme, schemeSpecificPart and userInfo. See either RFC 2306: Uniform Resource Identifiers (URI): Generic Syntax or Java’s java.net.URI Class for details regarding the components.

The following xml

<!-- Display and parse the URI contained within the text of the rURI element -->
<xsl:variable name="u" select="s:parseUri(rURI)" />

<URI>
  <xsl:value-of select="rURI" />
</URI>
<URIDetail>
  <xsl:copy-of select="$v"/>
</URIDetail>

given the rURI text contains

   http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2#more-details

would provide

<URL>http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2#more-details</URL>
<URIDetail>
  <authority xmlns="uri">foo:bar@w1.superman.com:8080</authority>
  <fragment xmlns="uri">more-details</fragment>
  <host xmlns="uri">w1.superman.com</host>
  <path xmlns="uri">/very/long/path.html</path>
  <port xmlns="uri">8080</port>
  <query xmlns="uri">p1=v1&amp;p2=v2</query>
  <scheme xmlns="uri">http</scheme>
  <schemeSpecificPart xmlns="uri">//foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&amp;p2=v2</schemeSpecificPart>
  <userInfo xmlns="uri">foo:bar</userInfo>
</URIDetail>

10.4.2 - XSLT Includes

Using an XSLT import to include XSLT from another translation.

You can use an XSLT import to include XSLT from another translation. E.g.

<xsl:import href="ApacheAccessCommon" />

This would include the XSLT from the ApacheAccessCommon translation.

11 - Properties

Configuration of Stroom’s application properties.

This documentation applies to Stroom 7.0 or higher

Properties are the means of configuring the Stroom application and are typically maintained by the Stroom system administrator. The value of some properties are required in order for Stroom to function, e.g. database connection details, and thus need to be set prior to running Stroom. Some properties can be changed at runtime to alter the behaviour of Stroom.

Sources

Property values can be defined in the following locations.

System Default

The system defaults are hard-coded into the Stroom application code by the developers and can’t be changed. These represent reasonable defaults, where applicable, but may need to be changed, e.g. to reflect the scale of the Stroom system or the specific environment.

The default property values can either be viewed in the Stroom user interface (Tools -> Properties) or in the file config/config-defaults.yml in the Stroom distribution.

Global Database Value

Global database values are property values stored in the database that are global across the whole cluster.

The global database value is defined as a record in the config table in the database. The database record will only exist if a database value has explicitly been set. The database value will apply to all nodes in the cluster, overriding the default value, unless a node also has a value set in its YAML configuration.

Database values can be set from the Stroom user interface using Tools -> Properties. Some properties are marked Read Only which means they cannot have a database value set for them. These properties can only be altered via the YAML configuration file on each node. Such properties are typically used to configure values required for Stroom to be able to boot, so it does not make sense for them to be configurable from the User Interface.

YAML Configuration file

Stroom is built on top of a framework called Drop Wizard. Drop Wizard uses a YAML configuration file on each node to configure the application. This is typically config.yml and is located in the config directory.

This file contains both the Drop Wizard configuration settings (settings for ports, paths and application logging) and the Stroom specific properties configuration. The file is in YAML format and the Stroom properties are located under the appConfig key. For details of the Drop Wizard configuration structure, see here (external link).

The file is split into three sections using these keys:

server - Configuration of the web server, e.g. ports, paths, request logging.
logging - Configuration of application logging
appConfig - The stroom configuration properties

The following is an example of the YAML configuration file:

# Drop Wizard configuration section
server:
  # e.g. ports and paths
logging:
  # e.g. logging levels/appenders

# Stroom properties configuration section
appConfig:
  commonDbDetails:
    connection:
      jdbcDriverClassName: ${STROOM_JDBC_DRIVER_CLASS_NAME:-com.mysql.cj.jdbc.Driver}
      jdbcDriverUrl: ${STROOM_JDBC_DRIVER_URL:-jdbc:mysql://localhost:3307/stroom?useUnicode=yes&characterEncoding=UTF-8}
      jdbcDriverUsername: ${STROOM_JDBC_DRIVER_USERNAME:-stroomuser}
      jdbcDriverPassword: ${STROOM_JDBC_DRIVER_PASSWORD:-stroompassword1}
  contentPackImport:
    enabled: true
  ...

In the Stroom user interface properties are named with a dot notation key, e.g. stroom.contentPackImport.enabled. Each part of the dot notation property name represents a key in the YAML file, e.g. for this example, the location in the YAML would be:

appConfig:
  contentPackImport:
    enabled: true   # stroom.contentPackImport.enabled

The stroom part of the dot notation name is replaced with appConfig.

Variable Substitution

The YAML configuration file supports Bash style variable substitution in the form of:

${ENV_VAR_NAME:-value_if_not_set}

This allows values to be set either directly in the file or via an environment variable, e.g.

      jdbcDriverClassName: ${STROOM_JDBC_DRIVER_CLASS_NAME:-com.mysql.cj.jdbc.Driver}

In the above example, if the STROOM_JDBC_DRIVER_CLASS_NAME environment variable is not set then the value com.mysql.cj.jdbc.Driver will be used instead.

Typed Values

YAML supports typed values rather than just strings, see https://yaml.org/refcard.html. YAML understands booleans, strings, integers, floating point numbers, as well as sequences/lists and maps. Some properties will be represented differently in the user interface to the YAML file. This is due to how values are stored in the database and how the current user interface works. This will likely be improved in future versions. For details of how different types are represented in the YAML and the UI, see Data Types.

Source Precedence

The three sources (Default, Database & YAML) are listed in increasing priority, i.e YAML trumps Database, which trumps Default.

For example, in a two node cluster, this table shows the effective value of a property on each node. A - indicates the value has not been set in that source. NULL indicates that the value has been explicitly set to NULL.

Source	Node1	Node2
Default	red	red
Database	-	-
YAML	-	blue
Effective	red	blue

Or where a Database value is set.

Source	Node1	Node2
Default	red	red
Database	green	green
YAML	-	blue
Effective	green	blue

Or where a YAML value is explicitly set to NULL.

Source	Node1	Node2
Default	red	red
Database	green	green
YAML	-	NULL
Effective	green	NULL

Data Types

Stroom property values can be set using a number of different data types. Database property values are currently set in the user interface using the string form of the value. For each of the data types defined below, there will be an example of how the data type is recorded in its string form.

Data Type	Example UI String Forms	Example YAML form
Boolean	`true` `false`	`true` `false`
String	`This is a string`	`"This is a string"`
Integer/Long	`123`	`123`
Float	`1.23`	`1.23`
Stroom Duration	`P30D` `P1DT12H` `PT30S` `30d` `30s` `30000`	`"P30D"` `"P1DT12H"` `"PT30S"` `"30d"` `"30s"` `"30000"` See Stroom Duration Data Type.
List	`#red#Green#Blue` `,1,2,3`	See List Data Type
Map	`,=red=FF0000,Green=00FF00,Blue=0000FF`	See Map Data Type
DocRef	`,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name)`	See DocRef Data Type
Enum	`HIGH` `LOW`	`"HIGH"` `"LOW"`
Path	`/some/path/to/a/file`	`"/some/path/to/a/file"`
ByteSize	`32`, `512Kib`	`32`, `512Kib` See Byte Size Data Type

Stroom Duration Data Type

The Stroom Duration data type is used to specify time durations, for example the time to live of a cache or the time to keep data before it is purged. Stroom Duration uses a number of string forms to support legacy property values.

ISO 8601 Durations

Stroom Duration can be expressed using ISO 8601 (external link) duration strings. It does NOT support the full ISO 8601 format, only D, H, M and S. For details of how the string is parsed to a Stroom Duration, see Duration (external link)

The following are examples of ISO 8601 duration strings:

P30D - 30 days
P1DT12H - 1 day 12 hours (36 hours)
PT30S - 30 seconds
PT0.5S - 500 milliseconds

Legacy Stroom Durations

This format was used in versions of Stroom older than v7 and is included to support legacy property values.

The following are examples of legacy duration strings:

30d - 30 days
12h - 12 hours
10m - 10 minutes
30s - 30 seconds
500 - 500 milliseconds

Combinations such as 1m30s are not supported.

List Data Type

This type supports ordered lists of items, where an item can be of any supported data type, e.g. a list of strings or list of integers.

The following is an example of how a property (statusValues) that is is List of strings is represented in the YAML:

  annotation:
    statusValues:
    - "New"
    - "Assigned"
    - "Closed"

This would be represented as a string in the User Interface as:

|New|Assigned|Closed.

See Delimiters in String Conversion for details of how the items are delimited in the string.

The following is an example of how a property (cpu) that is is List of DocRefs is represented in the YAML:

  statistics:
    internal:
      cpu:
      - type: "StatisticStore"
        uuid: "af08c4a7-ee7c-44e4-8f5e-e9c6be280434"
        name: "CPU"
      - type: "StroomStatsStore"
        uuid: "1edfd582-5e60-413a-b91c-151bd544da47"
        name: "CPU"

This would be represented as a string in the User Interface as:

|,docRef(StatisticStore,af08c4a7-ee7c-44e4-8f5e-e9c6be280434,CPU)|,docRef(StroomStatsStore,1edfd582-5e60-413a-b91c-151bd544da47,CPU)

See Delimiters in String Conversion for details of how the items are delimited in the string.

Map Data Type

This type supports a collection of key/value pairs where the key is unique within the collection. The type of the key must be string, but the type of the value can be any supported type.

The following is an example of how a property (mapProperty) that is a map of string => string would be represented in the YAML:

mapProperty:
  red: "FF0000"
  green: "00FF00"
  blue: "0000FF"

This would be represented as a string in the User Interface as:

,=red=FF0000,Green=00FF00,Blue=0000FF

The delimiter between pairs is defined first, then the delimiter for the key and value.

See Delimiters in String Conversion for details of how the items are delimited in the string.

DocRef Data Type

A DocRef (or Document Reference) is a type specific to Stroom that defines a reference to an instance of a Document within Stroom, e.g. an XLST, Pipeline, Dictionary, etc. A DocRef consists of three parts, the type, the UUID (external link) and the name of the Document.

The following is an example of how a property (aDocRefProperty) that is a DocRef would be represented in the YAML:

aDocRefProperty:
  type: "MyType"
  uuid: "a56ff805-b214-4674-a7a7-a8fac288be60"
  name: "My DocRef name"

This would be represented as a string in the User Interface as:

,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name)

See Delimiters in String Conversion for details of how the items are delimited in the string.

Byte Size Data Type

The Byte Size data type is used to represent a quantity of bytes using the IEC (external link) standard. Quantities are represented as powers of 1024, i.e. a Kib (Kibibyte) means 1024 bytes.

Examples of Byte Size values in string form are (a YAML value would optionally be surrounded with double quotes):

32, 32b, 32B, 32bytes - 32 bytes
32K, 32KB, 32KiB - 32 kibibytes
32M, 32MB, 32MiB - 32 mebibytes
32G, 32GB, 32GiB - 32 gibibytes
32T, 32TB, 32TiB - 32 tebibytes
32P, 32PB, 32PiB - 32 pebibytes

The *iB form is preferred as it is more explicit and avoids confusion with SI units.

Delimiters in String Conversion

The string conversion used for collection types like List, Map etc. relies on the string form defining the delimiter(s) to use for the collection. The delimiter(s) are added as the first n characters of the string form, e.g. |red|green|blue or |=red=FF0000|Green=00FF00|Blue=0000FF. It is possible to use a number of different delimiters to allow for delimiter characters appearing in the actual value, e.g. #some text#some text with a | in it The following are the delimiter characters that can be used.

|, :, ;, ,, !, /, \, #, @, ~, -, _, =, +, ?

When Stroom records a property value to the database it may use a delimiter of its own choosing, ensuring that it picks a delimiter that is not used in the property value.

Restart Required

Some properties are marked as requiring a restart. There are two scopes for this:

Requires UI Refresh

If a property is marked in UI as requiring a UI refresh then this means that a change to the property requires that the Stroom nodes serving the UI are restarted for the new value to take effect.

Requires Restart

If a property is marked in UI as requiring a restart then this means that a change to the property requires that all Stroom nodes are restarted for the new value to take effect.

12 - Roles

TODO

Describe application level permissions and how users and groups behave

13 - Security

There are many aspects of security that should be considered when installing and running Stroom.

Shared Storage

For most large installations Stroom uses shared storage for its data store. This storage could be a CIFS, NFS or similar shared file system. It is recommended that access to this shared storage is protected so that only the application can access it. This could be achieved by placing the storage and application behind a firewall and by requiring appropriate authentication to the shared storage. It should be noted that NFS is unauthenticated so should be used with appropriate safeguards.

MySQL

Accounts

It is beyond the scope of this article to discuss this in detail but all MySQL accounts should be secured on initial install. Official guidance for doing this can be found here (external link).

Communication

Communication between MySQL and the application should be secured. This can be achieved in one of the following ways:

Placing MySQL and the application behind a firewall
Securing communication through the use of iptables
Making MySQL and the application communicate over SSL (see here (external link) for instructions)

The above options are not mutually exclusive and may be combined to better secure communication.

Application

Node to node communication

In a multi node Stroom deployment each node communicates with the master node. This can be configured securely in one of several ways:

Direct communication to Tomcat on port 8080 - Secured by being behind a firewall or using iptables
Direct communication to Tomcat on port 8443 - Secured using SSL and certificates
Removal of Tomcat connectors other than AJP and configuration of Apache to communicate on port 443 using SSL and certificates

Application to Stroom Proxy Communication

The application can be configured to share some information with Stroom Proxy so that Stroom Proxy can decide whether or not to accept data for certain feeds based on the existence of the feed or it’s reject/accept status. The amount of information shared between the application and the proxy is minimal but could be used to discover what feeds are present within the system. Securing this communication is harder as both the application and the proxy will not typically reside behind the same firewall. Despite this communication can still be performed over SSL thus protecting this potential attack vector.

Admin port

Stroom (v6 and above) and its associated family of stroom-* DropWizard based services all expose an admin port (8081 in the case of stroom). This port serves up various health check and monitoring pages as well as a number of restful services for initiating admin tasks. There is currently no authentication on this admin port so it is assumed that access to this port will be tightly controlled using a firewall, iptables or similar.

Servlets

There are several servlets in Stroom that are accessible by certain URLs. Considerations should be made about what URLs are made available via Apache and who can access them. The servlets, path and function are described below:

Servlet	Path	Function	Risk
DataFeed	/datafeed or /datafeed/*	Used to receive data	Possible denial of service attack by posting too much data/noise
RemoteFeedService	/remoting/remotefeedservice.rpc	Used by proxy to ask application about feed status (described in previous section)	Possible to systematically discover which feeds are available. Communication with this service should be secured over SSL discussed above
DynamicCSSServlet	/stroom/dynamic.css	Serves dynamic CSS based on theme configuration	Low risk as no important data is made available by this servlet
DispatchService	/stroom/dispatch.rpc	Service for UI and server communication	All back-end services accessed by this umbrella service are secured appropriately by the application
ImportFileServlet	/stroom/importfile.rpc	Used during configuration upload	Users must be authenticated and have appropriate permissions to import configuration
ScriptServlet	/stroom/script	Serves user defined visualisation scripts to the UI	The visualisation script is considered to be part of the application just as the CSS so is not secured
ClusterCallService	/clustercall.rpc	Used for node to node communication as discussed above	Communication must be secured as discussed above
ExportConfig	/export/*	Servlet used to export configuration data	Servlet access must be restricted with Apache to prevent configuration data being made available to unauthenticated users
Status	/status	Shows the application status including volume usage	Needs to be secured so that only appropriate users can see the application status
Echo	/echo	Block GZIP data posted to the echo servlet is sent back uncompressed. This is a utility servlet for decompression of external data	URL should be secured or not made available
Debug	/debug	Servlet for echoing HTTP header arguments including certificate details	Should be secured in production environments
SessionList	/sessionList	Lists the logged in users	Needs to be secured so that only appropriate users can see who is logged in
SessionResourceStore	/resourcestore/*	Used to create, download and delete temporary files liked to a users session such as data for export	This is secured by using the users session and requiring authentication

HDFS, Kafka, HBase, Zookeeper

Stroom and stroom-stats can integrate with HDFS, Kafka, HBase and Zookeeper. It should be noted that communication with these external services is currently not secure. Until additional security measures (e.g. authentication) are put in place it is assumed that access to these services will be careful controlled (using a firewall, iptables or similar) so that only stroom nodes can access the open ports.

Content

It may be possible for a user to write XSLT, Data Splitter or other content that may expose data that we do not wish to or to cause the application some harm. At present processing operations are not isolated processes and so it is easy to cripple processing performance with a badly written translation whether written accidentally or on purpose. To mitigate this risk it is recommended that users that are given permission to create XSLT, Data Splitter and Pipeline configurations are trusted to do so.

Visualisations can be completely customised with javascript. The javascript that is added is executed in a clients browser potentially opening up the possibility of XSS attacks, an attack on the application to access data that a user shouldn’t be able to access, an attack to destroy data or simply failure/incorrect operation of the user interface. To mitigate this risk all user defined javascript is executed within a separate browser IFrame. In addition all javascript should be examined before being added to a production system unless the author is trusted. This may necessitate the creation of a separate development and testing environment for user content.

14 - Stroom Jobs

Managing background jobs.

There are various jobs that run in the background within Stroom. Among these are jobs that control pipeline processing, removing old files from the file system, checking the status of nodes and volumes etc. Each task executes at a different time depending on the purpose of the task. There are three ways that a task can be executed:

Scheduled jobs execute periodically according to a cron schedule. These include jobs such as cleaning the file system where Stroom only needs to perform this action once a day and can do so overnight.
Frequency controlled jobs are executed every X seconds, minutes, hours etc. Most of the jobs that execute with a given frequency are status checking jobs that perform a short lived action fairly frequently.
Distributed jobs are only applicable to stream processing with a pipeline. Distributed jobs are executed by a worker node as soon as a worker has available threads to execute a jobs and the task distributor has work available.

A list of task types and their execution method can be seen by opening Monitoring/Jobs from the main menu.

TODO: image

Expanding each task type allows you to configure how a task behaves on each node:

TODO: image

Account Maintenance

This job checks user accounts on the system and de-activates them under the following conditions:

An unused account that has been inactive for longer than the age configured by stroom.security.identity.passwordPolicy.neverUsedAccountDeactivationThreshold.
An account that has been inactive for longer than the age configured by stroom.security.identity.passwordPolicy.unusedAccountDeactivationThreshold.

Attribute Value Data Retention

Deletes Meta attribute values (additional and less valuable metadata) older than stroom.data.meta.metaValue.deleteAge.

Data Delete

Before data is physically removed from the database and file system it is marked as logically deleted by adding a flag to the metadata record in the database. Data can be logically deleted by a user from the UI or via a process such as data retention. Data is deleted logically as it is faster to do than a physical delete (important in the UI), and it also allows for data to be restored (undeleted) from the UI. This job performs the actual physical deletion of data that has been marked logically deleted for longer than the duration configured with stroom.data.store.deletePurgeAge. All data files associated with a metadata record are deleted from the file system before the metadata is physically removed from the database.

Data Processor

Processes data by finding data that matches processing filters on each pipeline. When enabled, each worker node asks the master node for data processing tasks. The master node creates tasks based on processing filters added to the Processors screen of each pipeline and supplies them to the requesting workers.

Feed Based Data Retention

This job uses the retention property of each feed to logically delete data from the associated feed that is older than the retention period. The recommended way of specifying data retention rules is via the data retention policy feature, but feed based retention still exists for backwards compatibility. Feed based data retention will be removed in a future release and should be considered deprecated.

File System Clean (deprecated)

This is the previous incarnation of the Data Delete job. This job scans the file system looking for files that are no longer associated with metadata in the database or where the metadata is marked as deleted and deletes the files if this is the case. The process is slow to run as it has to traverse all stored data files and examine each. However, this version of the data deletion process was created when metadata was deleted immediately, i.e. not marked for future physical deletion, so was the only way to perform this clean up activity at the time. This job will be removed in a future release. The Data Delete job should be used instead from now on.

File System Volume Status

Scans your data volumes to ensure they are available and determines how much free space they have. Records this status in the Volume Status table.

Index Searcher Cache Refresh

Refresh references to Lucene index searchers that have been cached for a while.

Index Shard Delete

How frequently index shards that have been logically deleted are physically deleted from the file system.

Index Shard Retention

How frequently index shards that are older then their retention period are logically deleted.

Index Volume Status

Scans your index volumes to ensure they are available and determines how much free space they have. Records this status in the Index Volume Status table.

Index Writer Cache Sweep

How frequently entries in the Index Shard Writer cache are evicted based on the time-to-live, time-to-idle and cache size settings.

Index Writer Flush

How frequently in-memory changes to the index shards are flushed to the file system and committed to the index.

Java Heap Histogram Statistics

How frequently heap histogram statistics will be captured. This can be useful for diagnosing issues or seeing where memory is being used. Each run will result in a JVM pause so car should be taken when running this on a production system.

Node Status

How frequently we try write stats about node status including JVM and memory usage.

Pipeline Destination Roll

How frequently rolling pipeline destinations, e.g. a Rolling File Appender are checked to see if they need to be rolled. This frequency should be at least as short as the most frequent rolling frequency.

Policy Based Data Retention

Run the policy based data retention rules over the data and logically delete and data that should no longer be retained.

Processor Task Queue Statistics

How frequently statistics about the state of the stream processing task queue are captured.

Processor Task Retention

How frequently failed and completed tasks are checked to see if they are older than the delete age threshold set by stroom.processor.deleteAge. Any that are older than this threshold are deleted.

Property Cache Reload

Stroom’s configuration properties can each be configured globally in the database. This job controls the frequency that each node refreshes the values of its properties cache from the global database values. See also Properties.

Proxy Aggregation

If you front Stroom with an Stroom proxy which is configured to ‘store’ rather than ‘store/forward’, then this task when run will pick up all files in the proxy repository dir, aggregate them by feed and bring them into Stroom. It uses the system property stroom.proxyDir.

Query History Clean

How frequently items in the query history are removed from the history if their age is older than stroom.history.daysRetention or if the number of items in the history exceeds stroom.history.itemsRetention.

Ref Data Off-heap Store Purge

Purges all data older than the purge age defined by property stroom.pipeline.purgeAge. See also Reference Data.

Solr Index Optimise

How frequently Solr index segments are explicitly optimised by merging them into one.

Solr Index Retention

How frequently a process is run to delete items from the Solr indexes that don’t meet the retention rule of that index.

SQL Stats Database Aggregation

This job controls the frequency that the database statistics aggregation process is run. This process takes the entries in SQL_STAT_VAL_SRC and merges them into the main statistics tables SQL_STAT_KEY and SQL_STAT_KEY. As this process is reliant on data flushed by the SQL Stats In Memory Flush job it is advisable to schedule it to run after that, leaving some time for the in-memory flush to finish.

SQL Stats In Memory Flush

SQL Statistics are initially held and aggregated in memory. This job controls the frequency that the in memory statistics are flushed from the in memory buffer to the staging table SQL_STAT_VAL_SRC in the database.

15 - Tools

Various additional tools to assist in administering Stroom and accessing its data.

15.1 - Command Line Tools

Command line actions for administering Stroom.

Version Information: Created with Stroom v7.0
Last Updated: 10 September 2020

Stroom has a number of tools that are available from the command line in addition to starting the main application.

The basic structure of the command for starting one of stroom’s commands is:

java -jar /absolute/path/to/stroom-app.jar COMMAND

COMMAND can be a number of values depending on what you want to do. Each command value is described in its own section.

NOTE: These commands are very powerful and potentially dangerous in the wrong hands, e.g. they allow the changing of user’s passwords. Access to these commands should be strictly limited. Also, each command will run in its own JVM so are not really intended to be run when Stroom is running on the node.

`server`

java -jar /absolute/path/to/stroom-app.jar server path/to/config.yml

This is the normal command for starting the Stroom application using the supplied YAML configuration file. The example above will start the application as a foreground process. Stroom would typically be started using the start.sh shell script, but the command above is listed for completeness.

When stroom starts it will check the database to see if any migration is required. If migration from an earlier version (including from an empty database) is required then this will happen as part of the application start process.

`migrate`

java -jar /absolute/path/to/stroom-app.jar migrate path/to/config.yml

There may be occasions where you want to migrate an old version but not start the application, e.g. during migration testing or to initiate the migration before starting up a cluster. This command will run the process that checks for any required migrations and then performs them. On completion of the process it exits. This runs as a foreground process.

`create_account`

java -jar /absolute/path/to/stroom-app.jar create_account --user USER --password PASSWORD [OPTIONS] path/to/config.yml

Where the named arguments are:

-u --user - The username for the user.
-p --password - The password for the user.
-e --email - The email address of the user.
-f --firstName - The first name of the user.
-s --lastName - The last name of the user.
--noPasswordChange - If set do not require a password change on first login.
--neverExpires - If set, the account will never expire.

This command will create an account in the internal identity provider within Stroom. Stroom is able to use third party OpenID identity providers such as Google or AWS Cognito but by default will use its own. When configured to use its own (the default) it will auto create an admin account when starting up a fresh instance. There are times when you may wish to create this account manually which this command allows.

Authentication Accounts and Stroom Users

The user account used for authentication is distinct to the Stroom user entity that is used for authorisation within Stroom. If an external IDP is used then the mechanism for creating the authentication account will be specific to that IDP. If using the default internal Stroom IDP then an account must be created in order to authenticate, either from within the UI if you are already authenticated as a privileged used or using this command. In either case a Stroom user will need to exist with the same username as the authentication account.

The command will fail if the user already exists. This command should NOT be run if you are using a third party identity provider.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

`reset_password`

java -jar /absolute/path/to/stroom-app.jar reset_password --user USER --password PASSWORD path/to/config.yml

Where the named arguments are:

-u --user - The username for the user.
-p --password - The password for the user.

This command is used for changing the password of an existing account in stroom’s internal identity provider. It will also reset all locked/inactive/disabled statuses to ensure the account can be logged into. This command should NOT be run if you are using a third party identity provider. It will fail if the account does not exist.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

`manage_users`

java -jar /absolute/path/to/stroom-app.jar manage_users [OPTIONS] path/to/config.yml

Where the named arguments are:

--createUser USER_NAME - Creates a Stroom user with the supplied username.
--greateGroup GROUP_NAME - Creates a Stroom user group with the supplied group name.
--addToGroup USER_OR_GROUP_NAME TARGET_GROUP - Adds a user/group to an existing group.
--removeFromGroup USER_OR_GROUP_NAME TARGET_GROUP - Removes a user/group from an existing group.
--grantPermission USER_OR_GROUP_NAME PERMISSION_NAME - Grants the named application permission to the user/group.
--revokePermission USER_OR_GROUP_NAME PERMISSION_NAME - Revokes the named application permission from the user/group.
--listPermissions - Lists all the valid permission names.

This command allows you to manage the account permissions within stroom regardless of whether the internal identity provider or a 3rd party one is used. A typical use case for this is when using a third party identity provider. In this instance Stroom has no way of auto creating an admin account when first started so the association between the account on the 3rd party IDP and the stroom user account needs to be made manually. To set up an admin account to enable you to login to stroom you can do:

This command is not intended for automation of user management tasks on a running Stroom instance that you can authenticate with. It is only intended for cases where you cannot authenticate with Stroom, i.e. when setting up a new Stroom with a 3rd party IDP or when scripting the creation of a test environment. If you want to automate actions that can be performed in the UI then you can make use of the REST API that is described at /stroom/noauth/swagger-ui.

See the note above about the distinction between authentication accounts and stroom users.

java -jar /absolute/path/to/stroom-app.jar manage_users --createUser jbloggs --createGroup Administrators --addToGroup jbloggs Administrators --grantPermission Administrators "Administrator" path/to/config.yml

Where jbloggs is the user name of the account on the 3rd party IDP.

This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.

The named arguments can be used as many times as you like so you can create multiple users/groups/grants/etc. Regardless of the order of the arguments, the changes are executed in the following order:

Create users
Create groups
Add users/groups to a group
Remove users/groups from a group
Grant permissions to users/groups
Revoke permissions from users/groups

15.2 - Stream Dump Tool

A tool for exporting stream data from Stroom.

Data within Stroom can be exported to a directory using the StreamDumpTool. The tool is contained within the core Stroom Java library and can be accessed via the command line, e.g.

java -cp "apache-tomcat-7.0.53/lib/*:lib/*:instance/webapps/stroom/WEB-INF/lib/*" stroom.util.StreamDumpTool outputDir=output

Note the classpath may need to be altered depending on your installation.

The above command will export all content from Stroom and output it to a directory called output. Data is exported to zip files in the same format as zip files in proxy repositories. The structure of the exported data is ${feed}/${pathId}/${id} by default with a .zip extension.

To provide greater control over what is exported and how the following additional parameters can be used:

feed - Specify the name of the feed to export data for (all feeds by default).

streamType - The single stream type to export (all stream types by default).

createPeriodFrom - Exports data created after this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z (exports from earliest data by default).

createPeriodTo - Exports data created before this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z (exports up to latest data by default).

outputDir - The output directory to write data to (required).

format - The format of the output data directory and file structure (${feed}/${pathId}/${id} by default).

Format

The format parameter can include several replacement variables:

feed - The name of the feed for the exported data.

streamType - The data type of the exported data, e.g. RAW_EVENTS.

streamId - The id of the data being exported.

pathId - A incrementing numeric id that creates sub directories when required to ensure no directory ends up containing too many files.

id - A incrementing numeric id similar to pathId but without sub directories.

16 - Volumes

Stroom’s logical storage volumes for storing event and index data.

TODO

Describe volumes