Stroom uses Pipelines to process its data.
A pipeline is a set of pipeline elements connected together.
Pipelines are very powerful and flexible and allow the user to transform, index, store and forward data in a wide variety of ways.
Example Pipeline
Pipelines can take many forms and be used for a wide variety of purposes, however a typical pipeline to convert CSV data into cooked events might look like this:
Pipelines process data in batches.
This batch of data is referred to as a Stream.
The input for the pipeline is a single Stream that exists within a Feed and this data is fed into the left-hand side of the pipeline at
Source
.
Pipelines can accept streams from multiple Feeds assuming those feeds contain similar data.
The data in the Stream is always text data (XML, JSON, CSV, fixed-width, etc.) in a known character encoding.
Stroom does not currently support processing binary formats.
XML
The working format for pipeline processing is XML (with the exception of raw streaming).
Data can be input and output in other forms, e.g. JSON, CSV, fixed-width, etc. but the majority of pipelines do most of their processing in XML.
Input data is converted into XML
SAX
events, processed using XSLT to transform it into different shapes of XML then either consumed as XML (e.g. an
IndexingFilter
) or converted into a desired output format for storage/forwarding.
Forks
Pipelines can also be forked at any point in the pipeline.
This allows the same data to processed in different ways.
Note
Rather than creating complicated pipelines with forks, it is sometimes better to create multiple pipelines as this makes it easer to handle errors in one fork of the processing.
It also makes it easier to re-use common simple pipelines.
For example if you have a pipeline to transform CSV events into normalised XML then index it and forward it to a remote server, it may be better to have a pipeline to cook the events, then a common one to index those XML events and one to forward XML events.
Pipeline Inheritance
It is possible for pipelines to inherit from other pipelines.
This allows for the creation of a standard abstract pipelines with a set structure, though not fully configured, to be inherited by many concrete pipelines.
For example you may have a standard pipeline for indexing XML events, i.e. read XML data and pass it to an
IndexingFilter
, but the IndexingFilter is not configured with the actual Index to send documents to.
A pipeline that inherits this one can then be simply configured with the Index to use.
Pipeline inheritance allows for changes to the inherited structure, e.g. adding additional elements in line.
Multi level inheritance is also supported.
Pipeline Element Types
Stroom has a number of categories of pipeline element.
Reader
Readers are responsible for reading the raw bytes of the input data and converting it to character data using the Feed’s character encoding.
They also provide functionality to modify the data before or after it is decoded to characters, e.g. Bye Order Mark removal, or doing find/replace on the character data.
You can chain multiple Readers.
Parser
A parser is designed to convert the character data into XML for processing.
For example, the
JSONParser
will use a JSON parser to read the character data as JSON and convert it into XML elements and attributes that represent the JSON structure, so that it can be transformed downstream using XSLT.
Parsers have a built in reader so if they are not preceded by a Reader they will decode the raw bytes into character data before parsing.
Filter
A filter is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and can either return those events unchanged or modify them.
An example of Filter is an
XSLTFilter
element.
Multiple filters can be chained, with each one consuming the events output by the one preceding it, therefore you can have lots of common reusable XSLTFilters that all do small incremental changes to a document.
Writer
A writer is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and converts them into encoded character data (using a specified encoding) of some form.
The preceding filter may have been an
XSLTFilter
which transformed XML into plain text, in which case only character data events will be output and a
TextWriter
can just write these out as text data.
Other writers will handle the XML SAX events to convert them into another format, e.g. the
JSONWriter
before encoding them as character data.
Destination
A destination element is a consumer of character data, as produced by a writer.
A typical destination is a
StreamAppender
that writes the character data (which may be XML, JSON, CSV, etc.) to a new Stream in Stroom’s stream store.
Other destinations can be used for sending the encoded character data to Kafka, a file on a file system or forwarding to an HTTP URL.
1 - Pipeline Recipies
A set of basic pipeline structure recipies for common use cases.
The following are a basic set of pipeline recipes for doing typical tasks in Stroom.
Is it not an exhaustive list as the possibilities with Pipelines are vast.
They are intended as a rough guide to get you started with building Pipelines.
The same as ingesting CSV data above, except the input JSON is converted into an XML representation of the JSON by the JSONParser.
The Normalise XSLTFilter will be specific to the format of the JSON being ingested.
The Decorate) XSLTFilter will likely be identical to that used for the CSV ingest above, demonstrating reuse of pipeline element content.
As above except that the input data is already XML, though not in event-logging format.
The XMLParser simply reads the XML character data and converts it to XML SAX events for processing.
The Normalise XSLTFilter will be specific to the format of this XML and will transform it into event-logging format.
XML Fragments are where the input data looks like:
<Event>
...
</Event>
<Event>
...
</Event>
In other words, it is technically badly formed XML as it has no root element or declaration.
This format is however easier for client systems to send as they can send multiple <Event> blocks in one stream (e.g. just appending them together in a rolled log file) but don’t need to wrap them with an outer <Events> element.
The XMLFragmentParser understands this format and will add the wrapping element to make well-formed XML.
If the XML fragments are already in event-logging format then no Normalise XSLTFilter is required.
In some cases client systems may send XML containing characters that are not supported by the XML standard.
These can be removed using the
InvalidXMLCharFilterReader
.
The input data may also be known to contain other sets of characters that will cause problems in processing.
The
FindReplaceFilter
can be used to remove/replace either a fixed string or a Regex pattern.
In cases where you want to export the raw (or cooked) data from a feed you can have a very simply pipeline to pipe the source data directly to an appender.
This may be so that the raw data can be ingested into another system for analysis.
In this case the data is being written to disk using a file appender.
Be careful when specifying the directory structure for the FileAppender so that you don’t end up with too many files in one folder, which can cause some OS issues.
Indexing
XML to Stroom Lucene Index
This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above.
The
XSLTFilter
is used to transform the event into records format, extracting the fields to be indexed from the event.
The
IndexingFilter
reads the records XML and loads each one into Stroom’s internal Lucene index .
Dynamic indexing in Stroom allows you to use the XSLT to define the fields that are being indexed and how each field should be indexed.
This avoids having to define all the fields up front in the Index and allows for the creation of fields based on the actual data received.
The only difference with normal indexing in Stroom is that is uses the
DynamicIndexingFilter
and rather than transforming the event into records:2 XML, it is transformed into index-documents:1 XML as shown in the example below.
This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above.
The
XSLTFilter
is used to transform the event into records format, extracting the fields to be indexed from the event.
The
ElasticIndexingFilter
reads the records XML and loads each one into an external Elasticsearch index .
Search extraction is the process of combining the data held in the index with data obtained from the original indexed document, i.e. the event.
Search extraction is useful when you do not want to store the whole of an event in the index (to reduce storage used) but still want to be able to access all the event data in a Dashboard/View.
An extraction pipeline is required to combine data in this way.
Search extraction pipelines are referenced in Dashboard and View settings.
Standard Lucene Index Extraction
This is a non-dynamic search extraction pipeline for a Lucene index.
XSLTFilter
- An XSLT transforming event-logging:3 => index-documents:1.
Data Egress
XML to CSV File
An recipe of writing normalised XML events (as produced by an ingest pipeline above) to a file, but in a flat file format like CSV.
The
XSLTFilter
transforms the events XML into CSV data with XSLT including this:
The
TextWriter
converts the XML character events into a stream of characters encoded using the desired output character encoding.
The data is appended to a file on a file system, with one file per Stream.
This is similar to the above recipe for writing out CSV, except that the
XSLTFilter
converts the event XML into XML conforming to the
https://www.w3.org/2013/XSL/json/
XMLSchema.
The
JSONWriter
can read this format of XML and convert it into JSON using the desired character encoding.
The
RollingFileAppender
will append the encoded JSON character data to a file on the file system that is rolled based on a size/time threshold.
This recipe is for sending normalised XML events to another system over HTTP.
The
HTTPAppender
is configured with the URL and any TLS certificates/keys/credentials.
A typical pipeline for loading XML reference data (conforming to the reference-data:2 XMLSchema) into the reference data store.
The
ReferenceDataFilter
reads the reference-data:2 format data and loads each entry into the appropriate map in the store.
As an example, the reference-data:2 XML for mapping userIDs to staff numbers looks something like this:
This recipe converts normalised XML data and converts it into statistic events (confirming to the statistics:4 XMLSchema).
Stroom’s Statistic Stores are a way to store aggregated counts or averaged values over time periods.
For example you may want counts of certain types of event, aggregated over fixed time buckets.
Each XML event is transformed using the
XSLTFilter
to either return no output or a statistic event.
An example of statistics:4 data for two statistic events is:
The following capabilities are available to parse input data:
XML - XML input can be parsed with the XML parser.
XML Fragment - Treat input data as an XML fragment, i.e. XML that does not have an XML declaration or root elements.
Data Splitter - Delimiter and regular expression based language for turning non XML data into XML (e.g. CSV)
2.1 - XML Fragments
Handling XML data without root level elements.
Some input XML data may be missing an XML declaration and root level enclosing elements.
This data is not a valid XML document and must be treated as an XML fragment.
To use XML fragments the input type for a translation must be set to ‘XML Fragment’.
A fragment wrapper must be defined in the XML conversion that tells Stroom what declaration and root elements to place around the XML fragment data.
Here is an example:
<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE records [
<!ENTITY fragment SYSTEM "fragment">
]>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="2.0">
&fragment;
</records>
During conversion Stroom replaces the fragment text entity with the input XML fragment data.
Note that XML fragments must still be well formed so that they can be parsed correctly.
3 - XSLT Conversion
Using Extensible Stylesheet Language Transformations (XSLT) to transform data.
XSLT is a language that is typically used for transforming XML documents into either a different XML document or plain text.
XSLT is key part of Stroom’s pipeline processing as it is used to normalise bespoke events into a common XML audit event document conforming to the event-loggingXML Schema.
Once a text file has been converted into intermediary XML (or the feed is already XML), XSLT is used to
translate the XML into the event-logging XML format.
The
XSLTFilter
pipeline element defines the XSLT document and is used to do the transformation of the input XML into XML or plain text.
You can have multiple XSLTFilter elements in a pipeline if you want to break the transformation into steps, or wish to have simpler XSLTs that can be reused.
Raw Event Feeds are typically translated into the event-logging:3 schema and Raw Reference into the reference-data:2 schema.
3.1 - XSLT Basics
The basics of using XSLT and the XSLTFilter element.
XSLT is a very powerful language and allows the user to perform very complex transformations of XML data.
This documentation does not aim to document how to write XSLT documents, for that, we strongly recommend you refer to online references (e.g.
W3Schools
or obtain a book covering XSLT 2.0 and XPath).
It does however aim to document aspects of XSLT that are specific to the use of XSLT in Stroom.
Examples
Event Normalisation
Here is an example XSLT document that transforms XML data in the records:2namespace (which is the output of the
DSParser
element) into event XML in the event-logging:3 namespace.
It is an example of event normalisation from a bespoke format.
Warning
This example aims to show some typical uses of XSLT in a typical Stroom use case.
It does not necessarily represent best practice in terms of creation of a normalised event.
Here is an example of transforming Reference Data in the records:2namespace (which is the output of the
DSParser
element) into XML in the reference-data:2 namespace that is suitable for loading using the
ReferenceDataFilter
If you want an XSLT to decorate an Events XML document with some additional data or to change it slightly without changing its namespace then a good starting point is the identity transformation.
<xsl:stylesheet
version="1.0"
xpath-default-namespace="event-logging:3"
xmlns="event-logging:3"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Match Root Object -->
<xsl:template match="Events">
<Events
xmlns="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.4.2.xsd"
Version="3.4.2">
<xsl:apply-templates />
</Events>
</xsl:template>
<!-- Whenever you match any node or any attribute -->
<xsl:template match="node( )|@*">
<!-- Copy the current node -->
<xsl:copy>
<!-- Including any attributes it has and any child nodes -->
<xsl:apply-templates select="@*|node( )" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This XSLT will copy every node and attribute as they are, returning the input document completely un-changed.
You can then add additional templates to match on specific elements and modify them, for example decorating a user’s UserDetails elements with value obtained from a reference data lookup on a user ID.
Note
You can insert this identity skeleton into an XSLT editor using this editor snippet.
<xsl:message>
Stroom supports the standard <xsl:message> element from the
http://www.w3.org/1999/XSL/Transform
.
This element behaves in a similar way to the stroom:log() XSLT function.
The element text is logged to the Error stream with a default severity of ERROR.
A child element can optionally be used to set the severity level (one of FATAL|ERROR|WARN|INFO).
The namespace of this element does not matter.
You can also set the attribute terminate="yes" to log the message at severity FATAL and halt processing of that stream part.
If the stream is multi-part then processing will continue with the next part.
Note
Setting terminate="yes" will trump any severity defined by a child element.
It will always be logged at FATAL.
The following are some examples of using <xsl:message>.
<!-- Log a message using default severity of ERROR -->
<xsl:message>Invalid length</xsl:message>
<!-- terminate="yes" means log the message as a FATAL ERROR and halt processing of the stream part -->
<xsl:message terminate="yes">Invalid length</xsl:message>
<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
<warn>Invalid length</warn>
</xsl:message>
<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
<info>Invalid length</info>
</xsl:message>
<!-- Log a message, specifying the severity and using a dynamic value. -->
<xsl:message>
<info>
<xsl:value-of select="concat('User ID ', $userId, ' is invalid')" />
</info>
</xsl:message>
cidr-to-numeric-ip-range() - Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end addresses of the range.
classification() - The classification of the feed for the data being processed
col-from() - The column in the input that the current record begins on (can be 0).
col-to() - The column in the input that the current record ends at.
current-time() - The current system time
current-user() - The current user logged into Stroom (only relevant for interactive use, e.g. search)
decode-url(String encodedUrl) - Decode the provided url.
dictionary(String name) - Loads the contents of the named dictionary for use within the translation
encode-url(String url) - Encode the provided url.
feed-attribute(String attributeKey) - NOTE: This function is deprecated, use meta(String key) instead.
The value for the supplied feed attributeKey.
feed-name() - Name of the feed for the data being processed
fetch-json(String url) - Simplistic version of http-call that sends a request to the passed url and converts the JSON response body to XML using json-to-xml.
Currently does not support SSL configuration like http-call does.
format-date(String milliseconds) - Format a date that is specified as a number of milliseconds since a standard base time known as “the epoch”, namely January 1, 1970, 00:00:00 GMT
get(String key) - Returns the value associated with a key that has been stored in a map using the put() function.
The map is in the scope of the current pipeline process so values do not live after the stream has been processed.
hash(String value) - Hash a string value using the default SHA-256 algorithm and no salt
hash(String value, String algorithm, String salt) - Hash a string value using the specified hashing algorithm and supplied salt value.
Supported hashing algorithms include SHA-256, SHA-512, MD5.
hex-to-dec(String hex) - Convert hex to dec representation.
hex-to-oct(String hex) - Convert hex to oct representation.
meta(String key) - Lookup a meta data value for the current stream using the specified key.
The key can be Feed, StreamType, CreatedTime, EffectiveTime, Pipeline or any other attribute supplied when the stream was sent to Stroom, e.g. meta(‘System’).
meta-keys() - Returns an array of meta keys for the current stream. Each key can then be used to retrieve its corresponding meta value, by calling meta($key).
numeric-ip(String ipAddress) - Convert an IP address to a numeric representation for range comparison
part-no() - The current part within a multi part aggregated input stream (AKA the substream number) (1 based)
parse-uri(String URI) - Returns an XML structure of the URI providing authority, fragment, host, path, port, query, scheme, schemeSpecificPart, and userInfo components if present.
pipeline-name() - Get the name of the pipeline currently processing the stream.
The bitmap-lookup() function looks up a bitmap key from reference or context data a value (which can be an XML node set) for each set bit position and adds it to the resultant XML.
map - The name of the reference data map to perform the lookup against.
key - The bitmap value to lookup.
This can either be represented as a decimal integer (e.g. 14) or as hexadecimal by prefixing with 0x (e.g 0xE).
time - Determines which set of reference data was effective at the requested time.
If no reference data exists with an effective time before the requested time then the lookup will fail.
Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.
If the look up fails no result will be returned.
The key is a bitmap expressed as either a decimal integer or a hexidecimal value, e.g. 14/0xE is 1110 as a binary bitmap.
For each bit position that is set, (i.e. has a binary value of 1) a lookup will be performed using that bit position as the key.
In this example, positions 1, 2 & 3 are set so a lookup would be performed for these bit positions.
The result of each lookup for the bitmap are concatenated together in bit position order, separated by a space.
If ignoreWarnings is true then any lookup failures will be ignored and it will return the value(s) for the bit positions it was able to lookup.
This function can be useful when you have a set of values that can be represented as a bitmap and you need them to be converted back to individual values.
For example if you have a set of additive account permissions (e.g Admin, ManageUsers, PerformExport, etc.), each of which is associated with a bit position, then a user’s permissions could be defined as a single decimal/hex bitmap value.
Thus a bitmap lookup with this value would return all the permissions held by the user.
For example the reference data store may contain:
Key (Bit position)
Value
0
Administrator
1
Manage_Users
2
Perform_Export
3
View_Data
4
Manage_Jobs
5
Delete_Data
6
Manage_Volumes
The following are example lookups using the above reference data:
Lookup Key (decimal)
Lookup Key (Hex)
Bitmap
Result
0
0x0
0000000
-
1
0x1
0000001
Administrator
74
0x4A
1001010
Manage_Users View_Data Manage_Volumes
2
0x2
0000010
Manage_Users
96
0x60
1100000
Delete_Data Manage_Volumes
cidr-to-numeric-ip-range()
Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end (broadcast) of the range.
When storing the result in a variable, ensure you indicate the type as a string array (xs:string*), as shown in the below example.
The dictionary() function gets the contents of the specified dictionary for use during translation.
The main use for this function is to allow users to abstract the management of a set of keywords from the XSLT so that it is easier for some users to make quick alterations to a dictionary that is used by some XSLT, without the need for the user to understand the complexities of XSLT.
format-date()
The format-date() function combines parsing and formatting of date strings.
In its simplest form it will parse a date string and return the parsed date in the XML standard Date Format.
It also supports supplying a custom format pattern to output the parsed date in a specified format.
Function Signatures
The following are the possible forms of the format-date function.
<!-- Convert time in millis to standard date format -->
format-date(long millisSinceEpoch)
<!-- Convert inputDate to standard date format -->
format-date(String inputDate, String inputPattern)
<!-- Convert inputDate to standard date format using specified input time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone)
<!-- Convert inputDate to a custom date format using optional input time zone inputTimeZone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern)
<!-- Convert inputDate to a custom date format using optional input time zone and a specified output time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern, String outputTimeZone)
millisSinceEpoch - The date/time expressed as the number of milliseconds since the
UNIX epoch
.
inputDate - The input date string, e.g. 2009/08/01 12:34:11.
inputPattern - The pattern that defines the structure of inputDate (see Custom Date Formats).
inputTimeZone - Optional time zone of the inputDate.
If null then the UTC/Zulu time zone will be used.
If inputTimeZone is present, the inputPattern must not include the time zone pattern tokens (z and Z).
outputPattern - The pattern that defines the format of the output date (see Custom Date Formats).
inputTimeZone - Optional time zone of the output date.
If null then the UTC/Zulu time zone will be used.
Time Zones
The following is a list of some common time zone values:
Values
Zone Name
GMT/BST
A Stroom specific value for UK daylight saving time (see below)
A special time zone value of GMT/BST can be used when the inputDate is in local wall clock time with time zone information.
In this case, the date/time will be used to determine whether the date is in British Summer Time or in GMT and adjust the output accordingly.
See the examples below.
Parsing Examples
The following table shows various examples of calls to stroom:format-date() with their output.
The stroom:format-date part has been omitted for brevity.
<!-- Date in millis since UNIX epoch -->
stroom:format-date('1269270011640')
-> '2010-03-22T15:00:11.640Z'
<!-- Simple date UK style date -->
stroom:format-date('29/08/24', 'dd/MM/yy')
-> '2024-08-29T00:00:00.000Z'
<!-- Simple date US style date -->
stroom:format-date('08/29/24', 'MM/dd/yy')
-> '2024-08-29T00:00:00.000Z'
<!-- ISO date with no delimiters -->
stroom:format-date('20010801184559', 'yyyyMMddHHmmss')
-> '2001-08-01T18:45:59.000Z'
<!-- Standard output, no TZ -->
stroom:format-date('2001/08/01 18:45:59', 'yyyy/MM/dd HH:mm:ss')
-> '2001-08-01T18:45:59.000Z'
<!-- Standard output, date only, with TZ -->
stroom:format-date('2001/08/01', 'yyyy/MM/dd', '-07:00')
-> '2001-08-01T07:00:00.000Z'
<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '-08:00')
-> '2001-08-01T09:00:00.000Z'
<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '+01:00')
-> '2001-08-01T00:00:00.000Z'
<!-- Single digit day and month, no padding -->
stroom:format-date('2001 8 1', 'yyyy MM dd')
-> '2001-08-01T00:00:00.000Z'
<!-- Double digit day and month, no padding -->
stroom:format-date('2001 12 28', 'yyyy MM dd')
-> '2001-12-28T00:00:00.000Z'
<!-- Single digit day and month, with optional padding -->
stroom:format-date('2001 8 1', 'yyyy ppMM ppdd')
-> '2001-08-01T00:00:00.000Z'
<!-- Double digit day and month, with optional padding -->
stroom:format-date('2001 12 31', 'yyyy ppMM ppdd')
-> '2001-12-31T00:00:00.000Z'
<!-- With abbreviated day of week month -->
stroom:format-date('Wed Aug 14 2024', 'EEE MMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'
<!-- With long form day of week and month -->
stroom:format-date('Wednesday August 14 2024', 'EEEE MMMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'
<!-- With 12 hour clock, AM -->
stroom:format-date('Wed Aug 14 2024 10:32:58 AM', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T10:32:58.000Z'
<!-- With 12 hour clock, PM (lower case) -->
stroom:format-date('Wed Aug 14 2024 10:32:58 pm', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T22:32:58.000Z'
<!-- Using minimal symbols -->
stroom:format-date('2001 12 31 22:58:32.123', 'y M d H:m:s.S')
-> '2001-12-31T22:58:32.123Z'
<!-- Optional time portion, with time -->
stroom:format-date('2001/12/31 22:58:32.123', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T22:58:32.123Z'
<!-- Optional time portion, without time -->
stroom:format-date('2001/12/31', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T00:00:00.000Z'
Parsing is done in lenient mode so, the count of each symbol is not critical, e.g. you can parse the year 2024 with y, yy, yyy or yyyy.
Despite this, it is advisable to use a pattern that matches the known format of the input dates, e.g. in this example yyyy, to avoid confusing with anyone else reading your XSLT.
The count of each symbol is however critical when it comes to formatting.
Formatting Examples
<!-- Specific output, no input or output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', null, 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'
<!-- Specific output, UTC input, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', 'UTC', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'
<!-- Specific output, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 13:30 (59 secs)'
<!-- Specific output, with input and output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm', '+02:00')
-> 'Wed 01 Aug 2001 15:30'
<!-- Long form text -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'EEEE d MMMM yyyy HH:mm:ss')
-> 'Wednesday 1 August 2001 14:07:05'
Reference Time
When parsing a date string that does not contain a full zoned date and time, certain assumptions will be made.
If there is no time zone in inputDate and no inputTimeZone argument has been passed then the time zone of the input date will be assumed to be in the UTC time zone.
If any of the date parts are not present, e.g. an input of 28 Oct then Stroom will use a reference date to fill in the gaps.
The reference date is the first of these values that is non-null
The create time of the stream being processed by the XSLT.
The current time, i.e. now().
For example for a call of stroom:format-date('28 Oct', 'dd MMM') and a stream create time of 2024, it will return 2024-10-28T00:00:00.000Z.
hex-to-string()
For a hexadecimal input string, decode it using the specified character set to its original form.
headers - A newline ( ) delimited list of HTTP headers to send.
Each header is of the form key:value.
mediaType - The media (or MIME) type of the request data, e.g. application/json.
If not set application/json; charset=utf-8 will be used.
data - The data to send.
The data type should be consistent with mediaType.
Supplying the data argument means a POST request method will be used rather than the default GET.
clientConfig - A JSON object containing the configuration for the HTTP client to use, including any SSL configuration.
The function returns the response as XML with namespace stroom-http.
The XML includes the body of the response in addition to the status code, success status, message and any headers.
clientConfig
The client can be configured using a JSON object containing various optional configuration items.
The following is an example of the client configuration object with all keys populated.
This is an example of how to use the function call in your XSLT.
It is recommended to place the clientConfig JSON in a Dictionary to make it easier to edit and to avoid having to escape all the quotes.
...
<xsl:template match="record">
...
<!-- Read the client config from a Dictionary into a variable -->
<xsl:variable name="clientConfig" select="stroom:dictionary('HTTP Client Config')" />
<!-- Make the HTTP call and store the response in a variable -->
<xsl:variable name="response" select="stroom:http-call('https://reqbin.com/echo', null, null, null, $clientConfig)" />
<!-- Apply 'response' templates to the response -->
<xsl:apply-templates mode="response" select="$response" />
...
</xsl:template>
<xsl:template mode="response" match="http:response">
<!-- Extract just the body of the response -->
<val><xsl:value-of select="./http:body/text()" /></val>
</xsl:template>
...
link()
Create a string that represents a hyperlink for display in a dashboard table.
dialog : Display the content of the link URL within a stroom popup dialog.
tab : Display the content of the link URL within a stroom tab.
browser : Display the content of the link URL within a new browser tab.
dashboard : Used to launch a stroom dashboard internally with parameters in the URL.
If you wish to override the default title or URL of the target link in either a tab or dialog you can. Both dialog and tab types allow titles to be specified after a |, e.g. dialog|My Title.
log()
The log() function writes a message to the processing log with the specified severity.
Severities of INFO, WARN, ERROR and FATAL can be used.
Severities of ERROR and FATAL will result in records being omitted from the output if a RecordOutputFilter is used in the pipeline.
The counts for RecWarn, RecError will be affected by warnings or errors generated in this way therefore this function is useful for adding business rules to XML output.
E.g. Warn if a SID is not the correct length.
<xsl:if test="string-length($sid) != 7">
<xsl:value-of select="stroom:log('WARN', concat($sid, ' is not the correct length'))"/>
</xsl:if>
The same functionality can also be achieved using the standard xsl:message element, see <xsl:message>
lookup()
The lookup() function looks up from reference or context data a value (which can be an XML node set) and adds it to the resultant XML.
map - The name of the reference data map to perform the lookup against.
key - The key to lookup. The key can be a simple string, an integer value in a numeric range or a nested lookup key.
time - Determines which set of reference data was effective at the requested time.
If no reference data exists with an effective time before the requested time then the lookup will fail.
Time is in the format yyyy-MM-dd'T'HH:mm:ss.SSSXX, e.g. 2010-01-01T00:00:00.000Z.
ignoreWarnings - If true, any lookup failures will be ignored, else they will be reported as warnings.
trace - If true, additional trace information is output as INFO messages.
If the look up fails no result will be returned.
By testing the result a default value may be output if no result is returned.
Reference data entries can either be stored with single string key or a key range that defines a numeric range, e.g 1-100.
When a lookup is preformed the passed key is looked up as if it were a normal string key.
If that lookup fails Stroom will try to convert the key to an integer (long) value.
If it can be converted to an integer than a second lookup will be performed against entries with key ranges to see if there is a key range that includes the requested key.
Range lookups can be used for looking up an IP address where the reference data values are associated with ranges of IP addresses.
In this use case, the IP address must first be converted into a numeric value using numeric-ip(), e.g:
Similarly the reference data must be stored with key ranges whose bounds were created using this function.
Nested Maps
The lookup function allows you to perform chained lookups using nested maps.
For example you may have a reference data map called USER_ID_TO_LOCATION that maps user IDs to some location information for that user and a map called USER_ID_TO_MANAGER that maps user IDs to the user ID of their manager.
If you wanted to decorate a user’s event with the location of their manager you could use a nested map to achieve the lookup chain.
To perform the lookup set the map argument to the list of maps in the lookup chain, separated by a /, e.g. USER_ID_TO_MANAGER/USER_ID_TO_LOCATION.
This will perform a lookup against the first map in the list using the requested key.
If a value is found the value will be used as the key in a lookup against the next map.
The value from each map lookup is used as the key in the next map all the way down the chain.
The value from the last lookup is then returned as the result of the lookup() call.
If no value is found at any point in the chain then that results in no value being returned from the function.
In order to use nested map lookups each intermediate map must contain simple string values.
The last map in the chain can either contain string values or XML fragment values.
put() and get()
You can put values into a map using the put() function.
These values can then be retrieved later using the get() function.
Values are stored against a key name so that multiple values can be stored.
These functions can be used for many purposes but are most commonly used to count a number of records that meet certain criteria.
The map is in the scope of the current pipeline process so values do not live after the stream has been processed.
Also, the map will only contain entries that were put() within the current pipeline process.
An example of how to count records is shown below:
<!-- Get the current record count -->
<xsl:variable name="currentCount" select="number(s:get('count'))" />
<!-- Increment the record count -->
<xsl:variable name="count">
<xsl:choose>
<xsl:when test="$currentCount">
<xsl:value-of select="$currentCount + 1" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="1" />
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<!-- Store the count for future retrieval -->
<xsl:value-of select="stroom:put('count', $count)" />
<!-- Output the new count -->
<data name="Count">
<xsl:attribute name="Value" select="$count" />
</data>
meta-keys()
When calling this function and assigning the result to a variable, you must specify the variable data type of xs:string* (array of strings).
The following fragment is an example of using meta-keys() to emit all meta values for a given stream, into an Event/Meta element:
The parse-uri() function takes a Uniform Resource Identifier (URI) in string form and returns an XML node with a namespace of uri containing the URI’s individual components of authority, fragment, host, path, port, query, scheme, schemeSpecificPart and userInfo. See either RFC 2306: Uniform Resource Identifiers (URI): Generic Syntax or Java’s java.net.URI Class for details regarding the components.
The following xml
<!-- Display and parse the URI contained within the text of the rURI element -->
<xsl:variable name="u" select="stroom:parseUri(rURI)" />
<URI>
<xsl:value-of select="rURI" />
</URI>
<URIDetail>
<xsl:copy-of select="$v"/>
</URIDetail>
Returns true if the specified point is inside the specified polygon.
Useful for determining if a user is inside a physical zone based on their location and the boundary of that zone.
pointIsInsideXYPolygon(Number xPos, Number yPos, Number[] xPolyData, Number[] yPolyData)
Arguments:
xPos - The X value of the point to be tested.
yPos - The Y value of the point to be tested.
xPolyData - A sequence of X values that define the polygon.
yPolyData - A sequence of Y values that define the polygon.
The list of values supplied for xPolyData must correspond with the list of values supplied for yPolyData.
The points that define the polygon must be provided in order, i.e. starting from one point on the polygon and then traveling round the path of the polygon until it gets back to the beginning.
3.3 - XSLT Includes
Using an XSLT import to include XSLT from another translation.
You can use an XSLT import to include XSLT from another translation.
E.g.:
<xsl:import href="ApacheAccessCommon" />
This would include the XSLT from the ApacheAccessCommon translation.
4 - File Output
Substitution variables for use in output file names and paths.
When outputting files with Stroom, the output file names and paths can include various substitution variables to form the file and path names.
Context Variables
The following replacement variables are specific to the current processing context.
${feed} - The name of the feed that the stream being processed belongs to
${pipeline} - The name of the pipeline that is producing output
${sourceId} - The id of the input data being processed
${partNo} - The part number of the input data being processed where data is in aggregated batches
${searchId} - The id of the batch search being performed. This is only available during a batch search
${node} - The name of the node producing the output
Time Variables
The following replacement variables can be used to include aspects of the current time in UTC.
${year} - Year in 4 digit form, e.g. 2000
${month} - Month of the year padded to 2 digits
${day} - Day of the month padded to 2 digits
${hour} - Hour padded to 2 digits using 24 hour clock, e.g. 22
${minute} - Minute padded to 2 digits
${second} - Second padded to 2 digits
${millis} - Milliseconds padded to 3 digits
${ms} - Milliseconds since the epoch
System (Environment) Variables
System variables (environment variables) can also be used, e.g. ${TMP}.
File Name References
rolledFileName in RollingFileAppender can use references to the fileName to incorporate parts of the non rolled file name.
${fileName} - The complete file name
${fileStem} - Part of the file name before the file extension, i.e. everything before the last ‘.’
${fileExtension} - The extension part of the file name, i.e. everything after the last ‘.’
Other Variables
${uuid} - A randomly generated UUID to guarantee unique file names
5 - Reference Data
Performing temporal reference data lookups to decorate event data.
In Stroom reference data is primarily used to decorate events using stroom:lookup() calls in XSLTs.
For example you may have reference data feed that associates the FQDN of a device to the physical location.
You can then perform a stroom:lookup() in the XSLT to decorate an event with the physical location of a device by looking up the FQDN found in the event.
Reference data is time sensitive and each stream of reference data has an Effective Date set against it.
This allows reference data lookups to be performed using the date of the event to ensure the reference data that was actually effective at the time of the event is used.
Using reference data involves the following steps/processes:
Ingesting the raw reference data into Stroom.
Creating (and processing) a pipeline to transform the raw reference into reference-data:2 format XML.
Creating a reference loader pipeline with a Reference Data Filter element to load cooked reference data into the reference data store.
Adding reference pipeline/feeds to an XSLT Filter in your event pipeline.
Adding the lookup call to the XSLT.
Processing the raw events through the event pipeline.
The process of creating a reference data pipeline is described in the HOWTO linked at the top of this document.
Reference Data Structure
A reference data entry essentially consists of the following:
Effective time - The data/time that the entry was effective from, i.e the time the raw reference data was received.
Map name - A unique name for the key/value map that the entry will be stored in.
The name only needs to be unique within all map names that may be loaded within an XSLT Filter.
In practice it makes sense to keep map names globally unique.
Key - The text that will be used to lookup the value in the reference data map.
Mutually exclusive with Range.
Range - The inclusive range of integer keys that the entry applies to.
Mutually exclusive with Key.
Value - The value can either be simple text, e.g. an IP address, or an XML fragment that can be inserted into another XML document.
XML values must be correctly namespaced.
The following is an example of some reference data that has been converted from its raw form into reference-data:2 XML.
In this example the reference data contains three entries that each belong to a different map.
Two of the entries are simple text values and the last has an XML value.
<?xml version="1.1" encoding="UTF-8"?>
<referenceData
xmlns="reference-data:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stroom="stroom"
xmlns:evt="event-logging:3"
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd"
version="2.0.1">
<!-- A simple string value -->
<reference>
<map>FQDN_TO_IP</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<IPAddress>192.168.2.245</IPAddress>
</value>
</reference>
<!-- A simple string value -->
<reference>
<map>IP_TO_FQDN</map>
<key>192.168.2.245</key>
<value>
<HostName>stroomnode00.strmdev00.org</HostName>
</value>
</reference>
<!-- A key range -->
<reference>
<map>USER_ID_TO_COUNTRY_CODE</map>
<range>
<from>1</from>
<to>1000</to>
</range>
<value>GBR</value>
</reference>
<!-- An XML fragment value -->
<reference>
<map>FQDN_TO_LOC</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<evt:Location>
<evt:Country>GBR</evt:Country>
<evt:Site>Bristol-S00</evt:Site>
<evt:Building>GZero</evt:Building>
<evt:Room>R00</evt:Room>
<evt:TimeZone>+00:00/+01:00</evt:TimeZone>
</evt:Location>
</value>
</reference>
</referenceData>
Reference Data Namespaces
When XML reference data values are created, as in the example XML above, the XML values must be qualified with a namespace to distinguish them from the reference-data:2 XML that surrounds them.
In the above example the XML fragment will become as follows when injected into an event:
Even if evt is already declared in the XML being injected into it, if it has been declared for the reference fragment then it will be explicitly declared in the destination.
While duplicate namespacing may appear odd it is valid XML.
The namespacing can also be achieved like this:
<?xml version="1.1" encoding="UTF-8"?>
<referenceData
xmlns="reference-data:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stroom="stroom"
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd"
version="2.0.1">
<!-- An XML value -->
<reference>
<map>FQDN_TO_LOC</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<Location xmlns="event-logging:3">
<Country>GBR</Country>
<Site>Bristol-S00</Site>
<Building>GZero</Building>
<Room>R00</Room>
<TimeZone>+00:00/+01:00</TimeZone>
</Location>
</value>
</reference>
</referenceData>
This reference data will be injected into event XML exactly as it, i.e.:
Reference data is stored in two different places on a Stroom node.
All reference data is only visible to the node where it is located.
Each node that is performing reference data lookups will need to load and store its own reference data.
While this will result in duplicate data being held by nodes it makes the storage of reference data and its subsequent lookup very performant.
On-Heap Store
The On-Heap store is the reference data store that is held in memory in the Java Heap.
This store is volatile and will be lost on shut down of the node.
The On-Heap store is only used for storage of context data.
Off-Heap Store
The Off-Heap store is the reference data store that is held in memory outside of the Java Heap and is persisted to to local disk.
As the store is also persisted to local disk it means the reference data will survive the shutdown of the stroom instance.
Storing the data off-heap means Stroom can run with a much smaller Java Heap size.
The Off-Heap store is based on the Lightning Memory-Mapped Database (LMDB).
LMDB makes use of the Linux page cache to ensure that hot portions of the reference data are held in the page cache (making use of all available free memory).
Infrequently used portions of the reference data will be evicted from the page cache by the Operating System.
Given that LMDB utilises the page cache for holding reference data in memory the more free memory the host has the better as there will be less shifting of pages in/out of the OS page cache.
When storing large amounts of data you may experience the OS reporting very little free memory as a large amount will be in use by the page cache.
This is not an issue as the OS will evict pages when memory is needed for other applications, e.g. the Java Heap.
Local Disk
The Off-Heap store is intended to be located on local disk on the Stroom node.
The location of the store is set using the property stroom.pipeline.referenceData.localDir.
Using LMDB on remote storage is NOT advised, see http://www.lmdb.tech/doc.
Using the fastest storage (i.g. fast SSDs) is advised to reduce load times and lookups of data that is not in memory.
Warning
If you are running stroom on AWS EC2 instances then you will need to attach some local instance storage to the host, e.g. SSD, to use for the reference data store.
In tests EBS storage was found to be VERY slow.
It should be noted that AWS instance storage is not persistent between instance stops, terminations and hardware failure.
However any loss of the reference data store will mean that the next time Stroom boots a new store will be created and reference data will be loaded on demand as normal.
Transactions
LMDB is a transactional database with ACID semantics.
All interaction with LMDB is done within a read or write transaction.
There can only be one write transaction at a time so if there are a number of concurrent reference data loads then they will have to wait in line.
Read transactions, i.e. lookups, are not blocked by each other but may be blocked by a write transaction depending on the value of the system property stroom.pipeline.referenceData.lmdb.readerBlockedByWriter.
LMDB can operate such that readers are not blocked by writers but if there is an open read transaction while a write transaction is writing data to the store then it is unable to make use of free space (from previous deletes, see Store Size & Compaction) so will result in the store increasing in size.
If read transactions are likely while writes are taking place then this can lead to excessive growth of the store.
Setting stroom.pipeline.referenceData.lmdb.readerBlockedByWriter to true will block all reads while a load is happening so any free space can be re-used, at the cost of making all lookups wait for the load to complete.
Use of this setting will depend on how likely it is that loads will clash with lookups and the store size should be monitored.
Read-Ahead Mode
When data is read from the store, if the data is not already in the page cache then it will be read from disk and added to the page cache by the OS.
Read-ahead is the process of speculatively reading ahead to load more pages into the page cache than were requested.
This is on the basis that future requests for data may need the pages speculatively read into memory as it is more efficient to read multiple pages at once.
If the reference data store is very large or is larger than the available memory then it is recommended to turn read-ahead off as the result will be to evict hot reference data from the page cache to make room for speculative pages that may not be needed.
It can be tuned off with the system property stroom.pipeline.referenceData.readAheadEnabled.
Key Size
When reference data is created care must be taken to ensure that the Key used for each entry is less than 507 bytes.
For simple ASCII characters then this means less than 507 characters.
If non-ASCII characters are in the key then these will take up more than one byte per character so the length of the key in characters will be less.
This is a limitation inherent to LMDB.
Commit intervals
The property stroom.pipeline.referenceData.maxPutsBeforeCommit controls the number of entries that are put into the store between each commit.
As there can be only one transaction writing to the store at a time, committing periodically allows other process to jump in and make writes.
There is a trade off though as reducing the number of entries put between each commit can seriously affect performance.
For the fastest single process performance a value of 0 should be used which means it will not commit mid-load.
This however means all other processes wanting to write to the store will need to wait.
Low values (e.g. in the hundreds) mean very frequent commits so will hamper performance.
Cloning The Off Heap Store
If you are provisioning a new stroom node it is possible to copy the off heap store from another node.
Stroom should not be running on the node being copied from.
Simply copy the contents of stroom.pipeline.referenceData.localDir into the same configured location on the other node.
The new node will use the copied store and have access to its reference data.
Store Size & Compaction
Due to the way LMDB works the store can only grow in size, it will never shrink, even if reference data is deleted.
Deleted data frees up space for new writes to the store so will be reused but will never be freed back to the operating system.
If there is a regular process of purging old data and adding new reference data then this should not be an issue as the new reference data will use the space made available by the purged data.
If store size becomes an issue then it is possible to compact the store.
lmdb-utils is package that is available on some package managers and this has an mdb_copy command that can be used with the -c switch to copy the LMDB environment to a new one, compacting it in the process.
This should be done when Stroom is down to avoid writes happening to the store while the copy is happening.
The following is an example of how to compact the store assuming Stroom has been shut down first.
# Navigate to the 'stroom.pipeline.referenceData.localDir' directory
cd /some/path/to/reference_data
# Verify contents
ls
(out) data.mdb lock.mdb
# Create a directory to write the compacted file to
mkdir compacted
# Run the compaction, writing the new data.mdb file to the new sub-dir
mdb_copy -c ./ ./compacted
# Delete the existing store
rm data.mdb lock.mdb
# Copy the compacted store back in (note a lock file gets created as needed)
mv compacted/data.mdb ./
# Remove the created directory
rmdir compacted
Now you can re-start Stroom and it will use the new compacted store, creating a lock file for it.
The compaction process is fast.
A test compaction of a 4Gb store, compacted down to 1.6Gb took about 7s on non-flash HDD storage.
Alternatively, given that the store is essentially a cache and all data can be re-loaded another option is to delete the contents of stroom.pipeline.referenceData.localDir when Stroom is not running.
On boot Stroom will create a brand new empty store and reference data will be re-loaded as required.
This approach will result in all data having to be re-loaded so will slow lookups down until it has been loaded.
The Loading Process
Reference data is loaded into the store on demand during the processing of a stroom:lookup() method call.
Reference data will only be loaded if it does not already exist in the store, however it is always loaded as a complete stream, rather than entry by entry.
The test for existence in the store is based on the following criteria:
The UUID of the reference loader pipeline.
The version of the reference loader pipeline.
The Stream ID for the stream of reference data that has been deemed effective for the lookup.
The Stream Number (in the case of multi part streams).
If a reference stream has already been loaded matching the above criteria then no additional load is required.
IMPORTANT: It should be noted that as the version of the reference data pipeline forms part of the criteria, if the reference loader pipeline is changed, for whatever reason, then this will invalidate ALL existing reference data associated with that reference loader pipeline.
Typically the reference loader pipeline is very static so this should not be an issue.
Standard practice is to convert raw reference data into reference:2 XML on receipt using a pipeline separate to the reference loader.
The reference loader is then only concerned with reading cooked reference:2 into the Reference Data Filter.
In instances where reference data streams are infrequently used it may be preferable to not convert the raw reference on receipt but instead to do it in the reference loader pipeline.
Duplicate Keys
The Reference Data Filter pipeline element has a property overrideExistingValues which if set to true means if an entry is found in an effective stream with the same key as an entry already loaded then it will overwrite the existing one.
Entries are loaded in the order they are found in the reference:2 XML document.
If set to false then the existing entry will be kept.
If warnOnDuplicateKeys is set to true then a warning will be logged for any duplicate keys, whether an overwrite happens or not.
Value De-Duplication
Only unique values are held in the store to reduce the storage footprint.
This is useful given that typically, reference data updates may be received daily and each one is a full snapshot of the whole reference data.
As a result this can mean many copies of the same value being loaded into the store.
The store will only hold the first instance of duplicate values.
Querying the Reference Data Store
The reference data store can be queried within a Dashboard in Stroom by selecting Reference Data Store in the data source selection pop-up.
Querying the store is currently an experimental feature and is mostly intended for use in fault finding.
Given the localised nature of the reference data store the dashboard can currently only query the store on the node that the user interface is being served from.
In a multi-node environment where some nodes are UI only and most are processing only, the UI nodes will have no reference data in their store.
Purging Old Reference Data
Reference data loading and purging is done at the level of a reference stream.
Whenever a reference lookup is performed the last accessed time of the reference stream in the store is checked.
If it is older than one hour then it will be updated to the current time.
This last access time is used to determine reference streams that are no longer in active use and thus can be purged.
The Stroom job Ref Data Off-heap Store Purge is used to perform the purge operation on the Off-Heap reference data store.
No purge is required for the On-Heap store as that only holds transient data.
When the purge job is run it checks the time since each reference stream was accessed against the purge cut-off age.
The purge age is configured via the property stroom.pipeline.referenceData.purgeAge.
It is advised to schedule this job for quiet times when it is unlikely to conflict with reference loading operations as they will fight for access to the single write transaction.
Lookups
Lookups are performed in XSLT Filters using the XSLT functions.
In order to perform a lookup one or more reference feeds must be specified on the XSLT Filter pipeline element.
Each reference feed is specified along with a reference loader pipeline that will ingest the specified feed (optional convert it into reference:2 XML if it is not already) and pass it into a Reference Data Filter pipeline element.
Reference Feeds & Loaders
In the XSLT Filter pipeline element multiple combinations of feed and reference loader pipeline can be specified.
There must be at least one in order to perform lookups.
If there are multiple then when a lookup is called for a given time, the effective stream for each feed/loader combination is determined.
The effective stream for each feed/loader combination will be loaded into the store, unless it is already present.
When the actual lookup is performed Stroom will try to find the key in each of the effective streams that have been loaded and that contain the map in the lookup call.
If the lookup is unsuccessful in the effective stream for the first feed/loader combination then it will try the next, and so on until it has tried all of them.
For this reason if you have multiple feed/loader combinations then order is important.
It is possible for multiple effective streams to contain the same map/key so a feed/loader combination higher up the list will trump one lower down with the same map/key.
Also if you have some lookups that may not return a value and others that should always return a value then the feed/loader for the latter should be higher up the list so it is searched first.
Effective Streams
Reference data lookups have the concept of Effective Streams.
An effective stream is the most recent stream for a given Feed that has an effective date that is less than or equal to the date used for the lookup.
When performing a lookup, Stroom will search the stream store to find all the effective streams in a time bucket that surrounds the lookup time.
These sets of effective streams are cached so if a new reference stream is created it will not be used until the cached set has expired.
To rectify this you can clear the cache Reference Data - Effective Stream Cache on the Caches screen accessed from:
Monitoring
Caches
Standard Key/Value Lookups
Standard key/value lookups consist of a simple string key and a value that is either a simple string or an XML fragment.
Standard lookups are performed using the various forms of the stroom:lookup() XSLT function.
Note
If the key is not found and the key is an integer then it will attempt a range lookup using the same key.
This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.
Range Lookups
Range lookups consist of a key that is an integer and a value that is either a simple string or an XML fragment.
For more detail on range lookups see the XSLT function stroom:lookup().
Note
The lookup will initially look for a single key that matches the lookup key.
If an exact match is not found then it will look for a range that contains the key.
This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.
Nested Map Lookups
Nested map lookups involve chaining a number of lookups with the value of each map being used as the key for the next.
For more detail on nested lookups see the XSLT function stroom:lookup().
Bitmap Lookups
A bitmap lookup is a special kind of lookup that actually performs a lookup for each enabled bit position of the passed bitmap value.
For more detail on bitmap lookups see the XSLT function stroom:bitmap-lookup().
Values can either be a simple string or an XML fragment.
Context data lookups
Some event streams have a Context stream associated with them.
Context streams allow the system sending the events to Stroom to supply an additional stream of data that provides context to the raw event stream.
This can be useful when the system sending the events has no control over the event content but needs to supply additional information.
The context stream can be used in lookups as a reference source to decorate events on receipt.
Context reference data is specific to a single event stream so is transient in nature, therefore the On Heap Store is used to hold it for the duration of the event stream processing only.
Typically the reference loader for a context stream will include a translation step to convert the raw context data into reference:2 XML.