This is the multi-page printable view of this section. Click here to print.
User Guide
- 1: Application Programming Interfaces (API)
- 1.1: API Specification
- 1.2: Calling an API
- 1.3: Query APIs
- 1.4: Export Content API
- 1.5: Reference Data
- 2: Background Jobs
- 2.1: Scheduler
- 3: Concepts
- 3.1: Streams
- 4: Data Retention
- 5: Data Splitter
- 5.1: Simple CSV Example
- 5.2: Simple CSV example with heading
- 5.3: Complex example with regex and user defined names
- 5.4: Multi Line Example
- 5.5: Element Reference
- 5.5.1: Content Providers
- 5.5.2: Expressions
- 5.5.3: Variables
- 5.5.4: Output
- 5.6: Match References, Variables and Fixed Strings
- 5.6.1: Expression match references
- 5.6.2: Variable reference
- 5.6.3: Use of fixed strings
- 5.6.4: Concatenation of references
- 6: Event Feeds
- 7: Indexing data
- 7.1: Elasticsearch
- 7.1.1: Introduction
- 7.1.2: Getting Started
- 7.1.3: Indexing data
- 7.1.4: Exploring Data in Kibana
- 7.2: Lucene Indexes
- 7.3: Solr Integration
- 8: Nodes
- 9: Pipelines
- 9.1: Pipeline Recipies
- 9.2: Parser
- 9.2.1: XML Fragments
- 9.3: XSLT Conversion
- 9.3.1: XSLT Basics
- 9.3.2: XSLT Functions
- 9.3.3: XSLT Includes
- 9.4: File Output
- 9.5: Reference Data
- 9.6: Context Data
- 10: Properties
- 11: Roles
- 12: Searching Data
- 12.1: Data Sources
- 12.1.1: Lucene Index Data Source
- 12.1.2: Statistics
- 12.1.3: Elasticsearch
- 12.1.4: Internal Data Sources
- 12.2: Dashboards
- 12.2.1: Queries
- 12.2.2: Internal Links
- 12.2.3: Direct URLs
- 12.3: Query
- 12.3.1: Stroom Query Language
- 12.4: Analytic Rules
- 12.5: Search Extraction
- 12.6: Dictionaries
- 13: Security
- 14: Tools
- 14.1: Command Line Tools
- 14.2: Stream Dump Tool
- 15: User Content
- 15.1: Editing Text
- 15.2: Naming Conventions
- 15.3: Documenting content
- 15.4: Finding Things
- 16: Viewing Data
- 17: Volumes
1 - Application Programming Interfaces (API)
Stroom has many public REST APIs to allow other systems to interact with Stroom. Everything that can be done via the user interface can also be done using the API.
All methods on the API will are authenticated and authorised, so the permissions will be exactly the same as if the API user is using the Stroom user interface directly.
1.1 - API Specification
Swagger UI
The APIs are available as a Swagger Open API specification in the following forms:
- JSON - stroom.json
- YAML - stroom.yaml
A dynamic Swagger user interface is also available for viewing all the API endpoints with details of parameters and data types. This can be found in two places.
- Published on GitHub for each minor version Swagger user interface .
- Published on a running stroom instance at the path
/stroom/noauth/swagger-ui
.
API Endpoints in Application Logs
The API methods are also all listed in the application logs when Stroom first boots up, e.g.
INFO 2023-01-17T11:09:30.244Z main i.d.j.DropwizardResourceConfig The following paths were found for the configured resources:
GET /api/account/v1/ (stroom.security.identity.account.AccountResourceImpl)
POST /api/account/v1/ (stroom.security.identity.account.AccountResourceImpl)
POST /api/account/v1/search (stroom.security.identity.account.AccountResourceImpl)
DELETE /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
GET /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
PUT /api/account/v1/{id} (stroom.security.identity.account.AccountResourceImpl)
GET /api/activity/v1 (stroom.activity.impl.ActivityResourceImpl)
POST /api/activity/v1 (stroom.activity.impl.ActivityResourceImpl)
POST /api/activity/v1/acknowledge (stroom.activity.impl.ActivityResourceImpl)
GET /api/activity/v1/current (stroom.activity.impl.ActivityResourceImpl)
...
You will also see entries in the logs for the various servlets exposed by Stroom, e.g.
INFO ... main s.d.common.Servlets Adding servlets to application path/port:
INFO ... main s.d.common.Servlets stroom.core.servlet.DashboardServlet => /stroom/dashboard
INFO ... main s.d.common.Servlets stroom.core.servlet.DynamicCSSServlet => /stroom/dynamic.css
INFO ... main s.d.common.Servlets stroom.data.store.impl.ImportFileServlet => /stroom/importfile.rpc
INFO ... main s.d.common.Servlets stroom.receive.common.ReceiveDataServlet => /stroom/noauth/datafeed
INFO ... main s.d.common.Servlets stroom.receive.common.ReceiveDataServlet => /stroom/noauth/datafeed/*
INFO ... main s.d.common.Servlets stroom.receive.common.DebugServlet => /stroom/noauth/debug
INFO ... main s.d.common.Servlets stroom.data.store.impl.fs.EchoServlet => /stroom/noauth/echo
INFO ... main s.d.common.Servlets stroom.receive.common.RemoteFeedServiceRPC => /stroom/noauth/remoting/remotefeedservice.rpc
INFO ... main s.d.common.Servlets stroom.core.servlet.StatusServlet => /stroom/noauth/status
INFO ... main s.d.common.Servlets stroom.core.servlet.SwaggerUiServlet => /stroom/noauth/swagger-ui
INFO ... main s.d.common.Servlets stroom.resource.impl.SessionResourceStoreImpl => /stroom/resourcestore/*
INFO ... main s.d.common.Servlets stroom.dashboard.impl.script.ScriptServlet => /stroom/script
INFO ... main s.d.common.Servlets stroom.security.impl.SessionListServlet => /stroom/sessionList
INFO ... main s.d.common.Servlets stroom.core.servlet.StroomServlet => /stroom/ui
1.2 - Calling an API
Authentication
In order to use the API endpoints you will need to authenticate. Authentication is achieved using an API Key or Token .
You will either need to create an API key for your personal Stroom user account or for a shared processing user account. Whichever user account you use it will need to have the necessary permissions for each API endpoint it is to be used with.
To create an API key (token) for a user:
- In the top menu, select:
- Click Create.
- Enter a suitable expiration date. Short expiry periods are more secure in case the key is compromised.
- Select the user account that you are creating the key for.
- Click
- Select the newly created API Key from the list of keys and double click it to open it.
- Click to copy the key to the clipboard.
To make an authenticated API call you need to provide a header of the form Authorization:Bearer ${TOKEN}
, where ${TOKEN}
is your API Key as copied from Stroom.
Calling an API method with curl
This section describes how to call an API method using the command line tool curl
as an example client.
Other clients can be used, e.g. using python, but these examples should provide enough help to get started using another client.
HTTP Requests Without a Body
Typically HTTP GET
requests will have no body/payload
Often PUT
and DELETE
requests will also have no body/payload.
The following is an example of how to call an HTTP GET method (i.e. a method that does not require a request body) on the API using curl
.
Warning
The --insecure
argument is used in this example which means certificate verification will not take place.
It is recommended not to use this argument and instead supply curl with client and certificate authority certificates to make a secure connection.
You can either call the API via Nginx (or similar reverse proxy) at https://stroom-fddn/api/some/path
or if you are making the call from one of the stroom hosts you can go direct using http://localhost:8080/api/some/path
. The former is preferred as it is more secure.
Requests With a Body
A lot of the API methods in Stroom require complex bodies/payloads for the request.
The following example is an HTTP POST
to perform a reference data lookup on the local host.
Create a file req.json
containing:
{
"mapName": "USER_ID_TO_STAFF_NO_MAP",
"effectiveTime": "2024-12-02T08:37:02.772Z",
"key": "user2",
"referenceLoaders": [
{
"loaderPipeline" : {
"name" : "Reference Loader",
"uuid" : "da1c7351-086f-493b-866a-b42dbe990700",
"type" : "Pipeline"
},
"referenceFeed" : {
"name": "STAFF-NO-REFERENCE",
"uuid": "350003fe-2b6c-4c57-95ed-2e6018c5b3d5",
"type" : "Feed"
}
}
]
}
Now send the request with curl
.
This API method returns plain text or XML depending on the reference data value.
Note
This assumes you are using curl
version 7.82.0
or later that supports the --json
argument.
If not you will need to replace --json
with --data
and add these arguments:
--header "Content-Type: application/json"
--header "Accept: application/json"
Handling JSON
jq is a utility for processing JSON and is very useful when using the API methods.
For example to get just the build version from the node info endpoint:
1.3 - Query APIs
The Query APIs use common request/response models and end points for querying each type of data source held in Stroom. The request/response models are defined in stroom-query .
Currently Stroom exposes a set of query endpoints for the following data source types. Each data source type will have its own endpoint due to differences in the way the data is queried and the restrictions imposed on the query terms. However they all share the same API definition.
- stroom-index Queries - The Lucene based search indexes.
- Sql Statistics Query - Stroom’s SQL Statistics store.
- Searchable - Searchables are various data sources that allow you to search the internals of Stroom, e.g. local reference data store, annotations, processor tasks, etc.
The detailed documentation for the request/responses is contained in the Swagger definition linked to above.
Common endpoints
The standard query endpoints are
Datasource
The Data Source endpoint is used to query Stroom for the details of a data source with a given DocRef . The details will include such things as the fields available and any restrictions on querying the data.
Search
The search endpoint is used to initiate a search against a data source or to request more data for an active search. A search request can be made using iterative mode, where it will perform the search and then only return the data it has immediately available. Subsequent requests for the same queryKey will also return the data immediately available, expecting that more results will have been found by the query. Requesting a search in non-iterative mode will result in the response being returned when the query has completed and all known results have been found.
The SearchRequest model is fairly complicated and contains not only the query terms but also a definition of how the data should be returned. A single SearchRequest can include multiple ResultRequest sections to return the queried data in multiple ways, e.g. as flat data and in an alternative aggregated form.
Stroom as a query builder
Stroom is able to export the json form of a SearchRequest model from its dashboards. This makes the dashboard a useful tool for building a query and the table settings to go with it. You can use the dashboard to defined the data source, define the query terms tree and build a table definition (or definitions) to describe how the data should be returned. The, clicking the download icon on the query pane of the dashboard will generate the SearchRequest json which can be immediately used with the /search API or modified to suit.
Destroy
This endpoint is used to kill an active query by supplying the queryKey for query in question.
Keep alive
Stroom will only hold search results from completed queries for a certain lenght of time. It will also terminate running queries that are too old. In order to prevent queries being aged off you can hit this endpoint to indicate to Stroom that you still have an interest in a particular query by supplying the query key.
1.4 - Export Content API
Stroom has API methods for exporting content in Stroom to a single zip file.
Export All - /api/export/v1
This method will export all content in Stroom to a single zip file. This is useful as an alternative backup of the content or where you need to export the content for import into another Stroom instance.
In order to perform a full export, the user (identified by their API Key) performing the export will need to ensure the following:
- Have created an API Key
- The system property
stroom.export.enabled
is set totrue
. - The user has the application permission
Export Configuration
orAdministrator
.
Only those items that the user has Read
permission on will be exported, so to export all items, the user performing the export will need Read
permission on all items or have the Administrator
application permission.
Performing an Export
To export all readable content to a file called export.zip
do something like the following:
Note
If you encounter problems then replace--silent
with --verbose
to get more information.
Export Zip Format
The export zip will contain a number of files for each document exported. The number and type of these files will depend on the type of document, however every document will have the following two file types:
.node
- This file represents the document’s location in the explorer tree along with its name and UUID..meta
- This is the metadata for the document independent of the explorer tree. It contains the name, type and UUID of the document along with the unique identifier for the version of the document.
Documents may also have files like these (a non-exhaustive list):
.json
- JSON data holding the content of the document, as used for Dashboards..txt
- Plain text data holding the content of the document, as used for Dictionaries..xml
- XML data holding the content of the document, as used for Pipelines..xsd
- XML Schema content..xsl
- XSLT content.
The following is an example of the content of an export zip file:
TEST_FEED_CERT.Feed.fcee4270-a479-4cc0-a79c-0e8f18a4bad8.meta
TEST_FEED_CERT.Feed.fcee4270-a479-4cc0-a79c-0e8f18a4bad8.node
TEST_FEED_PROXY.Feed.f06d4416-8b0e-4774-94a9-729adc5633aa.meta
TEST_FEED_PROXY.Feed.f06d4416-8b0e-4774-94a9-729adc5633aa.node
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.meta
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.xsl
TEST_REFERENCE_DATA_EVENTS_XXX.XSLT.4f74999e-9d69-47c7-97f7-5e88cc7459f7.node
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.xml
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.meta
Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.node
Filenames
When documents are added to the zip, they are added with a directory structure that mirrors the explorer tree.
The filenames are of the form:
<name>.<type>.<UUID>.<extension>
As Stroom allows characters in document and folder names that would not be supported in operating system paths (or cause confusion), some characters in the name/directory parts are replaced by _
to avoid this. e.g. Dashboard 01/02/2020
would become Dashboard_01_02_2020
.
If you need to see the contents of the zip as if viewing it within Stroom you can run this bash script in the root of the extracted zip.
#!/usr/bin/env bash
shopt -s globstar
for node_file in **/*.node; do
name=
name="$(grep -o -P "(?<=name=).*" "${node_file}" )"
path=
path="$(grep -o -P "(?<=path=).*" "${node_file}" )"
echo "./${path}/${name} (./${node_file})"
done
This will output something like:
./Standard Pipelines/Json/Events to JSON (./Standard_Pipelines/Json/Events_to_JSON.XSLT.1c3d42c2-f512-423f-aa6a-050c5cad7c0f.node)
./Standard Pipelines/Json/JSON Extraction (./Standard_Pipelines/Json/JSON_Extraction.Pipeline.13143179-b494-4146-ac4b-9a6010cada89.node)
./Standard Pipelines/Json/JSON Search Extraction (./Standard_Pipelines/Json/JSON_Search_Extraction.XSLT.a8c1aa77-fb90-461a-a121-d4d87d2ff072.node)
./Standard Pipelines/Reference Loader (./Standard_Pipelines/Reference_Loader.Pipeline.da1c7351-086f-493b-866a-b42dbe990700.node)
1.5 - Reference Data
The reference data store has an API to allow other systems to access the reference data store.
/api/refData/v1/lookup
The /lookup
endpoint requires the caller to provide details of the reference feed and loader pipeline so if the effective stream is not in the store it can be loaded prior to performing the lookup.
It is useful for forcing a reference load into the store and for performing ad-hoc lookups.
Note
As reference data stores are local to a node, it is best to send the request to a node that does processing as it is more likely to have already loaded the data. If you send it to a UI node that does not do processing, it is likely to trigger a load as the data will not be there.Below is an example of a lookup request file req.json
.
{
"mapName": "USER_ID_TO_LOCATION",
"effectiveTime": "2020-12-02T08:37:02.772Z",
"key": "jbloggs",
"referenceLoaders": [
{
"loaderPipeline" : {
"name" : "Reference Loader",
"uuid" : "da1c7351-086f-493b-866a-b42dbe990700",
"type" : "Pipeline"
},
"referenceFeed" : {
"name": "USER_ID_TOLOCATION-REFERENCE",
"uuid": "60f9f51d-e5d6-41f5-86b9-ae866b8c9fa3",
"type" : "Feed"
}
}
]
}
This is an example of how to perform the lookup on the local host.
2 - Background Jobs
There are various jobs that run in the background within Stroom. Among these are jobs that control pipeline processing, removing old files from the file system, checking the status of nodes and volumes etc. Each job executes at a different time depending on the purpose of the job. There are three ways that a job can be executed:
- Cron scheduled jobs execute periodically according to a cron schedule. These include jobs such as cleaning the file system where Stroom only needs to perform this action once a day and can do so overnight.
- Frequency controlled jobs are executed every X seconds, minutes, hours etc. Most of the jobs that execute with a given frequency are status checking jobs that perform a short lived action fairly frequently.
- Distributed jobs are only applicable to stream processing with a pipeline. Distributed jobs are executed by a worker node as soon as a worker has available threads to execute a jobs and the task distributor has work available.
A list of job types and their execution method can be seen by opening Jobs from the main menu.
Each job can be enabled/disabled at the job level. If you click on a job you will see an entry for each Stroom node in the lower pane. The job can be enabled/disabled at the node level for fine grained control of which nodes are running which jobs.
For a full list of all the jobs and details of what each one does, see the Job reference.
2.1 - Scheduler
Stroom has two main types of schedule, a simple frequency schedule that runs the job at a fixed time interval or a more complex cron schedule.
Note
This scheduler and its syntax are also used for Analytic Rules
.Frequency Schedules
A frequency schedule is expressed as a fixed time interval.
The frequency schedule expression syntax is stroom’s standard duration syntax and takes the form of a value followed by an optional unit suffix, e.g. 10m
for ten minutes.
Prefix | Time Unit |
---|---|
milliseconds | |
ms |
milliseconds |
s |
seconds |
m |
minutes |
h |
hours |
d |
days |
Cron Schedules
cron is a syntax for expressing schedules.
For full details of cron expressions see Cron Syntax
Stroom uses a scheduler called Quartz which supports cron expressions for scheduling.
3 - Concepts
3.1 - Streams
Streams can either be created when data is directly POSTed in to Stroom or during the proxy aggregation process. When data is directly POSTed to Stroom the content of the POST will be stored as one Stream. With proxy aggregation multiple files in the proxy repository will/can be aggregated together into a single Stream.
Anatomy of a Stream
A Stream is made up of a number of parts of which the raw or cooked data is just one. In addition to the data the Stream can contain a number of other child stream types, e.g. Context and Meta Data.
The hierarchy of a stream is as follows:
- Stream nnn
- Part [1 to *]
- Data [1-1]
- Context [0-1]
- Meta Data [0-1]
- Part [1 to *]
Although all streams conform to the above hierarchy there are three main types of Stream that are used in Stroom:
- Non-segmented Stream - Raw events, Raw Reference
- Segmented Stream - Events, Reference
- Segmented Error Stream - Error
Segmented means that the data has been demarcated into segments or records.
Child Stream Types
Data
This is the actual data of the stream, e.g. the XML events, raw CSV, JSON, etc.
Context
This is additional contextual data that can be sent with the data. Context data can be used for reference data lookups.
Meta Data
This is the data about the Stream (e.g. the feed name, receipt time, user agent, etc.). This meta data either comes from the HTTP headers when the data was POSTed to Stroom or is added by Stroom or Stroom-Proxy on receipt/processing.
Non-Segmented Stream
The following is a representation of a non-segmented stream with three parts, each with Meta Data and Context child streams.
Raw Events and Raw Reference streams contain non-segmented data, e.g. a large batch of CSV, JSON, XML, etc. data. There is no notion of a record/event/segment in the data, it is simply data in any form (including malformed data) that is yet to be processed and demarcated into records/events, for example using a Data Splitter or an XML parser.
The Stream may be single-part or multi-part depending on how it is received. If it is the product of proxy aggregation then it is likely to be multi-part. Each part will have its own context and meta data child streams, if applicable.
Segmented Stream
The following is a representation of a segmented stream that contains three records (i.e events) and the Meta Data.
Cooked Events and Reference data are forms of segmented data. The raw data has been parsed and split into records/events and the resulting data is stored in a way that allows Stroom to know where each record/event starts/ends. These streams only have a single part.
Error Stream
Error streams are similar to segmented Event/Reference streams in that they are single-part and have demarcated records (where each error/warning/info message is a record). Error streams do not have any Meta Data or Context child streams.
4 - Data Retention
By default Stroom will retain all the data it ingests and creates forever. It is likely that storage constraints/costs will mean that data needs to be deleted after a certain time. It is also likely that certain types of data may need to be kept for longer than other types.
Rules
Stroom allows for a set of data retention policy rules to be created to control at a fine grained level what data is deleted and what is retained.
The data retention rules are accessible by selecting Data Retention from the Tools menu. On first use the Rules tab of the Data Retention screen will show a single rule named Default Retain All Forever Rule. This is the implicit rule in stroom that retains all data and is always in play unless another rule overrides it. This rule cannot be edited, moved or removed.
Rule Precedence
Rules have a precedence, with a lower rule number being a higher priority.
When running the data retention job, Stroom will look at each stream held on the system and the retention policy of the first rule (starting from the lowest numbered rule) that matches that stream will apply.
One a matching rule is found all other rules with higher rule numbers (lower priority) are ignored.
For example if rule 1 says to retain streams from feed X-EVENTS
for 10 years and rule 2 says to retain streams from feeds *-EVENTS
for 1 year then rule 1 would apply to streams from feed X-EVENTS
and they would be kept for 10 years, but rule 2 would apply to feed Y-EVENTS
and they would only be kept for 1 year.
Rules are re-numbered as new rules are added/deleted/moved.
Creating a Rule
To create a rule do the following:
- Click the icon to add a new rule.
- Edit the expression to define the data that the rule will match on.
- Provide a name for the rule to help describe what its purpose is.
- Set the retention period for data matching this rule, i.e Forever or a set time period.
The new rule will be added at the top of the list of rules, i.e. with the highest priority. The
and icons can be used to change the priority of the rule.Rules can be enabled/disabled by clicking the checkbox next to the rule.
Changes to rules will not take effect until the
icon is clicked.Rules can also be deleted (
) and copied ( ).Impact Summary
When you have a number of complex rules it can be difficult to determine what data will actually be deleted next time the Policy Based Data Retention job runs. To help with this, Stroom has the Impact Summary tab that acts as a dry run for the active rules. The impact summary provides a count of the number of streams that will be deleted broken down by rule, stream type and feed name. On large systems with lots of data or complex rules, this query may take a long time to run.
The impact summary operates on the current state of the rules on the Rules tab whether saved or un-saved. This allows you to make a change to the rules and test its impact before saving it.
5 - Data Splitter
Data Splitter was created to transform text into XML. The XML produced is basic but can be processed further with XSLT to form any desired XML output.
Data Splitter works by using regular expressions to match a region of content or tokenizers to split content. The whole match or match group can then be output or passed to other expressions to further divide the matched data.
The root <dataSplitter>
element controls the way content is read and buffered from the source. It then passes this content on to one or more child expressions that attempt to match the content. The child expressions attempt to match content one at a time in the order they are specified until one matches. The matching expression then passes the content that it has matched to other elements that either emit XML or apply other expressions to the content matched by the parent.
This process of content supply, match, (supply, match)*, emit is best illustrated in a simple CSV example. Note that the elements and attributes used in all examples are explained in detail in the element reference.
5.1 - Simple CSV Example
The following CSV data will be split up into separate fields using Data Splitter.
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,
The first thing we need to do is match each record. Each record in a CSV file is delimited by a new line character. The following configuration will split the data into records using ‘\n’ as a delimiter:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match each line using a new line character as the delimiter -->
<split delimiter="\n"/>
</dataSplitter>
In the above example the ‘split’ tokenizer matches all of the supplied content up to the end of each line ready to pass each line of content on for further treatment.
We can now add a <group>
element within <split>
to take content matched by the tokenizer.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match each line using a new line character as the delimiter -->
<split delimiter="\n">
<!-- Take the matched line (using group 1 ignores the delimiters,
without this each match would include the new line character) -->
<group value="$1">
</group>
</split>
</dataSplitter>
The <group>
within the <split>
chooses to take the content from the <split>
without including the new line ‘\n’ delimiter by using match group 1, see expression match references for details.
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
The content selected by the <group>
from its parent match can then be passed onto sub expressions for further matching:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match each line using a new line character as the delimiter -->
<split delimiter="\n">
<!-- Take the matched line (using group 1 ignores the delimiters,
without this each match would include the new line character) -->
<group value="$1">
<!-- Match each value separated by a comma as the delimiter -->
<split delimiter=",">
</split>
</group>
</split>
</dataSplitter>
In the above example the additional <split>
element within the <group>
will match the content provided by the group repeatedly until it has used all of the group content.
The content matched by the inner <split>
element can be passed to a <data>
element to emit XML content.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match each line using a new line character as the delimiter -->
<split delimiter="\n">
<!-- Take the matched line (using group 1 ignores the delimiters,
without this each match would include the new line character) -->
<group value="$1">
<!-- Match each value separated by a comma as the delimiter -->
<split delimiter=",">
<!-- Output the value from group 1 (as above using group 1
ignores the delimiters, without this each value would include
the comma) -->
<data value="$1" />
</split>
</group>
</split>
</dataSplitter>
In the above example each match from the inner <split>
is made available to the inner <data>
element that chooses to output content from match group 1, see expression match references for details.
The above configuration results in the following XML output for the whole input:
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data value="01/01/2010" />
<data value="00:00:00" />
<data value="192.168.1.100" />
<data value="SOMEHOST.SOMEWHERE.COM" />
<data value="user1" />
<data value="logon" />
</record>
<record>
<data value="01/01/2010" />
<data value="00:01:00" />
<data value="192.168.1.100" />
<data value="SOMEHOST.SOMEWHERE.COM" />
<data value="user1" />
<data value="create" />
<data value="c:\test.txt" />
</record>
<record>
<data value="01/01/2010" />
<data value="00:02:00" />
<data value="192.168.1.100" />
<data value="SOMEHOST.SOMEWHERE.COM" />
<data value="user1" />
<data value="logoff" />
</record>
</records>
5.2 - Simple CSV example with heading
In addition to referencing content produced by a parent element it is often desirable to store content and reference it later. The following example of a CSV with a heading demonstrates how content can be stored in a variable and then referenced later on.
Input
This example will use a similar input to the one in the previous CSV example but also adds a heading line.
Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,
Configuration
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match heading line (note that maxMatch="1" means that only the
first line will be matched by this splitter) -->
<split delimiter="\n" maxMatch="1">
<!-- Store each heading in a named list -->
<group>
<split delimiter=",">
<var id="heading" />
</split>
</group>
</split>
<!-- Match each record -->
<split delimiter="\n">
<!-- Take the matched line -->
<group value="$1">
<!-- Split the line up -->
<split delimiter=",">
<!-- Output the stored heading for each iteration and the value
from group 1 -->
<data name="$heading$1" value="$1" />
</split>
</group>
</split>
</dataSplitter>
Output
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="Date" value="01/01/2010" />
<data name="Time" value="00:00:00" />
<data name="IPAddress" value="192.168.1.100" />
<data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
<data name="User" value="user1" />
<data name="EventType" value="logon" />
</record>
<record>
<data name="Date" value="01/01/2010" />
<data name="Time" value="00:01:00" />
<data name="IPAddress" value="192.168.1.100" />
<data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
<data name="User" value="user1" />
<data name="EventType" value="create" />
<data name="Detail" value="c:\test.txt" />
</record>
<record>
<data name="Date" value="01/01/2010" />
<data name="Time" value="00:02:00" />
<data name="IPAdress" value="192.168.1.100" />
<data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
<data name="User" value="user1" />
<data name="EventType" value="logoff" />
</record>
</records>
5.3 - Complex example with regex and user defined names
The following example uses a real world Apache log and demonstrates the use of regular expressions rather than simple ‘split’ tokenizers. The usage and structure of regular expressions is outside of the scope of this document but Data Splitter uses Java’s standard regular expression library that is POSIX compliant and documented in numerous places.
This example also demonstrates that the names and values that are output can be hard coded in the absence of field name information to make XSLT conversion easier later on. Also shown is that any match can be divided into further fields with additional expressions and the ability to nest data elements to provide structure if needed.
Input
192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /doc.htm HTTP/1.1" 200 4235 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /default.css HTTP/1.1" 200 3494 "http://some.server:8080/doc.htm" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
Configuration
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!--
Standard Apache Format
%h - host name should be ok without quotes
%l - Remote logname (from identd, if supplied). This will return a dash unless IdentityCheck is set On.
\"%u\" - user name should be quoted to deal with DNs
%t - time is added in square brackets so is contained for parsing purposes
\"%r\" - URL is quoted
%>s - Response code doesn't need to be quoted as it is a single number
%b - The size in bytes of the response sent to the client
\"%{Referer}i\" - Referrer is quoted so that’s ok
\"%{User-Agent}i\" - User agent is quoted so also ok
LogFormat "%h %l \"%u\" %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
-->
<!-- Match line -->
<split delimiter="\n">
<group value="$1">
<!-- Provide a regular expression for the whole line with match
groups for each field we want to split out -->
<regex pattern="^([^ ]+) ([^ ]+) "([^"]+)" \[([^\]]+)] "([^"]+)" ([^ ]+) ([^ ]+) "([^"]+)" "([^"]+)"">
<data name="host" value="$1" />
<data name="log" value="$2" />
<data name="user" value="$3" />
<data name="time" value="$4" />
<data name="url" value="$5">
<!-- Take the 5th regular expression group and pass it to
another expression to divide into smaller components -->
<group value="$5">
<regex pattern="^([^ ]+) ([^ ]+) ([^ /]*)/([^ ]*)">
<data name="httpMethod" value="$1" />
<data name="url" value="$2" />
<data name="protocol" value="$3" />
<data name="version" value="$4" />
</regex>
</group>
</data>
<data name="response" value="$6" />
<data name="size" value="$7" />
<data name="referrer" value="$8" />
<data name="userAgent" value="$9" />
</regex>
</group>
</split>
</dataSplitter>
Output
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="host" value="192.168.1.100" />
<data name="log" value="-" />
<data name="user" value="-" />
<data name="time" value="12/Jul/2012:11:57:07 +0000" />
<data name="url" value="GET /doc.htm HTTP/1.1">
<data name="httpMethod" value="GET" />
<data name="url" value="/doc.htm" />
<data name="protocol" value="HTTP" />
<data name="version" value="1.1" />
</data>
<data name="response" value="200" />
<data name="size" value="4235" />
<data name="referrer" value="-" />
<data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
</record>
<record>
<data name="host" value="192.168.1.100" />
<data name="log" value="-" />
<data name="user" value="-" />
<data name="time" value="12/Jul/2012:11:57:07 +0000" />
<data name="url" value="GET /default.css HTTP/1.1">
<data name="httpMethod" value="GET" />
<data name="url" value="/default.css" />
<data name="protocol" value="HTTP" />
<data name="version" value="1.1" />
</data>
<data name="response" value="200" />
<data name="size" value="3494" />
<data name="referrer" value="http://some.server:8080/doc.htm" />
<data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
</record>
</records>
5.4 - Multi Line Example
Example multi line file where records are split over may lines. There are various ways this data could be treated but this example forms a record from data created when some fictitious query starts plus the subsequent query results.
Input
09/07/2016 14:49:36 User = user1
09/07/2016 14:49:36 Query = some query
09/07/2016 16:34:40 Results:
09/07/2016 16:34:40 Line 1: result1
09/07/2016 16:34:40 Line 2: result2
09/07/2016 16:34:40 Line 3: result3
09/07/2016 16:34:40 Line 4: result4
09/07/2009 16:35:21 User = user2
09/07/2009 16:35:21 Query = some other query
09/07/2009 16:45:36 Results:
09/07/2009 16:45:36 Line 1: result1
09/07/2009 16:45:36 Line 2: result2
09/07/2009 16:45:36 Line 3: result3
09/07/2009 16:45:36 Line 4: result4
Configuration
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match each record. We want to treat the query and results as a single event so match the two sets of data separated by a double new line -->
<regex pattern="\n*((.*\n)+?\n(.*\n)+?\n)|\n*(.*\n?)+">
<group>
<!-- Split the record into query and results -->
<regex pattern="(.*?)\n\n(.*)" dotAll="true">
<!-- Create a data element to output query data -->
<data name="query">
<group value="$1">
<!-- We only want to output the date and time from the first line. -->
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
<data name="date" value="$1" />
<data name="time" value="$2" />
<data name="$3" value="$4" />
</regex>
<!-- Output all other values -->
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
<data name="$3" value="$4" />
</regex>
</group>
</data>
<!-- Create a data element to output result data -->
<data name="results">
<group value="$2">
<!-- We only want to output the date and time from the first line. -->
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
<data name="date" value="$1" />
<data name="time" value="$2" />
<data name="$3" value="$4" />
</regex>
<!-- Output all other values -->
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
<data name="$3" value="$4" />
</regex>
</group>
</data>
</regex>
</group>
</regex>
</dataSplitter>
Output
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="2.0">
<record>
<data name="query">
<data name="date" value="09/07/2016" />
<data name="time" value="14:49:36" />
<data name="User" value="user1" />
<data name="Query" value="some query" />
</data>
<data name="results">
<data name="date" value="09/07/2016" />
<data name="time" value="16:34:40" />
<data name="Results" />
<data name="Line 1" value="result1" />
<data name="Line 2" value="result2" />
<data name="Line 3" value="result3" />
<data name="Line 4" value="result4" />
</data>
</record>
<record>
<data name="query">
<data name="date" value="09/07/2016" />
<data name="time" value="16:35:21" />
<data name="User" value="user2" />
<data name="Query" value="some other query" />
</data>
<data name="results">
<data name="date" value="09/07/2016" />
<data name="time" value="16:45:36" />
<data name="Results" />
<data name="Line 1" value="result1" />
<data name="Line 2" value="result2" />
<data name="Line 3" value="result3" />
<data name="Line 4" value="result4" />
</data>
</record>
</records>
5.5 - Element Reference
There are various elements used in a Data Splitter configuration to control behaviour. Each of these elements can be categorised as one of the following:
5.5.1 - Content Providers
Content providers take some content from the input source or elsewhere (see fixed strings and provide it to one or more expressions.
Both the root element <dataSplitter>
and <group>
elements are content providers.
Root element <dataSplitter>
The root element of a Data Splitter configuration is <dataSplitter>
.
It supplies content from the input source to one or more expressions defined within it.
The way that content is buffered is controlled by the root element and the way that errors are handled as a result of child expressions not matching all of the content it supplies.
Attributes
The following attributes can be added to the <dataSplitter>
root element:
ignoreErrors
Data Splitter generates errors if not all of the content is matched by the regular expressions beneath the <dataSplitter>
or within <group>
elements.
The error messages are intended to aid the user in writing good Data Splitter configurations.
The intent is to indicate when the input data is not being matched fully and therefore possibly skipping some important data.
Despite this, in some cases it is laborious to have to write expressions to match all content.
In these cases it is preferable to add this attribute to ignore these errors.
However it is often better to write expressions that capture all of the supplied content and discard unwanted characters.
This attribute also affects errors generated by the use of the minMatch
attribute on <regex>
which is described later on.
Take the following example input:
Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment
This could be matched with the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<regex id="heading" pattern=".+" maxMatch="1">
…
</regex>
<regex id="body" pattern="\n[^#]+">
…
</regex>
</dataSplitter>
The above configuration would only match up to a comment for each record line, e.g.
Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment
This may well be the desired functionality but if there was useful content within the comment it would be lost. Because of this Data Splitter warns you when expressions are failing to match all of the content presented so that you can make sure that you aren’t missing anything important. In the above example it is obvious that this is the required behaviour but in more complex cases you might be otherwise unaware that your expressions were losing data.
To maintain this assurance that you are handling all content it is usually best to write expressions to explicitly match all content even though you may do nothing with some matches, e.g.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<regex id="heading" pattern=".+" maxMatch="1">
…
</regex>
<regex id="body" pattern="\n([^#]+)#.+">
…
</regex>
</dataSplitter>
The above example would match all of the content and would therefore not generate warnings. Sub-expressions of ‘body’ could use match group 1 and ignore the comment.
However as previously stated it might often be difficult to write expressions that will just match content that is to be discarded.
In these cases ignoreErrors
can be used to suppress errors caused by unmatched content.
bufferSize
(Advanced)
This is an optional attribute used to tune the size of the character buffer used by Data Splitter. The default size is 20000 characters and should be fine for most translations. The minimum value that this can be set to is 20000 characters and the maximum is 1000000000. The only reason to specify this attribute is when individual records are bigger than 10000 characters which is rarely the case.
Group element <group>
Groups behave in a similar way to the root element in that they provide content for one or more inner expressions to deal with, e.g.
<group value="$1">
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
...
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
...
Attributes
As the <group>
element is a content provider it also includes the same ignoreErrors
attribute which behaves in the same way.
The complete list of attributes for the <group>
element is as follows:
id
When Data Splitter reports errors it outputs an XPath to describe the part of the configuration that generated the error, e.g.
DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group>
It is often a little difficult to identify the configuration element that generated the error by looking at the path and the element description, particularly when multiple elements are the same, e.g. many <group>
elements without attributes.
To make identification easier you can add an ‘id’ attribute to any element in the configuration resulting in error descriptions as follows:
DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group id="myGroupId">
value
This attribute determines what content to present to child expressions.
By default the entire content matched by a group’s parent expression is passed on by the group to child expressions.
If required, content from a specific match group in the parent expression can be passed to child expressions using the value attribute, e.g. value="$1"
.
In addition to this content can be composed in the same way as it is for data names and values.
See Also
Match references for a full description of match references.
ignoreErrors
This behaves in the same way as for the root element.
matchOrder
This is an optional attribute used to control how content is consumed by expression matches.
Content can be consumed in sequence or in any order using matchOrder="sequence"
or matchOrder="any"
.
If the attribute is not specified, Data Splitter will default to matching in sequence.
When matching in sequence, each match consumes some content and the content position is moved beyond the match ready for the subsequent match. However, in some cases the order of these constructs is not predictable, e.g. we may sometimes be presented with:
Value1=1 Value2=2
… or sometimes with:
Value2=2 Value1=1
Using a sequential match order the following example would work to find both values in Value1=1 Value2=2
<group>
<regex pattern="Value1=([^ ]*)">
...
<regex pattern="Value2=([^ ]*)">
...
… but this example would skip over Value2 and only find the value of Value1 if the input was Value2=2 Value1=1
.
To be able to deal with content that contains these constructs in either order we need to change the match order to any
.
When matching in any order, each match removes the matched section from the content rather than moving the position past the match so that all remaining content can be matched by subsequent expressions.
In the following example the first expression would match and remove Value1=1
from the supplied content and the second expression would be presented with Value2=2
which it could also match.
<group matchOrder="any">
<regex pattern="Value1=([^ ]*)">
...
<regex pattern="Value2=([^ ]*)">
...
If the attribute is omitted by default the match order will be sequential. This is the default behaviour as tokens are most often in sequence and consuming content in this way is more efficient as content does not need to be copied by the parser to chop out sections as is required for matching in any order. It is only necessary to use this feature when fields that are identifiable with a specific match can occur in any order.
reverse
Occasionally it is desirable to reverse the content presented by a group to child expressions. This is because it is sometimes easier to form a pattern by matching content in reverse.
Take the following example content of name, value pairs delimited by =
but with no spaces between names, multiple spaces between values and only a space between subsequent pairs:
ipAddress=123.123.123.123 zones=Zone 1, Zone 2, Zone 3 location=loc1 A user=An end user serverName=bigserver
We could write a pattern that matches each name value pair by matching up to the start of the next name, e.g.
<regex pattern="([^=]+)=(.+?)( [^=]+=)">
This would match the following:
ipAddress=123.123.123.123 zones=
Here we are capturing the name and value for each pair in separate groups but the pattern has to also match the name from the next name value pair to find the end of the value. By default Data Splitter will move the content buffer to the end of the match ready for subsequent matches so the next name will not be available for matching.
In addition to matching too much content the above example also uses a reluctant qualifier .+?
. Use of reluctant qualifiers almost always impacts performance so they are to be avoided if at all possible.
A better way to match the example content is to match the input in reverse, reading characters from right to left.
The following example demonstrates this:
<group reverse="true">
<regex pattern="([^=]+)=([^ ]+)">
<data name="$2" value="$1" />
</regex>
</group>
Using the reverse attribute on the parent group causes content to be supplied to all child expressions in reverse order. In the above example this allows the pattern to match values followed by names which enables us to cope with the fact that values have multiple spaces but names have no spaces.
Content is only presented to child regular expressions in reverse. When referencing values from match groups the content is returned in the correct order, e.g. the above example would return:
<data name="ipAddress" value="123.123.123.123" />
<data name="zones" value="Zone 1, Zone 2, Zone 3" />
<data name="location" value="loc1" />
<data name="user" value="An end user" />
<data name="serverName" value="bigserver" />
The reverse feature isn’t needed very often but there are a few cases where it really helps produce the desired output without the complexity and performance overhead of a reluctant match.
An alternative to using the reverse attribute is to use the original reluctant expression example but tell Data Splitter to make the subsequent name available for the next match by not advancing the content beyond the end of the previous value. This is done by using the advance attribute on the <regex>
. However, the reverse attribute represents a better way to solve this particular problem and allows a simpler and more efficient regular expression to be used.
5.5.2 - Expressions
Expressions match some data supplied by a parent content provider. The content matched by an expression depends on the type of expression and how it is configured.
The <split>
, <regex>
and <all>
elements are all expressions and match content as described below.
The <split>
element
The <split>
element directs Data Splitter to break up content using a specified character sequence as a delimiter.
In addition to this it is possible to specify characters that are used to escape the delimiter as well as characters that contain or “quote” a value that may include the delimiter sequence but allow it to be ignored.
Attributes
The <split>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
delimiter
A required attribute used to specify the character string that will be used as a delimiter to split the supplied content unless it is preceded by an escape character or within a container if specified. Several of the previous examples use this attribute.
escape
An optional attribute used to specify a character sequence that is used to escape the delimiter. Many delimited text formats have an escape character that is used to tell any parser that the following delimiter should be ignored, e.g. often a character such as ‘' is used to escape the character that follows it so that it is not treated as a delimiter. When specified this escape sequence also applies to any container characters that may be specified.
containerStart
An optional attribute used to specify a character sequence that will make this expression ignore the presence of delimiters until an end container is found. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters up to a delimiter.
If used containerEnd
must also be specified.
If the container characters are to be ignored from the match then match group 1 must be used instead of 0.
containerEnd
An optional attribute used to specify a character sequence that will make this expression stop ignoring the presence of delimiters if it believes it is currently in a container. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters while ignoring the presence of any delimiter.
If used containerStart
must also be specified.
If the container characters are to be ignored from the match then match group 1 must be used instead of 0.
maxMatch
An optional attribute used to specify the maximum number of times this expression is allowed to match the supplied content. If you do not supply this attribute then the Data Splitter will keep matching the supplied content until it reaches the end. If specified Data Splitter will stop matching the supplied content when it has matched it the specified number of times.
This attribute is used in the ‘CSV with header line’ example to ensure that only the first line is treated as a header line.
minMatch
An optional attribute used to specify the minimum number of times this expression should match the supplied content. If you do not supply this attribute then Data Splitter will not enforce that the expression matches the supplied content. If specified Data Splitter will generate an error if the expression does not match the supplied content at least as many times as specified.
Unlike maxMatch
, minMatch
does not control the matching process but instead controls the production of error messages generated if the parser is not seeing the expected input.
onlyMatch
Optional attribute to use this expression only for specific instances of a match of the parent expression, e.g. on the 4th, 5th and 8th matches of the parent expression specified by ‘4,5,8’. This is used when this expression should only be used to subdivide content from certain parent matches.
The <regex>
element
The <regex>
element directs Data Splitter to match content using the specified regular expression pattern.
In addition to this the same match control attributes that are available on the <split>
element are also present as well as attributes to alter the way the pattern works.
Attributes
The <regex>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
pattern
This is a required attribute used to specify a regular expression to use to match on the supplied content.
The pattern is used to match the content multiple times until the end of the content is reached while the maxMatch
and onlyMatch
conditions are satisfied.
dotAll
An optional attribute used to specify if the use of ‘.’ in the supplied pattern matches all characters including new lines. If ’true’ ‘.’ will match all characters including new lines, if ‘false’ it will only match up to a new line. If this attribute is not specified it defaults to ‘false’ and will only match up to a new line.
This attribute is used in many of the multi-line examples above.
caseInsensitive
An optional attribute used to specify if the supplied pattern should match content in a case insensitive way. If ’true’ the expression will match content in a case insensitive manner, if ‘false’ it will match the content in a case sensitive manner. If this attribute is not specified it defaults to ‘false’ and will match the content in a case sensitive manner.
maxMatch
This is used in the same way it is on the <split>
element, see maxMatch
.
minMatch
This is used in the same way it is on the <split>
element, see minMatch
.
onlyMatch
This is used in the same way it is on the <split>
element, see onlyMatch
.
advance
After an expression has matched content in the buffer, the buffer start position is advanced so that it moves to the end of the entire match. This means that subsequent expressions operating on the content buffer will not see the previously matched content again. This is normally required behaviour, but in some cases some of the content from a match is still required for subsequent matches. Take the following example of name value pairs:
name1=some value 1 name2=some value 2 name3=some value 3
The first name value pair could be matched with the following expression:
<regex pattern="([^=]+)=(.+?) [^= ]+=">
The above expression would match as follows:
name1=some value 1 name2=some value 2 name3=some value 3
In this example we have had to do a reluctant match to extract the value in group 2 and not include the subsequent name. Because the reluctant match requires us to specify what we are reluctantly matching up to, we have had to include an expression after it that matches the next name.
By default the parser will move the character buffer to the end of the entire match so the next expression will be presented with the following:
some value 2 name3=some value 3
Therefore name2
will have been lost from the content buffer and will not be available for matching.
This behaviour can be altered by telling the expression how far to advance the character buffer after matching. This is done with the advance attribute and is used to specify the match group whose end position should be treated as the point the content buffer should advance to, e.g.
<regex pattern="([^=]+)=(.+?) [^= ]+=" advance="2">
In this example the content buffer will only advance to the end of match group 2 and subsequent expressions will be presented with the following content:
name2=some value 2 name3=some value 3
Therefore name2
will still be available in the content buffer.
It is likely that the advance feature will only be useful in cases where a reluctant match is performed. Reluctant matches are discouraged for performance reasons so this feature should rarely be used. A better way to tackle the above example would be to present the content in reverse, however this is only possible if the expression is within a group, i.e. is not a root expression. There may also be more complex cases where reversal is not an option and the use of a reluctant match is the only option.
The <all>
element
The <all>
element matches the entire content of the parent group and makes it available to child groups or <data>
elements.
The purpose of <all>
is to act as a catch all expression to deal with content that is not handled by a more specific expression, e.g. to output some other unknown, unrecognised or unexpected data.
<group>
<regex pattern="^\s*([^=]+)=([^=]+)\s*">
<data name="$1" value="$2" />
</regex>
<!-- Output unexpected data -->
<all>
<data name="unknown" value="$" />
</all>
</group>
The <all>
element provides the same functionality as using .*
as a pattern in a <regex>
element and where dotAll
is set to true, e.g. <regex pattern=".*" dotAll="true">
.
However it performs much faster as it doesn’t require pattern matching logic and is therefore always preferred.
Attributes
The <all>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
5.5.3 - Variables
A variable is added to Data Splitter using the <var>
element. A variable is used to store matches from a parent expression for use in a reference elsewhere in the configuration, see variable reference.
The most recent matches are stored for use in local references, i.e. references that are in the same match scope as the variable. Multiple matches are stored for use in references that are in a separate match scope. The concept of different variable scopes is described in scopes.
The <var>
element
The <var>
element is used to tell Data Splitter to store matches from a parent expression for use in a reference.
Attributes
The <var>
element has the following attributes:
id
Mandatory attribute used to uniquely identify it within the configuration (see id
) and is the means by which a variable is referenced, e.g. $VAR_ID$
.
5.5.4 - Output
As with all other aspects of Data Splitter, output XML is determined by adding certain elements to the Data Splitter configuration.
The <data>
element
Output is created by Data Splitter using one or more <data>
elements in the configuration.
The first <data>
element that is encountered within a matched expression will result in parent <record>
elements being produced in the output.
Attributes
The <data>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
name
Both the name and value attributes of the <data>
element can be specified using match references.
value
Both the name and value attributes of the <data>
element can be specified using match references.
Single <data>
element example
The simplest example that can be provided uses a single <data>
element within a <split>
expression.
Given the following input:
This is line 1
This is line 2
This is line 3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<split delimiter="\n" >
<data value="$1"/>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data value="This is line 1" />
</record>
<record>
<data value="This is line 2" />
</record>
<record>
<data value="This is line 3" />
</record>
</records>
Multiple <data>
element example
You could also output multiple <data>
elements for the same <record>
by adding multiple elements within the same expression:
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</record>
<record>
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</record>
<record>
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</record>
</records>
Multi level <data>
elements
As long as all data elements occur within the same parent/ancestor expression, all data elements will be output within the same record.
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<split delimiter="\n" >
<data name="line" value="$1"/>
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1" />
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2" />
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3" />
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</record>
</records>
Nesting <data>
elements
Rather than having <data>
elements all appear as children of <record>
it is possible to nest them either as direct children or within child groups.
Direct children
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
<data name="line" value="$">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</data>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<record
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1">
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</data>
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2">
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</data>
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3">
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</data>
</record>
</records>
Within child groups
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<split delimiter="\n" >
<data name="line" value="$1">
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</data>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1">
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</data>
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2">
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</data>
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3">
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</data>
</record>
</records>
The above example produces the same output as the previous but could be used to apply much more complex expression logic to produce the child <data>
elements, e.g. the inclusion of multiple child expressions to deal with different types of lines.
5.6 - Match References, Variables and Fixed Strings
The <group>
and <data>
elements can reference match groups from parent expressions or from stored matches in variables. In the case of the <group>
element, referenced values are passed on to child expressions whereas the <data>
element can use match group references for name and value attributes. In the case of both elements the way of specifying references is the same.
5.6.1 - Expression match references
Referencing matches in expressions is done using $
. In addition to this a match group number may be added to just retrieve part of the expression match. The applicability and effect that this has depends on the type of expression used.
References to <split>
Match Groups
In the following example a line matched by a parent <split>
expression is referenced by a child <data>
element.
<split delimiter="\n" >
<data name="line" value="$"/>
</split>
A <split>
element matches content up to and including the specified delimiter, so the above reference would output the entire line plus the delimiter. However there are various match groups that can be used by child <group>
and <data>
elements to reference sections of the matched content.
To illustrate the content provided by each match group, take the following example:
"This is some text\, that we wish to match", "This is the next text"
And the following <split>
element:
<split delimiter="," escape="\">
The match groups are as follows:
- $ or $0: The entire content that is matched including the specified delimiter at the end
"This is some text\, that we wish to match",
- $1: The content up to the specified delimiter at the end
"This is some text\, that we wish to match"
- $2: The content up to the specified delimiter at the end and filtered to remove escape characters (more expensive than $1)
"This is some text, that we wish to match"
In addition to this behaviour match groups 1 and 2 will omit outermost whitespace and container characters if specified, e.g. with the following content:
" This is some text\, that we wish to match " , "This is the next text"
And the following <split>
element:
<split delimiter="," escape="\" containerStart=""" containerEnd=""">
The match groups are as follows:
- $ or $0: The entire content that is matched including the specified delimiter at the end
" This is some text\, that we wish to match " ,
- $1: The content up to the specified delimiter at the end and strips outer containers.
This is some text\, that we wish to match
- $2: The content up to the specified delimiter at the end and strips outer containers and filtered to remove escape characters (more computationally expensive than $1)
This is some text, that we wish to match
References to Match Groups
Like the <split>
element various match groups can be referenced in a <regex>
expression to retrieve portions of matched content. This content can be used as values for <group>
and <data>
elements.
Given the following input:
ip=1.1.1.1 user=user1
And the following <regex>
element:
<regex pattern="ip=([^ ]+) user=([^ ]+)">
The match groups are as follows:
- $ or $0: The entire content that is matched by the expression
ip=1.1.1.1 user=user1
- $1: The content of the first match group
1.1.1.1
- $2: The content of the second match group
user1
Match group numbers in regular expressions are determined by the order that their open bracket appears in the expression.
References to <any>
Match Groups
The <any>
element does not have any match groups and always returns the entire content that was passed to it when referenced with $.
5.6.2 - Variable reference
Variables are added to Data Splitter configuration using the <var>
element, see variables. Each variable must have a unique id so that it can be referenced. References to variables have the form $VARIABLE_ID$
, e.g.
<data name="$heading$" value="$" />
Identification
Data Splitter validates the configuration on load and ensures that all element ids are unique and that referenced ids belong to a variable.
A variable will only store data if it is referenced so variables that are not referenced will do nothing. In addition to this a variable will only store data for match groups that are referenced, e.g. if $heading$1
is the only reference to a variable with an id of ‘heading’ then only data for match group 1 will be stored for reference lookup.
Scopes
Variables have two scopes which affect how data is retrieved when referenced:
Local Scope
Variables are local to a reference if the reference exists as a descendant of the variables parent expression, e.g.
<split delimiter="\n" >
<var id="line" />
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="line" value="$line$"/>
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</split>
In the above example, matches for the outermost <split>
expression are stored in the variable with the id of line
. The only reference to this variable is in a data element that is a descendant of the variables parent expression <split>
, i.e. it is nested within split/group/regex.
Because the variable is referenced locally only the most recent parent match is relevant, i.e. no retrieval of values by iteration, iteration offset or fixed position is applicable. These features only apply to remote variables that store multiple values.
Remote Scope
The CSV example with a heading is an example of a variable being referenced from a remote scope.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match heading line (note that maxMatch="1" means that only the first line will be matched by this splitter) -->
<split delimiter="\n" maxMatch="1">
<!-- Store each heading in a named list -->
<group>
<split delimiter=",">
<var id="heading" />
</split>
</group>
</split>
<!-- Match each record -->
<split delimiter="\n">
<!-- Take the matched line -->
<group value="$1">
<!-- Split the line up -->
<split delimiter=",">
<!-- Output the stored heading for each iteration and the value from group 1 -->
<data name="$heading$1" value="$1" />
</split>
</group>
</split>
</dataSplitter>
In the above example the parent expression of the variable is not the ancestor of the reference in the <data>
element. This makes the <data>
elements reference to the variable a remote one. In this situation the variable knows that it must store multiple values as the remote reference <data>
may retrieve one of many values from the variable based on:
- The match count of the parent expression.
- The match count of the parent expression, plus or minus an offset.
- A fixed position in the variable store.
Retrieval of value by iteration
In the above example the first line is taken then repeatedly matched by delimiting with commas. This results in multiple values being stored in the ‘heading’ variable. Once this is done subsequent lines are matched and then also repeatedly matched by delimiting with commas in the same way the heading is.
Each time a line is matched the internal match count of all sub expressions, (e.g. the <split>
expression that is delimited by comma) is reset to 0. Every time the sub <split>
expression matches up to a comma delimiter the match count is incremented. Any references to remote variables will, by default, use the current match count as an index to retrieve one of the many values stored in the variable. This means that the <data>
element in the above example will retrieve the corresponding heading for each value as the match count of the values will match the storage position of each heading.
Retrieval of value by iteration offset
In some cases there may be a mismatch between the position where a value is stored in a variable and the match count applicable when remotely referencing the variable.
Take the following input:
BAD,Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
In the above example we can see that the first heading ‘BAD’ is not correct for the first value of every line. In this situation we could either adjust the way the heading line is parsed to ignore ‘BAD’ or just adjust the way the heading variable is referenced.
To make this adjustment the reference just needs to be told what offset to apply to the current match count to correctly retrieve the stored value. In the above example this would be done like this:
<data name="$heading$1[+1]" value="$1" />
The above reference just uses the match count plus 1 to retrieve the stored value. Any integral offset plus or minus may be used, e.g. [+4] or [-10]. Offsets that result in a position that is outside of the storage range for the variable will not return a value.
Retrieval of value by fixed position
In addition to retrieval by offset from the current match count, a stored value can be returned by a fixed position that has no relevance to the current match count.
In the following example the value retrieved from the ‘heading’ variable will always be ‘IPAddress’ as this is the fourth value stored in the ‘heading’ variable and the position index starts at 0.
<data name="$heading$1[3]" value="$1" />
5.6.3 - Use of fixed strings
Any <group>
value or <data>
name and value can use references to matched content, but in addition to this it is possible just to output a known string, e.g.
<data name="somename" value="$" />
The above example would output somename
as the <data>
name attribute. This can often be useful where there are no headings specified in the input data but we want to associate certain names with certain values.
Given the following data:
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
We could provide useful headings with the following configuration:
<regex pattern="([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),">
<data name="date" value="$1" />
<data name="time" value="$2" />
<data name="ipAddress" value="$3" />
<data name="hostName" value="$4" />
<data name="user" value="$5" />
<data name="action" value="$6" />
</regex>
5.6.4 - Concatenation of references
It is possible to concatenate multiple fixed strings and match group references using the +
character. As with all references and fixed strings this can be done in <group>
value and <data>
name and value attributes. However concatenation does have some performance overhead as new buffers have to be created to store concatenated content.
A good example of concatenation is the production of ISO8601 date format from data in the previous example:
01/01/2010,00:00:00
Here the following <regex>
could be used to extract the relevant date, time groups:
<regex pattern="(\d{2})/(\d{2})/(\d{4}),(\d{2}):(\d{2}):(\d{2})">
The match groups from this expression can be concatenated with the following value output pattern in the data element:
<data name="dateTime" value="$3+’-‘+$2+’-‘+$1+’-‘+’T’+$4+’:’+$5+’:’+$6+’.000Z’" />
Using the original example, this would result in the output:
<data name="dateTime" value="2010-01-01T00:00:00.000Z" />
Note that the value output pattern wraps all fixed strings in single quotes. This is necessary when concatenating strings and references so that Data Splitter can determine which parts are to be treated as fixed strings. This also allows fixed strings to contain $
and +
characters.
As single quotes are used for this purpose, a single quote needs to be escaped with another single quote if one is desired in a fixed string, e.g.
‘this ‘’is quoted text’’’
This will result in:
this ‘is quoted text’
6 - Event Feeds
In order for Stroom to be able to handle the various data types as described in the previous section, Stroom must be told what the data is when it is received. This is achieved using Event Feeds. Each feed has a unique name within the system.
Events Feeds can be related to one or more Reference Feed. Reference Feeds are used to provide look up data for a translation. E.g. lookup a computer name by it’s IP address.
Feeds can also have associated context data. Context data used to provide look up information that is only applicable for the events file it relates to. E.g. if the events file is missing information relating to the computer it was generated on, and you don’t want to create separate feeds for each computer, an associated context file could be used to provide this information.
Feed Identifiers
Feed identifiers must be unique within the system. Identifiers can be in any format but an established convetnion is to use the following format
<SYSTEM>-<ENVIRONMENT>-<TYPE>-<EVENTS/REFERENCE>-<VERSION>
If feeds in a certain site need different reference data then the site can be broken down further.
_
may be used to represent a space.
7 - Indexing data
7.1 - Elasticsearch
7.1.1 - Introduction
Stroom supports using an external Elasticsearch cluster to index event data. This allows you to leverage all the features of the Elastic Stack, such as shard allocation, replication, fault tolerance and aggregations.
With Elasticsearch as an external service, your search infrastructure can scale independently of your Stroom data processing cluster, enhancing interoperability with other platforms by providing a performant and resilient time-series event data store. For instance, you can:
- Deploy Kibana to search and visualise Elasticsearch data.
- Index Stroom’s stream meta and
Error
streams so monitoring systems can generate metrics and alerts. - Use Apache Spark to perform stateful data processing and enrichment, through the use of the Elasticsearch-Hadoop connector.
Stroom achieves indexing and search integration by interfacing securely with the Elasticsearch REST API using the Java high-level client.
This guide will walk you through configuring a Stroom indexing pipeline, creating an Elasticsearch index template, activating a stream processor and searching the indexed data in both Stroom and Kibana.
Assumptions
- You have created an Elasticsearch cluster. Elasticsearch 8.x is recommended, though the latest supported 7.x version will also work. For test purposes, you can quickly create a single-node cluster using Docker by following the steps in the Elasticsearch Docs .
- The Elasticsearch cluster is reachable via HTTPS from all Stroom nodes participating in stream processing.
- Elasticsearch security is enabled. This is mandatory and is enabled by default in Elasticsearch 8.x and above.
- The Elasticsearch HTTPS interface presents a trusted X.509 server certificate. The Stroom node(s) connecting to Elasticsearch need to be able to verify the certificate, so for custom PKI, a Stroom truststore entry may be required.
- You have a feed containing
Event
streams to index.
Key differences
Indexing data with Elasticsearch differs from Solr and built-in Lucene methods in a number of ways:
- Unlike with Solr and built-in Lucene indexing, Elasticsearch field mappings are managed outside Stroom, through the use of index and component templates . These are normally created either via the Elasticsearch API, or interactively using Kibana.
- Aside from creating the mandatory
StreamId
andEventId
field mappings, explicitly defining mappings for other fields is optional. Elasticsearch will use dynamic mapping by default, to infer each field’s type at index time. Explicitly defining mappings is recommended where consistency or greater control are required, such as for IP address fields (Elasticsearch mapping typeip
).
Next page - Getting Started
7.1.2 - Getting Started
Establish an Elasticsearch cluster connection in Stroom
The first step is to configure Stroom to connect to an Elasticsearch cluster.
You can configure multiple cluster connections if required, such as a separate one for production and another for development.
Each cluster connection is defined by an Elastic Cluster
document within the Stroom UI.
- In the Stroom Explorer pane (
Elastic Cluster
document.
), right-click on the folder where you want to create the - Select:
- Give the cluster document a name and press .
- Complete the fields as explained in the section below. Any fields not marked as “Optional” are mandatory.
- Click
Test Connection
. A dialog will display with the test result. IfConnection Success
, details of the target cluster will be displayed. Otherwise, error details will be displayed. - Click to commit changes.
Warning
Ensure you restrict permissions to theElastic Cluster
document.
The Read
privilege permits retrieval of the Elasticsearch API key and secret, granting the holder the same level of privilege as Stroom.
Users authorised to search Elasticsearch indices via Stroom dashboards should only be assigned the Use
privilege.
Elastic Cluster document fields
Description
(Optional) You might choose to enter the Elasticsearch cluster name or purpose here.
Connection URLs
Enter one or more node or cluster addresses, including protocol, hostname and port. Only HTTPS is supported; attempts to use plain-text HTTP will fail.
Examples
- Local development node:
https://localhost:9200
- FQDN:
https://elasticsearch.example.com:9200
- Kubernetes service:
https://prod-es-http.elastic.svc:9200
CA certificate
PEM-format CA certificate chain used by Stroom to verify TLS connections to the Elasticsearch HTTPS REST interface. This is usually your organisation’s root enterprise CA certificate. For development, you can provide a self-signed certificate.
Use authentication
(Optional) Tick this box if Elasticsearch requires authentication. This is enabled by default from Elasticsearch version 8.0.
API key ID
Required if Use authentication
is checked.
Specifies the Elasticsearch API key ID for a valid Elasticsearch user account.
This user requires at a minimum the following
privileges
:
Cluster privileges
- monitor
- manage_own_api_key
Index privileges
- all
API key secret
Required if Use authentication
is checked.
Socket timeout (ms)
Number of milliseconds to wait for an Elasticsearch indexing or search REST call to complete.
Set to -1
(the default) to wait indefinitely, or until Elasticsearch closes the connection.
Next page - Indexing data
7.1.3 - Indexing data
A typical workflow is for a Stroom pipeline to convert XML Event
elements into the XML equivalent of JSON, complying with the schema http://www.w3.org/2005/xpath-functions
, using a format identical to the output of the XML function xml-to-json()
.
Understanding JSON XML representation
In an Elasticsearch indexing pipeline translation, you model JSON documents in a compatible XML representation.
Common JSON primitives and examples of their XML equivalents are outlined below.
Arrays
Array of maps
<array key="users" xmlns="http://www.w3.org/2005/xpath-functions">
<map>
<string key="name">John Smith</string>
</map>
</array>
Array of strings
<array key="userNames" xmlns="http://www.w3.org/2005/xpath-functions">
<string>John Smith</string>
<string>Jane Doe</string>
</array>
Maps and properties
<map key="user" xmlns="http://www.w3.org/2005/xpath-functions">
<string key="name">John Smith</string>
<boolean key="active">true</boolean>
<number key="daysSinceLastLogin">42</number>
<string key="loginDate">2022-12-25T01:59:01.000Z</string>
<null key="emailAddress" />
<array key="phoneNumbers">
<string>1234567890</string>
</array>
</map>
Note
It is recommended to insert a schema validation filter into your pipeline XML (with schema groupJSON
), to make it easier to diagnose JSON conversion errors.
We will now explore how create an Elasticsearch index template, which specifies field mappings and settings for one or more indices.
Create an Elasticsearch index template
For information on what index and component templates are, consult the Elastic documentation .
When Elasticsearch first receives a document from Stroom targeting an index, whose name matches any of the index_patterns
entries in the index template, it will create a new index / data stream using the settings
and mappings
properties from the template.
In this way, the index does not need to be manually created in advance.
Note
If an index doesn’t match a template when it is created, data will still be indexed - with default mappings and settings. This may be appropriate for small indices, but with a default shard count of5
, the indexing and search performance will likely be inadequate for large indices.
The following example creates a basic index template stroom-events-v1
in a local Elasticsearch cluster, with the following explicit field mappings:
StreamId
– mandatory, data typelong
orkeyword
.EventId
– mandatory, data typelong
orkeyword
.@timestamp
– required if the index is to be part of a data stream (recommended).User
– An object containing propertiesId
,Name
andActive
, each with their own data type.Tags
– An array of one or more strings.Message
– Contains arbitrary content such as unstructured raw log data. Supports full-text search. Nested fieldwildcard
supports regexp queries .
Note
Elasticsearch does not have a dedicated array
field mapping data type.
An Elasticsearch field may contain zero or more values by default.
See:
In Kibana Dev Tools, execute the following query:
PUT _index_template/stroom-events-v1
{
"index_patterns": [
"stroom-events-v1*" // Apply this template to index names matching this pattern.
],
"data_stream": {}, // For time-series data. Recommended for event data.
"template": {
"settings": {
"number_of_replicas": 1, // Replicas impact indexing throughput. This setting can be changed at any time.
"number_of_shards": 10, // Consider the shard sizing guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#shard-size-recommendation
"refresh_interval": "10s", // How often to refresh the index. For high-throughput indices, it's recommended to increase this from the default of 1s
"lifecycle": {
"name": "stroom_30d_retention_policy" // (Optional) Apply an ILM policy https://www.elastic.co/guide/en/elasticsearch/reference/current/set-up-lifecycle-policy.html
}
},
"mappings": {
"dynamic_templates": [],
"properties": {
"StreamId": { // Required.
"type": "long"
},
"EventId": { // Required.
"type": "long"
},
"@timestamp": { // Required if the index is part of a data stream.
"type": "date"
},
"User": {
"properties": {
"Id": {
"type": "keyword"
},
"Name": {
"type": "keyword"
},
"Active": {
"type": "boolean"
}
}
},
"Tags": {
"type": "keyword"
},
"Message": {
"type": "text",
"fields": {
"wildcard": {
"type": "wildcard"
}
}
}
}
}
},
"composed_of": [
// Optional array of component template names.
]
}
Create an Elasticsearch indexing pipeline template
An Elasticsearch indexing pipeline is similar in structure to the built-in packaged Indexing
template pipeline.
It typically consists of the following pipeline elements:
-
XSLTFilter
contains the translation mapping
Events
to JSONarray
. -
SchemaFilter
uses schema group
JSON
.
It is recommended to create a template Elasticsearch indexing pipeline, which can then be re-used.
Procedure
- Right-click on the
Template Pipelines
folder in the Stroom Explorer pane ( ). - Select:
- Enter the name
Indexing (Elasticsearch)
and click . - Define the pipeline structure as above, and customise the following pipeline elements:
- Set the Split Filter
splitCount
property to a sensible default value, based on the expected source XML element count (e.g.100
). - Set the Schema Filter
schemaGroup
property toJSON
. - Set the Elastic Indexing Filter
cluster
property to point to theElastic Cluster
document you created earlier. - Set the Write Record Count filter
countRead
property tofalse
.
- Set the Split Filter
Now you have created a template indexing pipeline, it’s time to create a feed-specific pipeline that inherits this template.
Create an Elasticsearch indexing pipeline
Procedure
- Right-click on a folder in the Stroom Explorer pane .
- Enter a name for your pipeline and click .
- Click the
Inherit From
button. - In the dialog that appears, select the template pipeline you created named
Indexing (Elasticsearch)
and click . - Select the Elastic Indexing Filter pipeline element.
- Set the
indexName
property to the name of the destination index or data stream.indexName
may be a simple string (static) or dynamic. - If using dynamic index names, configure the translation to output named element(s) that will be interpolated into
indexName
for each document indexed.
Choosing between simple and dynamic index names
Indexing data to a single, named data stream or index, is a simple and convenient way to manage data. There are cases however, where indices may contain significant volumes of data spanning long periods - and where a large portion of indexing will be performed up-front (such as when processing a feed with a lot of historical data). As Elasticsearch data stream indices roll over based on the current time (not event time), it is helpful to be able to partition data streams by user-defined properties such as year. This use case is met by Stroom’s dynamic index naming.
Note
An Elasticsearch
data stream consists of one or more backing indices, which automatically roll over once a size or date threshold are met. This abstraction assists with the lifecycle management of time-series log data, enabling users to define time and sized-based rules that can for instance, delete indices after they reach a certain age - or move older indices to different data tiers (e.g. cold storage).Single named index or data stream
This is the simplest use case and is suitable where you want to write all data for a particular pipeline, to a single data stream or index.
Whether data is written to an actual index or data stream depends on your index template, specifically whether you have declared data_stream: {}
.
If this property exists in the index template matching indexName
, a data stream is created when the first document is indexed.
Data streams, amongst many other features, provide the option to use Elasticsearch
Index Lifecycle Management (ILM) policies
to manage their lifecycle.
Note
When indexing to a data stream, ensure to include astring
field named @timestamp
in the output JSON XML.
This is mandatory and indexing will fail if this field isn’t a valid date
value.
Dynamic data stream names
With a dynamic stream name, indexName
contains the names of elements, for example: stroom-events-v1-{year}
.
For each document, the final index name is computed based on the values of the corresponding elements within the resulting JSON XML.
For example, if the JSON XML representation of an event consists of the following, the document will be indexed to the index or data stream named stroom-events-v1-2022
:
<?xml version="1.1" encoding="UTF-8"?>
<array
xmlns="http://www.w3.org/2005/xpath-functions"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/xpath-functions file://xpath-functions.xsd">
<map>
<number key="StreamId">3045516</number>
<number key="EventId">1</number>
<string key="@timestamp">2022-12-16T02:46:29.218Z</string>
<number key="year">2022</number>
</map>
</array>
This is due to the value of /map/number[@key='year']
being 2022
.
This approach can be useful when you need to apply different ILM policies, such as maintaining older data on slower storage tiers.
Warning
Any element names defined inindexName
must exist in the JSON XML (unless it is an empty document).
If a blank value is desired, output an empty string
element.
Note
If an element name begins with_
(underscore), its value is only used for indexName
interpolation, and it is not included in the final JSON.
Other applications for dynamic data stream names
Dynamic data stream names can also help in other scenarios, such as implementing fine-grained retention policies, such as deleting documents that aren’t user-attributed after 12 months.
While Stroom ElasticIndex
support data retention expressions, deleting documents in Elasticsearch by query is highly inefficient and doesn’t cause disk space to be freed (this requires an index to be force-merged, an expensive operation).
A better solution therefore, is to use dynamic data stream names to partition data and assign certain partitions to specific ILM policies and/or data tiers.
Migrating older data streams to other data tiers
Say a feed is indexed, spanning data from 2020 through 2023.
Assuming most searches only need to query data from the current year, the data streams stroom-events-v1-2020
and stroom-events-v1-2021
can be moved to cold storage.
To achieve this, use
index-level shard allocation filtering
.
In Kibana Dev Tools, execute the following command:
PUT stroom-events-v1-2020,stroom-events-v1-2021/_settings
{
"index.routing.allocation.include._tier_preference": "data_cold"
}
This example assumes a cold data tier has been defined for the cluster. If the command executes successfully, shards from the specified data streams are gradually migrated to the nodes comprising the destination data tier.
Create an indexing translation
In this example, let’s assume you have event data that looks like the following:
<?xml version="1.1" encoding="UTF-8"?>
<Events
xmlns="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.5.2.xsd"
Version="3.5.2">
<Event>
<EventTime>
<TimeCreated>2022-12-16T02:46:29.218Z</TimeCreated>
</EventTime>
<EventSource>
<System>
<Name>Nginx</Name>
<Environment>Development</Environment>
</System>
<Generator>Filebeat</Generator>
<Device>
<HostName>localhost</HostName>
</Device>
<User>
<Id>john.smith1</Id>
<Name>John Smith</Name>
<State>active</State>
</User>
</EventSource>
<EventDetail>
<View>
<Resource>
<URL>http://localhost:8080/index.html</URL>
</Resource>
<Data Name="Tags" Value="dev,testing" />
<Data
Name="Message"
Value="TLSv1.2 AES128-SHA 1.1.1.1 "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"" />
</View>
</EventDetail>
</Event>
<Event>
...
</Event>
</Events>
We need to write an XSL transform (XSLT) to form a JSON document for each stream processed.
Each document must consist of an array
element one or more map
elements (each representing an Event
), each with the necessary properties as per our index template.
See XSLT Conversion for instructions on how to write an XSLT.
The output from your XSLT should match the following:
<?xml version="1.1" encoding="UTF-8"?>
<array
xmlns="http://www.w3.org/2005/xpath-functions"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/xpath-functions file://xpath-functions.xsd">
<map>
<number key="StreamId">3045516</number>
<number key="EventId">1</number>
<string key="@timestamp">2022-12-16T02:46:29.218Z</string>
<map key="User">
<string key="Id">john.smith1</string>
<string key="Name">John Smith</string>
<boolean key="Active">true</boolean>
</map>
<array key="Tags">
<string>dev</string>
<string>testing</string>
</array>
<string key="Message">TLSv1.2 AES128-SHA 1.1.1.1 "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"</string>
</map>
<map>
...
</map>
</array>
Assign the translation to the indexing pipeline
Having created your translation, you need to reference it in your indexing pipeline.
- Open the pipeline you created.
- Select the
Structure
tab. - Select the XSLTFilter pipeline element.
- Double-click the
xslt
property value cell. - Select the XSLT you created and click .
- Click .
Step the pipeline
At this point, you will want to step the pipeline to ensure there are no errors and that output looks as expected.
Execute the pipeline
Create a pipeline processor and filter to run the pipeline against one or more feeds. Stroom will distribute processing tasks to enabled nodes and send documents to Elasticsearch for indexing.
You can monitor indexing status via your Elasticsearch monitoring tool of choice.
Detecting and handling errors
If any errors occur while a stream is being indexed, an Error
stream is created, containing details of each failure.
Error
streams can be found under the Data
tab of either the indexing pipeline or receiving Feed
.
Note
You can filter the selected pipeline or feed to list only Error
streams.
Click then add a condition Type
=
Error
.
Once you have addressed the underlying cause for a particular type of error (such as an incorrect field mapping), reprocess affected streams:
- Select any
Error
streams relating for reprocessing, by clicking the relevant checkboxes in the stream list (top pane). - Click .
- In the dialog that appears, check
Reprocess data
and click . - Click for any confirmation prompts that follow.
Stroom will re-send data from the selected Event
streams to Elasticsearch for indexing.
Any existing documents matching the StreamId
of the original Event
stream are first deleted automatically to avoid duplication.
Tips and tricks
Use a common schema for your indices
An example is Elastic Common Schema (ECS) . This helps users understand the purpose of each field and to build cross-index queries simpler by using a set of common fields (such as a user ID).
With this in mind, it is important that common fields also have the same data type in each index. Component templates help make this easier and reduce the chance of error, by centralising the definition of common fields to a single component.
Use a version control system (such as git) to track index and component templates
This helps keep track of changes over time and can be an important resource for both administrators and users.
Rebuilding an index
Sometimes it is necessary to rebuild an index. This could be due to a change in field mapping, shard count or responding to a user feature request.
To rebuild an index:
- Drain the indexing pipeline by deactivating any processor filters and waiting for any running tasks to complete.
- Delete the index or data stream via the Elasticsearch API or Kibana.
- Make the required changes to the index template and/or XSL translation.
- Create a new processor filter either from scratch or using the button.
- Activate the new processor filter.
Use a versioned index naming convention
As with the earlier example stroom-events-v1
, a version number is appended to the name of the index or data stream.
If a new field is added, or some other change occurred requiring the index to be rebuilt, users would experience downtime.
This can be avoided by incrementing the version and performing the rebuild against a new index: stroom-events-v2
.
Users could continue querying stroom-events-v1
until it is deleted.
This approach involves the following steps:
- Create a new Elasticsearch index template targeting the new index name (in this case,
stroom-events-v2
). - Create a copy of the indexing pipeline, targeting the new index in the Elastic Indexing Filter.
- Create and activate a processing filter for the new pipeline.
- Once indexing is complete, update the Elastic Index document to point to
stroom-events-v2
. Users will now be searching against the new index. - Drain any tasks for the original indexing pipeline and delete it.
- Delete index
stroom-events-v1
using either the Elasticsearch API or Kibana.
If you created a data view in Kibana, you’ll also want to update this to point to the new index / data stream.
7.1.4 - Exploring Data in Kibana
Kibana is part of the Elastic Stack and provides users with an interactive, visual way to query, visualise and explore data in Elasticsearch.
It is highly customisable and provides users and teams with tools to create and share dashboards, searches, reports and other content.
Once data has been indexed by Stroom into Elasticsearch, it can be explored in Kibana. You will first need to create a data view in order to query your indices.
Why use Kibana?
There are several use cases that benefit from Kibana:
- Convenient and powerful drag-and-drop charts and other visualisation types using Kibana Lens. Much more performant and easier to customise than built-in Stroom dashboard visualisations.
- Field statistics and value summaries with Kibana Discover. Great for doing initial audit data survey.
- Geospatial analysis and visualisation.
- Search field auto-completion.
- Runtime fields . Good for data exploration, at the cost of performance.
7.2 - Lucene Indexes
Stroom uses Apache Lucene for its built-in indexing solution. Index documents are stored in a Volume .
TODO
Complete this page.Field configuration
Field Types
Id
- Treated as aLong
.Boolean
- True/False values.Integer
- Whole numbers from -2,147,483,648 to 2,147,483,647.Long
- Whole numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.Float
- Fractional numbers. Sufficient for storing 6 to 7 decimal digits.Double
- Fractional numbers. Sufficient for storing 15 decimal digits.Date
- Date and time values.Text
- Text data.Number
- An alias forLong
.
Stored fields
If a field is Stored then it means the complete field value will be stored in the index. This means the value can be retrieved from the index when building search results rather than using the slower Search Extraction process. Storing field values comes at the cost of hight storage requirements for the index. If storage space is not an issue then storing all fields that you want to return in search results is the optimum.
Indexed fields
An Indexed field is one that will be processed by Lucene so that the field can be queried. How the field is indexed will depend on the Field type and the Analyser used.
If you have fields that you do not want to be able to filter (i.e. that you won’t use as a query term) then you can include them as non-Indexed fields. Including a non-indexed field means it will be available for the user to select in the Dashboard table. A non-indexed field would either need to be Stored in the index or added via Search Extraction to be available in the search results.
Positions
If Positions is selected then Lucene will store the positions of all the field terms in the document.
Analyser types
The Analyser determines how Lucene reads the fields value and extracts tokens from it. The choice of Analyser will depend on the date in the field and how you want to search it.
Keyword
- Treats the whole field value as one token. Useful for things like IDs and post codes. Supports the Case Sensitivity setting.Alpha
- Tokenises on any non-letter characters, e.g.one1 two2 three 3
=>one
two
three
. Strips non-letter characters. Supports the Case Sensitivity setting.Numeric
-Alpha numeric
- Tokenises on any non-letter/digit characters, e.g.one1 two2 three 3
=>one1
two2
three
3
. Supports the Case Sensitivity setting.Whitespace
- Tokenises only on white space. Not affected by the Case Sensitivity setting, case sensitive.Stop words
- Tokenises bases on non-letter characters and removes Stop Words, e.g.and
. Not affected by the Case Sensitivity setting. Case insensitive.Standard
- The most common analyser. Tokenises the value on spaces and punctuation but recognises URLs and email addresses. Removes Stop Words, e.g.and
. Not affected by the Case Sensitivity setting. Case insensitive. e.g.Find Stroom at github.com/stroom
=>Find
Stroom
at
github.com/stroom
.
Stop words
Some of the Analysers use a set of stop words for the tokenisers. This is the list of stop words that will not be indexed.
a
, an
, and
, are
, as
, at
, be
, but
, by
, for
, if
, in
, into
, is
, it
, no
, not
, of
, on
, or
, such
, that
, the
, their
, then
, there
, these
, they
, this
, to
, was
, will
, with
Case sensitivity
Some of the Analyser types support case (in)sensitivity.
For example if the Analyser supports it the value TWO two
would either be tokenised as TWO
two
or two
two
.
7.3 - Solr Integration
TODO
Complete this section.8 - Nodes
All nodes in an Stroom cluster must be configured correctly for them to communicate with each other.
Configuring nodes
Open Monitoring/Nodes from the top menu. The nodes screen looks like this:
TODO
ScreenshotYou need to edit each line by selecting it and then clicking the edit http://<HOST_NAME>:8080/stroom/clustercall.rpc
Nodes are expected communicate with each other on port 8080 over http. Ensure you have configured your firewall to allow nodes to talk to each other over this port. You can configure the URL to use a different port and possibly HTTPS but performance will be better with HTTP as no SSL termination is required.
Once you have set the URLs of each node you should also set the master assignment priority for each node to be different to all of the others. In the image above the priorities have been set in a random fashion to ensure that node3 assumes the role of master node for as long as it is enabled. You also need to check all of the nodes are enabled that you want to take part in processing or any other jobs.
Keep refreshing the table until all nodes show healthy pings as above. If you do not get ping results for each node then they are not configured correctly.
Once a cluster is configured correctly you will get proper distribution of processing tasks and search will be able to access all nodes to take part in a distributed query.
9 - Pipelines
Stroom uses Pipelines to process its data. A pipeline is a set of pipeline elements connected together. Pipelines are very powerful and flexible and allow the user to transform, index, store and forward data in a wide variety of ways.
Example Pipeline
Pipelines can take many forms and be used for a wide variety of purposes, however a typical pipeline to convert CSV data into cooked events might look like this:
Input Data
Pipelines process data in batches. This batch of data is referred to as a Stream . The input for the pipeline is a single Stream that exists within a Feed and this data is fed into the left-hand side of the pipeline at Source . Pipelines can accept streams from multiple Feeds assuming those feeds contain similar data.
The data in the Stream is always text data (XML, JSON, CSV, fixed-width, etc.) in a known character encoding . Stroom does not currently support processing binary formats.
XML
The working format for pipeline processing is XML (with the exception of raw streaming). Data can be input and output in other forms, e.g. JSON, CSV, fixed-width, etc. but the majority of pipelines do most of their processing in XML. Input data is converted into XML SAX events, processed using XSLT to transform it into different shapes of XML then either consumed as XML (e.g. an IndexingFilter ) or converted into a desired output format for storage/forwarding.
Forks
Pipelines can also be forked at any point in the pipeline. This allows the same data to processed in different ways.
Note
Rather than creating complicated pipelines with forks, it is sometimes better to create multiple pipelines as this makes it easer to handle errors in one fork of the processing. It also makes it easier to re-use common simple pipelines. For example if you have a pipeline to transform CSV events into normalised XML then index it and forward it to a remote server, it may be better to have a pipeline to cook the events, then a common one to index those XML events and one to forward XML events.Pipeline Inheritance
It is possible for pipelines to inherit from other pipelines. This allows for the creation of a standard abstract pipelines with a set structure, though not fully configured, to be inherited by many concrete pipelines.
For example you may have a standard pipeline for indexing XML events, i.e. read XML data and pass it to an IndexingFilter , but the IndexingFilter is not configured with the actual Index to send documents to. A pipeline that inherits this one can then be simply configured with the Index to use.
Pipeline inheritance allows for changes to the inherited structure, e.g. adding additional elements in line. Multi level inheritance is also supported.
Pipeline Element Types
Stroom has a number of categories of pipeline element.
Reader
Readers are responsible for reading the raw bytes of the input data and converting it to character data using the Feed’s character encoding. They also provide functionality to modify the data before or after it is decoded to characters, e.g. Bye Order Mark removal, or doing find/replace on the character data. You can chain multiple Readers.
Parser
A parser is designed to convert the character data into XML for processing. For example, the JSONParser will use a JSON parser to read the character data as JSON and convert it into XML elements and attributes that represent the JSON structure, so that it can be transformed downstream using XSLT.
Parsers have a built in reader so if they are not preceded by a Reader they will decode the raw bytes into character data before parsing.
Filter
A filter is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and can either return those events unchanged or modify them. An example of Filter is an XSLTFilter element. Multiple filters can be chained, with each one consuming the events output by the one preceding it, therefore you can have lots of common reusable XSLTFilters that all do small incremental changes to a document.
Writer
A writer is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and converts them into encoded character data (using a specified encoding) of some form. The preceding filter may have been an XSLTFilter which transformed XML into plain text, in which case only character data events will be output and a TextWriter can just write these out as text data. Other writers will handle the XML SAX events to convert them into another format, e.g. the JSONWriter before encoding them as character data.
Destination
A destination element is a consumer of character data, as produced by a writer. A typical destination is a StreamAppender that writes the character data (which may be XML, JSON, CSV, etc.) to a new Stream in Stroom’s stream store. Other destinations can be used for sending the encoded character data to Kafka, a file on a file system or forwarding to an HTTP URL.
9.1 - Pipeline Recipies
The following are a basic set of pipeline recipes for doing typical tasks in Stroom. Is it not an exhaustive list as the possibilities with Pipelines are vast. They are intended as a rough guide to get you started with building Pipelines.
Data Ingest and Transformation
CSV to Normalised XML
- CSV data is ingested.
- The Data Splitter parser parses the records and fields into
records
format XML using an XML based TextConverter document. - The first XSLTFilter is used to normalise the events in
records
XML intoevent-logging
XML. - The second XSLTFilter is used to decorate the events with additional data, e.g.
<UserDetails>
using reference data lookups. - The SchemaFilter ensures that the XML output by the stages of XSLT transformation conforms to the
event-logging
XMLSchema. - The XML events are then written out as an
Event
Stream to the Stream store.
Configured Content
-
Data Splitter
- A TextConverter containing XML conforming to
data-splitter:3
. -
Normalise
- An XSLT transforming
records:2
=>event-logging:3
. -
Decorate
- An XSLT transforming
event-logging:3
=>event-logging:3
. -
SchemaFilter
- XMLSchema
event-logging:3
JSON to Normalised XML
The same as ingesting CSV data above, except the input JSON is converted into an XML representation of the JSON by the JSONParser. The Normalise XSLTFilter will be specific to the format of the JSON being ingested. The Decorate) XSLTFilter will likely be identical to that used for the CSV ingest above, demonstrating reuse of pipeline element content.
Configured Content
-
Normalise
- An XSLT transforming
http://www.w3.org/2013/XSL/json
=>event-logging:3
. -
Decorate
- An XSLT transforming
event-logging:3
=>event-logging:3
. -
SchemaFilter
- XMLSchema
event-logging:3
XML (not event-logging) to Normalised XML
As above except that the input data is already XML, though not in event-logging
format.
The XMLParser simply reads the XML character data and converts it to XML SAX events for processing.
The Normalise XSLTFilter will be specific to the format of this XML and will transform it into event-logging
format.
Configured Content
-
Normalise
- An XSLT transforming a 3rd party schema =>
event-logging:3
. -
Decorate
- An XSLT transforming
event-logging:3
=>event-logging:3
. -
SchemaFilter
- XMLSchema
event-logging:3
XML (event-logging) to Normalised XML
As above except that the input data is already in event-logging
XML format, so no normalisation is required.
Decoration is still needed though.
Configured Content
-
Decorate
- An XSLT transforming
event-logging:3
=>event-logging:3
. -
SchemaFilter
- XMLSchema
event-logging:3
XML Fragments to Normalised XML
XML Fragments are where the input data looks like:
<Event>
...
</Event>
<Event>
...
</Event>
In other words, it is technically badly formed XML as it has no root element or declaration.
This format is however easier for client systems to send as they can send multiple <Event>
blocks in one stream (e.g. just appending them together in a rolled log file) but don’t need to wrap them with an outer <Events>
element.
The XMLFragmentParser understands this format and will add the wrapping element to make well-formed XML.
If the XML fragments are already in event-logging
format then no Normalise XSLTFilter is required.
Configured Content
-
XMLFragParser
- Content similar to:
<?xml version="1.1" encoding="utf-8"?> <!DOCTYPE Records [ <!ENTITY fragment SYSTEM "fragment">]> <Events xmlns="event-logging:3" xmlns:stroom="stroom" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="event-logging:3 file://event-logging-v3.4.2.xsd" Version="3.4.2"> &fragment; </Events>
-
Decorate
- An XSLT transforming
event-logging:3
=>event-logging:3
. -
SchemaFilter
- XMLSchema
event-logging:3
Handling Malformed Data
Cleaning Malformed XML data
In some cases client systems may send XML containing characters that are not supported by the XML standard. These can be removed using the InvalidXMLCharFilterReader .
The input data may also be known to contain other sets of characters that will cause problems in processing. The FindReplaceFilter can be used to remove/replace either a fixed string or a Regex pattern.
[Pipeline truncated]
Raw Streaming
In cases where you want to export the raw (or cooked) data from a feed you can have a very simply pipeline to pipe the source data directly to an appender. This may be so that the raw data can be ingested into another system for analysis. In this case the data is being written to disk using a file appender.
Note
Be careful when specifying the directory structure for the FileAppender so that you don’t end up with too many files in one folder, which can cause some OS issues.Indexing
XML to Stroom Lucene Index
This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above.
The
XSLTFilter
is used to transform the event into records
format, extracting the fields to be indexed from the event.
The
IndexingFilter
reads the records
XML and loads each one into Stroom’s internal Lucene index .
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>records:2
. -
SchemaFilter
- XMLSchema
records:2
The records:2
XML looks something like this, with each <data>
element representing an indexed field value.
<?xml version="1.1" encoding="UTF-8"?>
<records
xmlns="records:2"
xmlns:stroom="stroom"
xmlns:sm="stroom-meta"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="2.0">
<record>
<data name="StreamId" value="1997" />
<data name="EventId" value="1" />
<data name="Feed" value="MY_FEED" />
<data name="EventTime" value="2010-01-01T00:00:00.000Z" />
<data name="System" value="MySystem" />
<data name="Generator" value="CSV" />
<data name="IPAddress" value="1.1.1.1" />
<data name="UserId" analyzer="KEYWORD" value="user1" />
<data name="Action" value="Authenticate" />
<data name="Description" value="Some message 1" />
</record>
</records>
XML to Stroom Lucene Index (Dynamic)
Dynamic indexing in Stroom allows you to use the XSLT to define the fields that are being indexed and how each field should be indexed.
This avoids having to define all the fields up front in the Index
DynamicIndexingFilter
and rather than transforming the event into records:2
XML, it is transformed into index-documents:1
XML as shown in the example below.
<?xml version="1.1" encoding="UTF-8"?>
<index-documents
xmlns="index-documents:1"
xmlns:stroom="stroom"
xmlns:sm="stroom-meta"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="index-documents:1 file://index-documents-v1.0.xsd"
version="1.0">
<document>
<field><name>StreamId</name><type>Id</type><indexed>true</indexed><stored>true</stored><value>1997</value></field>
<field><name>EventId</name><type>Id</type><indexed>true</indexed><stored>true</stored><value>1</value></field>
<field><name>Feed</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>MY_FEED</value></field>
<field><name>EventTime</name><type>Date</type><indexed>true</indexed><value>2010-01-01T00:00:00.000Z</value></field>
<field><name>System</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>MySystem</value></field>
<field><name>Generator</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>CSV</value></field>
<field><name>IPAddress</name><type>Text</type><indexed>true</indexed><value>1.1.1.1</value></field>
<field><name>UserId</name><type>Text</type><indexed>true</indexed><value>user1</value></field>
<field><name>Action</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>Authenticate</value></field>
<field><name>Description</name><type>Text</type><analyser>Alpha numeric</analyser><indexed>true</indexed><value>Some message 1</value></field>
</document>
</index-documents>
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>index-documents:1
. -
SchemaFilter
- XMLSchema
index-documents:1
XML to an Elastic Search Index
This use case is for indexing XML event data that had already been normalised using one of the ingest pipelines above.
The
XSLTFilter
is used to transform the event into records
format, extracting the fields to be indexed from the event.
The
ElasticIndexingFilter
reads the records
XML and loads each one into an external Elasticsearch index .
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>records:2
. -
SchemaFilter
- XMLSchema
records:2
Search Extraction
Search extraction is the process of combining the data held in the index with data obtained from the original indexed document, i.e. the event. Search extraction is useful when you do not want to store the whole of an event in the index (to reduce storage used) but still want to be able to access all the event data in a Dashboard/View. An extraction pipeline is required to combine data in this way. Search extraction pipelines are referenced in Dashboard and View settings.
Standard Lucene Index Extraction
This is a non-dynamic search extraction pipeline for a Lucene index.
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>records:2
.
Dynamic Lucene Index Extraction
This is a dynamic search extraction pipeline for a Lucene index.
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>index-documents:1
.
Data Egress
XML to CSV File
An recipe of writing normalised XML events (as produced by an ingest pipeline above) to a file, but in a flat file format like CSV. The XSLTFilter transforms the events XML into CSV data with XSLT including this:
<xsl:output method="text" omit-xml-declaration="yes" indent="no"/>
The TextWriter converts the XML character events into a stream of characters encoded using the desired output character encoding. The data is appended to a file on a file system, with one file per Stream.
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=> schemaless plain text. -
SchemaFilter
- XMLSchema
records:2
XML to JSON Rolling File
This is similar to the above recipe for writing out CSV, except that the XSLTFilter converts the event XML into XML conforming to the https://www.w3.org/2013/XSL/json/ XMLSchema. The JSONWriter can read this format of XML and convert it into JSON using the desired character encoding. The RollingFileAppender will append the encoded JSON character data to a file on the file system that is rolled based on a size/time threshold.
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>http://www.w3.org/2013/XSL/json
. -
SchemaFilter
- XMLSchema
http://www.w3.org/2013/XSL/json
.
XML to HTTP Destination
This recipe is for sending normalised XML events to another system over HTTP. The HTTPAppender is configured with the URL and any TLS certificates/keys/credentials.
Reference Data
Reference Loader
A typical pipeline for loading XML reference data (conforming to the reference-data:2
XMLSchema) into the reference data store.
The
ReferenceDataFilter
reads the reference-data:2
format data and loads each entry into the appropriate map in the store.
As an example, the reference-data:2
XML for mapping userIDs to staff numbers looks something like this:
<?xml version="1.1" encoding="UTF-8"?>
<referenceData
xmlns="reference-data:2"
xmlns:evt="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.1.xsd"
version="2.0.1">
<reference>
<map>USER_ID_TO_STAFF_NO_MAP</map>
<key>user1</key>
<value>staff1</value>
</reference>
<reference>
<map>USER_ID_TO_STAFF_NO_MAP</map>
<key>user2</key>
<value>staff2</value>
</reference>
...
</referenceData>
Statistics
This recipe converts normalised XML data and converts it into statistic events (confirming to the statistics:4
XMLSchema).
Stroom’s Statistic Stores are a way to store aggregated counts or averaged values over time periods.
For example you may want counts of certain types of event, aggregated over fixed time buckets.
Each XML event is transformed using the
XSLTFilter
to either return no output or a statistic event.
An example of statistics:4
data for two statistic events is:
<?xml version="1.1" encoding="UTF-8"?>
<statistics
xmlns="statistics:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="statistics:2 file://statistics-v2.0.xsd">
<statistic>
<time>2023-12-22T00:00:00.000Z</time>
<count>1</count>
<tags>
<tag name="user" value="user1" />
</tags>
</statistic>
<statistic>
<time>2023-12-23T00:00:00.000Z</time>
<count>5</count>
<tags>
<tag name="user" value="user6" />
</tags>
</statistic>
</statistics>
Configured Content
-
XSLTFilter
- An XSLT transforming
event-logging:3
=>statistics:2
. -
SchemaFilter
- XMLSchema
statistics:2
.
9.2 - Parser
The following capabilities are available to parse input data:
- XML - XML input can be parsed with the XML parser.
- XML Fragment - Treat input data as an XML fragment, i.e. XML that does not have an XML declaration or root elements.
- Data Splitter - Delimiter and regular expression based language for turning non XML data into XML (e.g. CSV)
9.2.1 - XML Fragments
Some input XML data may be missing an XML declaration and root level enclosing elements. This data is not a valid XML document and must be treated as an XML fragment. To use XML fragments the input type for a translation must be set to ‘XML Fragment’. A fragment wrapper must be defined in the XML conversion that tells Stroom what declaration and root elements to place around the XML fragment data.
Here is an example:
<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE records [
<!ENTITY fragment SYSTEM "fragment">
]>
<records
xmlns="records:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="records:2 file://records-v2.0.xsd"
version="2.0">
&fragment;
</records>
During conversion Stroom replaces the fragment text entity with the input XML fragment data. Note that XML fragments must still be well formed so that they can be parsed correctly.
9.3 - XSLT Conversion
XSLT
is a language that is typically used for transforming XML documents into either a different XML document or plain text.
XSLT is key part of Stroom’s pipeline processing as it is used to normalise bespoke events into a common XML audit event document conforming to the event-logging
XML Schema
.
Once a text file has been converted into intermediary XML (or the feed is already XML),
XSLT
is used to
translate the XML into the event-logging
XML format.
The XSLTFilter pipeline element defines the XSLT document and is used to do the transformation of the input XML into XML or plain text. You can have multiple XSLTFilter elements in a pipeline if you want to break the transformation into steps, or wish to have simpler XSLTs that can be reused.
Raw Event Feeds are typically translated into the event-logging:3
schema and Raw Reference into the reference-data:2
schema.
9.3.1 - XSLT Basics
XSLT is a very powerful language and allows the user to perform very complex transformations of XML data. This documentation does not aim to document how to write XSLT documents, for that, we strongly recommend you refer to online references (e.g. W3Schools or obtain a book covering XSLT 2.0 and XPath ). It does however aim to document aspects of XSLT that are specific to the use of XSLT in Stroom.
Examples
Event Normalisation
Here is an example XSLT document that transforms XML data in the records:2
namespace
(which is the output of the
DSParser
element) into event XML in the event-logging:3
namespace.
It is an example of event normalisation from a bespoke format.
Warning
This example aims to show some typical uses of XSLT in a typical Stroom use case. It does not necessarily represent best practice in terms of creation of a normalised event.<?xml version="1.1" encoding="UTF-8" ?>
<xsl:stylesheet
xpath-default-namespace="records:2"
xmlns="event-logging:3"
xmlns:stroom="stroom"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="2.0">
<!-- Match the root element -->
<xsl:template match="records">
<Events
xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd"
Version="3.0.0">
<xsl:apply-templates />
</Events>
</xsl:template>
<!-- Match each 'record' element -->
<xsl:template match="record">
<xsl:variable name="user" select="data[@name='User']/@value" />
<Event>
<xsl:call-template name="header" />
<xsl:value-of select="stroom:log('info', concat('Processing user: ', $user))"/>
<EventDetail>
<TypeId>0001</TypeId>
<Description>
<xsl:value-of select="data[@name='Message']/@value" />
</Description>
<Authenticate>
<Action>Logon</Action>
<LogonType>Interactive</LogonType>
<User>
<Id>
<xsl:value-of select="$user" />
</Id>
</User>
<Data Name="FileNo">
<xsl:attribute name="Value" select="data[@name='FileNo']/@value" />
</Data>
<Data Name="LineNo">
<xsl:attribute name="Value" select="data[@name='LineNo']/@value" />
</Data>
</Authenticate>
</EventDetail>
</Event>
</xsl:template>
<xsl:template name="header">
<xsl:variable name="date" select="data[@name='Date']/@value" />
<xsl:variable name="time" select="data[@name='Time']/@value" />
<xsl:variable name="dateTime" select="concat($date, $time)" />
<xsl:variable name="formattedDateTime" select="stroom:format-date($dateTime, 'dd/MM/yyyyHH:mm:ss')" />
<xsl:variable name="user" select="data[@name='User']/@value" />
<EventTime>
<TimeCreated>
<xsl:value-of select="$formattedDateTime" />
</TimeCreated>
</EventTime>
<EventSource>
<System>
<Name>Test</Name>
<Environment>Test</Environment>
</System>
<Generator>CSV</Generator>
<Device>
<IPAddress>1.1.1.1</IPAddress>
<MACAddress>00-00-00-00-00-00</MACAddress>
<xsl:variable name="location" select="stroom:lookup('FILENO_TO_LOCATION_MAP', data[@name='FileNo']/@value, $formattedDateTime)" />
<xsl:if test="$location">
<xsl:copy-of select="$location" />
</xsl:if>
<Data Name="Zone1">
<xsl:attribute name="Value" select="stroom:lookup('IPToLocation', stroom:numeric-ip('192.168.1.1'))" />
</Data>
</Device>
<User>
<Id>
<xsl:value-of select="$user" />
</Id>
</User>
</EventSource>
</xsl:template>
</xsl:stylesheet>
Reference Data
Here is an example of transforming Reference Data in the records:2
namespace
(which is the output of the
DSParser
element) into XML in the reference-data:2
namespace that is suitable for loading using the
ReferenceDataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
xpath-default-namespace="records:2"
xmlns="reference-data:2"
xmlns:evt="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:template match="records">
<referenceData
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.1.xsd event-logging:3 file://event-logging-v3.0.0.xsd"
version="2.0.1">
<xsl:apply-templates/>
</referenceData>
</xsl:template>
<xsl:template match="record">
<reference>
<map>USER_ID_TO_STAFF_NO_MAP</map>
<key><xsl:value-of select="data[@name='userId']/@value"/></key>
<value><xsl:value-of select="data[@name='staffNo']/@value"/></value>
</reference>
</xsl:template>
</xsl:stylesheet>
Identity Transformation
If you want an XSLT to decorate an Events XML document with some additional data or to change it slightly without changing its namespace then a good starting point is the identity transformation.
<xsl:stylesheet
version="1.0"
xpath-default-namespace="event-logging:3"
xmlns="event-logging:3"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Match Root Object -->
<xsl:template match="Events">
<Events
xmlns="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.4.2.xsd"
Version="3.4.2">
<xsl:apply-templates />
</Events>
</xsl:template>
<!-- Whenever you match any node or any attribute -->
<xsl:template match="node( )|@*">
<!-- Copy the current node -->
<xsl:copy>
<!-- Including any attributes it has and any child nodes -->
<xsl:apply-templates select="@*|node( )" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This XSLT will copy every node and attribute as they are, returning the input document completely un-changed.
You can then add additional templates to match on specific elements and modify them, for example decorating a user’s UserDetails
elements with value obtained from a reference data lookup on a user ID.
Note
You can insert this identity skeleton into an XSLT editor using this editor snippet.
<xsl:message>
Stroom supports the standard <xsl:message>
element from the
http://www.w3.org/1999/XSL/Transform
.
This element behaves in a similar way to the stroom:log()
XSLT function.
The element text is logged to the Error stream with a default severity of ERROR
.
A child element can optionally be used to set the severity level (one of FATAL|ERROR|WARN|INFO
).
The namespace of this element does not matter.
You can also set the attribute terminate="yes"
to log the message at severity FATAL
and halt processing of that stream part.
If the stream is multi-part then processing will continue with the next part.
Note
Settingterminate="yes"
will trump any severity defined by a child element.
It will always be logged at FATAL
.
The following are some examples of using <xsl:message>
.
<!-- Log a message using default severity of ERROR -->
<xsl:message>Invalid length</xsl:message>
<!-- terminate="yes" means log the message as a FATAL ERROR and halt processing of the stream part -->
<xsl:message terminate="yes">Invalid length</xsl:message>
<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
<warn>Invalid length</warn>
</xsl:message>
<!-- Log a message with a child element name specifying the severity. -->
<xsl:message>
<info>Invalid length</info>
</xsl:message>
<!-- Log a message, specifying the severity and using a dynamic value. -->
<xsl:message>
<info>
<xsl:value-of select="concat('User ID ', $userId, ' is invalid')" />
</info>
</xsl:message>
9.3.2 - XSLT Functions
By including the following namespace:
xmlns:stroom="stroom"
E.g.
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
xmlns="event-logging:3"
xmlns:stroom="stroom"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="2.0">
The following functions are available to aid your translation:
bitmap-lookup(String map, String key)
- Bitmap based look up against reference data map using the period start timebitmap-lookup(String map, String key, String time)
- Bitmap based look up against reference data map using a specified time, e.g. the event timebitmap-lookup(String map, String key, String time, Boolean ignoreWarnings)
- Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookupbitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)
- Bitmap based look up against reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.cidr-to-numeric-ip-range()
- Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end addresses of the range.classification()
- The classification of the feed for the data being processedcol-from()
- The column in the input that the current record begins on (can be 0).col-to()
- The column in the input that the current record ends at.current-time()
- The current system timecurrent-user()
- The current user logged into Stroom (only relevant for interactive use, e.g. search)decode-url(String encodedUrl)
- Decode the provided url.dictionary(String name)
- Loads the contents of the named dictionary for use within the translationencode-url(String url)
- Encode the provided url.feed-attribute(String attributeKey)
- NOTE: This function is deprecated, usemeta(String key)
instead. The value for the supplied feedattributeKey
.feed-name()
- Name of the feed for the data being processedfetch-json(String url)
- Simplistic version ofhttp-call
that sends a request to the passedurl
and converts the JSON response body to XML usingjson-to-xml
. Currently does not support SSL configuration likehttp-call
does.format-date(String date, String pattern)
- Format a date that uses the specified pattern using the default time zoneformat-date(String date, String pattern, String timeZone)
- Format a date that uses the specified pattern with the specified time zoneformat-date(String date, String patternIn, String timeZoneIn, String patternOut, String timeZoneOut)
- Parse a date with the specified input pattern and time zone and format the output with the specified output pattern and time zoneformat-date(String milliseconds)
- Format a date that is specified as a number of milliseconds since a standard base time known as “the epoch”, namely January 1, 1970, 00:00:00 GMTget(String key)
- Returns the value associated with akey
that has been stored in a map using theput()
function. The map is in the scope of the current pipeline process so values do not live after the stream has been processed.hash(String value)
- Hash a string value using the defaultSHA-256
algorithm and no salthash(String value, String algorithm, String salt)
- Hash a string value using the specified hashing algorithm and supplied salt value. Supported hashing algorithms includeSHA-256
,SHA-512
,MD5
.hex-to-dec(String hex)
- Convert hex to dec representation.hex-to-oct(String hex)
- Convert hex to oct representation.hex-to-string(String hex, String textEncoding)
- Convert hex to string using the specified text encoding.host-address(String hostname)
- Convert a hostname into an IP address.host-name(String ipAddress)
- Convert an IP address into a hostname.http-call(String url, String headers, String mediaType, String data, String clientConfig)
- Makes an HTTP(S) request to a remote server.ip-in-cidr(String ipAddress, String cidr)
- Return whether an IPv4 address is within the specified CIDR (e.g.192.168.1.0/24
).json-to-xml(String json)
- Returns an XML representation of the supplied JSON value for use in XPath expressionsline-from()
- The line in the input that the current record begins on (1 based).line-to()
- The line in the input that the current record ends at.link(String url)
- Creates a stroom dashboard table link.link(String title, String url)
- Creates a stroom dashboard table link.link(String title, String url, String type)
- Creates a stroom dashboard table link.log(String severity, String message)
- Logs a message to the processing log with the specified severitylookup(String map, String key)
- Look up a reference data map using the period start timelookup(String map, String key, String time)
- Look up a reference data map using a specified time, e.g. the event timelookup(String map, String key, String time, Boolean ignoreWarnings)
- Look up a reference data map using a specified time, e.g. the event time, and ignore any warnings generated by a failed lookuplookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)
- Look up a reference data map using a specified time, e.g. the event time, ignore any warnings generated by a failed lookup and get trace information for the path taken to resolve the lookup.meta(String key)
- Lookup a meta data value for the current stream using the specified key. The key can beFeed
,StreamType
,CreatedTime
,EffectiveTime
,Pipeline
or any other attribute supplied when the stream was sent to Stroom, e.g. meta(‘System’).meta-keys()
- Returns an array of meta keys for the current stream. Each key can then be used to retrieve its corresponding meta value, by callingmeta($key)
.numeric-ip(String ipAddress)
- Convert an IP address to a numeric representation for range comparisonpart-no()
- The current part within a multi part aggregated input stream (AKA the substream number) (1 based)parse-uri(String URI)
- Returns an XML structure of the URI providingauthority
,fragment
,host
,path
,port
,query
,scheme
,schemeSpecificPart
, anduserInfo
components if present.pipeline-name()
- Get the name of the pipeline currently processing the stream.pointIsInsideXYPolygon(Number xPos, Number yPos, Number[] xPolyData, Number[] yPolyData)
- Get the name of the pipeline currently processing the stream.random()
- Get a system generated random number between 0 and 1.record-no()
- The current record number within the current part (substream) (1 based).search-id()
- Get the id of the batch search when a pipeline is processing as part of a batch searchsource()
- Returns an XML structure with thestroom-meta
namespace detailing the current source location.source-id()
- Get the id of the current input stream that is being processedstream-id()
- An alias forsource-id
included for backward compatibility.pipeline-name()
- Name of the current processing pipeline using the XSLTput(String key, String value)
- Store a value for use later on in the translation
bitmap-lookup()
The bitmap-lookup() function looks up a bitmap key from reference or context data a value (which can be an XML node set) for each set bit position and adds it to the resultant XML.
bitmap-lookup(String map, String key)
bitmap-lookup(String map, String key, String time)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings)
bitmap-lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)
map
- The name of the reference data map to perform the lookup against.key
- The bitmap value to lookup. This can either be represented as a decimal integer (e.g.14
) or as hexadecimal by prefixing with0x
(e.g0xE
).time
- Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the formatyyyy-MM-dd'T'HH:mm:ss.SSSXX
, e.g.2010-01-01T00:00:00.000Z
.ignoreWarnings
- If true, any lookup failures will be ignored, else they will be reported as warnings.trace
- If true, additional trace information is output as INFO messages.
If the look up fails no result will be returned.
The key is a bitmap expressed as either a decimal integer or a hexidecimal value, e.g. 14
/0xE
is 1110
as a binary bitmap.
For each bit position that is set, (i.e. has a binary value of 1
) a lookup will be performed using that bit position as the key.
In this example, positions 1
, 2
& 3
are set so a lookup would be performed for these bit positions.
The result of each lookup for the bitmap are concatenated together in bit position order, separated by a space.
If ignoreWarnings
is true then any lookup failures will be ignored and it will return the value(s) for the bit positions it was able to lookup.
This function can be useful when you have a set of values that can be represented as a bitmap and you need them to be converted back to individual values. For example if you have a set of additive account permissions (e.g Admin, ManageUsers, PerformExport, etc.), each of which is associated with a bit position, then a user’s permissions could be defined as a single decimal/hex bitmap value. Thus a bitmap lookup with this value would return all the permissions held by the user.
For example the reference data store may contain:
Key (Bit position) | Value |
---|---|
0 | Administrator |
1 | Manage_Users |
2 | Perform_Export |
3 | View_Data |
4 | Manage_Jobs |
5 | Delete_Data |
6 | Manage_Volumes |
The following are example lookups using the above reference data:
Lookup Key (decimal) | Lookup Key (Hex) | Bitmap | Result |
---|---|---|---|
0 |
0x0 |
0000000 |
- |
1 |
0x1 |
0000001 |
Administrator |
74 |
0x4A |
1001010 |
Manage_Users View_Data Manage_Volumes |
2 |
0x2 |
0000010 |
Manage_Users |
96 |
0x60 |
1100000 |
Delete_Data Manage_Volumes |
cidr-to-numeric-ip-range()
Converts a CIDR IP address range to an array of numeric IP addresses representing the start and end (broadcast) of the range.
When storing the result in a variable, ensure you indicate the type as a string array (xs:string*
), as shown in the below example.
Example XSLT
<xsl:variable name="range" select="stroom:cidr-to-numeric-ip-range('192.168.1.0/24')" as="xs:string*" />
<Range>
<Start><xsl:value-of select="$range[1]" /></Start>
<End><xsl:value-of select="$range[2]" /></End>
</Range>
Example output
<Range>
<Start>3232235776</Start>
<End>3232236031</End>
</Range>
dictionary()
The dictionary() function gets the contents of the specified dictionary for use during translation. The main use for this function is to allow users to abstract the management of a set of keywords from the XSLT so that it is easier for some users to make quick alterations to a dictionary that is used by some XSLT, without the need for the user to understand the complexities of XSLT.
format-date()
The format-date() function combines parsing and formatting of date strings. In its simplest form it will parse a date string and return the parsed date in the XML standard Date Format. It also supports supplying a custom format pattern to output the parsed date in a specified format.
Function Signatures
The following are the possible forms of the format-date
function.
<!-- Convert time in millis to standard date format -->
format-date(long millisSinceEpoch)
<!-- Convert inputDate to standard date format -->
format-date(String inputDate, String inputPattern)
<!-- Convert inputDate to standard date format using specified input time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone)
<!-- Convert inputDate to a custom date format using optional input time zone inputTimeZone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern)
<!-- Convert inputDate to a custom date format using optional input time zone and a specified output time zone -->
format-date(String inputDate, String inputPattern, String inputTimeZone, String outputPattern, String outputTimeZone)
millisSinceEpoch
- The date/time expressed as the number of milliseconds since the UNIX epoch .inputDate
- The input date string, e.g.2009/08/01 12:34:11
.inputPattern
- The pattern that defines the structure ofinputDate
(see Custom Date Formats).inputTimeZone
- Optional time zone of theinputDate
. Ifnull
then the UTC/Zulu time zone will be used. IfinputTimeZone
is present, the inputPattern must not include the time zone pattern tokens (z
andZ
).outputPattern
- The pattern that defines the format of the output date (see Custom Date Formats).inputTimeZone
- Optional time zone of the output date. Ifnull
then the UTC/Zulu time zone will be used.
Time Zones
The following is a list of some common time zone values:
Values | Zone Name |
---|---|
GMT/BST |
A Stroom specific value for UK daylight saving time (see below) |
UTC , UCT , Zulu , Universal , +00:00 , -00:00 , +00 , +0 |
Coordinated Universal Time (UTC) |
GMT , GMT0 , Greenwich |
Greenwich Mean Time (GMT) |
GB , GB-Eire , Europe/London |
British Time |
NZ , Pacific/Auckland |
New Zealand Time |
Australia/Canberra , Australia/Sydney |
Eastern Australia Time |
CET |
Central European Time |
EET |
Eastern European Time |
Canada/Atlantic |
Atlantic Time |
Canada/Central |
Central Time |
Canada/Pacific |
Pacific Time |
US/Central |
Central Time |
US/Eastern |
Eastern Time |
US/Mountain |
Mountain Time |
US/Pacific |
Pacific Time |
+02:00 , +02 , +2 |
UTC +2hrs |
-03:00 , -03 , -3 |
UTC -3hrs |
A special time zone value of GMT/BST
can be used when the inputDate
is in local wall clock time with time zone information.
In this case, the date/time will be used to determine whether the date is in British Summer Time or in GMT and adjust the output accordingly.
See the examples below.
Parsing Examples
The following table shows various examples of calls to stroom:format-date()
with their output.
The stroom:format-date
part has been omitted for brevity.
<!-- Date in millis since UNIX epoch -->
stroom:format-date('1269270011640')
-> '2010-03-22T15:00:11.640Z'
<!-- Simple date UK style date -->
stroom:format-date('29/08/24', 'dd/MM/yy')
-> '2024-08-29T00:00:00.000Z'
<!-- Simple date US style date -->
stroom:format-date('08/29/24', 'MM/dd/yy')
-> '2024-08-29T00:00:00.000Z'
<!-- ISO date with no delimiters -->
stroom:format-date('20010801184559', 'yyyyMMddHHmmss')
-> '2001-08-01T18:45:59.000Z'
<!-- Standard output, no TZ -->
stroom:format-date('2001/08/01 18:45:59', 'yyyy/MM/dd HH:mm:ss')
-> '2001-08-01T18:45:59.000Z'
<!-- Standard output, date only, with TZ -->
stroom:format-date('2001/08/01', 'yyyy/MM/dd', '-07:00')
-> '2001-08-01T07:00:00.000Z'
<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '-08:00')
-> '2001-08-01T09:00:00.000Z'
<!-- Standard output, with TZ -->
stroom:format-date('2001/08/01 01:00:00', 'yyyy/MM/dd HH:mm:ss', '+01:00')
-> '2001-08-01T00:00:00.000Z'
<!-- Single digit day and month, no padding -->
stroom:format-date('2001 8 1', 'yyyy MM dd')
-> '2001-08-01T00:00:00.000Z'
<!-- Double digit day and month, no padding -->
stroom:format-date('2001 12 28', 'yyyy MM dd')
-> '2001-12-28T00:00:00.000Z'
<!-- Single digit day and month, with optional padding -->
stroom:format-date('2001 8 1', 'yyyy ppMM ppdd')
-> '2001-08-01T00:00:00.000Z'
<!-- Double digit day and month, with optional padding -->
stroom:format-date('2001 12 31', 'yyyy ppMM ppdd')
-> '2001-12-31T00:00:00.000Z'
<!-- With abbreviated day of week month -->
stroom:format-date('Wed Aug 14 2024', 'EEE MMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'
<!-- With long form day of week and month -->
stroom:format-date('Wednesday August 14 2024', 'EEEE MMMM dd yyyy')
-> '2024-08-14T00:00:00.000Z'
<!-- With 12 hour clock, AM -->
stroom:format-date('Wed Aug 14 2024 10:32:58 AM', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T10:32:58.000Z'
<!-- With 12 hour clock, PM (lower case) -->
stroom:format-date('Wed Aug 14 2024 10:32:58 pm', 'E MMM dd yyyy hh:mm:ss a')
-> '2024-08-14T22:32:58.000Z'
<!-- Using minimal symbols -->
stroom:format-date('2001 12 31 22:58:32.123', 'y M d H:m:s.S')
-> '2001-12-31T22:58:32.123Z'
<!-- Optional time portion, with time -->
stroom:format-date('2001/12/31 22:58:32.123', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T22:58:32.123Z'
<!-- Optional time portion, without time -->
stroom:format-date('2001/12/31', 'yyyy/MM/dd[ HH:mm:ss.SSS]')
-> '2001-12-31T00:00:00.000Z'
<!-- Optional millis portion, with millis -->
stroom:format-date('2001/12/31 22:58:32.123', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.123Z'
<!-- Optional millis portion, without millis -->
stroom:format-date('2001/12/31 22:58:32', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.000Z'
<!-- Optional millis/nanos portion, with nanos -->
stroom:format-date('2001/12/31 22:58:32.123456', 'yyyy/MM/dd HH:mm:ss[.SSS]')
-> '2001-12-31T22:58:32.123Z'
<!-- Fixed text -->
stroom:format-date('Date: 2001/12/31 Time: 22:58:32.123', ''Date: 'yyyy/MM/dd 'Time: 'HH:mm:ss.SSS')
-> '2001-12-31T22:58:32.123Z'
<!-- GMT/BST date that is BST -->
stroom:format-date('2009/06/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT/BST')
-> '2009-06-01T11:34:11.000Z'
<!-- GMT/BST date that is GMT -->
stroom:format-date('2009/02/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', 'GMT/BST')
-> '2009-02-01T12:34:11.000Z'
<!-- Time zone offset -->
stroom:format-date('2009/02/01 12:34:11', 'yyyy/MM/dd HH:mm:ss', '+01:00')
-> '2009-02-01T11:34:11.000Z'
<!-- Named timezone -->
stroom:format-date('2009/02/01 23:34:11', 'yyyy/MM/dd HH:mm:ss', 'US/Eastern')
-> '2009-02-02T04:34:11.000Z'
Note
Parsing is done in lenient mode so, the count of each symbol is not critical, e.g. you can parse the year 2024
with y
, yy
, yyy
or yyyy
.
Despite this, it is advisable to use a pattern that matches the known format of the input dates, e.g. in this example yyyy
, to avoid confusing with anyone else reading your XSLT.
The count of each symbol is however critical when it comes to formatting.
Formatting Examples
<!-- Specific output, no input or output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', null, 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'
<!-- Specific output, UTC input, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', 'UTC', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 14:30 (59 secs)'
<!-- Specific output, no output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm (s 'secs')')
-> 'Wed 01 Aug 2001 13:30 (59 secs)'
<!-- Specific output, with input and output TZ -->
stroom:format-date('2001/08/01 14:30:59', 'yyyy/MM/dd HH:mm:ss', '+01:00', 'E dd MMM yyyy HH:mm', '+02:00')
-> 'Wed 01 Aug 2001 15:30'
<!-- Padded 12 hour clock output -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy pph:ppm:pps a')
-> 'Wed 01 Aug 2001 2: 7: 5 PM'
<!-- Padded 12 hour clock output -->
stroom:format-date('2001/08/01 22:27:25.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy pph:ppm:pps a')
-> 'Wed 01 Aug 2001 10:27:25 PM'
<!-- Non-Padded 12 hour clock output -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'E dd MMM yyyy h:m:s a')
-> 'Wed 01 Aug 2001 2:7:5 PM'
<!-- Long form text -->
stroom:format-date('2001/08/01 14:07:05.123', 'yyyy/MM/dd HH:mm:ss.SSS', 'UTC', 'EEEE d MMMM yyyy HH:mm:ss')
-> 'Wednesday 1 August 2001 14:07:05'
Reference Time
When parsing a date string that does not contain a full zoned date and time, certain assumptions will be made.
If there is no time zone in inputDate
and no inputTimeZone
argument has been passed then the time zone of the input date will be assumed to be in the UTC time zone.
If any of the date parts are not present, e.g. an input of 28 Oct
then Stroom will use a reference date to fill in the gaps.
The reference date is the first of these values that is non-null
- The create time of the stream being processed by the XSLT.
- The current time, i.e. now().
For example for a call of stroom:format-date('28 Oct', 'dd MMM')
and a stream create time of 2024
, it will return 2024-10-28T00:00:00.000Z
.
hex-to-string()
For a hexadecimal input string, decode it using the specified character set to its original form.
Valid character set names are listed at: https://www.iana.org/assignments/character-sets/character-sets.xhtml.
Common examples are: ASCII
, UTF-8
and UTF-16
.
Example
Input
<string><xsl:value-of select="hex-to-string('74 65 73 74 69 6e 67 20 31 32 33', 'UTF-8')" /></string>
Output
<string>testing 123</string>
http-call()
Executes an HTTP(S) request to a remote server and returns the response.
http-call(String url, [String headers], [String mediaType], [String data], [String clientConfig])
The arguments are as follows:
url
- The URL to send the request to.headers
- A newline (
) delimited list of HTTP headers to send. Each header is of the formkey:value
.mediaType
- The media (or MIME) type of the requestdata
, e.g.application/json
. If not setapplication/json; charset=utf-8
will be used.data
- The data to send. The data type should be consistent withmediaType
. Supplying thedata
argument means a POST request method will be used rather than the default GET.clientConfig
- A JSON object containing the configuration for the HTTP client to use, including any SSL configuration.
The function returns the response as XML with namespace stroom-http
.
The XML includes the body of the response in addition to the status code, success status, message and any headers.
clientConfig
The client can be configured using a JSON object containing various optional configuration items. The following is an example of the client configuration object with all keys populated.
{
"callTimeout": "PT30S",
"connectionTimeout": "PT30S",
"followRedirects": false,
"followSslRedirects": false,
"httpProtocols": [
"http/2",
"http/1.1"
],
"readTimeout": "PT30S",
"retryOnConnectionFailure": true,
"sslConfig": {
"keyStorePassword": "password",
"keyStorePath": "/some/path/client.jks",
"keyStoreType": "JKS",
"trustStorePassword": "password",
"trustStorePath": "/some/path/ca.jks",
"trustStoreType": "JKS",
"sslProtocol": "TLSv1.2",
"hostnameVerificationEnabled": false
},
"writeTimeout": "PT30S"
}
If you are using two-way SSL then you may need to set the protocol to HTTP/1.1
.
"httpProtocols": [
"http/1.1"
],
Example output
The following is an example of the XML returned from the http-call
function:
<response xmlns="stroom-http">
<successful>true</successful>
<code>200</code>
<message>OK</message>
<headers>
<header>
<key>cache-control</key>
<value>public, max-age=600</value>
</header>
<header>
<key>connection</key>
<value>keep-alive</value>
</header>
<header>
<key>content-length</key>
<value>108</value>
</header>
<header>
<key>content-type</key>
<value>application/json;charset=iso-8859-1</value>
</header>
<header>
<key>date</key>
<value>Wed, 29 Jun 2022 13:03:38 GMT</value>
</header>
<header>
<key>expires</key>
<value>Wed, 29 Jun 2022 13:13:38 GMT</value>
</header>
<header>
<key>server</key>
<value>nginx/1.21.6</value>
</header>
<header>
<key>vary</key>
<value>Accept-Encoding</value>
</header>
<header>
<key>x-content-type-options</key>
<value>nosniff</value>
</header>
<header>
<key>x-frame-options</key>
<value>sameorigin</value>
</header>
<header>
<key>x-xss-protection</key>
<value>1; mode=block</value>
</header>
</headers>
<body>{"buildDate":"2022-06-29T09:22:41.541886118Z","buildVersion":"SNAPSHOT","upDate":"2022-06-29T11:06:26.869Z"}</body>
</response>
Example usage
This is an example of how to use the function call in your XSLT.
It is recommended to place the clientConfig
JSON in a
Dictionary
to make it easier to edit and to avoid having to escape all the quotes.
...
<xsl:template match="record">
...
<!-- Read the client config from a Dictionary into a variable -->
<xsl:variable name="clientConfig" select="stroom:dictionary('HTTP Client Config')" />
<!-- Make the HTTP call and store the response in a variable -->
<xsl:variable name="response" select="stroom:http-call('https://reqbin.com/echo', null, null, null, $clientConfig)" />
<!-- Apply 'response' templates to the response -->
<xsl:apply-templates mode="response" select="$response" />
...
</xsl:template>
<xsl:template mode="response" match="http:response">
<!-- Extract just the body of the response -->
<val><xsl:value-of select="./http:body/text()" /></val>
</xsl:template>
...
link()
Create a string that represents a hyperlink for display in a dashboard table.
link(url)
link(title, url)
link(title, url, type)
Example
link('https://www.somehost.com/somepath')
> [https://www.somehost.com/somepath](https://www.somehost.com/somepath)
link('Click Here','https://www.somehost.com/somepath')
> [Click Here](https://www.somehost.com/somepath)
link('Click Here','https://www.somehost.com/somepath', 'dialog')
> [Click Here](https://www.somehost.com/somepath){dialog}
link('Click Here','https://www.somehost.com/somepath', 'dialog|Dialog Title')
> [Click Here](https://www.somehost.com/somepath){dialog|Dialog Title}
Type can be one of:
dialog
: Display the content of the link URL within a stroom popup dialog.tab
: Display the content of the link URL within a stroom tab.browser
: Display the content of the link URL within a new browser tab.dashboard
: Used to launch a stroom dashboard internally with parameters in the URL.
If you wish to override the default title or URL of the target link in either a tab or dialog you can. Both dialog
and tab
types allow titles to be specified after a |
, e.g. dialog|My Title
.
log()
The log() function writes a message to the processing log with the specified severity. Severities of INFO, WARN, ERROR and FATAL can be used. Severities of ERROR and FATAL will result in records being omitted from the output if a RecordOutputFilter is used in the pipeline. The counts for RecWarn, RecError will be affected by warnings or errors generated in this way therefore this function is useful for adding business rules to XML output.
E.g. Warn if a SID is not the correct length.
<xsl:if test="string-length($sid) != 7">
<xsl:value-of select="stroom:log('WARN', concat($sid, ' is not the correct length'))"/>
</xsl:if>
The same functionality can also be achieved using the standard xsl:message
element, see <xsl:message>
lookup()
The lookup() function looks up from reference or context data a value (which can be an XML node set) and adds it to the resultant XML.
lookup(String map, String key)
lookup(String map, String key, String time)
lookup(String map, String key, String time, Boolean ignoreWarnings)
lookup(String map, String key, String time, Boolean ignoreWarnings, Boolean trace)
map
- The name of the reference data map to perform the lookup against.key
- The key to lookup. The key can be a simple string, an integer value in a numeric range or a nested lookup key.time
- Determines which set of reference data was effective at the requested time. If no reference data exists with an effective time before the requested time then the lookup will fail. Time is in the formatyyyy-MM-dd'T'HH:mm:ss.SSSXX
, e.g.2010-01-01T00:00:00.000Z
.ignoreWarnings
- If true, any lookup failures will be ignored, else they will be reported as warnings.trace
- If true, additional trace information is output as INFO messages.
If the look up fails no result will be returned. By testing the result a default value may be output if no result is returned.
E.g. Look up a SID given a PF
<xsl:variable name="pf" select="PFNumber"/>
<xsl:if test="$pf">
<xsl:variable name="sid" select="stroom:lookup('PF_TO_SID', $pf, $formattedDateTime)"/>
<xsl:choose>
<xsl:when test="$sid">
<User>
<Id><xsl:value-of select="$sid"/></Id>
</User>
</xsl:when>
<xsl:otherwise>
<data name="PFNumber">
<xsl:attribute name="Value"><xsl:value-of select="$pf"/></xsl:attribute>
</data>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
Range lookups
Reference data entries can either be stored with single string key or a key range that defines a numeric range, e.g 1-100. When a lookup is preformed the passed key is looked up as if it were a normal string key. If that lookup fails Stroom will try to convert the key to an integer (long) value. If it can be converted to an integer than a second lookup will be performed against entries with key ranges to see if there is a key range that includes the requested key.
Range lookups can be used for looking up an IP address where the reference data values are associated with ranges of IP addresses.
In this use case, the IP address must first be converted into a numeric value using numeric-ip()
, e.g:
stroom:lookup('IP_TO_LOCATION', numeric-ip($ipAddress))
Similarly the reference data must be stored with key ranges whose bounds were created using this function.
Nested Maps
The lookup function allows you to perform chained lookups using nested maps.
For example you may have a reference data map called USER_ID_TO_LOCATION that maps user IDs to some location information for that user and a map called USER_ID_TO_MANAGER that maps user IDs to the user ID of their manager.
If you wanted to decorate a user’s event with the location of their manager you could use a nested map to achieve the lookup chain.
To perform the lookup set the map
argument to the list of maps in the lookup chain, separated by a /
, e.g. USER_ID_TO_MANAGER/USER_ID_TO_LOCATION
.
This will perform a lookup against the first map in the list using the requested key.
If a value is found the value will be used as the key in a lookup against the next map.
The value from each map lookup is used as the key in the next map all the way down the chain.
The value from the last lookup is then returned as the result of the lookup()
call.
If no value is found at any point in the chain then that results in no value being returned from the function.
In order to use nested map lookups each intermediate map must contain simple string values. The last map in the chain can either contain string values or XML fragment values.
put() and get()
You can put values into a map using the put()
function.
These values can then be retrieved later using the get()
function.
Values are stored against a key name so that multiple values can be stored.
These functions can be used for many purposes but are most commonly used to count a number of records that meet certain criteria.
The map is in the scope of the current pipeline process so values do not live after the stream has been processed.
Also, the map will only contain entries that were put()
within the current pipeline process.
An example of how to count records is shown below:
<!-- Get the current record count -->
<xsl:variable name="currentCount" select="number(s:get('count'))" />
<!-- Increment the record count -->
<xsl:variable name="count">
<xsl:choose>
<xsl:when test="$currentCount">
<xsl:value-of select="$currentCount + 1" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="1" />
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<!-- Store the count for future retrieval -->
<xsl:value-of select="stroom:put('count', $count)" />
<!-- Output the new count -->
<data name="Count">
<xsl:attribute name="Value" select="$count" />
</data>
meta-keys()
When calling this function and assigning the result to a variable, you must specify the variable data type of xs:string*
(array of strings).
The following fragment is an example of using meta-keys()
to emit all meta values for a given stream, into an Event/Meta
element:
<Event>
<xsl:variable name="metaKeys" select="stroom:meta-keys()" as="xs:string*" />
<Meta>
<xsl:for-each select="$metaKeys">
<string key="{.}"><xsl:value-of select="stroom:meta(.)" /></string>
</xsl:for-each>
</Meta>
</Event>
parse-uri()
The parse-uri() function takes a Uniform Resource Identifier (URI) in string form and returns an XML node with a namespace of uri
containing the URI’s individual components of authority
, fragment
, host
, path
, port
, query
, scheme
, schemeSpecificPart
and userInfo
. See either RFC 2306: Uniform Resource Identifiers (URI): Generic Syntax or Java’s java.net.URI Class for details regarding the components.
The following xml
<!-- Display and parse the URI contained within the text of the rURI element -->
<xsl:variable name="u" select="stroom:parseUri(rURI)" />
<URI>
<xsl:value-of select="rURI" />
</URI>
<URIDetail>
<xsl:copy-of select="$v"/>
</URIDetail>
given the rURI text contains
http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details
would provide
<URL>http://foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2#more-details</URL>
<URIDetail>
<authority xmlns="uri">foo:bar@w1.superman.com:8080</authority>
<fragment xmlns="uri">more-details</fragment>
<host xmlns="uri">w1.superman.com</host>
<path xmlns="uri">/very/long/path.html</path>
<port xmlns="uri">8080</port>
<query xmlns="uri">p1=v1&p2=v2</query>
<scheme xmlns="uri">http</scheme>
<schemeSpecificPart xmlns="uri">//foo:bar@w1.superman.com:8080/very/long/path.html?p1=v1&p2=v2</schemeSpecificPart>
<userInfo xmlns="uri">foo:bar</userInfo>
</URIDetail>
pointIsInsideXYPolygon()
Returns true if the specified point is inside the specified polygon. Useful for determining if a user is inside a physical zone based on their location and the boundary of that zone.
pointIsInsideXYPolygon(Number xPos, Number yPos, Number[] xPolyData, Number[] yPolyData)
Arguments:
xPos
- The X value of the point to be tested.yPos
- The Y value of the point to be tested.xPolyData
- A sequence of X values that define the polygon.yPolyData
- A sequence of Y values that define the polygon.
The list of values supplied for xPolyData
must correspond with the list of values supplied for yPolyData
.
The points that define the polygon must be provided in order, i.e. starting from one point on the polygon and then traveling round the path of the polygon until it gets back to the beginning.
9.3.3 - XSLT Includes
You can use an XSLT import to include XSLT from another translation. E.g.:
<xsl:import href="ApacheAccessCommon" />
This would include the XSLT from the ApacheAccessCommon translation.
9.4 - File Output
When outputting files with Stroom, the output file names and paths can include various substitution variables to form the file and path names.
Context Variables
The following replacement variables are specific to the current processing context.
${feed}
- The name of the feed that the stream being processed belongs to${pipeline}
- The name of the pipeline that is producing output${sourceId}
- The id of the input data being processed${partNo}
- The part number of the input data being processed where data is in aggregated batches${searchId}
- The id of the batch search being performed. This is only available during a batch search${node}
- The name of the node producing the output
Time Variables
The following replacement variables can be used to include aspects of the current time in UTC.
${year}
- Year in 4 digit form, e.g. 2000${month}
- Month of the year padded to 2 digits${day}
- Day of the month padded to 2 digits${hour}
- Hour padded to 2 digits using 24 hour clock, e.g. 22${minute}
- Minute padded to 2 digits${second}
- Second padded to 2 digits${millis}
- Milliseconds padded to 3 digits${ms}
- Milliseconds since the epoch
System (Environment) Variables
System variables (environment variables) can also be used, e.g. ${TMP}
.
File Name References
rolledFileName in RollingFileAppender can use references to the fileName to incorporate parts of the non rolled file name.
${fileName}
- The complete file name${fileStem}
- Part of the file name before the file extension, i.e. everything before the last ‘.’${fileExtension}
- The extension part of the file name, i.e. everything after the last ‘.’
Other Variables
${uuid}
- A randomly generated UUID to guarantee unique file names
9.5 - Reference Data
In Stroom reference data is primarily used to decorate events using stroom:lookup()
calls in XSLTs.
For example you may have reference data feed that associates the FQDN of a device to the physical location.
You can then perform a stroom:lookup()
in the XSLT to decorate an event with the physical location of a device by looking up the FQDN found in the event.
Reference data is time sensitive and each stream of reference data has an Effective Date set against it. This allows reference data lookups to be performed using the date of the event to ensure the reference data that was actually effective at the time of the event is used.
Using reference data involves the following steps/processes:
- Ingesting the raw reference data into Stroom.
- Creating (and processing) a pipeline to transform the raw reference into
reference-data:2
format XML. - Creating a reference loader pipeline with a Reference Data Filter element to load cooked reference data into the reference data store.
- Adding reference pipeline/feeds to an XSLT Filter in your event pipeline.
- Adding the lookup call to the XSLT.
- Processing the raw events through the event pipeline.
The process of creating a reference data pipeline is described in the HOWTO linked at the top of this document.
Reference Data Structure
A reference data entry essentially consists of the following:
- Effective time - The data/time that the entry was effective from, i.e the time the raw reference data was received.
- Map name - A unique name for the key/value map that the entry will be stored in. The name only needs to be unique within all map names that may be loaded within an XSLT Filter. In practice it makes sense to keep map names globally unique.
- Key - The text that will be used to lookup the value in the reference data map. Mutually exclusive with Range.
- Range - The inclusive range of integer keys that the entry applies to. Mutually exclusive with Key.
- Value - The value can either be simple text, e.g. an IP address, or an XML fragment that can be inserted into another XML document. XML values must be correctly namespaced.
The following is an example of some reference data that has been converted from its raw form into reference-data:2
XML.
In this example the reference data contains three entries that each belong to a different map.
Two of the entries are simple text values and the last has an XML value.
<?xml version="1.1" encoding="UTF-8"?>
<referenceData
xmlns="reference-data:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stroom="stroom"
xmlns:evt="event-logging:3"
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd"
version="2.0.1">
<!-- A simple string value -->
<reference>
<map>FQDN_TO_IP</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<IPAddress>192.168.2.245</IPAddress>
</value>
</reference>
<!-- A simple string value -->
<reference>
<map>IP_TO_FQDN</map>
<key>192.168.2.245</key>
<value>
<HostName>stroomnode00.strmdev00.org</HostName>
</value>
</reference>
<!-- A key range -->
<reference>
<map>USER_ID_TO_COUNTRY_CODE</map>
<range>
<from>1</from>
<to>1000</to>
</range>
<value>GBR</value>
</reference>
<!-- An XML fragment value -->
<reference>
<map>FQDN_TO_LOC</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<evt:Location>
<evt:Country>GBR</evt:Country>
<evt:Site>Bristol-S00</evt:Site>
<evt:Building>GZero</evt:Building>
<evt:Room>R00</evt:Room>
<evt:TimeZone>+00:00/+01:00</evt:TimeZone>
</evt:Location>
</value>
</reference>
</referenceData>
Reference Data Namespaces
When XML reference data values are created, as in the example XML above, the XML values must be qualified with a namespace to distinguish them from the reference-data:2
XML that surrounds them.
In the above example the XML fragment will become as follows when injected into an event:
<evt:Location xmlns:evt="event-logging:3" >
<evt:Country>GBR</evt:Country>
<evt:Site>Bristol-S00</evt:Site>
<evt:Building>GZero</evt:Building>
<evt:Room>R00</evt:Room>
<evt:TimeZone>+00:00/+01:00</evt:TimeZone>
</evt:Location>
Even if evt
is already declared in the XML being injected into it, if it has been declared for the reference fragment then it will be explicitly declared in the destination.
While duplicate namespacing may appear odd it is valid XML.
The namespacing can also be achieved like this:
<?xml version="1.1" encoding="UTF-8"?>
<referenceData
xmlns="reference-data:2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stroom="stroom"
xsi:schemaLocation="reference-data:2 file://reference-data-v2.0.xsd"
version="2.0.1">
<!-- An XML value -->
<reference>
<map>FQDN_TO_LOC</map>
<key>stroomnode00.strmdev00.org</key>
<value>
<Location xmlns="event-logging:3">
<Country>GBR</Country>
<Site>Bristol-S00</Site>
<Building>GZero</Building>
<Room>R00</Room>
<TimeZone>+00:00/+01:00</TimeZone>
</Location>
</value>
</reference>
</referenceData>
This reference data will be injected into event XML exactly as it, i.e.:
<Location xmlns="event-logging:3">
<Country>GBR</Country>
<Site>Bristol-S00</Site>
<Building>GZero</Building>
<Room>R00</Room>
<TimeZone>+00:00/+01:00</TimeZone>
</Location>
Reference Data Storage
Reference data is stored in two different places on a Stroom node. All reference data is only visible to the node where it is located. Each node that is performing reference data lookups will need to load and store its own reference data. While this will result in duplicate data being held by nodes it makes the storage of reference data and its subsequent lookup very performant.
On-Heap Store
The On-Heap store is the reference data store that is held in memory in the Java Heap. This store is volatile and will be lost on shut down of the node. The On-Heap store is only used for storage of context data.
Off-Heap Store
The Off-Heap store is the reference data store that is held in memory outside of the Java Heap and is persisted to to local disk. As the store is also persisted to local disk it means the reference data will survive the shutdown of the stroom instance. Storing the data off-heap means Stroom can run with a much smaller Java Heap size.
The Off-Heap store is based on the Lightning Memory-Mapped Database (LMDB). LMDB makes use of the Linux page cache to ensure that hot portions of the reference data are held in the page cache (making use of all available free memory). Infrequently used portions of the reference data will be evicted from the page cache by the Operating System. Given that LMDB utilises the page cache for holding reference data in memory the more free memory the host has the better as there will be less shifting of pages in/out of the OS page cache. When storing large amounts of data you may experience the OS reporting very little free memory as a large amount will be in use by the page cache. This is not an issue as the OS will evict pages when memory is needed for other applications, e.g. the Java Heap.
Local Disk
The Off-Heap store is intended to be located on local disk on the Stroom node.
The location of the store is set using the property stroom.pipeline.referenceData.localDir
.
Using LMDB on remote storage is NOT advised, see http://www.lmdb.tech/doc.
Using the fastest storage (i.g. fast SSDs) is advised to reduce load times and lookups of data that is not in memory.
Warning
If you are running stroom on AWS EC2 instances then you will need to attach some local instance storage to the host, e.g. SSD, to use for the reference data store. In tests EBS storage was found to be VERY slow.
It should be noted that AWS instance storage is not persistent between instance stops, terminations and hardware failure. However any loss of the reference data store will mean that the next time Stroom boots a new store will be created and reference data will be loaded on demand as normal.
Transactions
LMDB is a transactional database with ACID semantics. All interaction with LMDB is done within a read or write transaction. There can only be one write transaction at a time so if there are a number of concurrent reference data loads then they will have to wait in line.
Read transactions, i.e. lookups, are not blocked by each other but may be blocked by a write transaction depending on the value of the system property stroom.pipeline.referenceData.lmdb.readerBlockedByWriter
.
LMDB can operate such that readers are not blocked by writers but if there is an open read transaction while a write transaction is writing data to the store then it is unable to make use of free space (from previous deletes, see Store Size & Compaction) so will result in the store increasing in size.
If read transactions are likely while writes are taking place then this can lead to excessive growth of the store.
Setting stroom.pipeline.referenceData.lmdb.readerBlockedByWriter
to true
will block all reads while a load is happening so any free space can be re-used, at the cost of making all lookups wait for the load to complete.
Use of this setting will depend on how likely it is that loads will clash with lookups and the store size should be monitored.
Read-Ahead Mode
When data is read from the store, if the data is not already in the page cache then it will be read from disk and added to the page cache by the OS.
Read-ahead is the process of speculatively reading ahead to load more pages into the page cache than were requested.
This is on the basis that future requests for data may need the pages speculatively read into memory as it is more efficient to read multiple pages at once.
If the reference data store is very large or is larger than the available memory then it is recommended to turn read-ahead off as the result will be to evict hot reference data from the page cache to make room for speculative pages that may not be needed.
It can be tuned off with the system property stroom.pipeline.referenceData.readAheadEnabled
.
Key Size
When reference data is created care must be taken to ensure that the Key used for each entry is less than 507 bytes. For simple ASCII characters then this means less than 507 characters. If non-ASCII characters are in the key then these will take up more than one byte per character so the length of the key in characters will be less. This is a limitation inherent to LMDB.
Commit intervals
The property stroom.pipeline.referenceData.maxPutsBeforeCommit
controls the number of entries that are put into the store between each commit.
As there can be only one transaction writing to the store at a time, committing periodically allows other process to jump in and make writes.
There is a trade off though as reducing the number of entries put between each commit can seriously affect performance.
For the fastest single process performance a value of 0
should be used which means it will not commit mid-load.
This however means all other processes wanting to write to the store will need to wait.
Low values (e.g. in the hundreds) mean very frequent commits so will hamper performance.
Cloning The Off Heap Store
If you are provisioning a new stroom node it is possible to copy the off heap store from another node.
Stroom should not be running on the node being copied from.
Simply copy the contents of stroom.pipeline.referenceData.localDir
into the same configured location on the other node.
The new node will use the copied store and have access to its reference data.
Store Size & Compaction
Due to the way LMDB works the store can only grow in size, it will never shrink, even if reference data is deleted. Deleted data frees up space for new writes to the store so will be reused but will never be freed back to the operating system. If there is a regular process of purging old data and adding new reference data then this should not be an issue as the new reference data will use the space made available by the purged data.
If store size becomes an issue then it is possible to compact the store.
lmdb-utils
is package that is available on some package managers and this has an mdb_copy
command that can be used with the -c
switch to copy the LMDB environment to a new one, compacting it in the process.
This should be done when Stroom is down to avoid writes happening to the store while the copy is happening.
The following is an example of how to compact the store assuming Stroom has been shut down first.
Now you can re-start Stroom and it will use the new compacted store, creating a lock file for it.
The compaction process is fast. A test compaction of a 4Gb store, compacted down to 1.6Gb took about 7s on non-flash HDD storage.
Alternatively, given that the store is essentially a cache and all data can be re-loaded another option is to delete the contents of stroom.pipeline.referenceData.localDir
when Stroom is not running.
On boot Stroom will create a brand new empty store and reference data will be re-loaded as required.
This approach will result in all data having to be re-loaded so will slow lookups down until it has been loaded.
The Loading Process
Reference data is loaded into the store on demand during the processing of a stroom:lookup()
method call.
Reference data will only be loaded if it does not already exist in the store, however it is always loaded as a complete stream, rather than entry by entry.
The test for existence in the store is based on the following criteria:
- The UUID of the reference loader pipeline.
- The version of the reference loader pipeline.
- The Stream ID for the stream of reference data that has been deemed effective for the lookup.
- The Stream Number (in the case of multi part streams).
If a reference stream has already been loaded matching the above criteria then no additional load is required.
IMPORTANT: It should be noted that as the version of the reference data pipeline forms part of the criteria, if the reference loader pipeline is changed, for whatever reason, then this will invalidate ALL existing reference data associated with that reference loader pipeline.
Typically the reference loader pipeline is very static so this should not be an issue.
Standard practice is to convert raw reference data into reference:2
XML on receipt using a pipeline separate to the reference loader.
The reference loader is then only concerned with reading cooked reference:2
into the Reference Data Filter.
In instances where reference data streams are infrequently used it may be preferable to not convert the raw reference on receipt but instead to do it in the reference loader pipeline.
Duplicate Keys
The Reference Data Filter pipeline element has a property overrideExistingValues
which if set to true means if an entry is found in an effective stream with the same key as an entry already loaded then it will overwrite the existing one.
Entries are loaded in the order they are found in the reference:2
XML document.
If set to false then the existing entry will be kept.
If warnOnDuplicateKeys
is set to true then a warning will be logged for any duplicate keys, whether an overwrite happens or not.
Value De-Duplication
Only unique values are held in the store to reduce the storage footprint. This is useful given that typically, reference data updates may be received daily and each one is a full snapshot of the whole reference data. As a result this can mean many copies of the same value being loaded into the store. The store will only hold the first instance of duplicate values.
Querying the Reference Data Store
The reference data store can be queried within a Dashboard in Stroom by selecting Reference Data Store
in the data source selection pop-up.
Querying the store is currently an experimental feature and is mostly intended for use in fault finding.
Given the localised nature of the reference data store the dashboard can currently only query the store on the node that the user interface is being served from.
In a multi-node environment where some nodes are UI only and most are processing only, the UI nodes will have no reference data in their store.
Purging Old Reference Data
Reference data loading and purging is done at the level of a reference stream. Whenever a reference lookup is performed the last accessed time of the reference stream in the store is checked. If it is older than one hour then it will be updated to the current time. This last access time is used to determine reference streams that are no longer in active use and thus can be purged.
The Stroom job Ref Data Off-heap Store Purge is used to perform the purge operation on the Off-Heap reference data store.
No purge is required for the On-Heap store as that only holds transient data.
When the purge job is run it checks the time since each reference stream was accessed against the purge cut-off age.
The purge age is configured via the property stroom.pipeline.referenceData.purgeAge
.
It is advised to schedule this job for quiet times when it is unlikely to conflict with reference loading operations as they will fight for access to the single write transaction.
Lookups
Lookups are performed in XSLT Filters using the XSLT functions.
In order to perform a lookup one or more reference feeds must be specified on the XSLT Filter pipeline element.
Each reference feed is specified along with a reference loader pipeline that will ingest the specified feed (optional convert it into reference:2
XML if it is not already) and pass it into a Reference Data Filter pipeline element.
Reference Feeds & Loaders
In the XSLT Filter pipeline element multiple combinations of feed and reference loader pipeline can be specified. There must be at least one in order to perform lookups. If there are multiple then when a lookup is called for a given time, the effective stream for each feed/loader combination is determined. The effective stream for each feed/loader combination will be loaded into the store, unless it is already present.
When the actual lookup is performed Stroom will try to find the key in each of the effective streams that have been loaded and that contain the map in the lookup call. If the lookup is unsuccessful in the effective stream for the first feed/loader combination then it will try the next, and so on until it has tried all of them. For this reason if you have multiple feed/loader combinations then order is important. It is possible for multiple effective streams to contain the same map/key so a feed/loader combination higher up the list will trump one lower down with the same map/key. Also if you have some lookups that may not return a value and others that should always return a value then the feed/loader for the latter should be higher up the list so it is searched first.
Effective Streams
Reference data lookups have the concept of Effective Streams.
An effective stream is the most recent stream for a given
Feed
that has an effective date that is less than or equal to the date used for the lookup.
When performing a lookup, Stroom will search the stream store to find all the effective streams in a time bucket that surrounds the lookup time.
These sets of effective streams are cached so if a new reference stream is created it will not be used until the cached set has expired.
To rectify this you can clear the cache Reference Data - Effective Stream Cache
on the Caches screen accessed from:
Standard Key/Value Lookups
Standard key/value lookups consist of a simple string key and a value that is either a simple string or an XML fragment.
Standard lookups are performed using the various forms of the stroom:lookup()
XSLT function.
Note
If the key is not found and the key is an integer then it will attempt a range lookup using the same key. This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.Range Lookups
Range lookups consist of a key that is an integer and a value that is either a simple string or an XML fragment.
For more detail on range lookups see the XSLT function stroom:lookup()
.
Note
The lookup will initially look for a single key that matches the lookup key. If an exact match is not found then it will look for a range that contains the key. This is to allow for maps that contain a mixture of key/value pairs and range/value pairs.Nested Map Lookups
Nested map lookups involve chaining a number of lookups with the value of each map being used as the key for the next.
For more detail on nested lookups see the XSLT function stroom:lookup()
.
Bitmap Lookups
A bitmap lookup is a special kind of lookup that actually performs a lookup for each enabled bit position of the passed bitmap value.
For more detail on bitmap lookups see the XSLT function stroom:bitmap-lookup()
.
Values can either be a simple string or an XML fragment.
Context data lookups
Some event streams have a Context stream associated with them. Context streams allow the system sending the events to Stroom to supply an additional stream of data that provides context to the raw event stream. This can be useful when the system sending the events has no control over the event content but needs to supply additional information. The context stream can be used in lookups as a reference source to decorate events on receipt. Context reference data is specific to a single event stream so is transient in nature, therefore the On Heap Store is used to hold it for the duration of the event stream processing only.
Typically the reference loader for a context stream will include a translation step to convert the raw context data into reference:2
XML.
Reference Data API
See Reference Data API.
9.6 - Context Data
TODO
This section needs some explanation.Context File
Input File:
<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
<SomeEvent>
<SomeTime>01/01/2009:12:00:01</SomeTime>
<SomeAction>OPEN</SomeAction>
<SomeUser>userone</SomeUser>
<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
</SomeEvent>
</SomeData>
Context File:
<?xml version="1.0" encoding="UTF-8"?>
<SomeContext>
<Machine>MyMachine</Machine>
</SomeContext>
Context XSLT:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
xmlns="reference-data:2"
xmlns:evt="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:template match="SomeContext">
<referenceData
xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
version="2.0.1">
<xsl:apply-templates/>
</referenceData>
</xsl:template>
<xsl:template match="Machine">
<reference>
<map>CONTEXT</map>
<key>Machine</key>
<value><xsl:value-of select="."/></value>
</reference>
</xsl:template>
</xsl:stylesheet>
Context XML Translation:
<?xml version="1.0" encoding="UTF-8"?>
<referenceData xmlns:evt="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="reference-data:2"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd reference-data:2 file://reference-data-v2.0.1.xsd"
version="2.0.1">
<reference>
<map>CONTEXT</map>
<key>Machine</key>
<value>MyMachine</value>
</reference>
</referenceData>
Input File:
<?xml version="1.0" encoding="UTF-8"?>
<SomeData>
<SomeEvent>
<SomeTime>01/01/2009:12:00:01</SomeTime>
<SomeAction>OPEN</SomeAction>
<SomeUser>userone</SomeUser>
<SomeFile>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</SomeFile>
</SomeEvent>
</SomeData>
Main XSLT (Note the use of the context lookup):
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet
xmlns="event-logging:3"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="2.0">
<xsl:template match="SomeData">
<Events xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd" Version="3.0.0">
<xsl:apply-templates/>
</Events>
</xsl:template>
<xsl:template match="SomeEvent">
<xsl:if test="SomeAction = 'OPEN'">
<Event>
<EventTime>
<TimeCreated>
<xsl:value-of select="s:format-date(SomeTime, 'dd/MM/yyyy:hh:mm:ss')"/>
</TimeCreated>
</EventTime>
<EventSource>
<System>Example</System>
<Environment>Example</Environment>
<Generator>Very Simple Provider</Generator>
<Device>
<IPAddress>182.80.32.132</IPAddress>
<Location>
<Country>UK</Country>
<Site><xsl:value-of select="s:lookup('CONTEXT', 'Machine')"/></Site>
<Building>Main</Building>
<Floor>1</Floor>
<Room>1aaa</Room>
</Location>
</Device>
<User><Id><xsl:value-of select="SomeUser"/></Id></User>
</EventSource>
<EventDetail>
<View>
<Document>
<Title>UNKNOWN</Title>
<File>
<Path><xsl:value-of select="SomeFile"/></Path>
</File>
</Document>
</View>
</EventDetail>
</Event>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Main Output XML:
<?xml version="1.0" encoding="UTF-8"?>
<Events xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="event-logging:3"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.0.0.xsd"
Version="3.0.0">
<Event Id="6:1">
<EventTime>
<TimeCreated>2009-01-01T00:00:01.000Z</TimeCreated>
</EventTime>
<EventSource>
<System>Example</System>
<Environment>Example</Environment>
<Generator>Very Simple Provider</Generator>
<Device>
<IPAddress>182.80.32.132</IPAddress>
<Location>
<Country>UK</Country>
<Site>MyMachine</Site>
<Building>Main</Building>
<Floor>1</Floor>
<Room>1aaa</Room>
</Location>
</Device>
<User>
<Id>userone</Id>
</User>
</EventSource>
<EventDetail>
<View>
<Document>
<Title>UNKNOWN</Title>
<File>
<Path>D:\TranslationKit\example\VerySimple\OpenFileEvents.txt</Path>
</File>
</Document>
</View>
</EventDetail>
</Event>
</Events>
10 - Properties
Properties are the means of configuring the Stroom application and are typically maintained by the Stroom system administrator. The value of some properties are required in order for Stroom to function, e.g. database connection details, and thus need to be set prior to running Stroom. Some properties can be changed at runtime to alter the behaviour of Stroom.
Sources
Property values can be defined in the following locations.
System Default
The system defaults are hard-coded into the Stroom application code by the developers and can’t be changed. These represent reasonable defaults, where applicable, but may need to be changed, e.g. to reflect the scale of the Stroom system or the specific environment.
The default property values can either be viewed in the Stroom user interface or in the file config/config-defaults.yml
in the Stroom distribution.
Properties can be accessed in the user interface by selecting this from the top menu:
Global Database Value
Global database values are property values stored in the database that are global across the whole cluster.
The global database value is defined as a record in the config
table in the database.
The database record will only exist if a database value has explicitly been set.
The database value will apply to all nodes in the cluster, overriding the default value, unless a node also has a value set in its YAML configuration.
Database values can be set from the Stroom user interface, accessed by selecting this from the top menu:
Some properties are marked Read Only which means they cannot have a database value set for them. These properties can only be altered via the YAML configuration file on each node. Such properties are typically used to configure values required for Stroom to be able to boot, so it does not make sense for them to be configurable from the User Interface.
YAML Configuration file
Stroom is built on top of a framework called Dropwizard.
Dropwizard uses a YAML configuration file on each node to configure the application.
This is typically config.yml
and is located in the config
directory.
For details of the structure of this file, see Stroom and Stroom-Proxy Common Configuration
Source Precedence
The three sources (Default, Database & YAML) are listed in increasing priority, i.e YAML trumps Database, which trumps Default.
For example, in a two node cluster, this table shows the effective value of a property on each node.
A -
indicates the value has not been set in that source.
NULL
indicates that the value has been explicitly set to NULL.
Source | Node1 | Node2 |
---|---|---|
Default | red | red |
Database | - | - |
YAML | - | blue |
Effective | red | blue |
Or where a Database value is set.
Source | Node1 | Node2 |
---|---|---|
Default | red | red |
Database | green | green |
YAML | - | blue |
Effective | green | blue |
Or where a YAML value is explicitly set to NULL
.
Source | Node1 | Node2 |
---|---|---|
Default | red | red |
Database | green | green |
YAML | - | NULL |
Effective | green | NULL |
Data Types
Stroom property values can be set using a number of different data types. Database property values are currently set in the user interface using the string form of the value. For each of the data types defined below, there will be an example of how the data type is recorded in its string form.
Data Type | Example UI String Forms | Example YAML form |
---|---|---|
Boolean | true false |
true false |
String | This is a string |
"This is a string" |
Integer/Long | 123 |
123 |
Float | 1.23 |
1.23 |
Stroom Duration | P30D P1DT12H PT30S 30d 30s 30000 |
"P30D" "P1DT12H" "PT30S" "30d" "30s" "30000" See Stroom Duration Data Type. |
List | #red#Green#Blue ,1,2,3 |
See List Data Type |
Map | ,=red=FF0000,Green=00FF00,Blue=0000FF |
See Map Data Type |
DocRef | ,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name) |
See DocRef Data Type |
Enum | HIGH LOW |
"HIGH" "LOW" |
Path | /some/path/to/a/file |
"/some/path/to/a/file" |
ByteSize | 32 , 512Kib |
32 , 512Kib See Byte Size Data Type |
Stroom Duration Data Type
The Stroom Duration data type is used to specify time durations, for example the time to live of a cache or the time to keep data before it is purged. Stroom Duration uses a number of string forms to support legacy property values.
ISO 8601 Durations
Stroom Duration can be expressed using
ISO 8601
duration strings.
It does NOT support the full ISO 8601 format, only D
, H
, M
and S
.
For details of how the string is parsed to a Stroom Duration, see
Duration
The following are examples of ISO 8601 duration strings:
P30D
- 30 daysP1DT12H
- 1 day 12 hours (36 hours)PT30S
- 30 secondsPT0.5S
- 500 milliseconds
Legacy Stroom Durations
This format was used in versions of Stroom older than v7 and is included to support legacy property values.
The following are examples of legacy duration strings:
30d
- 30 days12h
- 12 hours10m
- 10 minutes30s
- 30 seconds500
- 500 milliseconds
Combinations such as 1m30s
are not supported.
List Data Type
This type supports ordered lists of items, where an item can be of any supported data type, e.g. a list of strings or list of integers.
The following is an example of how a property (statusValues
) that is is List of strings is represented in the YAML:
annotation:
statusValues:
- "New"
- "Assigned"
- "Closed"
This would be represented as a string in the User Interface as:
|New|Assigned|Closed
.
See Delimiters in String Conversion for details of how the items are delimited in the string.
The following is an example of how a property (cpu
) that is is List of DocRefs is represented in the YAML:
statistics:
internal:
cpu:
- type: "StatisticStore"
uuid: "af08c4a7-ee7c-44e4-8f5e-e9c6be280434"
name: "CPU"
- type: "StroomStatsStore"
uuid: "1edfd582-5e60-413a-b91c-151bd544da47"
name: "CPU"
This would be represented as a string in the User Interface as:
|,docRef(StatisticStore,af08c4a7-ee7c-44e4-8f5e-e9c6be280434,CPU)|,docRef(StroomStatsStore,1edfd582-5e60-413a-b91c-151bd544da47,CPU)
See Delimiters in String Conversion for details of how the items are delimited in the string.
Map Data Type
This type supports a collection of key/value pairs where the key is unique within the collection. The type of the key must be string, but the type of the value can be any supported type.
The following is an example of how a property (mapProperty
) that is a map of string => string would be represented in the YAML:
mapProperty:
red: "FF0000"
green: "00FF00"
blue: "0000FF"
This would be represented as a string in the User Interface as:
,=red=FF0000,Green=00FF00,Blue=0000FF
The delimiter between pairs is defined first, then the delimiter for the key and value.
See Delimiters in String Conversion for details of how the items are delimited in the string.
DocRef Data Type
A DocRef (or Document Reference) is a type specific to Stroom that defines a reference to an instance of a Document within Stroom, e.g. an XLST, Pipeline, Dictionary, etc. A DocRef consists of three parts, the type, the UUID and the name of the Document.
The following is an example of how a property (aDocRefProperty
) that is a DocRef would be represented in the YAML:
aDocRefProperty:
type: "MyType"
uuid: "a56ff805-b214-4674-a7a7-a8fac288be60"
name: "My DocRef name"
This would be represented as a string in the User Interface as:
,docRef(MyType,a56ff805-b214-4674-a7a7-a8fac288be60,My DocRef name)
See Delimiters in String Conversion for details of how the items are delimited in the string.
Byte Size Data Type
The Byte Size data type is used to represent a quantity of bytes using the IEC standard. Quantities are represented as powers of 1024, i.e. a Kib (Kibibyte) means 1024 bytes.
Examples of Byte Size values in string form are (a YAML value would optionally be surrounded with double quotes):
32
,32b
,32B
,32bytes
- 32 bytes32K
,32KB
,32KiB
- 32 kibibytes32M
,32MB
,32MiB
- 32 mebibytes32G
,32GB
,32GiB
- 32 gibibytes32T
,32TB
,32TiB
- 32 tebibytes32P
,32PB
,32PiB
- 32 pebibytes
The *iB
form is preferred as it is more explicit and avoids confusion with SI units.
Delimiters in String Conversion
The string conversion used for collection types like List, Map etc. relies on the string form defining the delimiter(s) to use for the collection.
The delimiter(s) are added as the first n characters of the string form, e.g. |red|green|blue
or |=red=FF0000|Green=00FF00|Blue=0000FF
.
It is possible to use a number of different delimiters to allow for delimiter characters appearing in the actual value, e.g. #some text#some text with a | in it
The following are the delimiter characters that can be used.
|
, :
, ;
, ,
, !
, /
, \
, #
, @
, ~
, -
, _
, =
, +
, ?
When Stroom records a property value to the database it may use a delimiter of its own choosing, ensuring that it picks a delimiter that is not used in the property value.
Paths
File and directory paths can either be absolute (e.g. /some/path/
) or relative (e.g. some/path
).
All relative paths will be resolved to an absolute path using the value of stroom.home
as the base.
Path values also support variable substitution. For full details on the possible variable substitution options, see File Output.
Restart Required
Some properties are marked as requiring a restart. There are two scopes for this:
Requires UI Refresh
If a property is marked in UI as requiring a UI refresh then this means that a change to the property requires that the Stroom nodes serving the UI are restarted for the new value to take effect.
Requires Restart
If a property is marked in UI as requiring a restart then this means that a change to the property requires that all Stroom nodes are restarted for the new value to take effect.
11 - Roles
TODO
Describe application level permissions and how users and groups behave12 - Searching Data
Data in stroom (and in external Elastic indexes) can be searched using a number of ways:
-
Dashboard Combines multiple query expressions, result tables and visualisations in one configurable layout.
-
Query Executes a single search query written in StroomQL and displays the results as a table or visualisation.
-
Analytic Rule Executes a StroomQL search query either against data as it is ingested into Stroom or on a scheduled basis.
12.1 - Data Sources
12.1.1 - Lucene Index Data Source
Stroom’s primary data source is its internal Lucene based search indexes. For details of how data is indexed see Lucene Indexes.
TODO
Complete this section12.1.2 - Statistics
TODO
Complete this section12.1.3 - Elasticsearch
Stroom can integrate with external Elasticsearch indexes to allow querying using Stroom’s various mechanisms for querying data sources. These indexes may have been populated using a Stroom pipeline (See here).
Searching using a Stroom dashboard
Searching an Elasticsearch index (or data stream) using a Stroom dashboard is conceptually similar to the process described in Dashboards.
Before you set the dashboard’s data source, you must first create an Elastic Index document to tell Stroom which index (or indices) you wish to query.
Create an Elastic Index document
- Right-click a folder in the Stroom Explorer pane ( ).
- Select:
- Enter a name for the index document and click .
- Click
Cluster configuration
field label.
next to the - In the dialog that appears, select the Elastic Cluster document where the index exists, and click .
- Enter the name of an index or data stream in
Index name or pattern
. Data view (formerly known as index pattern) syntax is supported, which enables you to query multiple indices or data streams at once. For example:stroom-events-v1
. - (Optional) Set
Search slices
, which is the number of parallel workers that will query the index. For very large indices, increasing this value up to and including the number of shards can increase scroll performance, which will allow you to download results faster. - (Optional) Set
Search scroll size
, which specifies the number of documents to return in each search response. Greater values generally increase efficiency. By default, Elasticsearch limits this number to10,000
. - Click
Test Connection
. A dialog will appear with the result, which will stateConnection Success
if the connection was successful and the index pattern matched one or more indices. - Click .
Set the Elastic Index document as the dashboard data source
- Open or create a dashboard.
- Click
Query
panel.
in the - Click
Data Source
field label.
next to the - Select the Elastic Index document you created and click .
- Configure the query expression as explained in Dashboards. Note the tips for particular Elasticsearch field mapping data types.
- Configure the table.
Query expression tips
Certain Elasticsearch field mapping types support special syntax when used in a Stroom dashboard query expression.
To identify the field mapping type for a particular field:
- Click
Query
panel to add a new expression item.
in the - Select the Elasticsearch field name in the drop-down list.
- Note the blue data type indicator to the far right of the row.
Common examples are:
keyword
,text
andnumber
.
After you identify the field mapping type, move the mouse cursor over the mapping type indicator. A tooltip appears, explaining various types of queries you can perform against that particular field’s type.
Searching multiple indices
Using data view (index pattern) syntax, you can create powerful dashboards that query multiple indices at a time.
An example of this is where you have multiple indices covering different types of email systems.
Let’s assume these indices are named: stroom-exchange-v1
, stroom-domino-v1
and stroom-mailu-v1
.
There is a common set of fields across all three indices: @timestamp
, Subject
, Sender
and Recipient
.
You want to allow search across all indices at once, in effect creating a unified email dashboard.
You can achieve this by creating an Elastic Index document called (for example) Elastic-Email-Combined
and setting the property Index name or pattern
to: stroom-exchange-v1,stroom-domino-v1,stroom-mailu-v1
.
Click and re-open the dashboard.
You’ll notice that the available fields are a union of the fields across all three indices.
You can now search by any of these - in particular, the fields common to all three.
12.1.4 - Internal Data Sources
Stroom provides a number of built in data sources for querying the inner workings of stroom. These data sources do not have a corresponding Document so do not feature in the explorer tree.
These data sources appear as children of the root folder when selecting a data source in a Dashboard
, View . They are also available in the list of data sources when editing a Query .Analytics
TODO
CompleteAnnotations
Annotations are a means of annotating search results with additional information and for assigning those annotations to users. The Annotations data source allows you to query the annotations that have been created.
Field | Type | Description |
---|---|---|
annotaion:Id |
Long | Annotation unique identifier. |
annotation:CreatedOn |
Date | Date created. |
annotation:CreatedBy |
String | Username of the user that created the annotation. |
annotation:UpdatedOn |
Date | Date last updated. |
annotation:UpdatedBy |
String | Username of the user that last updated the annotation. |
annotation:Title |
String | |
annotation:Subject |
String | |
annotation:AssignedTo |
String | Username the annotation is assigned to. |
annotation:Comment |
String | Any comments on the annotation. |
annotation:History |
String | History of changes to the annotation. |
Dual
The Dual data source is one with a single field that always returns one row with the same value.
This data source can be useful for testing expression functions.
It can also be useful when combined with an extraction pipeline that uses the stroom:http-call()
XSLT function in order to make a single HTTP call using Dashboard parameter values.
Field | Type | Description |
---|---|---|
Dummy |
String | Always one row that has the value X |
Index Shards
Exposes the details of the index shards that make up Stroom’s Lucene based index. Each index is split up into one or more partitions and each partition is further divided into one or more shards. Each row represents one index shard.
Field | Type | Description |
---|---|---|
Node |
String | The name of the node that the index belongs to. |
Index |
String | The name of the index document. |
Index Name |
String | The name of the index document. |
Volume Path |
String | The file path for the index shard. |
Volume Group |
String | The name of the volume group the index is using. |
Partition |
String | The name of the partition that the shard is in. |
Doc Count |
Integer | The number of documents in the shard. |
File Size |
Long | The size of the shard on disk in bytes. |
Status |
String | The status of the shard (Closed , Open , Closing , Opening , New , Deleted , Corrupt ). |
Last Commit |
Date | The time and date of the last commit to the shard. |
Meta Store
Exposes details of the streams held in Stroom’s stream (aka meta) store. Each row represents one stream.
Field | Type | Description |
---|---|---|
Feed |
String | The name of the feed the stream belongs to. |
Pipeline |
String | The name of the pipeline that created the stream. [Optional] |
Pipeline Name |
String | The name of the pipeline that created the stream. [Optional] |
Status |
String | The status of the stream (Unlocked , Locked , Deleted ). |
Type |
String | The
Stream Type
, e.g. Events , Raw Events , etc. |
Id |
Long | The unique ID (within this Stroom cluster) for the stream . |
Parent Id |
Long | The unique ID (within this Stroom cluster) for the parent stream, e.g. the Raw stream that spawned an Events stream. [Optional] |
Processor Id |
Long | The unique ID (within this Stroom cluster) for the processor that produced this stream. [Optional] |
Processor Filter Id |
Long | The unique ID (within this Stroom cluster) for the processor filter that produced this stream. [Optional] |
Processor Task Id |
Long | The unique ID (within this Stroom cluster) for the processor task that produced this stream. [Optional] |
Create Time |
Date | The time the stream was created. |
Effective Time |
Date | The time that the data in this stream is effective for. This is only used for reference data stream and is the time that the snapshot of reference data was captured. [Optional] |
Status Time |
Date | The time that the status was last changed. |
Duration |
Long | The time it took to process the stream in milliseconds. [Optional] |
Read Count |
Long | The number of records read in segmented streams. [Optional] |
Write Count |
Long | The number of records written in segmented streams. [Optional] |
Info Count |
Long | The number of INFO messages. |
Warning Count |
Long | The number of WARNING messages. |
Error Count |
Long | The number of ERROR messages. |
Fatal Error Count |
Long | The number of FATAL_ERROR messages. |
File Size |
Long | The compressed size of the stream on disk in bytes. |
Raw Size |
Long | The un-compressed size of the stream on disk in bytes. |
Processor Tasks
Exposes details of the tasks spawned by the processor filters. Each row represents one processor task.
Field | Type | Description |
---|---|---|
Create Time |
Date | The time the task was created. |
Create Time Ms |
Long | The time the task was created (milliseconds). |
Start Time |
Date | The time the task was executed. |
Start Time Ms |
Long | The time the task was executed (milliseconds). |
End Time |
Date | The time the task finished. |
End Time Ms |
Long | The time the task finished (milliseconds). |
Status Time |
Date | The time the status of the task was last updated. |
Status Time Ms |
Long | The time the status of the task was last updated (milliseconds). |
Meta Id |
Long | The unique ID (unique within this Stroom cluster) of the stream the task was for. |
Node |
String | The name of the node that the task was executed on. |
Pipeline |
String | The name of the pipeline that spawned the task. |
Pipeline Name |
String | The name of the pipeline that spawned the task. |
Processor Filter Id |
Long | The ID of the processor filter that spawned the task. |
Processor Filter Priority |
Integer | The priority of the processor filter when the task was executed. |
Processor Id |
Long | The unique ID (unique within this Stroom cluster) of the pipeline processor that spawned this task. |
Feed |
String | |
Status |
String | The status of the task (Created , Queued , Processing , Complete , Failed , Deleted ). |
Task Id |
Long | The unique ID (unique within this Stroom cluster) of this task. |
Reference Data Store
Warning
This data source is for advanced users only and is primarily aimed at debugging issues with reference data.Reference data is written to a persistent cache on storage local to the node. This data source exposes the data held in the store on the local node only. Given that most Stroom deployments are clustered and the UI nodes are typically not doing processing, this means the UI node will have no reference data.
Task Manager
This data source exposed the back ground tasks currently running across the Stroom cluster. Each row represents a single background server task.
Requires the Manage Tasks
application permission.
Field | Type | Description |
---|---|---|
Node |
String | The name of the node that the task is running on. |
Name |
String | The name of the task. |
User |
String | The user name of the user that the task is running as. |
Submit Time |
Date | The time the task was submitted. |
Age |
Duration | The time the task has been running for. |
Info |
String | The latest information message from the task. |
12.2 - Dashboards
12.2.1 - Queries
Dashboard queries are created with the query expression builder. The expression builder allows for complex boolean logic to be created across multiple index fields. The way in which different index fields may be queried depends on the type of data that the index field contains.
Date Time Fields
Time fields can be queried for times equal, greater than, greater than or equal, less than, less than or equal or between two times.
Times can be specified in two ways:
-
Absolute times
-
Relative times
Absolute Times
An absolute time is specified in ISO 8601 date time format, e.g. 2016-01-23T12:34:11.844Z
Relative Times
In addition to absolute times it is possible to specify times using expressions. Relative time expressions create a date time that is relative to the execution time of the query. Supported expressons are as follows:
now()
- The current execution time of the query.second()
- The current execution time of the query rounded down to the nearest second.minute()
- The current execution time of the query rounded down to the nearest minute.hour()
- The current execution time of the query rounded down to the nearest hour.day()
- The current execution time of the query rounded down to the nearest day.week()
- The current execution time of the query rounded down to the first day of the week (Monday).month()
- The current execution time of the query rounded down to the start of the current month.year()
- The current execution time of the query rounded down to the start of the current year.
Adding/Subtracting Durations
With relative times it is possible to add or subtract durations so that queries can be constructed to provide for example, the last week of data, the last hour of data etc.
To add/subtract a duration from a query term the duration is simply appended after the relative time, e.g.
now() + 2d
Multiple durations can be combined in the expression, e.g.
now() + 2d - 10h
now() + 2w - 1d10h
Durations consist of a number and duration unit. Supported duration units are:
s
- Secondsm
- Minutesh
- Hoursd
- Daysw
- WeeksM
- Monthsy
- Years
Using these durations a query to get the last weeks data could be as follows:
between now() - 1w and now()
Or midnight a week ago to midnight today:
between day() - 1w and day()
Or if you just wanted data for the week so far:
greater than week()
Or all data for the previous year:
between year() - 1y and year()
Or this year so far:
greater than year()
12.2.2 - Internal Links
Within Stroom, links can be created in dashboard tables or dashboard text panes that will direct Stroom to display an item in various ways.
Links are inserted in the form:
[Link Text](URL and parameters){Link Type}
In dashboard tables links can be inserted using the link()
function or more specialised functions such as data()
or stepping()
.
In dashboard text panes, links can be inserted into the HTML as link
attributes on elements.
Note
The text pane must be set toShow As HTML
for links to operate.
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer" link="[link](uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da&params=user%3Duser1&title=Details%20For%20user1){dashboard}">Details For user1</span>
</div>
The link type can be one of the following:
dialog
: Display the content of a link URL within a stroom popup dialog.tab
: Display the content of a link URL within a stroom tab.browser
: Display the content of a link URL within a new browser tab.dashboard
: Used to launch a Stroom dashboard internally with parameters in the URL.stepping
: Used to launch Stroom stepping internally with parameters in the URL.data
: Used to show Stroom data internally with parameters in the URL.annotation
: Used to show a Stroom annotation internally with parameters in the URL.
Dialog
Dialog links are used to embed any referenced URL in a Stroom popup Dialog. Dialog links look something like this in HTML:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[Show](https://www.somehost.com/somepath){dialog|Embedded In Stroom}">
Show In Stroom Dialog
</span>
</div>
Note
The dialog title can be controlled by adding a|
and required title after the type, e.g.
{dialog|Embedded In Stroom}
Tab
Tab links are similar to dialog links are used to embed any referenced URL in a Stroom tab. Tab links look something like this in HTML:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[Show](https://www.somehost.com/somepath){tab|Embedded In Stroom}">
Show In Stroom Tab
</span>
</div>
Note
The tab title can be controlled by adding a|
and required title after the type, e.g.
{tab|Embedded In Stroom}
Browser
Browser links are used to open any referenced URL in a new browser tab.
In most cases this is easily accomplished via a normal hyperlink but Stroom also provides a mechanism to do this as a link event so that dashboard tables are also able to open new browser tabs.
This can be accomplished by using the link()
table function.
In a dashboard text pane the HTML could look like this:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[Show](https://www.somehost.com/somepath){browser}">
Show In Browser Tab
</span>
</div>
Note
Unlike the other link types there is no way to control the browser tab title.Dashboard
In addition to viewing/embedding external URLs, Stroom links can be used to direct Stroom to show an internal item or feature.
The dashboard
link type allows Stroom to open a new tab and show a dashboard with the specified parameters.
The format for a dashboard link is as follows:
[Link Text](uuid=<UUID>¶ms=<PARAMS>&title=<CUSTOM_TITLE>){dashboard}
The parameters for dashboard links are:
uuid
- The UUID of the dashboard to open.params
- A URL encoded list of params to supply to the dashboard, e.g.params=user%3Duser1
.title
- An optional URL encoded title to better identify the specific instance of the dashboard, e.g.title=Details%20For%20user1
.
Note
Parameter values can be URL encoded in XSLT using theencode-for-uri
function.
An example of this type of link in HTML:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[link](uuid=e177cf16-da6c-4c7d-a19c-09a201f5a2da&params=user%3Duser1&title=Details%20For%20user1){dashboard}">
Details For user1
</span>
</div>
Note
By using a pipeline with the appropriate XSLT it is possible to dynamically generate links in dashboard text panes that will be specific to the data being displayed.Data
A link can be created to open a sub-set of a source of data (i.e. part of a stream) for viewing.
The data can either be opened in a popup dialog (dialog
) or in another stroom tab (tab
).
It can also be display in preview
form (with formatting and syntax highlighting) or unaltered source
form.
Note
To make full use of data links for viewing raw data, you need to use the stroom:source()
XSLT Function to decorate an event with the details of the source location it derived from.
The format for a data link is as follows:
[Link Text](id=<STREAM_ID>&partNo=<PART_NO>&recordNo=<RECORD_NO>&lineFrom=<LINE_FROM>&colFrom=<COL_FROM>&lineTo=<LINE_TO>&colTo=<COL_TO>&viewType=<VIEW_TYPE>&displayType=<DISPLAY_TYPE>){data}
Stroom deals in two main types of stream, segmented and non-segmented (see Streams).
Data in a non-segmented (i.e. raw) stream is identified by an id
, a partNo
and optionally line and column positions to define the sub-set of that stream part to display.
Data in a segmented (i.e. cooked) stream is identified by an id
, a recordNo
and optionally line and column positions to define the sub-set of that record (i.e. event) within that stream.
The parameters for data links are:
id
- The stream ID.partNo
- The part number of the stream (one based). Always1
for segmented (cooked) streams.recordNo
- The record number within a segmented stream (optional). Not applicable for non-segmented streams so usenull()
instead.lineFrom
- The line number of the start of the sub-set of data (optional, one based).colFrom
- The column number of the start of the sub-set of data (optional, one based).lineTo
- The line number of the end of the sub-set of data (optional, one based).colTo
- The column number of the end of the sub-set of data (optional, one based).viewType
- The type of view of the data (optional, defaults topreview
):preview
: Display the data as a formatted preview of a limited portion of the data.source
: Display the un-formatted data in its original form with the ability to navigate around all of the data source.
displayType
- The way of displaying the data (optional, defaults todialog
):dialog
: Open as a modal popup dialog.tab
: Open as a top level tab within the Stroom browser tab.
In preview
mode the line and column positions will limit the data displayed to the specified selection.
In source
mode the line and column positions define a highlight block of text within the part/record.
Warning
The displayType
value tab
is not supported if the dashboard is viewed via a Direct URL.
This is because a direct URL displays only the dashboard without Stroom’s top level tab bar so it is not possible to open it as a top level tab.
An example of this type of link in HTML:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[link](id=1822&partNo=1&recordNo=1){data}">
Show Source</span>
</div>
View Type
The additional parameter viewType
can be used to switch the data view mode from preview
(default) to source
.
In preview mode the optional parameters lineFrom
, colFrom
, lineTo
, colTo
can be used to limit the portion of the data that is displayed.
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[link](id=1822&partNo=1&recordNo=1&viewType=preview&lineFrom=1&colFrom=1&lineTo=10&colTo=8){data}">
Show Source Preview
</span>
</div>
In source mode the optional parameters lineFrom
, colFrom
, lineTo
, colTo
can be used to highlight a portion of the data that is displayed.
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[link](id=1822&partNo=1&recordNo=1&viewType=source&lineFrom=1&colFrom=1&lineTo=10&colTo=8){data}">
Show Source
</span>
</div>
Display Type
Choose whether to display data in a dialog
(default) or a Stroom tab
.
Stepping
A stepping link can be used to launch the data stepping feature with the specified data. The format for a stepping link is as follows:
[Link Text](id=<STREAM_ID>&partNo=<PART_NO>&recordNo=<RECORD_NO>){stepping}
The parameters for stepping links are as follows:
- id - The id of the stream to step.
- partNo - The sub part no within the stream to step (usually 1).
- recordNo - The record or event number within the stream to step.
An example of this type of link in HTML:
<div style="padding: 5px;">
<span style="text-decoration:underline;color:blue;cursor:pointer"
link="[link](id=1822&partNo=1&recordNo=1){stepping}">
Step Source</span>
</div>
Annotation
A link can be used to edit or create annotations. To view or edit an existing annotation the id must be known or one can be found using a stream and event id. If all parameters are specified an annotation will either be created or edited depending on whether it exists or not. The format for an annotation link is as follows:
[Link Text](annotationId=<ANNOTATION_ID>&streamId=<STREAM_ID>&eventId=<EVENT_ID>&title=<TITLE>&subject=<SUBJECT>&status=<STATUS>&assignedTo=<ASSIGNED_TO>&comment=<COMMENT>){annotation}
The parameters for annotation links are as follows:
- annotationId - The optional existing id of an annotation if one already exists.
- streamId - An optional stream id to link to a newly created annotation, or used to lookup an existing annotation if no annotation id is provided.
- eventId - An optional event id to link to a newly created annotation, or used to lookup an existing annotation if no annotation id is provided.
- title - An optional default title to give the annotation if a new one is created.
- subject - An optional default subject to give the annotation if a new one is created.
- status - An optional default status to give the annotation if a new one is created.
- assignedTo - An optional initial assignedTo value to give the annotation if a new one is created.
- comment - An optional initial comment to give the annotation if a new one is created.
12.2.3 - Direct URLs
It is possible to navigate directly to a specific Stroom dashboard using a direct URL. This can be useful when you have a dashboard that needs to be viewed by users that would otherwise not be using the Stroom user interface.
URL format
The format for the URL is as follows:
https://<HOST>/stroom/dashboard?type=Dashboard&uuid=<DASHBOARD UUID>[&title=<DASHBOARD TITLE>][¶ms=<DASHBOARD PARAMETERS>]
Example:
https://localhost/stroom/dashboard?type=Dashboard&uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09&title=My%20Dash¶ms=userId%3DFred%20Bloggs
Host and path
The host and path are typically https://<HOST>/stroom/dashboard
where <HOST>
is the hostname/IP for Stroom.
type
type
is a required parameter and must always be Dashboard
since we are opening a dashboard.
uuid
uuid
is a required parameter where <DASHBOARD UUID>
is the UUID for the dashboard you want a direct URL to, e.g. uuid=c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09
The UUID for the dashboard that you want to link to can be found by right clicking on the dashboard icon in the explorer tree and selecting Info.
The Info dialog will display something like this and the UUID can be copied from it:
DB ID: 4
UUID: c7c6b03c-5d47-4b8b-b84e-e4dfc6c84a09
Type: Dashboard
Name: Stroom Family App Events Dashboard
Created By: INTERNAL
Created On: 2018-12-10T06:33:03.275Z
Updated By: admin
Updated On: 2018-12-10T07:47:06.841Z
title (Optional)
title
is an optional URL parameter where <DASHBOARD TITLE>
allows the specification of a specific title for the opened dashboard instead of the default dashboard name.
The inclusion of ${name}
in the title allows the default dashboard name to be used and appended with other values, e.g. 'title=${name}%20-%20' + param.name
params (Optional)
params
is an optional URL parameter where <DASHBOARD PARAMETERS>
includes any parameters that have been defined for the dashboard in any of the expressions, e.g. params=userId%3DFred%20Bloggs
Permissions
In order for as user to view a dashboard they will need the necessary permission on the various entities that make up the dashboard.
For a Lucene index query and associated table the following permissions will be required:
- Read permission on the Dashboard entity.
- Use permission on any Indexe entities being queried in the dashboard.
- Use permission on any Pipeline entities set as search extraction Pipelines in any of the dashboard’s tables.
- Use permission on any XSLT entities used by the above search extraction Pipeline entites.
- Use permission on any ancestor pipelines of any of the above search extraction Pipeline entites (if applicable).
- Use permission on any Feed entities that you want the user to be able to see data for.
For a SQL Statistics query and associated table the following permissions will be required:
- Read permission on the Dashboard entity.
- Use permission on the StatisticStore entity being queried.
For a visualisation the following permissions will be required:
- Read permission on any Visualiation entities used in the dashboard.
- Read permission on any Script entities used by the above Visualiation entities.
- Read permission on any Script entities used by the above Script entities.
12.3 - Query
TODO
Complete this section.12.3.1 - Stroom Query Language
Query Format
Stroom Query Language (StroomQL) is a text based replacement for the existing Dashboard query builder and allows you to express the same queries in text form as well as providing additional functionality. It is currently used on the Query entity as the means of defining a query.
The following shows the supported syntax for a StroomQL query.
from <DATA_SOURCE>
where <FIELD> <CONDITION> <VALUE> [and|or|not]
[and|or|not]
[window] <TIME_FIELD> by <WINDOW_SIZE> [advance <ADVANCE_WINDOW_SIZE>]
[filter] <FIELD> <CONDITION> <VALUE> [and|or|not]
[and|or|not]
[eval...] <FIELD> = <EXPRESSION>
[having] <FIELD> <CONDITION> <VALUE> [and|or|not]
[group by] <FIELD>
[sort by] <FIELD> [desc|asc] // asc by default
[limit] <MAX_ROWS>
select <FIELD> [as <COLUMN NAME>], ...
[show as] <VIS_NAME> (<VIS_CONTROL_ID_1> = <COLUMN_1>, <VIS_CONTROL_ID_2> = <COLUMN_2>)
Keywords
From
The first part of a StroomQL expression is the from
clause that defines the single data source to query.
All queries must include the from
clause.
Select the data source to query, e.g.
from my_source
If the name of the data source contains white space then it must be quoted, e.g.
from "my source"
Where
Use where
to construct query criteria, e.g.
where feed = "my feed"
Add boolean logic with and
, or
and not
to build complex criteria, e.g.
where feed = "my feed"
or feed = "other feed"
Use brackets to group logical sub expressions, e.g.
where user = "bob"
and (feed = "my feed" or feed = "other feed")
Conditions
Supported conditions are:
=
!=
>
>=
<
<=
is null
is not null
And|Or|Not
Logical operators to add to where and filter clauses.
Bracket groups
You can force evaluation of items in a specific order using bracketed groups.
and X = 5 OR (name = foo and surname = bar)
Window
window <TIME_FIELD> by <WINDOW_SIZE> [advance <ADVANCE_WINDOW_SIZE>]
Windowing groups data by a specified window size applied to a time field. A window inserts additional rows for future periods so that rows for future periods contain count columns for previous periods.
Specify the field to window by and a duration.
Durations are specified in simple terms e.g. 1d
, 2w
etc.
By default, a window will insert a count into the next period row. This is because by default we advance by the specified window size. If you wish to advance by a different duration you can specify the advance amount which will insert counts into multiple future rows.
Filter
Use filter
to filter values that have not been indexed during search retrieval.
This is used the same way as the where
clause but applies to data after being retrieved from the index, e.g.
filter obscure_field = "some value"
Add boolean logic with and
, or
and not
to build complex criteria as supported by the where
clause.
Use brackets to group logical sub expressions as supported by the where
clause.
Note
As filters do not make use of the index they can be considerably slower than awhere
clause, however they allow filtering on fields that have not been indexed for some reason.
Frequent use of filter
on a field suggests you may want to consider including that field in an index.
Eval
Use eval
to assign the value returned from an Expression Function to a named variable, e.g.
eval my_count = count()
Here the result of the count()
function is being stored in a variable called my_count
.
Functions can be nested and applied to variables, e.g.
eval new_name = concat(
substring(name, 3, 5),
substring(name, 8, 9))
Note that all fields in the data source selected using from
will be available as variables by default.
Multiple eval
statements can also be used to breakup complex function expressions and make it easier to comment out individual evaluations, e.g.
eval name_prefix = substring(name, 3, 5)
eval name_suffix = substring(name, 8, 9)
eval new_name = concat(
name_prefix,
name_suffix)
Variables can be reused, e.g.
eval name_prefix = substring(name, 3, 5)
eval new_name = substring(name, 8, 9)
eval new_name = concat(
name_prefix,
new_name)
In this example, the second assignment of new_name
will override the value initially assigned to it.
Note that that when reusing a variable name, the assignment can depend on the previous value assigned to that variable.
Add boolean logic with and
, or
and not
to build complex criteria, e.g.
where feed = "my feed" or feed = "other feed"
Use brackets to group logical sub expressions, e.g.
where user = "bob" and (feed = "my feed" or feed = "other feed")
Having
A post aggregate filter that is applied at query time to return only rows that match the having
conditions.
having count > 3
Group By
Use to group by columns, e.g.
group by feed
You can group across multiple columns, e.g.
group by feed, name
You can create nested groups, e.g.
group by feed
group by name
Sort By
Use to sort by columns, e.g.
sort by feed
You can sort across multiple columns, e.g.
sort by feed, name
You can change the sort direction, e.g.
sort by feed asc
Or
sort by feed desc
Limit
Limit the number of results, e.g.
limit 10
Select
The select
keyword is used to define the fields that will be selected out of the data source (and any eval
’d fields) for display in the table output.
select feed, name
You can optionally rename the fields so that they appear in the table with more human friendly names.
select feed as 'my feed column',
name as 'my name column'
Show
The show
keyword is used to tell StroomQL how to show the data resulting from the select
.
A Stroom visualisation can be specified and then passed column values from the select
for the visualisation control properties.
show LineChart(x = EventTime, y = count)
show Doughnut(names = Feed, values = count)
For visualisations that contain spaces in their names it is necessary to use quotes, e.g.
show "My Visualisation" (x = EventTime, y = count)
Comments
Single line
StroomQL supports single line comments using //
.
For example:
from "index_view" // view
where EventTime > now() - 1227d
// and StreamId = 1210
select StreamId as "Stream Id", EventTime as "Event Time"
Multi line
Multiple lines can be commented by surrounding sections with /*
and */
.
For example:
from "index_view" // view
where EventTime > now() - 1227d
/*
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
*/
select StreamId as "Stream Id", EventTime as "Event Time"
Examples
The following are various example queries.
// add a where
from "index_view" // view
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10
from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10
from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
// eval count = count()
// group by StreamId
// sort by Sl desc
select StreamId as "Stream Id", EventId as "Event Id"
// limit 10
from "index_view" // view
// add a where
where EventTime > now() - 1227d
// and StreamId = 1210
eval UserId = any(upperCase(UserId))
eval FirstName = lowerCase(substringBefore(UserId, '.'))
eval FirstName = any(FirstName)
eval Sl = stringLength(FirstName)
eval count = count()
group by StreamId
sort by Sl desc
select Sl, StreamId as "Stream Id", EventId as "Event Id", EventTime as "Event Time", UserId as "User Id", FirstName, count
limit 10
12.4 - Analytic Rules
TODO
Complete this section.12.5 - Search Extraction
When indexing data it is possible to store (see Stored Fields all data in the index. This comes with a storage cost as the data is then held in two places; the event; and the index document.
Stroom has the capability of doing Search Extraction at query time. This involves combining the data stored in the index document with data extracted using a search extraction pipeline. Extracting data in this way is slower but reduces the data stored in the index, so it is a trade off between performance and storage space consumed.
Search Extraction relies on the StreamId and EventId being stored in the Index. Stroom can then used these two fields to locate the event in the stream store and process it with the search extraction pipeline.
TODO
Add more detail12.6 - Dictionaries
Creating
Right click on a folder in the explorer tree that you want to create a dictionary in. Choose ‘New/Dictionary’ from the popup menu:
Call the dictionary something like ‘My Dictionary’ and click
.Now just add any search terms you want to the newly created dictionary and click
.You can add multiple terms.
- Terms on separate lines act as if they are part of an ‘OR’ expression when used in a search.
apple banana orange
- Terms on a single line separated by spaces act as if they are part of an ‘AND’ expression when used in a search.
apple,banana,orange
Using the Dictionary
To perform a search using your dictionary, just choose the newly created dictionary as part of your search expression:
TODO: Fix image
13 - Security
Shared Storage
For most large installations Stroom uses shared storage for its data store. This storage could be a CIFS, NFS or similar shared file system. It is recommended that access to this shared storage is protected so that only the application can access it. This could be achieved by placing the storage and application behind a firewall and by requiring appropriate authentication to the shared storage. It should be noted that NFS is unauthenticated so should be used with appropriate safeguards.
MySQL
Accounts
It is beyond the scope of this article to discuss this in detail but all MySQL accounts should be secured on initial install. Official guidance for doing this can be found here .
Communication
Communication between MySQL and the application should be secured. This can be achieved in one of the following ways:
- Placing MySQL and the application behind a firewall
- Securing communication through the use of iptables
- Making MySQL and the application communicate over SSL (see here for instructions)
The above options are not mutually exclusive and may be combined to better secure communication.
Application
Node to node communication
In a multi node Stroom deployment each node communicates with the master node. This can be configured securely in one of several ways:
- Direct communication to Tomcat on port 8080 - Secured by being behind a firewall or using iptables
- Direct communication to Tomcat on port 8443 - Secured using SSL and certificates
- Removal of Tomcat connectors other than AJP and configuration of Apache to communicate on port 443 using SSL and certificates
Application to Stroom Proxy Communication
The application can be configured to share some information with Stroom Proxy so that Stroom Proxy can decide whether or not to accept data for certain feeds based on the existence of the feed or it’s reject/accept status. The amount of information shared between the application and the proxy is minimal but could be used to discover what feeds are present within the system. Securing this communication is harder as both the application and the proxy will not typically reside behind the same firewall. Despite this communication can still be performed over SSL thus protecting this potential attack vector.
Admin port
Stroom (v6 and above) and its associated family of stroom-* Dropwizard based services all expose an admin port (8081 in the case of stroom). This port serves up various health check and monitoring pages as well as a number of restful services for initiating admin tasks. There is currently no authentication on this admin port so it is assumed that access to this port will be tightly controlled using a firewall, iptables or similar.
Servlets
There are several servlets in Stroom that are accessible by certain URLs. Considerations should be made about what URLs are made available via Apache and who can access them. The servlets, path and function are described below:
Servlet | Path | Function | Risk |
---|---|---|---|
DataFeed | /datafeed or /datafeed/* | Used to receive data | Possible denial of service attack by posting too much data/noise |
RemoteFeedService | /remoting/remotefeedservice.rpc | Used by proxy to ask application about feed status (described in previous section) | Possible to systematically discover which feeds are available. Communication with this service should be secured over SSL discussed above |
DynamicCSSServlet | /stroom/dynamic.css | Serves dynamic CSS based on theme configuration | Low risk as no important data is made available by this servlet |
DispatchService | /stroom/dispatch.rpc | Service for UI and server communication | All back-end services accessed by this umbrella service are secured appropriately by the application |
ImportFileServlet | /stroom/importfile.rpc | Used during configuration upload | Users must be authenticated and have appropriate permissions to import configuration |
ScriptServlet | /stroom/script | Serves user defined visualisation scripts to the UI | The visualisation script is considered to be part of the application just as the CSS so is not secured |
ClusterCallService | /clustercall.rpc | Used for node to node communication as discussed above | Communication must be secured as discussed above |
ExportConfig | /export/* | Servlet used to export configuration data | Servlet access must be restricted with Apache to prevent configuration data being made available to unauthenticated users |
Status | /status | Shows the application status including volume usage | Needs to be secured so that only appropriate users can see the application status |
Echo | /echo | Block GZIP data posted to the echo servlet is sent back uncompressed. This is a utility servlet for decompression of external data | URL should be secured or not made available |
Debug | /debug | Servlet for echoing HTTP header arguments including certificate details | Should be secured in production environments |
SessionList | /sessionList | Lists the logged in users | Needs to be secured so that only appropriate users can see who is logged in |
SessionResourceStore | /resourcestore/* | Used to create, download and delete temporary files liked to a users session such as data for export | This is secured by using the users session and requiring authentication |
HDFS, Kafka, HBase, Zookeeper
Stroom and stroom-stats can integrate with HDFS, Kafka, HBase and Zookeeper. It should be noted that communication with these external services is currently not secure. Until additional security measures (e.g. authentication) are put in place it is assumed that access to these services will be careful controlled (using a firewall, iptables or similar) so that only stroom nodes can access the open ports.
Content
It may be possible for a user to write XSLT, Data Splitter or other content that may expose data that we do not wish to or to cause the application some harm. At present processing operations are not isolated processes and so it is easy to cripple processing performance with a badly written translation whether written accidentally or on purpose. To mitigate this risk it is recommended that users that are given permission to create XSLT, Data Splitter and Pipeline configurations are trusted to do so.
Visualisations can be completely customised with javascript. The javascript that is added is executed in a clients browser potentially opening up the possibility of XSS attacks, an attack on the application to access data that a user shouldn’t be able to access, an attack to destroy data or simply failure/incorrect operation of the user interface. To mitigate this risk all user defined javascript is executed within a separate browser IFrame. In addition all javascript should be examined before being added to a production system unless the author is trusted. This may necessitate the creation of a separate development and testing environment for user content.
14 - Tools
14.1 - Command Line Tools
Stroom has a number of tools that are available from the command line in addition to starting the main application.
Running commands
The basic structure of the shell command for starting one of stroom’s commands depends on whether you are running the zip distribution of stroom or a docker stack.
In either case, COMMAND
is the name of the stroom command to run, as specified by the various headings on this page.
Each command value is described in its own section and may take no arguments or a mixture of mandatory and optional arguments.
Note
These commands are very powerful and potentially dangerous in the wrong hands, e.g. they allow the changing of user’s passwords. Access to these commands should be strictly limited. Also, each command will run in its own JVM so are not really intended to be run when Stroom is running on the node.Running commands with the zip distribution
The commands are run by passing the command and any of its arguments to the java
command.
The jar file is in the bin
directory of the zip distribution.
For example:
Running commands in a stroom Docker stack
Commands are run in a Docker stack using the command.sh
script found in the root of the stack directory structure.
Note
You do not specify the config file location as the script does this for you.For example:
Command reference
Note
All the examples below assume you are running stroom as part of the zip distribution. If you are running a Docker stack then you will need to use thecommand.sh
script (as described above) with the same arguments but omitting the config file path.
server
This is the normal command for starting the Stroom application using the supplied YAML configuration file.
The example above will start the application as a foreground process.
Stroom would typically be started using the start.sh
shell script, but the command above is listed for completeness.
When stroom starts it will check the database to see if any migration is required. If migration from an earlier version (including from an empty database) is required then this will happen as part of the application start process.
migrate
There may be occasions where you want to migrate an old version but not start the application, e.g. during migration testing or to initiate the migration before starting up a cluster. This command will run the process that checks for any required migrations and then performs them. On completion of the process it exits. This runs as a foreground process.
create_account
Where the named arguments are:
-u
--user
- The username for the user.-p
--password
- The password for the user.-e
--email
- The email address of the user.-f
--firstName
- The first name of the user.-s
--lastName
- The last name of the user.--noPasswordChange
- If set do not require a password change on first login.--neverExpires
- If set, the account will never expire.
This command will create an account in the internal identity provider within Stroom. Stroom is able to use an external OpenID identity providers such as Google or AWS Cognito but by default will use its own. When configured to use its own (the default) it will auto create an admin account when starting up a fresh instance. There are times when you may wish to create this account manually which this command allows.
Authentication Accounts and Stroom Users
The user account used for authentication is distinct to the Stroom user entity that is used for authorisation within Stroom. If an external IDP is used then the mechanism for creating the authentication account will be specific to that IDP. If using the default internal Stroom IDP then an account must be created in order to authenticate, either from within the UI if you are already authenticated as a privileged used or using this command. In either case a Stroom user will need to exist with the same username as the authentication account.
The command will fail if the user already exists. This command should NOT be run if you are using an external identity provider.
This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.
reset_password
Where the named arguments are:
-u
--user
- The username for the user.-p
--password
- The password for the user.
This command is used for changing the password of an existing account in stroom’s internal identity provider. It will also reset all locked/inactive/disabled statuses to ensure the account can be logged into. This command should NOT be run if you are using an external identity provider. It will fail if the account does not exist.
This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.
manage_users
Where the named arguments are:
--createUser
USER_IDENTIFIER
- Creates a Stroom user with the supplied user identifier.--greateGroup
GROUP_IDENTIFIER
- Creates a Stroom user group with the supplied group name.--addToGroup
USER_OR_GROUP_IDENTIFIER
TARGET_GROUP
- Adds a user/group to an existing group.--removeFromGroup
USER_OR_GROUP_IDENTIFIER
TARGET_GROUP
- Removes a user/group from an existing group.--grantPermission
USER_OR_GROUP_IDENTIFIER
PERMISSION_IDENTIFIER
- Grants the named application permission to the user/group.--revokePermission
USER_OR_GROUP_IDENTIFIER
PERMISSION_IDENTIFIER
- Revokes the named application permission from the user/group.--listPermissions
- Lists all the valid permission names.
This command allows you to manage the account permissions within stroom regardless of whether the internal identity provider or an external party is used. A typical use case for this is when using a external identity provider. In this instance Stroom has no way of auto creating an admin account when first started so the association between the account on the 3rd party IDP and the stroom user account needs to be made manually. To set up an admin account to enable you to login to stroom you can do:
This command is not intended for automation of user management tasks on a running Stroom instance that you can authenticate with.
It is only intended for cases where you cannot authenticate with Stroom, i.e. when setting up a new Stroom with a 3rd party IDP or when scripting the creation of a test environment.
If you want to automate actions that can be performed in the UI then you can make use of the REST API that is described at /stroom/noauth/swagger-ui
.
Warning
See the section above about the distinction between authentication accounts and stroom users.
The following is an example command to create a new stroom user jbloggs
, create a group called Administrators
with the Administrator application permission and then add jbloggs
to the Administrators
group.
This is a typical command to bootstrap a stroom instance with one admin user so they can login to stroom with full privileges to manage other users from within the application.
Where jbloggs is the user name of the account on the identity provider.
This command will also run any necessary database migrations to ensure it is working with the correct version of the database schema.
The named arguments can be used as many times as you like so you can create multiple users/groups/grants/etc. Regardless of the order of the arguments, the changes are executed in the following order:
- Create users
- Create groups
- Add users/groups to a group
- Remove users/groups from a group
- Grant permissions to users/groups
- Revoke permissions from users/groups
External Identity Providers
The manageUsers
command is particularly useful when using stroom with an external identity provider.
In order to use a new install of stroom that is configured with an external identity provider you must first set up a user with the Administrator system permission.
If this is not done, users will be able to log in to stroom but will have no permissions to do anything.
You can optionally set up other groups/users with other permissions to bootstrap the stroom instance.
External OIDC identity providers have a unique identifier for each user (this may be called sub
or oid
) and this often takes the form of a
UUID
.
Stroom stores this unique identifier (know as a Subject ID in stroom) against a user so it is able to associate the stroom user with the identity provider user.
Identity providers may also have a more friendly display name and full name for the user, though these may not be unique.
USER_IDENTIFIER
The USER_IDENTIFIER
is of the form subject_id[,display_name[,full_name]]
e.g.:
eaddac6e-6762-404c-9778-4b74338d4a17
eaddac6e-6762-404c-9778-4b74338d4a17,jbloggs
eaddac6e-6762-404c-9778-4b74338d4a17,jbloggs,Joe Bloggs
The optional parts are so that stroom can display more human friendly identifiers for a user. They are only initial values and will always be over written with the values from the identity provider when the user logs in.
The following are examples of various uses of the --createUser
argument group.
GROUP_IDENTIFIER
The GROUP_IDENTIFIER
is the name of the group in stroom, e.g. Administrators
, Analysts
, etc.
Groups are created by an admin to help manage permissions for large number of similar users.
Groups relate only to stroom and have nothing to do with the identity provider.
USER_OR_GROUP_IDENTIFIER
The USER_OR_GROUP_IDENTIFIER
can either be the identifier for a user or a group, e.g. when granting a permission to a user/group.
It takes the following forms (with examples for each):
user_subject_id
eaddac6e-6762-404c-9778-4b74338d4a17
user_display_name
jbloggs
group_name
Administrators
The value for the argument will first be treated as a unique identifier (i.e. the subject ID or group name). If the user cannot be found it will fall back to using the display name to find the user.
create_api_key
The create_api_key
command can be used to create an API Key for a user.
This is useful if, when bootstrapping a cluster, you want to set up a user and associated API Key to allow an external process to monitor/manage that Stroom cluster, e.g. using an Operator in Kubernetes.
The arguments to the command are as follows:
u
user
- The identity of the user to create the API Key for. This is the unique subject ID of the user.n
keyName
- The name of the key. This must be unique for the user.e
expiresDays
- Optional number of days after which the key should expire. This must not be greater than the configured propertystroom.security.authentication.maxApiKeyExpiryAge
. If not set, it will be defaulted to the maximum configured age.c
comments
- Optional string to set the comments for the API Key.o
outFile
- Optional path to use to output the API Key string to. If not set, the API Key string will be output to stdout.
14.2 - Stream Dump Tool
Data within Stroom can be exported to a directory using the StreamDumpTool
.
The tool is contained within the core Stroom Java library and can be accessed via the command line, e.g.
java -cp "apache-tomcat-7.0.53/lib/*:lib/*:instance/webapps/stroom/WEB-INF/lib/*" stroom.util.StreamDumpTool outputDir=output
Note the classpath may need to be altered depending on your installation.
The above command will export all content from Stroom and output it to a directory called output
. Data is exported to zip files in the same format as zip files in proxy repositories. The structure of the exported data is ${feed}/${pathId}/${id}
by default with a .zip
extension.
To provide greater control over what is exported and how the following additional parameters can be used:
feed
- Specify the name of the feed to export data for (all feeds by default).
streamType
- The single stream type to export (all stream types by default).
createPeriodFrom
- Exports data created after this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z
(exports from earliest data by default).
createPeriodTo
- Exports data created before this time specified in ISO8601 UTC format, e.g. 2001-01-01T00:00:00.000Z
(exports up to latest data by default).
outputDir
- The output directory to write data to (required).
format
- The format of the output data directory and file structure (${feed}/${pathId}/${id}
by default).
Format
The format parameter can include several replacement variables:
feed
- The name of the feed for the exported data.
streamType
- The data type of the exported data, e.g. RAW_EVENTS
.
streamId
- The id of the data being exported.
pathId
- A incrementing numeric id that creates sub directories when required to ensure no directory ends up containing too many files.
id
- A incrementing numeric id similar to pathId
but without sub directories.
15 - User Content
15.1 - Editing Text
Stroom uses the Ace text editor for editing and viewing text, such as XSLTs, raw data, cooked events, stepping, etc. The editor provides various useful features:
- Syntax highlighting
- Themes
- Find/replace (see Keyboard Shortcuts)
- Code auto-completion
Keyboard shortcuts
See Keyboard Shortcuts for details of the keyboard shortcuts available when using the Ace editor.
Vim key bindings
If you are familiar with the Vi/Vim text editors then it is possible to enable Vim key bindings in Stroom. This can be done in two ways.
Either globally by setting Editor Key Bindings to Vim
in your user preferences:
Or within an editor using the context menu. This latter option allows you to temporarily change your bindings.
The Ace editor does not support all features of Vim however the core navigation/editing key bindings are present. The key supported features of Vim are:
- Visual mode and visual block mode.
- Searching with
/
(javascript flavour regex) - Search/replace with commands like
:%s/foo/bar/g
- Incrementing/decrementing numbers with Ctrl ^ + a / Ctrl ^ + b
- Code (un-)folding with z , o , z , c , etc.
- Text objects, e.g.
>
,)
,]
,'
,"
,p
paragraph,w
word. - Repetition with the
.
command. - Jumping to a line with
:<line no>
.
Notable features not supported by the Ace editor:
- The following text objects are not supported
b
- Braces, i.e{
or[
.t
- Tags, i.e. XML tags<value>
.s
- Sentence.
- The
g
command mode command, i.e.:g/foo/d
- Splits
For a list of useful Vim key bindings see this cheat sheet , though not all bindings will be available in Stroom’s Ace editor.
Use of Esc
key in Vim mode
The Esc key is bound to the close action in Stroom, so pressing Esc will typically close a popup, dialog, selection box, etc. Dialogs will not be closed if the Ace editor has focus but as Esc is used so frequently with Vim bindings it may be advisable to use an alternative key to exit insert mode to avoid accidental closure. You can use the standard Vim binding of Ctrl ^ + [ or the custom binding of k , b as alternatives to Esc .
Auto-Completion And Snippets
The editor supports a number of different types of auto-completion of text. Completion suggestions are triggered by the following mechanisms:
- Ctrl ^ + Space ␣ - when live auto-complete is disabled.
- Typing - when live auto-complete is enabled.
When completion suggestions are triggered the follow types of completion may be available depending on the text being edited.
- Local - any word/token found in the existing document. Useful if you have typed a long word and need to type it again.
- Keyword - A word/token that has been defined in the syntax highlighting rules for the text type, i.e.
function
is a keyword when editing Javascript. - Snippet - A block of text that has been defined as a snippet for the editor mode (XML, Javascript, etc.).
Snippets
Snippets allow you to quickly enter pre-defined blocks of common text into the editor.
For example when editing an XSLT you may want to insert a call-template
with parameters.
To do this using snippets you can do the following:
-
Type
call
then hit Ctrl ^ + Space ␣ . -
In the list of options use the cursor keys to select
call-template with-param
then hit Enter ↵ or Tab ↹ to insert the snippet. The snippet will look like<xsl:call-template name="template"> <xsl:with-param name="param"></xsl:with-param> </xsl:call-template>
-
The cursor will be positioned on the first tab stop (the template name) with the tab stop text selected.
-
At this point you can type in your template name, e.g.
MyTemplate
, then hit Tab ↹ to advance to the next tab stop (the param name) -
Now type the name of the param, e.g.
MyParam
, then hit Tab ↹ to advance to the last tab stop positioned within the<with-param>
ready to enter the param value.
Snippets can be disabled from the list of suggestions by selecting the option in the editor context menu.
Tab triggers
Some snippets can be triggered by typing an abbreviation and then hitting Tab ↹ to insert the snippet. This mechanism is faster than hitting Ctrl ^ + Space ␣ and selecting the snippet, if you can remember the snippet tab trigger abbreviations.
Available snippets
For a list of the available completion snippets see the Completion Snippet Reference.
Theme
The editor has a number of different themes that control what colours are used for the different elements in syntax highlighted text. The theme can be set User Preferences, from the main menu
, select:The list of themes available match the main Stroom theme, i.e. dark Ace editor themes for a dark Stroom theme.
15.2 - Naming Conventions
Stroom has been in use by GCHQ for many years and is used to process logs from a large number of different systems. This sections aims to provide some guidelines on how to name and organise your content, e.g. Feeds, XSLTs, Pipelines, Folders, etc. These are not hard rules and you do not have to follow them, however it may help when it comes to sharing content.
See Also
TODO
Complete this section15.3 - Documenting content
The screen for each Entity in Stroom has a Documentation sub-tab. The purpose of this sub-tab is to allow the user to provide any documentation about the entity that is relevant. For example a user might want to provide information about the system that a Feed receives data from, or document the purpose of a complex XSLT translation.
In previous versions of stroom this documentation was a small and simple Description text field, however now it is a full screen of rich text. This screen defaults to its read-only preview mode, but the user can toggle it to the edit mode to edit the content. In the edit mode, the documentation can be created/edited using plain text, or Markdown . Markdown is a fairly simple markup language for producing richly formatted text from plain text.
There are many variants of markdown that all have subtly different features or syntax. Stroom uses the Showdown markdown converter to render users’ markdown content into formatted text. This link is the definitive source for supported markdown syntax.
Note
The Showdown markdown processor used in stroom is not the same as the markdown processor used within this documentation site (stroom-docs), so there may be some subtle differences in syntax.Example Markdown Content
The following is a brief guide to the most common formatting that can be done with markdown and that is supported in Stroom.
# Markdown Example
This is an example of a markdown document.
## Headings Example
This is at level 2.
### Heading Level 3
This is at level 3.
#### Heading Level 4
This is at level 4.
## Text Styles
**bold**, __bold__, *italic*, _italic_, ***bold and italic***, ~~strike-through~~
## Bullets
Use four spaces to indent a sub-item.
* Bullet 1
* Bullet 1a
* Bullet 2
* Bullet 2a
## Numbered Lists
Use four spaces to indent a sub-item.
Using `1` for all items means the makrdown processor will replace them with the correct number, making it easier to re-order items.
1. Item 1
1. Item 1a
1. Item 1b
1. Item 2
1. Item 2a
1. Item 2b
## Quoted Text
> This is a quote.
Text
> This is another quote.
> It has multiple lines...
>
> ...and gaps and bullets
> * Item 1
> * Item 2
## Tables
Note `---:` to right align a column, `:---:` to center align it.
| Syntax | Description | Value | Fruit |
| ----------- | ----------- | -----:| :----: |
| Row 1 | Title | 1 | Pear |
| Row 2 | Text | 10 | Apple |
| Row 3 | Text | 100 | Kiwi |
| Row 4 | Text | 1000 | Orange |
Table using `<br>` for multi-line cells.
| Name | Description |
|-----------|-----------------|
| Row 1 | Line 1<br>Line2 |
| Row 2 | Line 1<br>Line2 |
## Links
Line: [title](https://www.example.com)
## Simple Lists
Add two spaces to the end of each line to stop each line being treated as a paragraph.
One
Two
Three
## Paragraphs
Lines not separated by a blank line will be joined together with a space between them.
Stroom will wrap long lines when rendered.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
## Task Lists
The `X` indicates a task has been completed.
* [x] Write the press release
* [ ] Update the website
* [ ] Contact the media
## Images
A non-standard syntax is supported to render images at a set size in the form of `<width>x<height>`.
Use a `*` for one of the dimensions to scale it proportionately.
![This is my alt text](images/logo.svg =200x* "This is my tooltip title")
## Separator
This is a horizontal rule separator
---
## Code
Code can be represented in-line like this `echo "hello world"` by surround it with single back-ticks.
Multi-line blocks of code can rendered with the appropriate syntax highlighting using a fenced block comprising three back-ticks.
Specify the language after the first set of three back ticks, or `text` for plain text.
Only certain languages are supported in Stroom.
**JSON**
```json
{
"key1": "some text",
"key2": 123
}
```
**XML**
```xml
<record>
<data name="dateTime" value="2020-09-28T14:30:33.476" />
<data name="machineIp" value="19.141.201.14" />
</record>
```
**bash**
```bash
#!/bin/bash
now="$(date)"
computer_name="$(hostname)"
echo "Current date and time : $now"
echo "Computer name : $computer_name"
```
Wrapping
Long paragraphs will be wrapped
Code Syntax Highlighting
This is an example of a fenced code block.
```xml
<record>
<data name="dateTime" value="2020-09-28T14:30:33.476" />
</record>
```
In this example, xml
defines the language used within the fenced block.
Stroom supports the following languages for fenced code blocks.
If you require additional languages then please raised a ticket
here
. If your language is not currently supported or is just plain text then use text
.
- text
- sh
- bash
- xml
- css
- javascript
- csv
- regex
- powershell
- sql
- json
- yaml
- properties
- toml
Fenced blocks with content that is wider than the pane will result in the fenced block having its own horizontal scroll bar.
Escaping Characters
It is common to use _
characters in
Feed
names, however if there are two of these in a word then the markdown processor will interpret them as italic markup.
To prevent this, either surround the word with back ticks to be rendered as code or escape each underscore with a \
, i.e. THIS\_IS\_MY\_FEED
. THIS_IS_MY_FEED.
HTML
While it is possible to use HTML in the documentation, its use is not recomended as it increases the complexity of the documentation content and requires that other users have knowledge of HTML. Markdown should be sufficient for most cases, with the possible exception of complex tables where HTML may be prefereable.
Note
No form of HTML scripting (i.e. Javascript) is supported within the documentation content.15.4 - Finding Things
Explorer Tree
The Explorer Tree in stroom is the primary means of finding user created content, for example Feeds, XSLTs, Pipelines, etc.
Branches of the Explorer Tree can be expanded and collapsed to reveal/hide the content at different levels.
Filtering by Type
The Explorer Tree can be filtered by the type of content, e.g. to display only Feeds, or only Feeds and XSLTs. This is done by clicking the filter icon
. The following is an example of filtering by Feeds and XSLTs.Clicking All/None toggles between all types selected and no types selected.
Filtering by type can also be achieved using the Quick Filter by entering the type name (or a partial form of the type name), prefixed with type:
.
I.e:
type:feed
For example:
NOTE: If both the type picker and the Quick Filter are used to filter on type then the two filters will be combined as an AND.
Filtering by Name
The Explorer Tree can be filtered by the name of the entity. This is done by entering some text in the Quick Filter field. The tree will then be updated to only show entities matching the Quick Filter. The way the matching works for entity names is described in Common Fuzzy Matching
Filtering by UUID
What is a UUID?
The Explorer Tree can be filtered by the UUID of the entity. The UUID Universally unique identifier is an identifier that can be relied on to be unique both within the system and universally across all other systems. Stroom uses UUIDs as the primary identifier for all content (Feeds, XSLTs, Pipelines, etc.) created in Stroom. An entity’s UUID is generated randomly by Stroom upon creation and is fixed for the life of that entity.
When an entity is exported it is exported with its UUID and if it is then imported into another instance of Stroom the same UUID will be used. The name of an entity can be changed within Stroom but its UUID remains un-changed.
With the exception of Feeds, Stroom allows multiple entities to have the same name. This is because entities may exist that a user does not have access to see so restricting their choice of names based on existing invisible entities would be confusing. Where there are multiple entities with the same name the UUID can be used to distinguish between them.
The UUID of an entity can be viewed using the context menu for the entity. The context menu is accessed by right-clicking on the entity.
Clicking Info displays the entities UUID.
The UUID can be copied by selecting it and then pressing Ctrl ^ + c .
UUID Quick Filter Matching
In the Explorer Tree Quick Filter you can filter by UUIDs in the following ways:
To show the entity matching a UUID, enter the full UUID value (with dashes) prefixed with the field qualifier uuid
, e.g. uuid:a95e5c59-2a3a-4f14-9b26-2911c6043028
.
To filter on part of a UUID you can do uuid:/2a3a
to find an entity whose UUID contains 2a3a
or uuid:^2a3a
to find an entity whose UUID starts with 2a3a
.
Quick Filters
Quick Filter controls are used in a number of screens in Stroom. The most prominent use of a Quick Filter is in the Explorer Tree as described above. Quick filters allow for quick searching of a data set or a list of items using a text based query language. The basis of the query language is described in Common Fuzzy Matching.
A number of the Quick Filters are used for filter tables of data that have a number of fields.
The quick filter query language supports matching in specified fields.
Each Quick Filter will have a number of named fields that it can filter on.
The field to match on is specified by prefixing the match term with the name of the field followed by a :
, i.e. type:
.
Multiple field matches can be used, each separate by a space.
E.g:
name:^xml name:events$ type:feed
In the above example the filter will match on items with a name beginning xml
, a name ending events
and a type partially matching feed
.
All the match terms are combined together with an AND operator. The same field can be used multiple times in the match. The list of filterable fields and their qualifier names (sometimes a shortened form) are listed by clicking on the help icon
.One or more of the fields will be defined as default fields. This means the if no qualifier is entered the match will be applied to all default fields using an OR operator. Sometimes all fields may be considered default which means a match term will be tested against all fields and an item will be included in the results if one or more of those fields match.
For example if the Quick Filter has fields Name
, Type
and Status
, of which Name
and Type
are default:
feed status:ok
The above would match items where the Name OR Type fields match feed
AND the Status field matches ok
.
Match Negation
Each match item can be negated using the !
prefix.
This is also described in Common Fuzzy Matching.
The prefix is applied after the field qualifier.
E.g:
name:xml source:!/default
In the above example it would match on items where the Name field matched xml
and the Source field does NOT match the regex pattern default
.
Spaces and Quotes
If your match term contains a space then you can surround the match term with double quotes.
Also if your match term contains a double quote you can escape it with a \
character.
The following would be valid for example.
"name:csv splitter" "default field match" "symbol:\""
Multiple Terms
If multiple terms are provided for the same field then an AND is used to combine them. This can be useful where you are not sure of the order of words within the items being filtered.
For example:
User input: spain plain rain
Will match:
The rain in spain stays mainly in the plain
^^^^ ^^^^^ ^^^^^
rainspainplain
^^^^^^^^^^^^^^
spain plain rain
^^^^^ ^^^^^ ^^^^
raining spain plain
^^^^^^^ ^^^^^ ^^^^^
Won’t match: sprain
, rain
, spain
OR Logic
There is no support for combining terms with an OR. However you can acheive this using a sinlge regular expression term. For example
User input: status:/(disabled|locked)
Will match:
Locked
^^^^^^
Disabled
^^^^^^^^
Won’t match: A MAN
, HUMAN
Suggestion Input Fields
Stroom uses a number of suggestion input fields, such as when selecting Feeds, Pipelines, types, status values, etc. in the pipeline processor filter screen.
These fields will typically display the full list of values or a truncated list where the total number of value is too large. Entering text in one of these fields will use the fuzzy matching algorithm to partially/fully match on values. See CommonFuzzy Matching below for details of how the matching works.
Common Fuzzy Matching
A common fuzzy matching mechanism is used in a number of places in Stroom. It is used for partially matching the user input to a list of a list of possible values.
In some instances, the list of matched items will be truncated to a more manageable size with the expectation that the filter will be refined.
The fuzzy matching employs a number of approaches that are attempted in the following order:
NOTE: In the following examples the ^
character is used to indicate which characters have been matched.
No Input
If no input is provided all items will match.
Contains (Default)
If no prefixes or suffixes are used then all characters in the user input will need to be contained as a whole somewhere within the string being tested. The matching is case insensitive.
User input: bad
Will match:
bad angry dog
^^^
BAD
^^^
very badly
^^^
Very bad
^^^
Won’t match: dab
, ba d
, ba
Characters Anywhere Matching
If the user input is prefixed with a ~
(tilde) character then characters anywher matching will be employed.
The matching is case insensitive.
User input: bad
Will match:
Big Angry Dog
^ ^ ^
bad angry dog
^^^
BAD
^^^
badly
^^^
Very bad
^^^
b a d
^ ^ ^
bbaadd
^ ^ ^
Won’t match: dab
, ba
Word Boundary Matching
If the user input is prefixed with a ?
character then word boundary matching will be employed.
This approache uses upper case letters to denote the start of a word.
If you know the some or all of words in the item you are looking for, and their order, then condensing those words down to their first letters (capitalised) makes this a more targeted way to find what you want than the characters anywhere matching above.
Words can either be separated by characters like _- ()[].
, or be distinguished with lowerCamelCase
or upperCamelCase
format.
An upper case letter in the input denotes the beginning of a word and any subsequent lower case characters are treated as contiguously following the character at the start of the word.
User input: ?OTheiMa
Will match:
the cat sat on their mat
^ ^^^^ ^^
ON THEIR MAT
^ ^^^^ ^^
Of their magic
^ ^^^^ ^^
o thei ma
^ ^^^^ ^^
onTheirMat
^ ^^^^ ^^
OnTheirMat
^ ^^^^ ^^
Won’t match: On the mat
, the cat sat on there mat
, On their moat
User input: ?MFN
Will match:
MY_FEED_NAME
^ ^ ^
MY FEED NAME
^ ^ ^
MY_FEED_OTHER_NAME
^ ^ ^
THIS_IS_MY_FEED_NAME_TOO
^ ^ ^
myFeedName
^ ^ ^
MyFeedName
^ ^ ^
also-my-feed-name
^ ^ ^
MFN
^^^
stroom.something.somethingElse.maxFileNumber
^ ^ ^
Won’t match: myfeedname
, MY FEEDNAME
Regular Expression Matching
If the user input is prefixed with a /
character then the remaining user input is treated as a Java syntax regular expression.
An string will be considered a match if any part of it matches the regular expression pattern.
The regular expression operates in case insensitive mode.
For more details on the syntax of java regular expressions see this internet link https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html.
User input: /(^|wo)man
Will match:
MAN
^^^
A WOMAN
^^^^^
Manly
^^^
Womanly
^^^^^
Won’t match: A MAN
, HUMAN
Exact Match
If the user input is prefixed with a ^
character and suffixed with a $
character then a case-insensitive exact match will be used.
E.g:
User input: ^xml-events$
Will match:
xml-events
^^^^^^^^^^
XML-EVENTS
^^^^^^^^^^
Won’t match: xslt-events
, XML EVENTS
, SOME-XML-EVENTS
, AN-XML-EVENTS-PIPELINE
Note: Despite the similarity in syntax, this is NOT regular expression matching.
Starts With
If the user input is prefixed with a ^
character then a case-insensitive starts with match will be used.
E.g:
User input: ^events
Will match:
events
^^^^^^
EVENTS_FEED
^^^^^^
events-xslt
^^^^^^
Won’t match: xslt-events
, JSON_EVENTS
Note: Despite the similarity in syntax, this is NOT regular expression matching.
Ends With
If the user input is suffixed with a $
character then a case-insensitive ends with match will be used.
E.g:
User input: events$
Will match:
events
^^^^^^
xslt-events
^^^^^^
JSON_EVENTS
^^^^^^
Won’t match: EVENTS_FEED
, events-xslt
Note: Despite the similarity in syntax, this is NOT regular expression matching.
Wild-Carded Case Sensitive Exact Matching
If one or more *
characters are found in the user input then this form of matching will be used.
This form of matching is to support those fields that accept wild-carded values, e.g. a whild-carded feed name expression term.
In this instance you are NOT picking a value from the suggestion list but entering a wild-carded value that will be evaluated when the expression/filter is actually used.
The user may want an expression term that matches on all feeds starting with XML_
, in which case they would enter XML_*
.
To give an indication of what it would match on if the list of feeds remains the same, the list of suggested items will reflect the wild-carded input.
User input: XML_*
Will match:
XML_
^^^^
XML_EVENTS
^^^^
Won’t match: BAD_XML_EVENTS
, XML-EVENTS
, xml_events
User input: XML_*EVENTS*
Will match:
XML_EVENTS
^^^^^^^^^^
XML_SEC_EVENTS
^^^^ ^^^^^^
XML_SEC_EVENTS_FEED
^^^^ ^^^^^^
Won’t match: BAD_XML_EVENTS
, xml_events
Match Negation
A match can be negated, ie. the NOT operator using the prefix !
.
This prefix can be applied before all the match prefixes listed above.
E.g:
!/(error|warn)
In the above example it will match everything except those matched by the regex pattern (error|warn)
.
16 - Viewing Data
Viewing Data
The data viewer is shown on the Data tab when you open (by double clicking) one of these items in the explorer tree:
- Feed - to show all data for that feed.
- Folder - to show all data for all feeds that are descendants of the folder.
- System Root Folder - to show all data for all feeds that are ancestors of the folder.
In all cases the data shown is dependant on the permissions of the user performing the action and any permissions set on the feeds/folders being viewed.
The Data Viewer screen is made up of the following three parts which are shown as three panes split horizontally.
Stream List
This shows all streams within the opened entity (feed or folder). The streams are shown in reverse chronological order. By default Deleted and Locked streams are filtered out. The filtering can be changed by clicking on the icon. This will show all stream types by default so may be a mixture of Raw events, Events, Errors, etc. depending on the feed/folder in question.
Related Stream List
This list only shows data when a stream is selected in the streams list above it. It shows all streams related to the currently selected stream. It may show streams that are ‘ancestors’ of the selected stream, e.g. showing the Raw Events stream for an Events stream, or show descendants, e.g. showing the Errors stream which resulted from processing the selected Raw Events stream.
Content Viewer Pane
This pane shows the contents of the stream selected in the Related Streams List. The content of a stream will differ depending on the type of stream selected and the child stream types in that stream. For more information on the anatomy of streams, see Streams. This pane is split into multiple sub tabs depending on the different types of content available.
Info Tab
This sub-tab shows the information for the stream, such as creation times, size, physical file location, state, etc.
Error Tab
This sub-tab is only visible for an Error stream. It shows a table of errors and warnings with associated messages and locations in the stream that it relates to.
Data Preview Tab
This sub-tab shows the content of the data child stream, formatted if it is XML. It will only show a limited amount of data so if the data child stream is large then it will only show the first n characters.
If the stream is multi-part then you will see Part navigation controls to switch between parts. For each part you will be shown the first n character of that part (formatted if applicable).
If the stream is a Segmented stream stream then you will see the Record navigation controls to switch between records. Only one record is shown at once. If a record is very large then only the first n characters of the record will be shown.
This sub-tab is intended for seeing a quick preview of the data in a form that is easy to read, i.e. formatted. If you want to see the full data in its original form then click on the View Source link at the top right of the sub-tab.
The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.
Context Tab
This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated context data child stream. For more details of context streams, see Context This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.
Meta Tab
This sub-tab is only shown for non-segmented streams, e.g. Raw Events and Raw_Reference that have an associated meta data child stream. For more details of meta streams, see Meta This sub-tab works in exactly the same way as the Data Preview sub-tab except that it shows a different child stream.
Source View
The source view is accessed by clicking the View Source link on the Data Preview sub-tab or from the data()
dashboard column function.
Its purpose is to display the selected child stream (data, context, meta, etc) or record in the form in which it was received, i.e un-formatted.
The Data Preview tab shows a ‘progress’ bar to indicate what portion of the content is visible in the editor.
In order to navigate through the data you have three options
- Click on the ‘progress bar’ to show a porting of the data starting from the position clicked on.
- Page through the data using the navigation controls.
- Select a source range to display using the Set Source Range dialog which is accessed by clicking on the Lines or Chars links.
This allows you to precisely select the range to display.
You can either specify a range with a just start point or a start point and some form of size/position limit.
If no limit is specified then Stroom will limit the data shown to the configured maximum (
stroom.ui.source.maxCharactersPerFetch
). If a range is entered that is too big to display Stroom will limit the data to its maximum.
A Note About Characters
Stroom does not know the size of a stream in terms of character lines/cols, it only knows the size in bytes. Due to the way character data is encoded into bytes it is not possible to say how many characters are in a stream based on its size in bytes. Stroom can only provide an estimate based on the ratio of characters to bytes seen so far in the stream.
Data Progress Bar
Stroom often handles very large streams of data and it is not feasible to show all of this data in the editor at once. Therefore Stroom will show a limited amount of the data in the editor at a time. The ‘progress’ bar at the top of the Data Preview and Source View screens shows what percentage of the data is visible in the editor and where in the stream the visible portion is located. If all of the data is visible in the editor (which includes scrolling down to see it) the bar will be green and will occupy the full width. If only some of the data is visible then the bar will be blue and the coloured part will only occupy part of the width.