This is the multi-page printable view of this section. Click here to print.
Indexing data
- 1: Elasticsearch
- 1.1: Introduction
- 1.2: Getting Started
- 1.3: Indexing data
- 1.4: Exploring Data in Kibana
- 2: Lucene Indexes
- 3: Solr Integration
1 - Elasticsearch
1.1 - Introduction
Stroom supports using an external Elasticsearch cluster to index event data. This allows you to leverage all the features of the Elastic Stack, such as shard allocation, replication, fault tolerance and aggregations.
With Elasticsearch as an external service, your search infrastructure can scale independently of your Stroom data processing cluster, enhancing interoperability with other platforms by providing a performant and resilient time-series event data store. For instance, you can deploy Kibana to search and visualise Elasticsearch data.
Stroom achieves indexing and search integration by interfacing securely with the Elasticsearch REST API using the Java high-level client.
This guide will walk you through configuring a Stroom indexing pipeline, creating an Elasticsearch index template, activating a stream processor and searching the indexed data in both Stroom and Kibana.
Assumptions
- You have created an Elasticsearch cluster. Elasticsearch 8.x is recommended, though the latest supported 7.x version will also work. For test purposes, you can quickly create a single-node cluster using Docker by following the steps in the Elasticsearch Docs .
- The Elasticsearch cluster is reachable via HTTPS from all Stroom nodes participating in stream processing.
- Elasticsearch security is enabled. This is mandatory and is enabled by default in Elasticsearch 8.x and above.
- The Elasticsearch HTTPS interface presents a trusted X.509 server certificate. The Stroom node(s) connecting to Elasticsearch need to be able to verify the certificate, so for custom PKI, a Stroom truststore entry may be required.
- You have a feed containing
Event
streams to index.
Key differences
Indexing data with Elasticsearch differs from Solr and built-in Lucene methods in a number of ways:
- Unlike with Solr and built-in Lucene indexing, Elasticsearch field mappings are managed outside Stroom, through the use of index and component templates . These are normally created either via the Elasticsearch API, or interactively using Kibana.
- Aside from creating the mandatory
StreamId
andEventId
field mappings, explicitly defining mappings for other fields is optional. Elasticsearch will use dynamic mapping by default, to infer each field’s type at index time. Explicitly defining mappings is recommended where consistency or greater control are required, such as for IP address fields (Elasticsearch mapping typeip
).
Next page - Getting Started
1.2 - Getting Started
Establish an Elasticsearch cluster connection in Stroom
The first step is to configure Stroom to connect to an Elasticsearch cluster.
You can configure multiple cluster connections if required, such as a separate one for production and another for development.
Each cluster connection is defined by an Elastic Cluster
document within the Stroom UI.
- In the Stroom Explorer pane (
), right-click on the folder where you want to create the
Elastic Cluster
document. - Select:
- Give the cluster document a name and press .
- Complete the fields as explained in the section below. Any fields not marked as “Optional” are mandatory.
- Click
Test Connection
. A dialog will display with the test result. IfConnection Success
, details of the target cluster will be displayed. Otherwise, error details will be displayed. - Click to commit changes.
Warning
Ensure you restrict permissions to theElastic Cluster
document.
The Read
privilege permits retrieval of the Elasticsearch API key and secret, granting the holder the same level of privilege as Stroom.
Users authorised to search Elasticsearch indices via Stroom dashboards should only be assigned the Use
privilege.
Elastic Cluster document fields
Description
(Optional) You might choose to enter the Elasticsearch cluster name or purpose here.
Connection URLs
Enter one or more node or cluster addresses, including protocol, hostname and port. Only HTTPS is supported; attempts to use plain-text HTTP will fail.
Examples
- Local development node: https://localhost:9200
- FQDN: https://elasticsearch.example.com:9200
- Kubernetes service: https://prod-es-http.elastic.svc:9200
CA certificate
PEM-format CA certificate chain used by Stroom to verify TLS connections to the Elasticsearch HTTPS REST interface. This is usually your organisation’s root enterprise CA certificate. For development, you can provide a self-signed certificate.
Use authentication
(Optional) Tick this box if Elasticsearch requires authentication. This is enabled by default from Elasticsearch version 8.0.
API key ID
Required if Use authentication
is checked. Specifies the Elasticsearch API key ID for a valid Elasticsearch user account.
This user requires at a minimum the following
privileges
:
Cluster privileges
- monitor
- manage_own_api_key
Index privileges
- all
API key secret
Required if Use authentication
is checked.
Socket timeout (ms)
Number of milliseconds to wait for an Elasticsearch indexing or search REST call to complete. Set to -1
(the default) to wait indefinitely, or until Elasticsearch closes the connection.
Next page - Indexing data
1.3 - Indexing data
A typical workflow is for a Stroom pipeline to convert XML Event
elements into the XML equivalent of JSON, complying with the schema http://www.w3.org/2005/xpath-functions
, using a format identical to the output of the XML function xml-to-json()
.
Understanding JSON XML representation
In an Elasticsearch indexing pipeline translation, you model JSON documents in a compatible XML representation.
Common JSON primitives and examples of their XML equivalents are outlined below.
Arrays
Array of maps
<array key="users" xmlns="http://www.w3.org/2005/xpath-functions">
<map>
<string key="name">John Smith</string>
</map>
</array>
Array of strings
<array key="userNames" xmlns="http://www.w3.org/2005/xpath-functions">
<string>John Smith</string>
<string>Jane Doe</string>
</array>
Maps and properties
<map key="user" xmlns="http://www.w3.org/2005/xpath-functions">
<string key="name">John Smith</string>
<boolean key="active">true</boolean>
<number key="daysSinceLastLogin">42</number>
<string key="loginDate">2022-12-25T01:59:01.000Z</string>
<null key="emailAddress" />
<array key="phoneNumbers">
<string>1234567890</string>
</array>
</map>
Note
It is recommended to insert a schema validation filter into your pipeline XML (with schema groupJSON
), to make it easier to diagnose JSON conversion errors.
We will now explore how create an Elasticsearch index template, which specifies field mappings and settings for one or more indices.
Create an Elasticsearch index template
For information on what index and component templates are, consult the Elastic documentation .
When Elasticsearch first receives a document from Stroom targeting an index, whose name matches any of the index_patterns
entries in the index template, Elasticsearch creates a new index using the settings
and mappings
properties from the template.
The following example creates a basic index template stroom-events-v1
in a local Elasticsearch cluster, with the following explicit field mappings:
StreamId
– mandatory, data typelong
orkeyword
.EventId
– mandatory, data typelong
orkeyword
.@timestamp
– required if the index is to be part of a data stream (recommended).User
– An object containing propertiesId
,Name
andActive
, each with their own data type.Tags
– An array of one or more strings.Message
– Contains arbitrary content such as unstructured raw log data. Supports full-text search. Nested fieldwildcard
supports regexp queries .
Note
Elasticsearch does not have a dedicated array
field mapping data type.
An Elasticsearch field may contain zero or more values by default.
See:
In Kibana Dev Tools, execute the following query:
PUT _index_template/stroom-events-v1
{
"index_patterns": [
"stroom-events-v1*" // Apply this template to index names matching this pattern.
],
"data_stream": {}, // For time-series data. Recommended for event data.
"template": {
"settings": {
"number_of_replicas": 1, // Replicas impact indexing throughput. This setting can be changed at any time.
"number_of_shards": 10, // Consider the shard sizing guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#shard-size-recommendation
"refresh_interval": "10s", // How often to refresh the index. For high-throughput indices, it's recommended to increase this from the default of 1s
"lifecycle": {
"name": "stroom_30d_retention_policy" // (Optional) Apply an ILM policy https://www.elastic.co/guide/en/elasticsearch/reference/current/set-up-lifecycle-policy.html
}
},
"mappings": {
"dynamic_templates": [],
"properties": {
"StreamId": { // Required.
"type": "long"
},
"EventId": { // Required.
"type": "long"
},
"@timestamp": { // Required if the index is part of a data stream.
"type": "date"
},
"User": {
"properties": {
"Id": {
"type": "keyword"
},
"Name": {
"type": "keyword"
},
"Active": {
"type": "boolean"
}
}
},
"Tags": {
"type": "keyword"
},
"Message": {
"type": "text",
"fields": {
"wildcard": {
"type": "wildcard"
}
}
}
}
}
},
"composed_of": [
// Optional array of component template names.
]
}
Create an Elasticsearch indexing pipeline template
An Elasticsearch indexing pipeline is similar in structure to the built-in packaged Indexing
template pipeline.
It typically consists of the following pipeline elements:
-
XSLTFilter
contains the translation mapping
Events
to JSONarray
. -
SchemaFilter
uses schema group
JSON
.
It is recommended to create a template Elasticsearch indexing pipeline, which can then be re-used.
Procedure
- Right-click on the
Template Pipelines
folder in the Stroom Explorer pane ( ). - Select:
- Enter the name
Indexing (Elasticsearch)
and click . - Define the pipeline structure as above, and customise the following pipeline elements:
- Set the Split Filter
splitCount
property to a sensible default value, based on the expected source XML element count (e.g.100
). - Set the Schema Filter
schemaGroup
property toJSON
. - Set the Elastic Indexing Filter
cluster
property to point to theElastic Cluster
document you created earlier. - Set the Write Record Count filter
countRead
property tofalse
.
- Set the Split Filter
Now you have created a template indexing pipeline, it’s time to create a feed-specific pipeline that inherits this template.
Create an Elasticsearch indexing pipeline
Procedure
- Right-click on a folder ( ) in the Stroom Explorer pane ( ).
- Select:
- Enter a name for your pipeline and click .
- Click the
Inherit From
button. - In the dialog that appears, select the template pipeline you created named
Indexing (Elasticsearch)
and click . - Select the Elastic Indexing Filter pipeline element.
- Set its properties as per one of the examples below.
Example 1: Single index or data stream
This is the simplest use case and is suitable where you want to write to a single
data stream
(for time-series data) or index.
If your index template contains the property data_stream: {}
, be sure to include a string
field named @timestamp
in the output JSON XML.
If targeting a data stream, you may choose to use Elasticsearch ILM to manage its lifecycle.
indexBaseName: stroom-events-v1
Example 2: Dynamic time-based data streams
In this example, Stroom creates data streams as needed, named according to the value of a particular JSON date field and date pattern. This is useful when you need to roll over data streams manually, such as maintaining older data on slower storage tiers.
For instance, you may have data spanning many years and want to have Stroom create a separate data stream for each year, such as stroom-events-v1-2020
, stroom-events-v1-2021
, stroom-events-v1-2022
and so on.
indexBaseName: stroom-events-v1
indexNameDateFieldName: @timestamp
indexNameDateFormat: -yyyy
Other options
There are other options available for the Elastic Indexing Filter. These are documented in the UI.
Create an indexing translation
In this example, let’s assume you have event data that looks like the following:
<?xml version="1.1" encoding="UTF-8"?>
<Events
xmlns="event-logging:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="event-logging:3 file://event-logging-v3.5.2.xsd"
Version="3.5.2">
<Event>
<EventTime>
<TimeCreated>2022-12-16T02:46:29.218Z</TimeCreated>
</EventTime>
<EventSource>
<System>
<Name>Nginx</Name>
<Environment>Development</Environment>
</System>
<Generator>Filebeat</Generator>
<Device>
<HostName>localhost</HostName>
</Device>
<User>
<Id>john.smith1</Id>
<Name>John Smith</Name>
<State>active</State>
</User>
</EventSource>
<EventDetail>
<View>
<Resource>
<URL>http://localhost:8080/index.html</URL>
</Resource>
<Data Name="Tags" Value="dev,testing" />
<Data
Name="Message"
Value="TLSv1.2 AES128-SHA 1.1.1.1 "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"" />
</View>
</EventDetail>
</Event>
<Event>
...
</Event>
</Events>
We need to write an XSL transform (XSLT) to form a JSON document for each stream processed.
Each document must consist of an array
element one or more map
elements (each representing an Event
), each with the necessary properties as per our index template.
See XSLT Conversion for instructions on how to write an XSLT.
The output from your XSLT should match the following:
<?xml version="1.1" encoding="UTF-8"?>
<array
xmlns="http://www.w3.org/2005/xpath-functions"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/xpath-functions file://xpath-functions.xsd">
<map>
<number key="StreamId">3045516</number>
<number key="EventId">1</number>
<string key="@timestamp">2022-12-16T02:46:29.218Z</string>
<map key="User">
<string key="Id">john.smith1</string>
<string key="Name">John Smith</string>
<boolean key="Active">true</boolean>
</map>
<array key="Tags">
<string>dev</string>
<string>testing</string>
</array>
<string key="Message">TLSv1.2 AES128-SHA 1.1.1.1 "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"</string>
</map>
<map>
...
</map>
</array>
Assign the translation to the indexing pipeline
Having created your translation, you need to reference it in your indexing pipeline.
- Open the pipeline you created.
- Select the
Structure
tab. - Select the XSLTFilter pipeline element.
- Double-click the
xslt
property value cell. - Select the XSLT you created and click .
- Click .
Step the pipeline
At this point, you will want to step the pipeline to ensure there are no errors and that output looks as expected.
Execute the pipeline
Create a pipeline processor and filter to run the pipeline against one or more feeds. Stroom will distribute processing tasks to enabled nodes and send documents to Elasticsearch for indexing.
You can monitor indexing status via your Elasticsearch monitoring tool of choice.
Detecting and handling errors
If any errors occur while a stream is being indexed, an Error
stream is created, containing details of each failure. Error
streams can be found under the Data
tab of either the indexing pipeline or receiving Feed
.
Note
You can filter the selected pipeline or feed to list only Error
streams.
Click
then add a condition Type
=
Error
.
Once you have addressed the underlying cause for a particular type of error (such as an incorrect field mapping), reprocess affected streams:
- Select any
Error
streams relating for reprocessing, by clicking the relevant checkboxes in the stream list (top pane). - Click .
- In the dialog that appears, check
Reprocess data
and click . - Click for any confirmation prompts that follow.
Stroom will re-send data from the selected Event
streams to Elasticsearch for indexing.
Any existing documents matching the StreamId
of the original Event
stream are first deleted automatically to avoid duplication.
Tips and tricks
Use a common schema for your indices
An example is Elastic Common Schema (ECS) . This helps users understand the purpose of each field and to build cross-index queries simpler by using a set of common fields (such as a user ID).
With this in mind, it is important that common fields also have the same data type in each index. Component templates help make this easier and reduce the chance of error, by centralising the definition of common fields to a single component.
Use a version control system (such as git) to track index and component templates
This helps keep track of changes over time and can be an important resource for both administrators and users.
Rebuilding an index
Sometimes it is necessary to rebuild an index. This could be due to a change in field mapping, shard count or responding to a user feature request.
To rebuild an index:
- Drain the indexing pipeline by deactivating any processor filters and waiting for any running tasks to complete.
- Delete the index or data stream via the Elasticsearch API or Kibana.
- Make the required changes to the index template and/or XSL translation.
- Create a new processor filter either from scratch or using the button.
- Activate the new processor filter.
Use a versioned index naming convention
As with the earlier example stroom-events-v1
, a version number is appended to the name of the index or data stream.
If a new field is added, or some other change occurred requiring the index to be rebuilt, users would experience downtime.
This can be avoided by incrementing the version and performing the rebuild against a new index: stroom-events-v2
.
Users could continue querying stroom-events-v1
until it is deleted.
This approach involves the following steps:
- Create a new Elasticsearch index template targeting the new index name (in this case,
stroom-events-v2
). - Create a copy of the indexing pipeline, targeting the new index in the Elastic Indexing Filter.
- Create and activate a processing filter for the new pipeline.
- Once indexing is complete, update the Elastic Index document to point to
stroom-events-v2
. Users will now be searching against the new index. - Drain any tasks for the original indexing pipeline and delete it.
- Delete index
stroom-events-v1
using either the Elasticsearch API or Kibana.
If you created a data view in Kibana, you’ll also want to update this to point to the new index / data stream.
1.4 - Exploring Data in Kibana
Kibana is part of the Elastic Stack and provides users with an interactive, visual way to query, visualise and explore data in Elasticsearch.
It is highly customisable and provides users and teams with tools to create and share dashboards, searches, reports and other content.
Once data has been indexed by Stroom into Elasticsearch, it can be explored in Kibana. You will first need to create a *data view* in order to query your indices.
Why use Kibana?
There are several use cases that benefit from Kibana:
- Convenient and powerful drag-and-drop charts and other visualisation types using Kibana Lens. Much more performant and easier to customise than built-in Stroom dashboard visualisations.
- Field statistics and value summaries with Kibana Discover. Great for doing initial audit data survey.
- Geospatial analysis and visualisation.
- Search field auto-completion.
- Runtime fields . Good for data exploration, at the cost of performance.
2 - Lucene Indexes
Stroom uses Apache Lucene for its built-in indexing solution. Index documents are stored in a Volume .
TODO
Complete this page.Field configuration
Field Types
Id
- Treated as aLong
.Boolean
- True/False values.Integer
- Whole numbers from -2,147,483,648 to 2,147,483,647.Long
- Whole numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.Float
- Fractional numbers. Sufficient for storing 6 to 7 decimal digits.Double
- Fractional numbers. Sufficient for storing 15 decimal digits.Date
- Date and time values.Text
- Text data.Number
- An alias forLong
.
Stored fields
If a field is Stored then it means the complete field value will be stored in the index. This means the value can be retrieved from the index when building search results rather than using the slower Search Extraction process. Storing field values comes at the cost of hight storage requirements for the index. If storage space is not an issue then storing all fields that you want to return in search results is the optimum.
Indexed fields
An Indexed field is one that will be processed by Lucene so that the field can be queried. How the field is indexed will depend on the Field type and the Analyser used.
If you have fields that you do not want to be able to filter (i.e. that you won’t use as a query term) then you can include them as non-Indexed fields. Including a non-indexed field means it will be available for the user to select in the Dashboard table. A non-indexed field would either need to be Stored in the index or added via Search Extraction to be available in the search results.
Positions
If Positions is selected then Lucene will store the positions of all the field terms in the document.
Analyser types
The Analyser determines how Lucene reads the fields value and extracts tokens from it. The choice of Analyser will depend on the date in the field and how you want to search it.
Keyword
- Treats the whole field value as one token. Useful for things like IDs and post codes. Supports the Case Sensitivity setting.Alpha
- Tokenises on any non-letter characters, e.g.one1 two2 three 3
=>one
two
three
. Strips non-letter characters. Supports the Case Sensitivity setting.Numeric
-Alpha numeric
- Tokenises on any non-letter/digit characters, e.g.one1 two2 three 3
=>one1
two2
three
3
. Supports the Case Sensitivity setting.Whitespace
- Tokenises only on white space. Not affected by the Case Sensitivity setting, case sensitive.Stop words
- Tokenises bases on non-letter characters and removes Stop Words, e.g.and
. Not affected by the Case Sensitivity setting. Case insensitive.Standard
- The most common analyser. Tokenises the value on spaces and punctuation but recognises URLs and email addresses. Removes Stop Words, e.g.and
. Not affected by the Case Sensitivity setting. Case insensitive. e.g.Find Stroom at github.com/stroom
=>Find
Stroom
at
github.com/stroom
.
Stop words
Some of the Analysers use a set of stop words for the tokenisers. This is the list of stop words that will not be indexed.
a
, an
, and
, are
, as
, at
, be
, but
, by
, for
, if
, in
, into
, is
, it
, no
, not
, of
, on
, or
, such
, that
, the
, their
, then
, there
, these
, they
, this
, to
, was
, will
, with
Case sensitivity
Some of the Analyser types support case (in)sensitivity.
For example if the Analyser supports it the value TWO two
would either be tokenised as TWO
two
or two
two
.