The <group>
and <data>
elements can reference match groups from parent expressions or from stored matches in variables. In the case of the <group>
element, referenced values are passed on to child expressions whereas the <data>
element can use match group references for name and value attributes. In the case of both elements the way of specifying references is the same.
This is the multi-page printable view of this section. Click here to print.
Match References, Variables and Fixed Strings
- 1: Expression match references
- 2: Variable reference
- 3: Use of fixed strings
- 4: Concatenation of references
1 - Expression match references
Referencing matches in expressions is done using $
. In addition to this a match group number may be added to just retrieve part of the expression match. The applicability and effect that this has depends on the type of expression used.
References to <split>
Match Groups
In the following example a line matched by a parent <split>
expression is referenced by a child <data>
element.
<split delimiter="\n" >
<data name="line" value="$"/>
</split>
A <split>
element matches content up to and including the specified delimiter, so the above reference would output the entire line plus the delimiter. However there are various match groups that can be used by child <group>
and <data>
elements to reference sections of the matched content.
To illustrate the content provided by each match group, take the following example:
"This is some text\, that we wish to match", "This is the next text"
And the following <split>
element:
<split delimiter="," escape="\">
The match groups are as follows:
- $ or $0: The entire content that is matched including the specified delimiter at the end
"This is some text\, that we wish to match",
- $1: The content up to the specified delimiter at the end
"This is some text\, that we wish to match"
- $2: The content up to the specified delimiter at the end and filtered to remove escape characters (more expensive than $1)
"This is some text, that we wish to match"
In addition to this behaviour match groups 1 and 2 will omit outermost whitespace and container characters if specified, e.g. with the following content:
" This is some text\, that we wish to match " , "This is the next text"
And the following <split>
element:
<split delimiter="," escape="\" containerStart=""" containerEnd=""">
The match groups are as follows:
- $ or $0: The entire content that is matched including the specified delimiter at the end
" This is some text\, that we wish to match " ,
- $1: The content up to the specified delimiter at the end and strips outer containers.
This is some text\, that we wish to match
- $2: The content up to the specified delimiter at the end and strips outer containers and filtered to remove escape characters (more computationally expensive than $1)
This is some text, that we wish to match
References to Match Groups
Like the <split>
element various match groups can be referenced in a <regex>
expression to retrieve portions of matched content. This content can be used as values for <group>
and <data>
elements.
Given the following input:
ip=1.1.1.1 user=user1
And the following <regex>
element:
<regex pattern="ip=([^ ]+) user=([^ ]+)">
The match groups are as follows:
- $ or $0: The entire content that is matched by the expression
ip=1.1.1.1 user=user1
- $1: The content of the first match group
1.1.1.1
- $2: The content of the second match group
user1
Match group numbers in regular expressions are determined by the order that their open bracket appears in the expression.
References to <any>
Match Groups
The <any>
element does not have any match groups and always returns the entire content that was passed to it when referenced with $.
2 - Variable reference
Variables are added to Data Splitter configuration using the <var>
element, see variables. Each variable must have a unique id so that it can be referenced. References to variables have the form $VARIABLE_ID$
, e.g.
<data name="$heading$" value="$" />
Identification
Data Splitter validates the configuration on load and ensures that all element ids are unique and that referenced ids belong to a variable.
A variable will only store data if it is referenced so variables that are not referenced will do nothing. In addition to this a variable will only store data for match groups that are referenced, e.g. if $heading$1
is the only reference to a variable with an id of ‘heading’ then only data for match group 1 will be stored for reference lookup.
Scopes
Variables have two scopes which affect how data is retrieved when referenced:
Local Scope
Variables are local to a reference if the reference exists as a descendant of the variables parent expression, e.g.
<split delimiter="\n" >
<var id="line" />
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="line" value="$line$"/>
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</split>
In the above example, matches for the outermost <split>
expression are stored in the variable with the id of line
. The only reference to this variable is in a data element that is a descendant of the variables parent expression <split>
, i.e. it is nested within split/group/regex.
Because the variable is referenced locally only the most recent parent match is relevant, i.e. no retrieval of values by iteration, iteration offset or fixed position is applicable. These features only apply to remote variables that store multiple values.
Remote Scope
The CSV example with a heading is an example of a variable being referenced from a remote scope.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
xmlns="data-splitter:3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
version="3.0">
<!-- Match heading line (note that maxMatch="1" means that only the first line will be matched by this splitter) -->
<split delimiter="\n" maxMatch="1">
<!-- Store each heading in a named list -->
<group>
<split delimiter=",">
<var id="heading" />
</split>
</group>
</split>
<!-- Match each record -->
<split delimiter="\n">
<!-- Take the matched line -->
<group value="$1">
<!-- Split the line up -->
<split delimiter=",">
<!-- Output the stored heading for each iteration and the value from group 1 -->
<data name="$heading$1" value="$1" />
</split>
</group>
</split>
</dataSplitter>
In the above example the parent expression of the variable is not the ancestor of the reference in the <data>
element. This makes the <data>
elements reference to the variable a remote one. In this situation the variable knows that it must store multiple values as the remote reference <data>
may retrieve one of many values from the variable based on:
- The match count of the parent expression.
- The match count of the parent expression, plus or minus an offset.
- A fixed position in the variable store.
Retrieval of value by iteration
In the above example the first line is taken then repeatedly matched by delimiting with commas. This results in multiple values being stored in the ‘heading’ variable. Once this is done subsequent lines are matched and then also repeatedly matched by delimiting with commas in the same way the heading is.
Each time a line is matched the internal match count of all sub expressions, (e.g. the <split>
expression that is delimited by comma) is reset to 0. Every time the sub <split>
expression matches up to a comma delimiter the match count is incremented. Any references to remote variables will, by default, use the current match count as an index to retrieve one of the many values stored in the variable. This means that the <data>
element in the above example will retrieve the corresponding heading for each value as the match count of the values will match the storage position of each heading.
Retrieval of value by iteration offset
In some cases there may be a mismatch between the position where a value is stored in a variable and the match count applicable when remotely referencing the variable.
Take the following input:
BAD,Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
In the above example we can see that the first heading ‘BAD’ is not correct for the first value of every line. In this situation we could either adjust the way the heading line is parsed to ignore ‘BAD’ or just adjust the way the heading variable is referenced.
To make this adjustment the reference just needs to be told what offset to apply to the current match count to correctly retrieve the stored value. In the above example this would be done like this:
<data name="$heading$1[+1]" value="$1" />
The above reference just uses the match count plus 1 to retrieve the stored value. Any integral offset plus or minus may be used, e.g. [+4] or [-10]. Offsets that result in a position that is outside of the storage range for the variable will not return a value.
Retrieval of value by fixed position
In addition to retrieval by offset from the current match count, a stored value can be returned by a fixed position that has no relevance to the current match count.
In the following example the value retrieved from the ‘heading’ variable will always be ‘IPAddress’ as this is the fourth value stored in the ‘heading’ variable and the position index starts at 0.
<data name="$heading$1[3]" value="$1" />
3 - Use of fixed strings
Any <group>
value or <data>
name and value can use references to matched content, but in addition to this it is possible just to output a known string, e.g.
<data name="somename" value="$" />
The above example would output somename
as the <data>
name attribute. This can often be useful where there are no headings specified in the input data but we want to associate certain names with certain values.
Given the following data:
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
We could provide useful headings with the following configuration:
<regex pattern="([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),">
<data name="date" value="$1" />
<data name="time" value="$2" />
<data name="ipAddress" value="$3" />
<data name="hostName" value="$4" />
<data name="user" value="$5" />
<data name="action" value="$6" />
</regex>
4 - Concatenation of references
It is possible to concatenate multiple fixed strings and match group references using the +
character. As with all references and fixed strings this can be done in <group>
value and <data>
name and value attributes. However concatenation does have some performance overhead as new buffers have to be created to store concatenated content.
A good example of concatenation is the production of ISO8601 date format from data in the previous example:
01/01/2010,00:00:00
Here the following <regex>
could be used to extract the relevant date, time groups:
<regex pattern="(\d{2})/(\d{2})/(\d{4}),(\d{2}):(\d{2}):(\d{2})">
The match groups from this expression can be concatenated with the following value output pattern in the data element:
<data name="dateTime" value="$3+’-‘+$2+’-‘+$1+’-‘+’T’+$4+’:’+$5+’:’+$6+’.000Z’" />
Using the original example, this would result in the output:
<data name="dateTime" value="2010-01-01T00:00:00.000Z" />
Note that the value output pattern wraps all fixed strings in single quotes. This is necessary when concatenating strings and references so that Data Splitter can determine which parts are to be treated as fixed strings. This also allows fixed strings to contain $
and +
characters.
As single quotes are used for this purpose, a single quote needs to be escaped with another single quote if one is desired in a fixed string, e.g.
‘this ‘’is quoted text’’’
This will result in:
this ‘is quoted text’