There are various elements used in a Data Splitter configuration to control behaviour. Each of these elements can be categorised as one of the following:
This is the multi-page printable view of this section. Click here to print.
Element Reference
- 1: Content Providers
- 2: Expressions
- 3: Output
- 4: Variables
1 - Content Providers
Content providers take some content from the input source or elsewhere (see fixed strings and provide it to one or more expressions. Both the root element <dataSplitter>
and <group>
elements are content providers.
Root element <dataSplitter>
The root element of a Data Splitter configuration is <dataSplitter>
. It supplies content from the input source to one or more expressions defined within it. The way that content is buffered is controlled by the root element and the way that errors are handled as a result of child expressions not matching all of the content it supplies.
Attributes
The following attributes can be added to the <dataSplitter>
root element:
ignoreErrors
Data Splitter generates errors if not all of the content is matched by the regular expressions beneath the <dataSplitter>
or within <group>
elements. The error messages are intended to aid the user in writing good Data Splitter configurations. The intent is to indicate when the input data is not being matched fully and therefore possibly skipping some important data. Despite this, in some cases it is laborious to have to write expressions to match all content. In these cases it is preferable to add this attribute to ignore these errors. However it is often better to write expressions that capture all of the supplied content and discard unwanted characters. This attribute also affects errors generated by the use of the minMatch attribute on <regex>
which is described later on.
Take the following example input:
Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment
This could be matched with the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<regex id="heading" pattern=".+" maxMatch="1">
…
</regex>
<regex id="body" pattern="\n[^#]+">
…
</regex>
</dataSplitter>
The above configuration would only match up to a comment for each record line, e.g.
Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment
This may well be the desired functionality but if there was useful content within the comment it would be lost. Because of this Data Splitter warns you when expressions are failing to match all of the content presented so that you can make sure that you aren’t missing anything important. In the above example it is obvious that this is the required behaviour but in more complex cases you might be otherwise unaware that your expressions were losing data.
To maintain this assurance that you are handling all content it is usually best to write expressions to explicitly match all content even though you may do nothing with some matches, e.g.
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<regex id="heading" pattern=".+" maxMatch="1">
…
</regex>
<regex id="body" pattern="\n([^#]+)#.+">
…
</regex>
</dataSplitter>
The above example would match all of the content and would therefore not generate warnings. Sub-expressions of ‘body’ could use match group 1 and ignore the comment.
However as previously stated it might often be difficult to write expressions that will just match content that is to be discarded. In these cases ignoreErrors can be used to suppress errors caused by unmatched content.
bufferSize (Advanced)
This is an optional attribute used to tune the size of the character buffer used by Data Splitter. The default size is 20000 characters and should be fine for most translations. The minimum value that this can be set to is 20000 characters and the maximum is 1000000000. The only reason to specify this attribute is when individual records are bigger than 10000 characters which is rarely the case.
Group element <group>
Groups behave in a similar way to the root element in that they provide content for one or more inner expressions to deal with, e.g.
<group value="$1">
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
...
<regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
...
Attributes
As the <group>
element is a content provider it also includes the same ‘ignoreErrors’ attribute which behaves in the same way. The complete list of attributes for the <group>
element is as follows:
id
When Data Splitter reports errors it outputs an XPath to describe the part of the configuration that generated the error, e.g.
DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group>
It is often a little difficult to identify the configuration element that generated the error by looking at the path and the element description, particularly when multiple elements are the same, e.g. many <group>
elements without attributes. To make identification easier you can add an ‘id’ attribute to any element in the configuration resulting in error descriptions as follows:
DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group id="myGroupId">
value
This attribute determines what content to present to child expressions. By default the entire content matched by a group’s parent expression is passed on by the group to child expressions. If required, content from a specific match group in the parent expression can be passed to child expressions using the value attribute, e.g. value="$1"
. In addition to this content can be composed in the same way as it is for data names and values. see match references for a full description of match references.
ignoreErrors
This behaves in the same way as for the root element.
matchOrder
This is an optional attribute used to control how content is consumed by expression matches. Content can be consumed in sequence or in any order using matchOrder="sequence"
or matchOrder="any"
. If the attribute is not specified, Data Splitter will default to matching in sequence.
When matching in sequence, each match consumes some content and the content position is moved beyond the match ready for the subsequent match. However, in some cases the order of these constructs is not predictable, e.g. we may sometimes be presented with:
Value1=1 Value2=2
… or sometimes with:
Value2=2 Value1=1
Using a sequential match order the following example would work to find both values in Value1=1 Value2=2
<group>
<regex pattern="Value1=([^ ]*)">
...
<regex pattern="Value2=([^ ]*)">
...
… but this example would skip over Value2 and only find the value of Value1 if the input was Value2=2 Value1=1
.
To be able to deal with content that contains these constructs in either order we need to change the match order to any
.
When matching in any order, each match removes the matched section from the content rather than moving the position past the match so that all remaining content can be matched by subsequent expressions. In the following example the first expression would match and remove Value1=1
from the supplied content and the second expression would be presented with Value2=2
which it could also match.
<group matchOrder="any">
<regex pattern="Value1=([^ ]*)">
...
<regex pattern="Value2=([^ ]*)">
...
If the attribute is omitted by default the match order will be sequential. This is the default behaviour as tokens are most often in sequence and consuming content in this way is more efficient as content does not need to be copied by the parser to chop out sections as is required for matching in any order. It is only necessary to use this feature when fields that are identifiable with a specific match can occur in any order.
reverse
Occasionally it is desirable to reverse the content presented by a group to child expressions. This is because it is sometimes easier to form a pattern by matching content in reverse.
Take the following example content of name, value pairs delimited by =
but with no spaces between names, multiple spaces between values and only a space between subsequent pairs:
ipAddress=123.123.123.123 zones=Zone 1, Zone 2, Zone 3 location=loc1 A user=An end user serverName=bigserver
We could write a pattern that matches each name value pair by matching up to the start of the next name, e.g.
<regex pattern="([^=]+)=(.+?)( [^=]+=)">
This would match the following:
ipAddress=123.123.123.123 zones=
Here we are capturing the name and value for each pair in separate groups but the pattern has to also match the name from the next name value pair to find the end of the value. By default Data Splitter will move the content buffer to the end of the match ready for subsequent matches so the next name will not be available for matching.
In addition to matching too much content the above example also uses a reluctant qualifier .+?
. Use of reluctant qualifiers almost always impacts performance so they are to be avoided if at all possible.
A better way to match the example content is to match the input in reverse, reading characters from right to left.
The following example demonstrates this:
<group reverse="true">
<regex pattern="([^=]+)=([^ ]+)">
<data name="$2" value="$1" />
</regex>
</group>
Using the reverse attribute on the parent group causes content to be supplied to all child expressions in reverse order. In the above example this allows the pattern to match values followed by names which enables us to cope with the fact that values have multiple spaces but names have no spaces.
Content is only presented to child regular expressions in reverse. When referencing values from match groups the content is returned in the correct order, e.g. the above example would return:
<data name="ipAddress" value="123.123.123.123" />
<data name="zones" value="Zone 1, Zone 2, Zone 3" />
<data name="location" value="loc1" />
<data name="user" value="An end user" />
<data name="serverName" value="bigserver" />
The reverse feature isn’t needed very often but there are a few cases where it really helps produce the desired output without the complexity and performance overhead of a reluctant match.
An alternative to using the reverse attribute is to use the original reluctant expression example but tell Data Splitter to make the subsequent name available for the next match by not advancing the content beyond the end of the previous value. This is done by using the advance attribute on the <regex>
. However, the reverse attribute represents a better way to solve this particular problem and allows a simpler and more efficient regular expression to be used.
2 - Expressions
Expressions match some data supplied by a parent content provider. The content matched by an expression depends on the type of expression and how it is configured.
The <split>
, <regex>
and <all>
elements are all expressions and match content as described below.
The <split> element
The <split>
element directs Data Splitter to break up content using a specified character sequence as a delimiter. In addition to this it is possible to specify characters that are used to escape the delimiter as well as characters that contain or “quote” a value that may include the delimiter sequence but allow it to be ignored.
Attributes
The <split>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
delimiter
A required attribute used to specify the character string that will be used as a delimiter to split the supplied content unless it is preceded by an escape character or within a container if specified. Several of the previous examples use this attribute.
escape
An optional attribute used to specify a character sequence that is used to escape the delimiter. Many delimited text formats have an escape character that is used to tell any parser that the following delimiter should be ignored, e.g. often a character such as ‘' is used to escape the character that follows it so that it is not treated as a delimiter. When specified this escape sequence also applies to any container characters that may be specified.
containerStart
An optional attribute used to specify a character sequence that will make this expression ignore the presence of delimiters until an end container is found. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters up to a delimiter.
If used containerEnd
must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.
containerEnd
An optional attribute used to specify a character sequence that will make this expression stop ignoring the presence of delimiters if it believes it is currently in a container. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters while ignoring the presence of any delimiter.
If used containerStart
must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.
maxMatch
An optional attribute used to specify the maximum number of times this expression is allowed to match the supplied content. If you do not supply this attribute then the Data Splitter will keep matching the supplied content until it reaches the end. If specified Data Splitter will stop matching the supplied content when it has matched it the specified number of times.
This attribute is used in the ‘CSV with header line’ example to ensure that only the first line is treated as a header line.
minMatch
An optional attribute used to specify the minimum number of times this expression should match the supplied content. If you do not supply this attribute then Data Splitter will not enforce that the expression matches the supplied content. If specified Data Splitter will generate an error if the expression does not match the supplied content at least as many times as specified.
Unlike maxMatch
, minMatch
does not control the matching process but instead controls the production of error messages generated if the parser is not seeing the expected input.
onlyMatch
Optional attribute to use this expression only for specific instances of a match of the parent expression, e.g. on the 4th, 5th and 8th matches of the parent expression specified by ‘4,5,8’. This is used when this expression should only be used to subdivide content from certain parent matches.
The <regex> element
The <regex>
element directs Data Splitter to match content using the specified regular expression pattern. In addition to this the same match control attributes that are available on the <split>
element are also present as well as attributes to alter the way the pattern works.
Attributes
The <regex>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
pattern
This is a required attribute used to specify a regular expression to use to match on the supplied content. The pattern is used to match the content multiple times until the end of the content is reached while the maxMatch and onlyMatch conditions are satisfied.
dotAll
An optional attribute used to specify if the use of ‘.’ in the supplied pattern matches all characters including new lines. If ’true’ ‘.’ will match all characters including new lines, if ‘false’ it will only match up to a new line. If this attribute is not specified it defaults to ‘false’ and will only match up to a new line.
This attribute is used in many of the multiline examples above.
caseInsensitive
An optional attribute used to specify if the supplied pattern should match content in a case insensitive way. If ’true’ the expression will match content in a case insensitive manner, if ‘false’ it will match the content in a case sensitive manner. If this attribute is not specified it defaults to ‘false’ and will match the content in a case sensitive manner.
maxMatch
This is used in the same way it is on the <split>
element, see maxMatch.
minMatch
This is used in the same way it is on the <split>
element, see minMatch.
onlyMatch
This is used in the same way it is on the <split>
element, see onlyMatch.
advance
After an expression has matched content in the buffer, the buffer start position is advanced so that it moves to the end of the entire match. This means that subsequent expressions operating on the content buffer will not see the previously matched content again. This is normally required behaviour, but in some cases some of the content from a match is still required for subsequent matches. Take the following example of name value pairs:
name1=some value 1 name2=some value 2 name3=some value 3
The first name value pair could be matched with the following expression:
<regex pattern="([^=]+)=(.+?) [^= ]+=">
The above expression would match as follows:
name1=some value 1 name2=some value 2 name3=some value 3
In this example we have had to do a reluctant match to extract the value in group 2 and not include the subsequent name. Because the reluctant match requires us to specify what we are reluctantly matching up to, we have had to include an expression after it that matches the next name.
By default the parser will move the character buffer to the end of the entire match so the next expression will be presented with the following:
some value 2 name3=some value 3
Therefore name2
will have been lost from the content buffer and will not be available for matching.
This behaviour can be altered by telling the expression how far to advance the character buffer after matching. This is done with the advance attribute and is used to specify the match group whose end position should be treated as the point the content buffer should advance to, e.g.
<regex pattern="([^=]+)=(.+?) [^= ]+=" advance="2">
In this example the content buffer will only advance to the end of match group 2 and subsequent expressions will be presented with the following content:
name2=some value 2 name3=some value 3
Therefore name2
will still be available in the content buffer.
It is likely that the advance feature will only be useful in cases where a reluctant match is performed. Reluctant matches are discouraged for performance reasons so this feature should rarely be used. A better way to tackle the above example would be to present the content in reverse, however this is only possible if the expression is within a group, i.e. is not a root expression. There may also be more complex cases where reversal is not an option and the use of a reluctant match is the only option.
The <all> element
The <all>
element matches the entire content of the parent group and makes it available to child groups or <data>
elements. The purpose of <all>
is to act as a catch all expression to deal with content that is not handled by a more specific expression, e.g. to output some other unknown, unrecognised or unexpected data.
<group>
<regex pattern="^\s*([^=]+)=([^=]+)\s*">
<data name="$1" value="$2" />
</regex>
<!-- Output unexpected data -->
<all>
<data name="unknown" value="$" />
</all>
</group>
The <all>
element provides the same functionality as using .*
as a pattern in a <regex>
element and where dotAll is set to true, e.g. <regex pattern=".*" dotAll="true">
. However it performs much faster as it doesn’t require pattern matching logic and is therefore always preferred.
Attributes
The <all>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
3 - Output
As with all other aspects of Data Splitter, output XML is determined by adding certain elements to the Data Splitter configuration.
The <data> element
Output is created by Data Splitter using one or more <data>
elements in the configuration. The first <data>
element that is encountered within a matched expression will result in parent <record>
elements being produced in the output.
Attributes
The <data>
element has the following attributes:
id
Optional attribute used to debug the location of expressions causing errors, see id.
name
Both the name and value attributes of the <data>
element can be specified using match references.
value
Both the name and value attributes of the <data>
element can be specified using match references.
Single <data> element example
The simplest example that can be provided uses a single <data>
element within a <split>
expression.
Given the following input:
This is line 1
This is line 2
This is line 3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<split delimiter="\n" >
<data value="$1"/>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
<record>
<data value="This is line 1" />
</record>
<record>
<data value="This is line 2" />
</record>
<record>
<data value="This is line 3" />
</record>
</records>
Multiple <data> element example
You could also output multiple <data>
elements for the same <record>
by adding multiple elements within the same expression:
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
<record>
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</record>
<record>
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</record>
<record>
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</record>
</records>
Multi level <data> elements
As long as all data elements occur within the same parent/ancestor expression, all data elements will be output within the same record.
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<split delimiter="\n" >
<data name="line" value="$1"/>
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1" />
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2" />
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3" />
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</record>
</records>
Nesting <data> elements
Rather than having <data>
elements all appear as children of <record>
it is possible to nest them either as direct children or within child groups.
Direct children
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
<data name="line" value="$">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</data>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1">
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</data>
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2">
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</data>
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3">
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</data>
</record>
</records>
Within child groups
Given the following input:
ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3
… and the following configuration:
<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter xmlns="data-splitter:3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd" version="3.0">
<split delimiter="\n" >
<data name="line" value="$1">
<group value="$1">
<regex pattern="ip=([^ ]+) user=([^ ]+)">
<data name="ip" value="$1"/>
<data name="user" value="$2"/>
</regex>
</group>
</data>
</split>
</dataSplitter>
… you would get the following output:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="records:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="records:2 file://records-v2.0.xsd" version="3.0">
<record>
<data name="line" value="ip=1.1.1.1 user=user1">
<data name="ip" value="1.1.1.1" />
<data name="user" value="user1" />
</data>
</record>
<record>
<data name="line" value="ip=2.2.2.2 user=user2">
<data name="ip" value="2.2.2.2" />
<data name="user" value="user2" />
</data>
</record>
<record>
<data name="line" value="ip=3.3.3.3 user=user3">
<data name="ip" value="3.3.3.3" />
<data name="user" value="user3" />
</data>
</record>
</records>
The above example produces the same output as the previous but could be used to apply much more complex expression logic to produce the child <data>
elements, e.g. the inclusion of multiple child expressions to deal with different types of lines.
4 - Variables
A variable is added to Data Splitter using the <var>
element. A variable is used to store matches from a parent expression for use in a reference elsewhere in the configuration, see variable reference.
The most recent matches are stored for use in local references, i.e. references that are in the same match scope as the variable. Multiple matches are stored for use in references that are in a separate match scope. The concept of different variable scopes is described in scopes.
The <var> element
The <var>
element is used to tell Data Splitter to store matches from a parent expression for use in a reference.
Attributes
The <var>
element has the following attributes:
id
Mandatory attribute used to uniquely identify it within the configuration (see id) and is the means by which a variable is referenced, e.g. $VAR_ID$
.