This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Splitter

1: Simple CSV Example
2: Simple CSV example with heading
3: Complex example with regex and user defined names
4: Multi Line Example
5: Element Reference

5.1: Content Providers
5.2: Expressions
5.3: Variables
5.4: Output

6: Match References, Variables and Fixed Strings

6.1: Expression match references
6.2: Variable reference
6.3: Use of fixed strings
6.4: Concatenation of references

Data Splitter was created to transform text into XML. The XML produced is basic but can be processed further with XSLT to form any desired XML output.

Data Splitter works by using regular expressions to match a region of content or tokenizers to split content. The whole match or match group can then be output or passed to other expressions to further divide the matched data.

The root <dataSplitter> element controls the way content is read and buffered from the source. It then passes this content on to one or more child expressions that attempt to match the content. The child expressions attempt to match content one at a time in the order they are specified until one matches. The matching expression then passes the content that it has matched to other elements that either emit XML or apply other expressions to the content matched by the parent.

This process of content supply, match, (supply, match)*, emit is best illustrated in a simple CSV example. Note that the elements and attributes used in all examples are explained in detail in the element reference.

1 - Simple CSV Example

The following CSV data will be split up into separate fields using Data Splitter.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

The first thing we need to do is match each record. Each record in a CSV file is delimited by a new line character. The following configuration will split the data into records using ‘\n’ as a delimiter:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  
  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n"/>

</dataSplitter>

In the above example the ‘split’ tokenizer matches all of the supplied content up to the end of each line ready to pass each line of content on for further treatment.

We can now add a <group> element within <split> to take content matched by the tokenizer.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

    </group>
  </split>
</dataSplitter>

The <group> within the <split> chooses to take the content from the <split> without including the new line ‘\n’ delimiter by using match group 1, see expression match references for details.

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

The content selected by the <group> from its parent match can then be passed onto sub expressions for further matching:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

      </split>
    </group>
  </split>
</dataSplitter>

In the above example the additional <split> element within the <group> will match the content provided by the group repeatedly until it has used all of the group content.

The content matched by the inner <split> element can be passed to a <data> element to emit XML content.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each line using a new line character as the delimiter -->
  <split delimiter="\n">

    <!-- Take the matched line (using group 1 ignores the delimiters, 
    without this each match would include the new line character) -->
    <group value="$1">

      <!-- Match each value separated by a comma as the delimiter -->
      <split delimiter=",">

        <!-- Output the value from group 1 (as above using group 1
        ignores the delimiters, without this each value would include
        the comma) -->
        <data value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example each match from the inner <split> is made available to the inner <data> element that chooses to output content from match group 1, see expression match references for details.

The above configuration results in the following XML output for the whole input:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data value="01/01/2010" />
    <data value="00:00:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logon" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:01:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="create" />
    <data value="c:\test.txt" />
  </record>
  <record>
    <data value="01/01/2010" />
    <data value="00:02:00" />
    <data value="192.168.1.100" />
    <data value="SOMEHOST.SOMEWHERE.COM" />
    <data value="user1" />
    <data value="logoff" />
  </record>
</records>

2 - Simple CSV example with heading

In addition to referencing content produced by a parent element it is often desirable to store content and reference it later. The following example of a CSV with a heading demonstrates how content can be stored in a variable and then referenced later on.

Input

This example will use a similar input to the one in the previous CSV example but also adds a heading line.

Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,
01/01/2010,00:01:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,create,c:\test.txt
01/01/2010,00:02:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logoff,

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the
  first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value
        from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:00:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logon" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:01:00" />
    <data name="IPAddress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="create" />
    <data name="Detail" value="c:\test.txt" />
  </record>
  <record>
    <data name="Date" value="01/01/2010" />
    <data name="Time" value="00:02:00" />
    <data name="IPAdress" value="192.168.1.100" />
    <data name="HostName" value="SOMEHOST.SOMEWHERE.COM" />
    <data name="User" value="user1" />
    <data name="EventType" value="logoff" />
  </record>
</records>

3 - Complex example with regex and user defined names

The following example uses a real world Apache log and demonstrates the use of regular expressions rather than simple ‘split’ tokenizers. The usage and structure of regular expressions is outside of the scope of this document but Data Splitter uses Java’s standard regular expression library that is POSIX compliant and documented in numerous places.

This example also demonstrates that the names and values that are output can be hard coded in the absence of field name information to make XSLT conversion easier later on. Also shown is that any match can be divided into further fields with additional expressions and the ability to nest data elements to provide structure if needed.

Input

192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /doc.htm HTTP/1.1" 200 4235 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
192.168.1.100 - "-" [12/Jul/2012:11:57:07 +0000] "GET /default.css HTTP/1.1" 200 3494 "http://some.server:8080/doc.htm" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!--
  Standard Apache Format

  %h - host name should be ok without quotes
  %l - Remote logname (from identd, if supplied). This will return a dash unless IdentityCheck is set On.
  \"%u\" - user name should be quoted to deal with DNs
  %t - time is added in square brackets so is contained for parsing purposes
  \"%r\" - URL is quoted
  %>s - Response code doesn't need to be quoted as it is a single number
  %b - The size in bytes of the response sent to the client
  \"%{Referer}i\" - Referrer is quoted so that’s ok
  \"%{User-Agent}i\" - User agent is quoted so also ok

  LogFormat "%h %l \"%u\" %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
  -->

  <!-- Match line -->
  <split delimiter="\n">
    <group value="$1">

      <!-- Provide a regular expression for the whole line with match
      groups for each field we want to split out -->
      <regex pattern="^([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; \[([^\]]+)] &#34;([^&#34;]+)&#34; ([^ ]+) ([^ ]+) &#34;([^&#34;]+)&#34; &#34;([^&#34;]+)&#34;">
        <data name="host" value="$1" />
        <data name="log" value="$2" />
        <data name="user" value="$3" />
        <data name="time" value="$4" />
        <data name="url" value="$5">

          <!-- Take the 5th regular expression group and pass it to
          another expression to divide into smaller components -->
          <group value="$5">
            <regex pattern="^([^ ]+) ([^ ]+) ([^ /]*)/([^ ]*)">
              <data name="httpMethod" value="$1" />
              <data name="url" value="$2" />
              <data name="protocol" value="$3" />
              <data name="version" value="$4" />
            </regex>
          </group>
        </data>
        <data name="response" value="$6" />
        <data name="size" value="$7" />
        <data name="referrer" value="$8" />
        <data name="userAgent" value="$9" />
      </regex>
    </group>
  </split>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /doc.htm HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/doc.htm" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="4235" />
    <data name="referrer" value="-" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
  <record>
    <data name="host" value="192.168.1.100" />
    <data name="log" value="-" />
    <data name="user" value="-" />
    <data name="time" value="12/Jul/2012:11:57:07 +0000" />
    <data name="url" value="GET /default.css HTTP/1.1">
      <data name="httpMethod" value="GET" />
      <data name="url" value="/default.css" />
      <data name="protocol" value="HTTP" />
      <data name="version" value="1.1" />
    </data>
    <data name="response" value="200" />
    <data name="size" value="3494" />
    <data name="referrer" value="http://some.server:8080/doc.htm" />
    <data name="userAgent" value="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" />
  </record>
</records>

4 - Multi Line Example

Example multi line file where records are split over may lines. There are various ways this data could be treated but this example forms a record from data created when some fictitious query starts plus the subsequent query results.

Input

09/07/2016    14:49:36    User = user1
09/07/2016    14:49:36    Query = some query

09/07/2016    16:34:40    Results:
09/07/2016    16:34:40    Line 1:   result1
09/07/2016    16:34:40    Line 2:   result2
09/07/2016    16:34:40    Line 3:   result3
09/07/2016    16:34:40    Line 4:   result4

09/07/2009    16:35:21    User = user2
09/07/2009    16:35:21    Query = some other query

09/07/2009    16:45:36    Results:
09/07/2009    16:45:36    Line 1:   result1
09/07/2009    16:45:36    Line 2:   result2
09/07/2009    16:45:36    Line 3:   result3
09/07/2009    16:45:36    Line 4:   result4

Configuration

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match each record. We want to treat the query and results as a single event so match the two sets of data separated by a double new line -->
  <regex pattern="\n*((.*\n)+?\n(.*\n)+?\n)|\n*(.*\n?)+">
    <group>

      <!-- Split the record into query and results -->
      <regex pattern="(.*?)\n\n(.*)" dotAll="true">

        <!-- Create a data element to output query data -->
        <data name="query">
          <group value="$1">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>

        <!-- Create a data element to output result data -->
        <data name="results">
          <group value="$2">

            <!-- We only want to output the date and time from the first line. -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
              <data name="date" value="$1" />
              <data name="time" value="$2" />
              <data name="$3" value="$4" />
            </regex>
            
            <!-- Output all other values -->
            <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
              <data name="$3" value="$4" />
            </regex>
          </group>
        </data>
      </regex>
    </group>
  </regex>
</dataSplitter>

Output

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="records:2 file://records-v2.0.xsd"
                version="2.0">
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="14:49:36" />
      <data name="User" value="user1" />
      <data name="Query" value="some query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:34:40" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
  <record>
    <data name="query">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:35:21" />
      <data name="User" value="user2" />
      <data name="Query" value="some other query" />
    </data>
    <data name="results">
      <data name="date" value="09/07/2016" />
      <data name="time" value="16:45:36" />
      <data name="Results" />
      <data name="Line 1" value="result1" />
      <data name="Line 2" value="result2" />
      <data name="Line 3" value="result3" />
      <data name="Line 4" value="result4" />
    </data>
  </record>
</records>

5 - Element Reference

There are various elements used in a Data Splitter configuration to control behaviour. Each of these elements can be categorised as one of the following:

5.1 - Content Providers

Content providers take some content from the input source or elsewhere (see fixed strings and provide it to one or more expressions. Both the root element <dataSplitter> and <group> elements are content providers.

Root element `<dataSplitter>`

The root element of a Data Splitter configuration is <dataSplitter>. It supplies content from the input source to one or more expressions defined within it. The way that content is buffered is controlled by the root element and the way that errors are handled as a result of child expressions not matching all of the content it supplies.

Attributes

The following attributes can be added to the <dataSplitter> root element:

ignoreErrors
bufferSize

`ignoreErrors`

Data Splitter generates errors if not all of the content is matched by the regular expressions beneath the <dataSplitter> or within <group> elements. The error messages are intended to aid the user in writing good Data Splitter configurations. The intent is to indicate when the input data is not being matched fully and therefore possibly skipping some important data. Despite this, in some cases it is laborious to have to write expressions to match all content. In these cases it is preferable to add this attribute to ignore these errors. However it is often better to write expressions that capture all of the supplied content and discard unwanted characters. This attribute also affects errors generated by the use of the minMatch attribute on <regex> which is described later on.

Take the following example input:

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This could be matched with the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n[^#]+">
…
  </regex>
</dataSplitter>

The above configuration would only match up to a comment for each record line, e.g.

Name1,Name2,Name3
value1,value2,value3 # a useless comment
value1,value2,value3 # a useless comment

This may well be the desired functionality but if there was useful content within the comment it would be lost. Because of this Data Splitter warns you when expressions are failing to match all of the content presented so that you can make sure that you aren’t missing anything important. In the above example it is obvious that this is the required behaviour but in more complex cases you might be otherwise unaware that your expressions were losing data.

To maintain this assurance that you are handling all content it is usually best to write expressions to explicitly match all content even though you may do nothing with some matches, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex id="heading" pattern=".+" maxMatch="1">
…
  </regex>
  <regex id="body" pattern="\n([^#]+)#.+">
…
  </regex>
</dataSplitter>

The above example would match all of the content and would therefore not generate warnings. Sub-expressions of ‘body’ could use match group 1 and ignore the comment.

However as previously stated it might often be difficult to write expressions that will just match content that is to be discarded. In these cases ignoreErrors can be used to suppress errors caused by unmatched content.

`bufferSize` (Advanced)

This is an optional attribute used to tune the size of the character buffer used by Data Splitter. The default size is 20000 characters and should be fine for most translations. The minimum value that this can be set to is 20000 characters and the maximum is 1000000000. The only reason to specify this attribute is when individual records are bigger than 10000 characters which is rarely the case.

Group element `<group>`

Groups behave in a similar way to the root element in that they provide content for one or more inner expressions to deal with, e.g.

<group value="$1">
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)" maxMatch="1">
  ...
  <regex pattern="([^\t]*)\t([^\t]*)[\t]*([^=:]*)[=:]*(.*)">
  ...

Attributes

As the <group> element is a content provider it also includes the same ignoreErrors attribute which behaves in the same way. The complete list of attributes for the <group> element is as follows:

id
value
ignoreErrors
matchOrder
reverse

`id`

When Data Splitter reports errors it outputs an XPath to describe the part of the configuration that generated the error, e.g.

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group>

It is often a little difficult to identify the configuration element that generated the error by looking at the path and the element description, particularly when multiple elements are the same, e.g. many <group> elements without attributes. To make identification easier you can add an ‘id’ attribute to any element in the configuration resulting in error descriptions as follows:

DSParser [2:1] ERROR: Expressions failed to match all of the content provided by group: regex[0]/group[0]/regex[3]/group[1] : <group id="myGroupId">

`value`

This attribute determines what content to present to child expressions. By default the entire content matched by a group’s parent expression is passed on by the group to child expressions. If required, content from a specific match group in the parent expression can be passed to child expressions using the value attribute, e.g. value="$1". In addition to this content can be composed in the same way as it is for data names and values.

`ignoreErrors`

This behaves in the same way as for the root element.

`matchOrder`

This is an optional attribute used to control how content is consumed by expression matches. Content can be consumed in sequence or in any order using matchOrder="sequence" or matchOrder="any". If the attribute is not specified, Data Splitter will default to matching in sequence.

When matching in sequence, each match consumes some content and the content position is moved beyond the match ready for the subsequent match. However, in some cases the order of these constructs is not predictable, e.g. we may sometimes be presented with:

Value1=1 Value2=2

… or sometimes with:

Value2=2 Value1=1

Using a sequential match order the following example would work to find both values in Value1=1 Value2=2

<group>
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

… but this example would skip over Value2 and only find the value of Value1 if the input was Value2=2 Value1=1.

To be able to deal with content that contains these constructs in either order we need to change the match order to any.

When matching in any order, each match removes the matched section from the content rather than moving the position past the match so that all remaining content can be matched by subsequent expressions. In the following example the first expression would match and remove Value1=1 from the supplied content and the second expression would be presented with Value2=2 which it could also match.

<group matchOrder="any">
  <regex pattern="Value1=([^ ]*)">
  ...
  <regex pattern="Value2=([^ ]*)">
  ...

If the attribute is omitted by default the match order will be sequential. This is the default behaviour as tokens are most often in sequence and consuming content in this way is more efficient as content does not need to be copied by the parser to chop out sections as is required for matching in any order. It is only necessary to use this feature when fields that are identifiable with a specific match can occur in any order.

`reverse`

Occasionally it is desirable to reverse the content presented by a group to child expressions. This is because it is sometimes easier to form a pattern by matching content in reverse.

Take the following example content of name, value pairs delimited by = but with no spaces between names, multiple spaces between values and only a space between subsequent pairs:

ipAddress=123.123.123.123 zones=Zone 1, Zone 2, Zone 3 location=loc1 A user=An end user serverName=bigserver

We could write a pattern that matches each name value pair by matching up to the start of the next name, e.g.

<regex pattern="([^=]+)=(.+?)( [^=]+=)">

This would match the following:

ipAddress=123.123.123.123 zones=

Here we are capturing the name and value for each pair in separate groups but the pattern has to also match the name from the next name value pair to find the end of the value. By default Data Splitter will move the content buffer to the end of the match ready for subsequent matches so the next name will not be available for matching.

In addition to matching too much content the above example also uses a reluctant qualifier .+?. Use of reluctant qualifiers almost always impacts performance so they are to be avoided if at all possible.

A better way to match the example content is to match the input in reverse, reading characters from right to left.

The following example demonstrates this:

<group reverse="true">
  <regex pattern="([^=]+)=([^ ]+)">
    <data name="$2" value="$1" />
  </regex>
</group>

Using the reverse attribute on the parent group causes content to be supplied to all child expressions in reverse order. In the above example this allows the pattern to match values followed by names which enables us to cope with the fact that values have multiple spaces but names have no spaces.

Content is only presented to child regular expressions in reverse. When referencing values from match groups the content is returned in the correct order, e.g. the above example would return:

<data name="ipAddress" value="123.123.123.123" />
<data name="zones" value="Zone 1, Zone 2, Zone 3" />
<data name="location" value="loc1" />
<data name="user" value="An end user" />
<data name="serverName" value="bigserver" />

The reverse feature isn’t needed very often but there are a few cases where it really helps produce the desired output without the complexity and performance overhead of a reluctant match.

An alternative to using the reverse attribute is to use the original reluctant expression example but tell Data Splitter to make the subsequent name available for the next match by not advancing the content beyond the end of the previous value. This is done by using the advance attribute on the <regex>. However, the reverse attribute represents a better way to solve this particular problem and allows a simpler and more efficient regular expression to be used.

5.2 - Expressions

Expressions match some data supplied by a parent content provider. The content matched by an expression depends on the type of expression and how it is configured.

The <split>, <regex> and <all> elements are all expressions and match content as described below.

The `<split>` element

The <split> element directs Data Splitter to break up content using a specified character sequence as a delimiter. In addition to this it is possible to specify characters that are used to escape the delimiter as well as characters that contain or “quote” a value that may include the delimiter sequence but allow it to be ignored.

Attributes

The <split> element has the following attributes:

id
delimiter
escape
containerStart
containerEnd
maxMatch
minMatch
onlyMatch

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`delimiter`

A required attribute used to specify the character string that will be used as a delimiter to split the supplied content unless it is preceded by an escape character or within a container if specified. Several of the previous examples use this attribute.

`escape`

An optional attribute used to specify a character sequence that is used to escape the delimiter. Many delimited text formats have an escape character that is used to tell any parser that the following delimiter should be ignored, e.g. often a character such as ‘' is used to escape the character that follows it so that it is not treated as a delimiter. When specified this escape sequence also applies to any container characters that may be specified.

`containerStart`

An optional attribute used to specify a character sequence that will make this expression ignore the presence of delimiters until an end container is found. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters up to a delimiter.

If used containerEnd must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

`containerEnd`

An optional attribute used to specify a character sequence that will make this expression stop ignoring the presence of delimiters if it believes it is currently in a container. If the character is preceded by the specified escape sequence then this container sequence will be ignored and the expression will continue matching characters while ignoring the presence of any delimiter.

If used containerStart must also be specified. If the container characters are to be ignored from the match then match group 1 must be used instead of 0.

`maxMatch`

An optional attribute used to specify the maximum number of times this expression is allowed to match the supplied content. If you do not supply this attribute then the Data Splitter will keep matching the supplied content until it reaches the end. If specified Data Splitter will stop matching the supplied content when it has matched it the specified number of times.

This attribute is used in the ‘CSV with header line’ example to ensure that only the first line is treated as a header line.

`minMatch`

An optional attribute used to specify the minimum number of times this expression should match the supplied content. If you do not supply this attribute then Data Splitter will not enforce that the expression matches the supplied content. If specified Data Splitter will generate an error if the expression does not match the supplied content at least as many times as specified.

Unlike maxMatch, minMatch does not control the matching process but instead controls the production of error messages generated if the parser is not seeing the expected input.

`onlyMatch`

Optional attribute to use this expression only for specific instances of a match of the parent expression, e.g. on the 4th, 5th and 8th matches of the parent expression specified by ‘4,5,8’. This is used when this expression should only be used to subdivide content from certain parent matches.

The `<regex>` element

The <regex> element directs Data Splitter to match content using the specified regular expression pattern. In addition to this the same match control attributes that are available on the <split> element are also present as well as attributes to alter the way the pattern works.

Attributes

The <regex> element has the following attributes:

id
pattern
dotAll
caseInsensitive
maxMatch
minMatch
onlyMatch
advance

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`pattern`

This is a required attribute used to specify a regular expression to use to match on the supplied content. The pattern is used to match the content multiple times until the end of the content is reached while the maxMatch and onlyMatch conditions are satisfied.

`dotAll`

An optional attribute used to specify if the use of ‘.’ in the supplied pattern matches all characters including new lines. If ’true’ ‘.’ will match all characters including new lines, if ‘false’ it will only match up to a new line. If this attribute is not specified it defaults to ‘false’ and will only match up to a new line.

This attribute is used in many of the multi-line examples above.

`caseInsensitive`

An optional attribute used to specify if the supplied pattern should match content in a case insensitive way. If ’true’ the expression will match content in a case insensitive manner, if ‘false’ it will match the content in a case sensitive manner. If this attribute is not specified it defaults to ‘false’ and will match the content in a case sensitive manner.

`maxMatch`

This is used in the same way it is on the <split> element, see maxMatch.

`minMatch`

This is used in the same way it is on the <split> element, see minMatch.

`onlyMatch`

This is used in the same way it is on the <split> element, see onlyMatch.

`advance`

After an expression has matched content in the buffer, the buffer start position is advanced so that it moves to the end of the entire match. This means that subsequent expressions operating on the content buffer will not see the previously matched content again. This is normally required behaviour, but in some cases some of the content from a match is still required for subsequent matches. Take the following example of name value pairs:

name1=some value 1 name2=some value 2 name3=some value 3

The first name value pair could be matched with the following expression:

<regex pattern="([^=]+)=(.+?) [^= ]+=">

The above expression would match as follows:

name1=some value 1 name2=some value 2 name3=some value 3

In this example we have had to do a reluctant match to extract the value in group 2 and not include the subsequent name. Because the reluctant match requires us to specify what we are reluctantly matching up to, we have had to include an expression after it that matches the next name.

By default the parser will move the character buffer to the end of the entire match so the next expression will be presented with the following:

some value 2 name3=some value 3

Therefore name2 will have been lost from the content buffer and will not be available for matching.

This behaviour can be altered by telling the expression how far to advance the character buffer after matching. This is done with the advance attribute and is used to specify the match group whose end position should be treated as the point the content buffer should advance to, e.g.

<regex pattern="([^=]+)=(.+?) [^= ]+=" advance="2">

In this example the content buffer will only advance to the end of match group 2 and subsequent expressions will be presented with the following content:

name2=some value 2 name3=some value 3

Therefore name2 will still be available in the content buffer.

It is likely that the advance feature will only be useful in cases where a reluctant match is performed. Reluctant matches are discouraged for performance reasons so this feature should rarely be used. A better way to tackle the above example would be to present the content in reverse, however this is only possible if the expression is within a group, i.e. is not a root expression. There may also be more complex cases where reversal is not an option and the use of a reluctant match is the only option.

The `<all>` element

The <all> element matches the entire content of the parent group and makes it available to child groups or <data> elements. The purpose of <all> is to act as a catch all expression to deal with content that is not handled by a more specific expression, e.g. to output some other unknown, unrecognised or unexpected data.

<group>
  <regex pattern="^\s*([^=]+)=([^=]+)\s*">
    <data name="$1" value="$2" />
  </regex>

  <!-- Output unexpected data -->
  <all>
    <data name="unknown" value="$" />
  </all>
</group>

The <all> element provides the same functionality as using .* as a pattern in a <regex> element and where dotAll is set to true, e.g. <regex pattern=".*" dotAll="true">. However it performs much faster as it doesn’t require pattern matching logic and is therefore always preferred.

Attributes

The <all> element has the following attributes:

id

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

5.3 - Variables

A variable is added to Data Splitter using the <var> element. A variable is used to store matches from a parent expression for use in a reference elsewhere in the configuration, see variable reference.

The most recent matches are stored for use in local references, i.e. references that are in the same match scope as the variable. Multiple matches are stored for use in references that are in a separate match scope. The concept of different variable scopes is described in scopes.

The `<var>` element

The <var> element is used to tell Data Splitter to store matches from a parent expression for use in a reference.

Attributes

The <var> element has the following attributes:

id

`id`

Mandatory attribute used to uniquely identify it within the configuration (see id) and is the means by which a variable is referenced, e.g. $VAR_ID$ .

5.4 - Output

As with all other aspects of Data Splitter, output XML is determined by adding certain elements to the Data Splitter configuration.

The `<data>` element

Output is created by Data Splitter using one or more <data> elements in the configuration. The first <data> element that is encountered within a matched expression will result in parent <record> elements being produced in the output.

Attributes

The <data> element has the following attributes:

id
name
value

`id`

Optional attribute used to debug the location of expressions causing errors, see id.

`name`

Both the name and value attributes of the <data> element can be specified using match references.

`value`

Both the name and value attributes of the <data> element can be specified using match references.

Single `<data>` element example

The simplest example that can be provided uses a single <data> element within a <split> expression.

Given the following input:

This is line 1
This is line 2
This is line 3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter 
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data value="$1"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data value="This is line 1" />
  </record>
  <record>
    <data value="This is line 2" />
  </record>
  <record>
    <data value="This is line 3" />
  </record>
</records>

Multiple `<data>` element example

You could also output multiple <data> elements for the same <record> by adding multiple elements within the same expression:

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="ip" value="$1"/>
    <data name="user" value="$2"/>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Multi level `<data>` elements

As long as all data elements occur within the same parent/ancestor expression, all data elements will be output within the same record.

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1"/>

    <group value="$1">
      <regex pattern="ip=([^ ]+) user=([^ ]+)">
        <data name="ip" value="$1"/>
        <data name="user" value="$2"/>
      </regex>
    </group>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1" />
    <data name="ip" value="1.1.1.1" />
    <data name="user" value="user1" />
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2" />
    <data name="ip" value="2.2.2.2" />
    <data name="user" value="user2" />
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3" />
    <data name="ip" value="3.3.3.3" />
    <data name="user" value="user3" />
  </record>
</records>

Nesting `<data>` elements

Rather than having <data> elements all appear as children of <record> it is possible to nest them either as direct children or within child groups.

Direct children

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <regex pattern="ip=([^ ]+) user=([^ ]+)\s*">
    <data name="line" value="$">
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<record
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

Within child groups

Given the following input:

ip=1.1.1.1 user=user1
ip=2.2.2.2 user=user2
ip=3.3.3.3 user=user3

… and the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">
  <split delimiter="\n" >
    <data name="line" value="$1">
      <group value="$1">
        <regex pattern="ip=([^ ]+) user=([^ ]+)">
          <data name="ip" value="$1"/>
          <data name="user" value="$2"/>
        </regex>
      </group>
    </data>
  </split>
</dataSplitter>

… you would get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<records
    xmlns="records:2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="records:2 file://records-v2.0.xsd"
    version="3.0">
  <record>
    <data name="line" value="ip=1.1.1.1 user=user1">
      <data name="ip" value="1.1.1.1" />
      <data name="user" value="user1" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=2.2.2.2 user=user2">
      <data name="ip" value="2.2.2.2" />
      <data name="user" value="user2" />
    </data>
  </record>
  <record>
    <data name="line" value="ip=3.3.3.3 user=user3">
      <data name="ip" value="3.3.3.3" />
      <data name="user" value="user3" />
    </data>
  </record>
</records>

The above example produces the same output as the previous but could be used to apply much more complex expression logic to produce the child <data> elements, e.g. the inclusion of multiple child expressions to deal with different types of lines.

6 - Match References, Variables and Fixed Strings

The <group> and <data> elements can reference match groups from parent expressions or from stored matches in variables. In the case of the <group> element, referenced values are passed on to child expressions whereas the <data> element can use match group references for name and value attributes. In the case of both elements the way of specifying references is the same.

6.1 - Expression match references

Referencing matches in expressions is done using $. In addition to this a match group number may be added to just retrieve part of the expression match. The applicability and effect that this has depends on the type of expression used.

References to `<split>` Match Groups

In the following example a line matched by a parent <split> expression is referenced by a child <data> element.

<split delimiter="\n" >
  <data name="line" value="$"/>
</split>

A <split> element matches content up to and including the specified delimiter, so the above reference would output the entire line plus the delimiter. However there are various match groups that can be used by child <group> and <data> elements to reference sections of the matched content.

To illustrate the content provided by each match group, take the following example:

"This is some text\, that we wish to match", "This is the next text"

And the following <split> element:

<split delimiter="," escape="\">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

"This is some text\, that we wish to match",

$1: The content up to the specified delimiter at the end

"This is some text\, that we wish to match"

$2: The content up to the specified delimiter at the end and filtered to remove escape characters (more expensive than $1)

"This is some text, that we wish to match"

In addition to this behaviour match groups 1 and 2 will omit outermost whitespace and container characters if specified, e.g. with the following content:

"  This is some text\, that we wish to match  "  , "This is the next text"

And the following <split> element:

<split delimiter="," escape="\" containerStart="&quot" containerEnd="&quot">

The match groups are as follows:

$ or $0: The entire content that is matched including the specified delimiter at the end

" This is some text\, that we wish to match " ,

$1: The content up to the specified delimiter at the end and strips outer containers.

This is some text\, that we wish to match

$2: The content up to the specified delimiter at the end and strips outer containers and filtered to remove escape characters (more computationally expensive than $1)

This is some text, that we wish to match

References to Match Groups

Like the <split> element various match groups can be referenced in a <regex> expression to retrieve portions of matched content. This content can be used as values for <group> and <data> elements.

Given the following input:

ip=1.1.1.1 user=user1

And the following <regex> element:

<regex pattern="ip=([^ ]+) user=([^ ]+)">

The match groups are as follows:

$ or $0: The entire content that is matched by the expression

ip=1.1.1.1 user=user1

$1: The content of the first match group

1.1.1.1

$2: The content of the second match group

user1

Match group numbers in regular expressions are determined by the order that their open bracket appears in the expression.

References to `<any>` Match Groups

The <any> element does not have any match groups and always returns the entire content that was passed to it when referenced with $.

6.2 - Variable reference

Variables are added to Data Splitter configuration using the <var> element, see variables. Each variable must have a unique id so that it can be referenced. References to variables have the form $VARIABLE_ID$ , e.g.

<data name="$heading$" value="$" />

Identification

Data Splitter validates the configuration on load and ensures that all element ids are unique and that referenced ids belong to a variable.

A variable will only store data if it is referenced so variables that are not referenced will do nothing. In addition to this a variable will only store data for match groups that are referenced, e.g. if $heading$1 is the only reference to a variable with an id of ‘heading’ then only data for match group 1 will be stored for reference lookup.

Scopes

Variables have two scopes which affect how data is retrieved when referenced:

Local scope
Remote scope

Local Scope

Variables are local to a reference if the reference exists as a descendant of the variables parent expression, e.g.

<split delimiter="\n" >
  <var id="line" />

  <group value="$1">
    <regex pattern="ip=([^ ]+) user=([^ ]+)">
      <data name="line" value="$line$"/>
      <data name="ip" value="$1"/>
      <data name="user" value="$2"/>
    </regex>
  </group>
</split>

In the above example, matches for the outermost <split> expression are stored in the variable with the id of line. The only reference to this variable is in a data element that is a descendant of the variables parent expression <split>, i.e. it is nested within split/group/regex.

Because the variable is referenced locally only the most recent parent match is relevant, i.e. no retrieval of values by iteration, iteration offset or fixed position is applicable. These features only apply to remote variables that store multiple values.

Remote Scope

The CSV example with a heading is an example of a variable being referenced from a remote scope.

<?xml version="1.0" encoding="UTF-8"?>
<dataSplitter
    xmlns="data-splitter:3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="data-splitter:3 file://data-splitter-v3.0.xsd"
    version="3.0">

  <!-- Match heading line (note that maxMatch="1" means that only the first line will be matched by this splitter) -->
  <split delimiter="\n" maxMatch="1">

    <!-- Store each heading in a named list -->
    <group>
      <split delimiter=",">
        <var id="heading" />
      </split>
    </group>
  </split>

  <!-- Match each record -->
  <split delimiter="\n">

    <!-- Take the matched line -->
    <group value="$1">

      <!-- Split the line up -->
      <split delimiter=",">

        <!-- Output the stored heading for each iteration and the value from group 1 -->
        <data name="$heading$1" value="$1" />
      </split>
    </group>
  </split>
</dataSplitter>

In the above example the parent expression of the variable is not the ancestor of the reference in the <data> element. This makes the <data> elements reference to the variable a remote one. In this situation the variable knows that it must store multiple values as the remote reference <data> may retrieve one of many values from the variable based on:

The match count of the parent expression.
The match count of the parent expression, plus or minus an offset.
A fixed position in the variable store.

Retrieval of value by iteration

In the above example the first line is taken then repeatedly matched by delimiting with commas. This results in multiple values being stored in the ‘heading’ variable. Once this is done subsequent lines are matched and then also repeatedly matched by delimiting with commas in the same way the heading is.

Each time a line is matched the internal match count of all sub expressions, (e.g. the <split> expression that is delimited by comma) is reset to 0. Every time the sub <split> expression matches up to a comma delimiter the match count is incremented. Any references to remote variables will, by default, use the current match count as an index to retrieve one of the many values stored in the variable. This means that the <data> element in the above example will retrieve the corresponding heading for each value as the match count of the values will match the storage position of each heading.

Retrieval of value by iteration offset

In some cases there may be a mismatch between the position where a value is stored in a variable and the match count applicable when remotely referencing the variable.

Take the following input:

BAD,Date,Time,IPAddress,HostName,User,EventType,Detail
01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

In the above example we can see that the first heading ‘BAD’ is not correct for the first value of every line. In this situation we could either adjust the way the heading line is parsed to ignore ‘BAD’ or just adjust the way the heading variable is referenced.

To make this adjustment the reference just needs to be told what offset to apply to the current match count to correctly retrieve the stored value. In the above example this would be done like this:

<data name="$heading$1[+1]" value="$1" />

The above reference just uses the match count plus 1 to retrieve the stored value. Any integral offset plus or minus may be used, e.g. [+4] or [-10]. Offsets that result in a position that is outside of the storage range for the variable will not return a value.

Retrieval of value by fixed position

In addition to retrieval by offset from the current match count, a stored value can be returned by a fixed position that has no relevance to the current match count.

In the following example the value retrieved from the ‘heading’ variable will always be ‘IPAddress’ as this is the fourth value stored in the ‘heading’ variable and the position index starts at 0.

<data name="$heading$1[3]" value="$1" />

6.3 - Use of fixed strings

Any <group> value or <data> name and value can use references to matched content, but in addition to this it is possible just to output a known string, e.g.

<data name="somename" value="$" />

The above example would output somename as the <data> name attribute. This can often be useful where there are no headings specified in the input data but we want to associate certain names with certain values.

Given the following data:

01/01/2010,00:00:00,192.168.1.100,SOMEHOST.SOMEWHERE.COM,user1,logon,

We could provide useful headings with the following configuration:

<regex pattern="([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),">
  <data name="date" value="$1" />
  <data name="time" value="$2" />
  <data name="ipAddress" value="$3" />
  <data name="hostName" value="$4" />
  <data name="user" value="$5" />
  <data name="action" value="$6" />
</regex>

6.4 - Concatenation of references

It is possible to concatenate multiple fixed strings and match group references using the + character. As with all references and fixed strings this can be done in <group> value and <data> name and value attributes. However concatenation does have some performance overhead as new buffers have to be created to store concatenated content.

A good example of concatenation is the production of ISO8601 date format from data in the previous example:

01/01/2010,00:00:00

Here the following <regex> could be used to extract the relevant date, time groups:

<regex pattern="(\d{2})/(\d{2})/(\d{4}),(\d{2}):(\d{2}):(\d{2})">

The match groups from this expression can be concatenated with the following value output pattern in the data element:

<data name="dateTime" value="$3+’-‘+$2+’-‘+$1+’-‘+’T’+$4+’:’+$5+’:’+$6+’.000Z’" />

Using the original example, this would result in the output:

<data name="dateTime" value="2010-01-01T00:00:00.000Z" />

Note that the value output pattern wraps all fixed strings in single quotes. This is necessary when concatenating strings and references so that Data Splitter can determine which parts are to be treated as fixed strings. This also allows fixed strings to contain $ and + characters.

As single quotes are used for this purpose, a single quote needs to be escaped with another single quote if one is desired in a fixed string, e.g.

‘this ‘’is quoted text’’’

This will result in:

this ‘is quoted text’

Data Splitter

1 - Simple CSV Example

2 - Simple CSV example with heading

Input

Configuration

Output

3 - Complex example with regex and user defined names

Input

Configuration

Output

4 - Multi Line Example

Input

Configuration

Output

5 - Element Reference

5.1 - Content Providers

Root element <dataSplitter>

Attributes

ignoreErrors

bufferSize (Advanced)

Group element <group>

Attributes

id

value

See Also

ignoreErrors

matchOrder

reverse

5.2 - Expressions

The <split> element

Attributes

id

delimiter

escape

containerStart

containerEnd

maxMatch

minMatch

onlyMatch

The <regex> element

Attributes

id

pattern

dotAll

caseInsensitive

maxMatch

minMatch

onlyMatch

advance

The <all> element

Attributes

id

5.3 - Variables

The <var> element

Attributes

id

5.4 - Output

The <data> element

Attributes

id

name

value

Single <data> element example

Multiple <data> element example

Multi level <data> elements

Nesting <data> elements

Direct children

Within child groups

6 - Match References, Variables and Fixed Strings

6.1 - Expression match references

References to <split> Match Groups

References to Match Groups

References to <any> Match Groups

6.2 - Variable reference

Identification

Scopes

Local Scope

Remote Scope

Retrieval of value by iteration

Retrieval of value by iteration offset

Root element `<dataSplitter>`

`ignoreErrors`

`bufferSize` (Advanced)

Group element `<group>`

`id`

`value`

`ignoreErrors`

`matchOrder`

`reverse`

The `<split>` element

`id`

`delimiter`

`escape`

`containerStart`

`containerEnd`

`maxMatch`

`minMatch`

`onlyMatch`

The `<regex>` element

`id`

`pattern`

`dotAll`

`caseInsensitive`

`maxMatch`

`minMatch`

`onlyMatch`

`advance`

The `<all>` element

`id`

The `<var>` element

`id`

The `<data>` element

`id`

`name`

`value`

Single `<data>` element example

Multiple `<data>` element example

Multi level `<data>` elements

Nesting `<data>` elements

References to `<split>` Match Groups

References to `<any>` Match Groups