Pipelines

Pipelines are the mechanism for processing and transforming ingested data.

Stroom uses Pipelines to process its data. A pipeline is a set of pipeline elements connected together. Pipelines are very powerful and flexible and allow the user to transform, index, store and forward data in a wide variety of ways.

Example Pipeline

Pipelines can take many forms and be used for a wide variety of purposes, however a typical pipeline to convert CSV data into cooked events might look like this:

Input Data

Pipelines process data in batches. This batch of data is referred to as a Stream . The input for the pipeline is a single Stream that exists within a Feed and this data is fed into the left-hand side of the pipeline at Source . Pipelines can accept streams from multiple Feeds assuming those feeds contain similar data.

The data in the Stream is always text data (XML, JSON, CSV, fixed-width, etc.) in a known character encoding . Stroom does not currently support processing binary formats.

XML

The working format for pipeline processing is XML (with the exception of raw streaming). Data can be input and output in other forms, e.g. JSON, CSV, fixed-width, etc. but the majority of pipelines do most of their processing in XML. Input data is converted into XML SAX events, processed using XSLT to transform it into different shapes of XML then either consumed as XML (e.g. an IndexingFilter ) or converted into a desired output format for storage/forwarding.

Forks

Pipelines can also be forked at any point in the pipeline. This allows the same data to processed in different ways.

Pipeline Inheritance

It is possible for pipelines to inherit from other pipelines. This allows for the creation of a standard abstract pipelines with a set structure, though not fully configured, to be inherited by many concrete pipelines.

For example you may have a standard pipeline for indexing XML events, i.e. read XML data and pass it to an IndexingFilter , but the IndexingFilter is not configured with the actual Index to send documents to. A pipeline that inherits this one can then be simply configured with the Index to use.

Pipeline inheritance allows for changes to the inherited structure, e.g. adding additional elements in line. Multi level inheritance is also supported.

Pipeline Element Types

Stroom has a number of categories of pipeline element.

Reader

Readers are responsible for reading the raw bytes of the input data and converting it to character data using the Feed’s character encoding. They also provide functionality to modify the data before or after it is decoded to characters, e.g. Bye Order Mark removal, or doing find/replace on the character data. You can chain multiple Readers.

Parser

A parser is designed to convert the character data into XML for processing. For example, the JSONParser will use a JSON parser to read the character data as JSON and convert it into XML elements and attributes that represent the JSON structure, so that it can be transformed downstream using XSLT.

Parsers have a built in reader so if they are not preceded by a Reader they will decode the raw bytes into character data before parsing.

Filter

A filter is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and can either return those events unchanged or modify them. An example of Filter is an XSLTFilter element. Multiple filters can be chained, with each one consuming the events output by the one preceding it, therefore you can have lots of common reusable XSLTFilters that all do small incremental changes to a document.

Writer

A writer is an element that handles XML SAX events (e.g. element, attribute, character data, etc.) and converts them into encoded character data (using a specified encoding) of some form. The preceding filter may have been an XSLTFilter which transformed XML into plain text, in which case only character data events will be output and a TextWriter can just write these out as text data. Other writers will handle the XML SAX events to convert them into another format, e.g. the JSONWriter before encoding them as character data.

Destination

A destination element is a consumer of character data, as produced by a writer. A typical destination is a StreamAppender that writes the character data (which may be XML, JSON, CSV, etc.) to a new Stream in Stroom’s stream store. Other destinations can be used for sending the encoded character data to Kafka, a file on a file system or forwarding to an HTTP URL.


Pipeline Recipies

A set of basic pipeline structure recipies for common use cases.

Parser

Parsing input data.

XSLT Conversion

Using Extensible Stylesheet Language Transformations (XSLT) to transform data.

File Output

Substitution variables for use in output file names and paths.

Reference Data

Performing temporal reference data lookups to decorate event data.

Context Data

Context data is additional contextual data Stream that is sent alongside the main event data Stream.

Last modified November 1, 2024: Merge branch '7.3' into 7.4 (98246aa)