This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Stroom Proxy

Stroom Proxy acts as a proxy for sending data to a Stroom instance/cluster. Stroom Proxy has various modes such as storing, aggregating and forwarding the received data. Stroom Proxies can be used to forward to other Stroom Proxy instances.

Stroom-Proxy’s primary role is to act as a front door for data being sent to Stroom. Data can be sent to Stroom-Proxy in small chunks and it will aggregate the data into larger chunks (grouped by Feed and Stream Type ) so that Stroom doesn’t have to process lots of small Streams . It also provides a separation between the client and Stroom, so Stroom can be taken offline while data is still being accepted by Stroom-Proxy.

See Architecture for an example of how Stroom-Proxy is typically deployed.

API

Stroom-Proxy presents an identical HTTP POST /datafeed API to Stroom, so clients can send the same data in the same way to either Stroom or Stroom-Proxy. For more detail on sending data into Stroom-Proxy, see Sending Data.

It also presents a number of other APIs for administration and communication with other proxies. For more detail on Stroom-Proxy’s other APIs, see Proxy API.

Functions

Stroom-Proxy has a number of key functions:

  • Receipt Filtering - The process of filtering the incoming data based on the HTTP headers. Data can either be Received, silently Dropped or Rejected with an error.
  • Splitting - Splitting received ZIP files by Feed and Stream Type .
  • Aggregation - Storing received data locally and forwarding it when the aggregation limits have been reached.
  • Forwarding - Forwarding the received/aggregated data to one or more forward destinations.
  • Instant Forwarding - Data is streamed to a single HTTP forward destination (i.e. Stroom or another Stroom-Proxy) as the data is received. This function does not support multiple forward destinations or aggregations.
  • Directory Scanning - Periodically scanning one or more directories for ZIP files in Stroom ZIP Format.
  • Event Store - Stroom-Proxy presents an API for receiving individual events. This is to support applications that want to log events directly to Stroom-Proxy rather than writing them to rolled files locally.

For a more detailed explanation of each function, see Proxy Functions.

1 - Stroom Proxy Installation

How to install Stroom-Proxy.

Stroom-Proxy can be installed in 4 main ways:

  • App - There is an app version that runs Stroom-Proxy as a Java JAR file locally on the server and has settings contained in a configuration file that controls access to the stroom server and database.

  • Docker Stack - Stroom-Proxy, Nginx and Stroom-Log-Sender run in Docker containers, orchestrated using Docker Compose and some shell scripts. The stroom-proxy image is essentially a minimal Alpine Linux container with the appropriate Java version installed and the Stroom-Proxy JAR contained within it.

  • Docker Images - Manually run containers based on the Stroom-Proxy docker image.

  • Kubernetes - Deploy Stroom-Proxy into a Kubernetes cluster.

The document will cover the installation and configuration of the Stroom-Proxy software for both the ‘app’ and Docker stack deployments.

Typical Deployments

Stroom-Proxy is typically deployed in front of Stroom to act as a proxy for data receipt into Stroom. This abstracts Stroom from the clients sending the data and ensures that received data is aggregated into sensibly sized streams.

For a production Stroom cluster, it is likely that you will want multiple Stroom-Proxy instances behind a load balancer for resiliency and load management.

Assumptions

The following assumptions are used in this document.

  • The user has reasonable RHEL/CentOS/Rocky System administration skills.
  • Installation is on a fully patched minimal RHEL/CentOS/Rocky instance.
  • The application user stroomuser has been created in the OS.
  • The user has set up the Stroom processing user as described here.
  • The prerequisite software has been installed.

Firewall Configuration

For both methods of deployment, the ports used are as follows: Some may need to be opened to allow access to the ports from outside the host.

  • 80 - Nginx listens on port 80 but redirects onto 443.
  • 443 - Nginx listens on port 443.
  • 8090 - Stroom-Proxy listens on port 8090 for its main public APIs (/datafeed, REST endpoints, etc).
  • 8091 - Stroom-Proxy listens on port 8091 for its administration APIs. Access to this port should probably be carefully controlled.

It is therefore likely that you will only want to expose 443 and maybe 80 to other hosts.

For example on a RHEL/CentOS server using firewalld the commands would be as root user:

firewall-cmd --zone=public --permanent --add-port=80/tcp
firewall-cmd --zone=public --permanent --add-port=443/tcp
firewall-cmd --reload

Stroom Proxy (docker version)

The build of a stroom proxy where the Stroom-Proxy Java application (and associated services) are running in docker containers.

Because everything is running in Docker containers, the only requirement for the host is for the following:

  • Docker Engine
  • Docker Compose Plugin
  • bash v4 or greater - Used by the stack scripts.
  • GNU coreutils - Used by the stack scripts.
  • jq - Used by the stack scripts.

Download and install docker

To install Docker Engine and the Docker Composer Plugin see:

All the Stroom-Proxy logs and data will be stored in Docker managed volumes that will, by default, reside in /var/lib/docker. It is typical that this directory will be on small mount point for the OS. It is therefore recommended to relocate this directory to a mount with more space and sufficient resilience, i.e. RAID mirroring.

To do this you need to follow these steps:

  1. Stop the Docker engine.
  2. Move the directory to its new location.
  3. Edit the file /etc/docker/daemon.json and ensure this field is present with the new location as the value.
    {
      "data-root": "/path/to/new/location"
    }
    
  4. Start the Docker engine.

Download and Install Docker Stack

The stroom_proxy Docker stack is available from stroom-resources releases on GitHub. The stack distribution is simply a collection of shell scripts and Docker Compose configuration files. The Docker images will get pulled down from DockerHub when the stack is started.

The installation example below is for stroom version 7.10.20 - but is applicable to other stroom v7 versions. As a suitable stroom user e.g. stroomuser - download and unpack the stroom software.

mkdir -p ~/stroom-proxy
cd ~/stroom-proxy
wget https://github.com/gchq/stroom-resources/releases/download/stroom-stacks-v7.10.20/stroom_proxy-v7.10.20.tar.gz
tar -zxf stroom_proxy-v7.10.20.tar.gz
cd stroom_proxy-v7.10.20

For a stroom proxy, the configuration file stroom_proxy/stroom_proxy-v7.10.20/stroom_proxy.env needs to be edited, with the connection details of the stroom server that data files will be sent to. The default network port for connection to the stroom server is 8080.

The values that need to be set are:

STROOM_PROXY_REMOTE_FEED_STATUS_API_KEY  
STROOM_PROXY_REMOTE_FEED_STATUS_URL  
STROOM_PROXY_REMOTE_FORWARD_URL  

The ‘API key’ is generated on the stroom server and is related to a specific user e.g. proxyServiceUser. The 2 URL values also refer to the stroom server and can be a fully qualified domain name (fqdn) or the IP Address.

e.g. if the stroom server was - stroom-serve.somewhere.co.uk - the URL lines would be:

export STROOM_PROXY_REMOTE_FEED_STATUS_URL="http://stroom-serve.somewhere.co.uk:8080/api/feedStatus/v1"
export STROOM_PROXY_REMOTE_FORWARD_URL="http://stroom-serve.somewhere.co.uk:8080/stroom/datafeed"

To Start Stroom Proxy

As the stroom user, run the ‘start.sh’ script found in the stroom install:

cd ~/stroom_proxy/stroom_proxy-v7.10.20/
./start.sh

The first time the script is run it will download the docker images from DockerHub:

  • stroom-proxy-remote
  • stroom-log-sender
  • stroom-nginx

Once the script has completed the Stroom-Proxy server should be running.

The stack directory contains the following scripts for managing the Stroom-Proxy stack.

  • health.sh - Tests and displays the health of the stack.
  • info.sh* - Displays info about the stack.
  • pull_images.sh - Pulls all the docker images used in the stack.
  • logs.sh - Tails the logs from all services in the stack.
  • remove.sh - Removes all services and volumes in the stack. Warning: this will delete any data held in Stroom-Proxy.
  • restart.sh - Restarts all or named services it the stack.
  • send_data.sh - Script to aid POSTing data into Stroom-Proxy.
  • set_log_levels.sh - Sets log levels for classes/packages on the running Stroom-Proxy.
  • set_services.sh - Used for disabling services in the stack.
  • show_config.sh - Displays the effective docker compose config taking the env file into account.
  • start.sh - Starts all or named services it the stack.
  • status.sh - Shows the status of the services in the stack.
  • stop.sh - Stops all or named services it the stack.

Stroom Proxy (app version)

This is the bare bones installation method that requires installing everything manually. If you are able to use Docker we recommend doing this as there are less things to install and configure, e.g. nginx, send_to_stroom.sh, cron, etc.

Stroom-Proxy is distributed as a ( JAR ) file so this method will run this JAR using the java executable.

The pre-requisites for this deployment are:

  • RHEL/CentOS/Rocky
  • Java 25+ JDK (JDK is preferred over JRE as it provides additional tools (e.g. jmap) for capturing heap histogram statistics).
  • bash v4 or greater - Used by the helper scripts.
  • GNU coreutils - Used by the helper scripts.

For details about which Java distribution and version to use, and how to install it, see Java.

Download and install Stroom v7 (app version)

Stroom-Proxy releases are available from github.com/gchq/stroom/releases . Each release has a number of artefacts, the Stroom-Proxy application is stroom-proxy-app-v*.zip.

The installation example below is for stroom version v7.10.20, but is applicable to other stroom v7 versions. As a suitable stroom user e.g. stroomuser - download and unpack the stroom software.

wget https://github.com/gchq/stroom/releases/download/v7.10.20/stroom-proxy-app-v7.10.20.zip
unzip stroom-proxy-app-v7.10.20.zip

The configuration file – stroom-proxy/config/config.yml – is the principal file that controls the configuration of Stroom-Proxy. See Stroom Proxy Configuration.

2 - Proxy Configuration

How Stroom Proxy is configured.

See Stroom Proxy Configuration for details.

3 - Proxy Functions

The key functions and capabilities of Stroom-Proxy.

Receipt Filtering

Stroom-Proxy can be configured a number of different methods of data receipt filtering:

  • FEED_STATUS - Data is filtered based on the Status of the Feed in Stroom.
  • RECEIPT_POLICY - Data is filtered based on a set of policy rules that have been created in Stroom.
  • RECEIVE_ALL - All data is accepted, regardless.
  • DROP_ALL - All data is silently dropped.
  • REJECT_ALL - All data is rejected with an error.

Splitting

When ZIP data is received in Stroom ZIP Format it will be examined to determine if it contains multiple groups (where a group is identified by Feed and Stream Type). ZIP data with multiple groups will be split so that data for each group will be processed separately.

Aggregation

If enabled, the aggregation function will locally store the received data and aggregate data from multiple HTTP requests together until the aggregation threshold is reached. Data will be aggregated by common group key (Feed and Stream Type).

Aggregation can be limited by one or more of:

  • Item count - The number of items in the aggregate.
  • Maximum uncompressed size - The total uncompressed size of the aggregate. Note, this is a target as Stroom-Proxy may received a single item of data that is larger than this limit.
  • Frequency - How often data is assembled into a completed aggregate.

Forwarding

Stroom-Proxy can forward data to one or more destinations and the following destination types are supported:

  • File - The data (in ZIP format) is written to a configured directory.
  • HTTP - The data (in ZIP format) is POSTed to a configured URL.

Instant Forwarding

This is a special type of forwarding that means data is streamed directly to a destination rather than being written to local disk first. The instant forwarding is only possible if there is only one forwarding destination configured. Data will still be subject to the configured receipt filtering.

Directory Scanning

Stroom-Proxy can periodically scan one or more directories to look for ZIP files to ingest. Any ZIP files found will be treated as if they were received via the /datafeed API. The scanning will recurse into any directories found.

This feature is primarily aimed at re-processing data that Stroom-Proxy has been unable to forward due to an un-recoverable error or too many retries.

Event Store API

Stroom-Proxy presents a HTTP POST API at /api/event to receive individual events. If the Stroom-Proxy instances are sufficiently resilient then client systems can use this API to send events directly without needing to buffer them locally.

HTTP headers are used to provide the Feed and Stream Type, which are used as the key for aggregation. The POSTed data is assumed to text data, UTF8 encoded.

Each event is converted into the following JSON object and aggregated by Feed and Stream Type in the Event Store. The JSON combines the receipt information, the HTTP headers and the event data into one structured object that can be processed and transformed by Stroom.

{
  "version": 0,
  "event-id": "1771956627189_0001_P_test-proxy",
  "proxy-id": "test-proxy",
  "feed": "FEED_X",
  "type": "Raw Events",
  "receive-time": "2026-02-24T18:10:27.192Z",
  "headers": [
    { "name": "Feed", "value": "FEED_X" },
    { "name": "Type", "value": "Raw Events" }
  ],
  "detail": "this\nis some data \n with new \n\n lines"
}
  • version - The version of the Event structure, currently 0.
  • event-id - A unique ID for the event. This uses the Receipt ID which is a unique identifier for the event.
  • proxy-id - The unique identity for the Stroom-Proxy instance within the estate.
  • feed - The Feed the event is destined for, taken from the Feed HTTP header.
  • type - The Stream Type the event is destined for, taken from the Type HTTP header.
  • receive-time - The ISO-8601 timestamp taken when the event was received.
  • headers - A list of the meta attributes extracted from the HTTP headers.
  • detail - The event payload.

AWS Simple Queue Service Connector

Stroom-Proxy Supports receiving individual events from one or more AWS Simple Queue Service queues. Each event received is treated in the same way as event received via the Event Store API.

4 - Proxy API

Details of the various APIs presented by Stroom-Proxy.

Application APIs

These are the public APIs of the Stroom-Proxy application and are all available on the application port (which defaults to 8090). Administrators may still want to restrict access to specific endpoints, e.g. making the /datafeed API public, but limiting the REST API to within the Stroom estate as the REST APIs are typically called by other Stroom-Proxy instances.

/datafeed

Stroom-Proxy presents the same /datafeed API as Stroom. This also has a legacy alias of /stroom/datafeed.

For more details of how to use this API, see Sending Data to Stroom.

/ui

This returns HTML and is intended to be used in a browser. It will display something like:

Stroom Proxy v7.10.20 built on 2026-02-25T15:32:45.708Z
Send data to http://localhost:8090/datafeed

/status

This provides a basic status response for Stroom-Proxy. It returns a JSON object like this:

{
  "upTime": 1772119560408,
  "buildVersion": "v7.10.20",
  "buildTime": 1772033565708
}

/debug

This endpoint can be used for debugging datafeed requests. A datafeed request can be POSTed to this endpoint instead, so that the client can see what headers and payload are reaching the server.

This example POSTs a simple bit of data with one extra header.

echo "Today is $(date)" \
| curl -X POST --data-binary @- -H "Feed:MY_FEED" http://localhost:8090/debug
(out)
(out)HTTP Header
(out)===========
(out)[Accept]=[*/*]
(out)[User-Agent]=[curl/8.18.0]
(out)[Host]=[localhost:8090]
(out)[Content-Length]=[38]
(out)[Feed]=[MY_FEED]
(out)[Content-Type]=[application/x-www-form-urlencoded]
(out)
(out)HTTP Header
(out)===========
(out)contentLength=38
(out)HTTP Payload
(out)============
(out)Today is Thu 26 Feb 16:23:50 GMT 2026

REST API

Stroom-Proxy presents a number of REST endpoints:

  • POST - /api/apikey/v2/verifyApiKey - Allows an upstream Stroom-Proxy to verify an API key.
  • POST - /api/event - The Event Store API for POSTing individual events.
  • POST - /api/feedStatus/v1/getFeedStatus - Allows an upstream Stroom-Proxy to check the receipt status of a Feed.
  • POST - /api/feedStatus/v2/getFeedStatus - Allows an upstream Stroom-Proxy to check the receipt status of a Feed.
  • GET - /api/ruleset/v2/fetchHashedRules - Allows an upstream Stroom-Proxy to fetch the obfuscated receipt policy rules.

Admin APIs

These APIs are presented on the administration port/path which by default is:

localhost:8091/proxyAdmin/....

More details about the admin APIs (with the exception of the Prometheus endpoint) can be found here Metrics Servlets .

Metrics

Proxy exposes two endpoints for capturing metrics on its inner workings:

  • Dropwizard Metrics - http://localhost:8091/proxyAdmin/metrics. This exposes the metrics as a JSON object. For more details see Dropwizard Metrics .

  • Prometheus Metrics - http://localhost:8091/proxyAdmin/prometheusMetrics. Exposes the same data as Dropwizard Metrics, but in a format suitable for scraping by Prometheus .

Health Check

http://localhost:8091/proxyAdmin/healthcheck
http://localhost:8091/proxyAdmin/healthcheck?pretty=true

Performing a GET request on this endpoint will initiate a health check on all parts of Stroom-Proxy that have registered a health check. Each registered health check will return healthy or unhealthy along with any details relating to its state. If all health checks return healthy then the endpoint will return a 200 status.

It allows the Stroom-Proxy instance to self check its inner workings.

Current registered health checks are:

  • deadlocks - Checks for any deadlocked threads.
  • stroom.dropwizard.common.LogLevelInspector - Reports the current logger levels that have been set. This is not strictly a health check as it will always return healthy, more for information purposes.
  • stroom.proxy.app.ProxyConfigHealthCheck - Displays the current configuration values. This is not strictly a health check as it will always return healthy, more for information purposes.
  • stroom.proxy.app.ProxyConfigMonitor - Returns healthy if the monitoring of the config file is working correctly.
  • stroom.proxy.app.ReceiveDataRuleSetClient - Returns healthy if the receipt policy rules could be fetched from the downstream host. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.proxy.app.handler.RemoteFeedStatusClient - Returns healthy if a feed status check could be fetched from the downstream host. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.proxy.app.security.ProxyApiKeyCheckClient - Returns healthy if an API Key check could be performed. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.receive.common.DataFeedKeyDirWatcher - Returns healthy if the monitoring of the Datafeed Key directory is working correctly.
  • stroom.security.common.impl.ExternalIdpConfigurationProvider - Returns healthy if the configuration of the external IDP could be fetched. Will return healthy if no external IDP is configured.

Filtered Health Check

http://localhost:8091/proxyAdmin/filteredhealthcheck

This performs the same as the Health Check, but allows for filtering of the checks, which can be useful if there are certain checks that need to be ignored.

It takes the following optional query parameters:

  • allow - A comma delimited list of health check names to include.
  • deny - A comma delimited list of health check names to exclude.
  • minimal - Set to true to exclude all the detail in the health check response.
  • pretty - Set to true to format the JSON.

Queues

http://localhost:8091/proxyAdmin/queues

This endpoint returns HTML and is intended as a means for an admin to monitor the state of the various internal queues within Stroom-Proxy. It is intended to be called from a browser.

Tasks

Stroom-Proxy has a number of administrative tasks that can be executed via its tasks API.

The list of available task names can be found by performing a GET request on:

http://localhost:8091/proxyAdmin/tasks

The following is a list of the task names that are currently available:

  • clear-all-cache - Clears all caches in Stroom-Proxy.
  • clear-cache-Authenticated-Data-Feed-Key-Cache - Clears the Authenticated Datafeed Key cache.
  • clear-cache-Event-Store-Open-Appenders - Clears the Event Store Open Appenders cache.
  • clear-cache-Remote-Feed-Status-Response-Cache - Clears the Remote Feed Status Response cache.
  • gc - Forces a Java garbage collection to destroy unused objects in memory.
  • log-level - Sets the log level for a named class or package.

Tasks are executed using a POST and may require form data if the task requires it.

curl -X POST http://localhost:8091/proxyAdmin/tasks/clear-all-caches

The log-level task requires parameters to tell it the log level to set and on which class/package to set it.

curl -X POST http://localhost:8091/proxyAdmin/tasks/log-level -d "logger=stroom.core.servlet.StatusServlet&level=DEBUG"

The task may or may not return content.

Ping

http://localhost:8091/proxyAdmin/ping

Simple endpoint that will respond with the text pong and a 200 status if Stroom-Proxy is running. This can be used by load balancers to determine if Stroom-Proxy is up or not.

Threads

http://localhost:8091/proxyAdmin/threads

Lists the currently running threads with a stack trace for each. Can be useful for debugging.

5 - Receipt ID

A unique identifier that is assigned to each item of data received by Stroom-Proxy.

On receipt of data, Stroom-Proxy will assign the data a unique Receipt ID. This value will be set in the ReceiptId meta attribute. It will also be appended to the ReceiptIdPath meta attribute, which is a comma delimited list of Receipt IDs.

The format of this attribute has been made to make it more useful to administrators, while still being unique across the environment that the Stroom and Stroom-Proxy instances are deployed in.

The format is as follows:

<timestamp>_<seq no>_<(P|S)>_<proxyId or stroom nodeName>

  • <timestamp> - The receipt timestamp in milliseconds since the Unix Epoch , zero padded.

  • <seq no> - This is zero padded four digit sequential number (starting at 0000) that is used to distinguish between multiple receipt events happening during the same millisecond on the same instance.

  • <P|S> - Indicates whether the item was received by Stroom (S) or Stroom-Proxy (P).

  • <proxyId or stroom nodeName> - For Stroom-Proxy this will be the proxyConfig.proxyId that is either set in configuration to uniquely identify a proxy instance or is one of the FQDN / IP Address . For Stroom this is the node name of the Stroom instance. The proxyId set on each Stroom-Proxy instance must be unique across all Stroom-Proxy instances in the estate. The nodeName set on each Stroom instance must be unique across all Stroom instances in the estate.

An example Receipt ID is 0000001738332835967_0000_P_node1

The new format is useful for tracing the flow of data through a chain of proxies as it will be included in receive and send logs as well as being written to the meta attributes.

To ensure uniqueness of these IDs across the estate, proxyID values should be unique within the environment that data will flow. The same is true for Stroom nodeName values.

6 - Proxy Architecture

An overview of the architecture of Stroom-Proxy.

Overview

Stroom-Proxy has a number of moving parts and it can be configured in a variety of ways. This document aims to describe some typical configurations of Stroom-Proxy.

Directories as Queues

Stroom-Proxy makes heavy use of multiple file system directories as work queues. These queues act as the interface between the different processing steps in Stroom-Proxy.

Data representing one queue item is placed into a directory. That directory is atomically moved into a queue directory with a new name to represent its position in the queue. The directory is consumed from the directory queue by atomically moving it to a different path, typically this will be a numbered directory that acts as a staging area where it can be worked on before moving it to a different directory queue.

These sub-directories are placed in a path structure that indicates the position in the queue, e.g.:

./50_forwarding/downstream/02_retry/2/012/345/012345678

In the above example:

  • ./50_forwarding/downstream/02_retry represents the base directory of the queue.
  • /2/ represents the depth of the directory tree, i.e. the queue item has two sub-directories above it.
  • /012/ is a sub-directory containing items 12,000,000 to 12,999,999.
  • /345/ is a sub-directory containing items 12,345,000 to 12,345,999.
  • /012345678/ is the queue item containing the data to be processed. The number is the position in the queue and the number of digits is always left padded with zeros to be a multiple of three.

This structure ensures that there are never more than 999 items in each directory and the head/tail of the queue can be found quickly.

Numbered Directories

Typically between each queue is a numbered directory that acts as a staging area to work on the data. Numbered directories are sequentially numbered directories that all exist in a single parent directory. They are expected to be transient in nature, i.e. only existing until they can be move to another queue.

For example, 01_receiving_simple contains numbered directories and each one is used to stage non-ZIP data that has been received into proxy:

./01_receiving_simple/0000001407/
./01_receiving_simple/0000001408/
./01_receiving_simple/0000001409/

Each directory represents data for a single request into Stroom Proxy. Once the data has been successfully written to one of these directories, the directory will be atomically moved to one of the directory queues, e.g.

./01_receiving_simple/0000001407/ => 20_pre_aggregate_input_queue/0/382/

The directory then becomes the responsibility of the queue directory it was moved into.

Directory Structure

The following is a list of the directories used by Stroom-Proxy in its data directory (as configured by proxyConfig.path.data).

|-- 01_receiving_simple/
|-- 01_receiving_zip/
|-- 02_split_zip_input_queue/
|-- 03_split_zip_splits/
|-- 20_pre_aggregate_input_queue/
|-- 21_pre_aggregates/
|-- 22_splitting/
|-- 23_split_output/
|-- 30_aggregate_input_queue/
|-- 31_aggregates/
|-- 40_forwarding_input_queue/
|-- 50_forwarding/
|   |-- <destination name 1>/
|   |   |-- 01_forward/
|   |   |-- 02_retry/
|   |   `-- 03_failure/
|   `-- <destination name 2>/
|       |-- 01_forward/
|       |-- 02_retry/
|       `-- 03_failure/
|-- 99_deleting/
|-- event/
`-- temp_forward_copies/

The following diagram illustrates how data flows between the various queues and numbered directories.

images/proxy/architecture.puml.svg

/01_receiving_simple/

This directory is the reception for area for data that is NOT a ZIP file, i.e. uncompressed or gzip compressed data. It contains numbered directories.

Data will be written to this directory before the client receives the HTTP response.

Each numbered directory will contain two files:

  • /01_receiving_simple/0000002034/0000000001.meta - The meta sidecar file containing the HTTP headers.
  • /01_receiving_simple/0000002034/0000000001.dat - The file containing the received payload data.

The filenames are always the same as it is only dealing with a single stream.

/02_receiving_zip/

This directory is the reception for area for data that has been received as a ZIP file which may contain one or more streams of data and associated metadata. It contains numbered directories.

Received ZIP files will be written to a numbered sub-directory in this directory before the client receives the HTTP response.

All .meta files in the ZIP file will be updated to add the HTTP headers from the request. In order to do this, Stroom Proxy will first write the ZIP as a .zip.staging file. It will clone all the ZIP entries in this file into a .zip file, updating the .meta entries as it goes. The .zip.staging file will be deleted once complete.

The ZIP entries will be scanned and all valid entries will be written to a .entries sidecar file for subsequent processes to use. This .entries file defines the entries in the ZIP that are valid for further processing and allows subsequent processing to use this file as a reference rather than having to re-scan the ZIP.

The scanning process will also establish how many groups are in the ZIP. A group is defined as a combination of the Feed and the Stream Type.

If the ZIP contains more than one group or the ZIP does not adhere to the correct [Stroom ZIP Format](/docs/sending-data/payloads/#stroom-zip-format, the directory will be moved to /02_split_zip_input_queue/ for splitting.

If the ZIP has a valid format and only contains one group, it will either be moved to the 20_pre_aggregate_input_queue queue, if aggregation is enabled, or 40_forwarding_input_queue queue if not.

/02_split_zip_input_queue/

Each directory placed into this directory queue will contain a ZIP file and a .entries file. The ZIP may be in an invalid format, in which case a new ZIP will be created with the correct entry naming and structure. This is to ensure that all ZIP files received downstream are in a consistent format. Alternatively it will contain more than one group, so will need to be split into one ZIP file per group.

A numbered directory will be created in /03_split_zip_splits/ to hold each split. For each group of entries in a split, it will create a sub-directory named after the group in the numbered directory, e.g. for two splits:

  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.zip
  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.entries
  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.meta
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.zip
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.entries
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.meta

Once the splitting is complete, each split directory will be moved to the 20_pre_aggregate_input_queue queue, if aggregation is enabled, or 40_forwarding_input_queue queue if not.

/20_pre_aggregate_input_queue/

Each directory on this queue will contain a ZIP file that contains one or more entries for the same group (combination of Feed and Stream Type).

If proxyConfig.aggregator.splitSources is set to true, Stroom Proxy will inspect the ZIP to see if it needs to be split up into multiple parts, to meet the aggregation targets (defined by proxyConfig.aggregator.maxItemsPerAggregate and proxyConfig.aggregator.maxUncompressedByteSize), else the zip will be treated as a single split-part.

If there is just one split-part, the directory will be moved into the current aggregate directory for its group, e.g.

  • /21_pre_aggregates/FEED_X__raw_events/009/proxy.zip

If there are multiple split-parts the ZIP file will require splitting into multiple ZIP files with one per split-part, i.e. all entries from the input ZIP spread over multiple split-part ZIPs. Each split-part will be written like this:

  • /22_splitting/0000000343/009_part_1/proxy.zip
  • /22_splitting/0000000343/009_part_2/proxy.zip
  • /22_splitting/0000000343/009_part_3/proxy.zip

Once the splitting has been completed, the common parent directory is moved to /23_split_output/:

  • /23_split_output/0000000343/009_part_1/proxy.zip
  • /23_split_output/0000000343/009_part_1/proxy.zip
  • /23_split_output/0000000343/009_part_1/proxy.zip

Each split-part is then moved to /21_pre_aggregates/.

  • /21_pre_aggregates/FEED_X__raw_events/011/proxy.zip

When the aggregate for a Feed|Type group is complete (based on item count and uncompressed size), the aggregate will be closed. Closing of the aggregate involves moving the parent directory of all the aggregate items to /30_aggregate_input_queue/.

/30_aggregate_input_queue/

Each directory on this queue will contain multiple directory groups (each containing a ZIP file and its associated files) that are to be part of a single aggregate.

If there is only one item in the queue directory, the directory will be moved to /40_forwarding_input_queue/ for forwarding.

If there are more than one items in the queue directory then a new aggregate ZIP will be created in /31_aggregates/. The entries from each item ZIP will be written into the new aggregate ZIP.

It will also create a set of meta entries for the aggregate. This will contain only key/value entries that are present in every item in the aggregate.

Once the aggregate has been written it is moved to /40_forwarding_input_queue/.

/40_forwarding_input_queue/

Each directory on this queue will contain a single ZIP file that may contain one or more streams (plus associated files). In addition to the ZIP file will be a combined .meta file for the aggregate.

Depending on how forwarding has been configured (using proxyConfig.forwardFileDestinations and proxyConfig.forwardHttpDestinations), there will be a pair of directory queues for each of the forwarding destinations, with the destination name in the path, e.g.:

  • /50_forwarding/file-dest-1/01_forward/
  • /50_forwarding/file-dest-1/02_retry/
  • /50_forwarding/file-dest-2/01_forward/
  • /50_forwarding/file-dest-2/02_retry/
  • /50_forwarding/http-dest-1/01_forward/
  • /50_forwarding/http-dest-1/02_retry/
  • /50_forwarding/http-dest-2/01_forward/
  • /50_forwarding/http-dest-2/02_retry/

Each item on the /40_forwarding_input_queue/ queue will be copied into each of the 01_forward queues, then the source item will be deleted. This keeps each destination independent and prevents a loss of connection to one destination from impacting the others.

/50_forwarding/

This directory contains multiple directory queues, two per forward destination.

  • ..../<destination name>/01_forward/ - Items initially queued for forwarding to the destination.
  • ..../<destination name>/02_retry/ - Items that have failed to forward to the destination and have been queued for a retry.

Each forward destination directory also contains a failure directory:

  • ..../<destination name>/03_/failure/ - Items that have failed to forward. Either they have failed too many times or have failed with an error that prevents retry.