This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Stroom Proxy

Stroom Proxy acts as a proxy for sending data to a Stroom instance/cluster. Stroom Proxy has various modes such as storing, aggregating and forwarding the received data. Stroom Proxies can be used to forward to other Stroom Proxy instances.

Stroom-Proxy’s primary role is to act as a front door for data being sent to Stroom. Data can be sent to Stroom-Proxy in small chunks and it will aggregate the data into larger chunks (grouped by Feed Feed A Feed is a means of organising and categorising data in Stroom. A Feed contains multiple Streams of data that have been ingested into Stroom or output by a Pipeline. Typically a Feed will contain Streams of data that are all from one system and have a common data format.Click to see more details... and Stream Type Stream Type All Streams must have a Stream Type. The list of Stream Types is configured using the Property stroom.data.meta.metaTypes.Click to see more details...) so that Stroom doesn’t have to process lots of small Streams Stream A Stream is the unit of data that Stroom works with and will typically contain many Events.Click to see more details.... It also provides a separation between the client and Stroom, so Stroom can be taken offline while data is still being accepted by Stroom-Proxy.

See Architecture for an example of how Stroom-Proxy is typically deployed.

API

Stroom-Proxy presents an identical HTTP POST /datafeed API API Application Programming Interface. An interface that one system can present so other systems can use it to communicate. Stroom has a number of APIs, e.g. its many REST APIs and its /datafeed interface for data receipt.Click to see more details... to Stroom, so clients can send the same data in the same way to either Stroom or Stroom-Proxy. For more detail on sending data into Stroom-Proxy, see Sending Data.

It also presents a number of other APIs for administration and communication with other proxies. For more detail on Stroom-Proxy’s other APIs, see Proxy API.

Functions

Stroom-Proxy has a number of key functions:

  • Receipt Filtering - The process of filtering the incoming data based on the HTTP headers. Data can either be Received, silently Dropped or Rejected with an error.
  • Splitting - Splitting received ZIP ZIP A compressed file format for storing a one or more files with an associated directory structure. Stroom and Stroom Proxy use the ZIP format for exporting content and data as well as its Proxy ZIP format for holding multiple streams of data with associated meta data.Click to see more details... files by Feed Feed A Feed is a means of organising and categorising data in Stroom. A Feed contains multiple Streams of data that have been ingested into Stroom or output by a Pipeline. Typically a Feed will contain Streams of data that are all from one system and have a common data format.Click to see more details... and Stream Type Stream Type All Streams must have a Stream Type. The list of Stream Types is configured using the Property stroom.data.meta.metaTypes.Click to see more details....
  • Aggregation - Storing received data locally and forwarding it when the aggregation limits have been reached.
  • Forwarding - Forwarding the received/aggregated data to one or more forward destinations.
  • Instant Forwarding - Data is streamed to a single HTTP forward destination (i.e. Stroom or another Stroom-Proxy) as the data is received. This function does not support multiple forward destinations or aggregations.
  • Directory Scanning - Periodically scanning one or more directories for ZIP files in Stroom ZIP Format.
  • Event Store - Stroom-Proxy presents an API API Application Programming Interface. An interface that one system can present so other systems can use it to communicate. Stroom has a number of APIs, e.g. its many REST APIs and its /datafeed interface for data receipt.Click to see more details... for receiving individual events. This is to support applications that want to log events directly to Stroom-Proxy rather than writing them to rolled files locally.

For a more detailed explanation of each function, see Proxy Functions.

1 - Stroom Proxy Installation

How to install Stroom-Proxy.

Stroom-Proxy can be installed in 4 main ways:

  • App - There is an app version that runs Stroom-Proxy as a Java JAR JAR Java Archive is a file format for distributing Java class files, associated metadata and resource files. It is a compressed archive based on the {{< glossary “ZIP” >}} format, so can be inspected with any tool capable of reading a ZIP file. Stroom and Stroom-Proxy are distributed as JAR files.Click to see more details... file locally on the server and has settings contained in a configuration file that controls access to the stroom server and database.

  • Docker Stack - Stroom-Proxy, Nginx and Stroom-Log-Sender run in Docker containers, orchestrated using Docker Compose and some shell scripts. The stroom-proxy image is essentially a minimal Alpine Linux container with the appropriate Java version installed and the Stroom-Proxy JAR contained within it.

  • Docker Images - Manually run containers based on the Stroom-Proxy docker image.

  • Kubernetes - Deploy Stroom-Proxy into a Kubernetes cluster.

The document will cover the installation and configuration of the Stroom-Proxy software for both the ‘app’ and Docker stack deployments.

Typical Deployments

Stroom-Proxy is typically deployed in front of Stroom to act as a proxy for data receipt into Stroom. This abstracts Stroom from the clients sending the data and ensures that received data is aggregated into sensibly sized streams.

For a production Stroom cluster, it is likely that you will want multiple Stroom-Proxy instances behind a load balancer for resiliency and load management.

Assumptions

The following assumptions are used in this document.

  • The user has reasonable RHEL/CentOS/Rocky System administration skills.
  • Installation is on a fully patched minimal RHEL/CentOS/Rocky instance.
  • The application user stroomuser has been created in the OS.
  • The user has set up the Stroom processing user as described here.
  • The prerequisite software has been installed.

Firewall Configuration

For both methods of deployment, the ports used are as follows: Some may need to be opened to allow access to the ports from outside the host.

  • 80 - Nginx listens on port 80 but redirects onto 443.
  • 443 - Nginx listens on port 443.
  • 8090 - Stroom-Proxy listens on port 8090 for its main public APIs (/datafeed, REST endpoints, etc).
  • 8091 - Stroom-Proxy listens on port 8091 for its administration APIs. Access to this port should probably be carefully controlled.

It is therefore likely that you will only want to expose 443 and maybe 80 to other hosts.

For example on a RHEL/CentOS server using firewalld the commands would be as root user:

firewall-cmd --zone=public --permanent --add-port=80/tcp
firewall-cmd --zone=public --permanent --add-port=443/tcp
firewall-cmd --reload

Stroom Proxy (docker version)

The build of a stroom proxy where the Stroom-Proxy Java application (and associated services) are running in docker containers.

Because everything is running in Docker containers, the only requirement for the host is for the following:

  • Docker Engine
  • Docker Compose Plugin
  • bash v4 or greater - Used by the stack scripts.
  • GNU coreutils - Used by the stack scripts.
  • jq - Used by the stack scripts.

Download and install docker

To install Docker Engine and the Docker Composer Plugin see:

All the Stroom-Proxy logs and data will be stored in Docker managed volumes that will, by default, reside in /var/lib/docker. It is typical that this directory will be on small mount point for the OS. It is therefore recommended to relocate this directory to a mount with more space and sufficient resilience, i.e. RAID mirroring.

To do this you need to follow these steps:

  1. Stop the Docker engine.
  2. Move the directory to its new location.
  3. Edit the file /etc/docker/daemon.json and ensure this field is present with the new location as the value.
    {
      "data-root": "/path/to/new/location"
    }
    
  4. Start the Docker engine.

Download and Install Docker Stack

The stroom_proxy Docker stack is available from stroom-resources releases on GitHub. The stack distribution is simply a collection of shell scripts and Docker Compose configuration files. The Docker images will get pulled down from DockerHub when the stack is started.

The installation example below is for stroom version 7.10.20 - but is applicable to other stroom v7 versions. As a suitable stroom user e.g. stroomuser - download and unpack the stroom software.

mkdir -p ~/stroom-proxy
cd ~/stroom-proxy
wget https://github.com/gchq/stroom-resources/releases/download/stroom-stacks-v7.10.20/stroom_proxy-v7.10.20.tar.gz
tar -zxf stroom_proxy-v7.10.20.tar.gz
cd stroom_proxy-v7.10.20

For a stroom proxy, the configuration file stroom_proxy/stroom_proxy-v7.10.20/stroom_proxy.env needs to be edited, with the connection details of the stroom server that data files will be sent to. The default network port for connection to the stroom server is 8080.

The values that need to be set are:

STROOM_PROXY_REMOTE_FEED_STATUS_API_KEY  
STROOM_PROXY_REMOTE_FEED_STATUS_URL  
STROOM_PROXY_REMOTE_FORWARD_URL  

The ‘API key’ is generated on the stroom server and is related to a specific user e.g. proxyServiceUser. The 2 URL values also refer to the stroom server and can be a fully qualified domain name (fqdn) or the IP Address.

e.g. if the stroom server was - stroom-serve.somewhere.co.uk - the URL lines would be:

export STROOM_PROXY_REMOTE_FEED_STATUS_URL="http://stroom-serve.somewhere.co.uk:8080/api/feedStatus/v1"
export STROOM_PROXY_REMOTE_FORWARD_URL="http://stroom-serve.somewhere.co.uk:8080/stroom/datafeed"

To Start Stroom Proxy

As the stroom user, run the ‘start.sh’ script found in the stroom install:

cd ~/stroom_proxy/stroom_proxy-v7.10.20/
./start.sh

The first time the script is run it will download the docker images from DockerHub:

  • stroom-proxy-remote
  • stroom-log-sender
  • stroom-nginx

Once the script has completed the Stroom-Proxy server should be running.

The stack directory contains the following scripts for managing the Stroom-Proxy stack.

  • health.sh - Tests and displays the health of the stack.
  • info.sh* - Displays info about the stack.
  • pull_images.sh - Pulls all the docker images used in the stack.
  • logs.sh - Tails the logs from all services in the stack.
  • remove.sh - Removes all services and volumes in the stack. Warning: this will delete any data held in Stroom-Proxy.
  • restart.sh - Restarts all or named services it the stack.
  • send_data.sh - Script to aid POSTing data into Stroom-Proxy.
  • set_log_levels.sh - Sets log levels for classes/packages on the running Stroom-Proxy.
  • set_services.sh - Used for disabling services in the stack.
  • show_config.sh - Displays the effective docker compose config taking the env file into account.
  • start.sh - Starts all or named services it the stack.
  • status.sh - Shows the status of the services in the stack.
  • stop.sh - Stops all or named services it the stack.

Stroom Proxy (app version)

This is the bare bones installation method that requires installing everything manually. If you are able to use Docker we recommend doing this as there are less things to install and configure, e.g. nginx, send_to_stroom.sh, cron, etc.

Stroom-Proxy is distributed as a ( JAR JAR Java Archive is a file format for distributing Java class files, associated metadata and resource files. It is a compressed archive based on the {{< glossary “ZIP” >}} format, so can be inspected with any tool capable of reading a ZIP file. Stroom and Stroom-Proxy are distributed as JAR files.Click to see more details...) file so this method will run this JAR using the java executable.

The pre-requisites for this deployment are:

  • RHEL/CentOS/Rocky
  • Java 25+ JDK (JDK is preferred over JRE as it provides additional tools (e.g. jmap) for capturing heap histogram statistics).
  • bash v4 or greater - Used by the helper scripts.
  • GNU coreutils - Used by the helper scripts.

For details about which Java distribution and version to use, and how to install it, see Java.

Download and install Stroom v7 (app version)

Stroom-Proxy releases are available from github.com/gchq/stroom/releases . Each release has a number of artefacts, the Stroom-Proxy application is stroom-proxy-app-v*.zip.

The installation example below is for stroom version v7.10.20, but is applicable to other stroom v7 versions. As a suitable stroom user e.g. stroomuser - download and unpack the stroom software.

wget https://github.com/gchq/stroom/releases/download/v7.10.20/stroom-proxy-app-v7.10.20.zip
unzip stroom-proxy-app-v7.10.20.zip

The configuration file – stroom-proxy/config/config.yml – is the principal file that controls the configuration of Stroom-Proxy. See Stroom Proxy Configuration.

2 - Proxy Configuration

How Stroom Proxy is configured.

See Stroom Proxy Configuration for details.

3 - Proxy Functions

The key functions and capabilities of Stroom-Proxy.

Data Receipt

Data Feed API

This is Stroom-Proxy’s traditional API for receiving data and Stroom shares the same API. See /datafeed for more details.

Event Store API

Stroom-Proxy presents an alternative HTTP POST API at /api/event to receive individual events. If the Stroom-Proxy instances are sufficiently resilient then client systems can use this API to send events directly without needing to buffer them locally. It must only be used for sending a single event, not a batch of events.

The HTTP headers Feed and Type are used to provide the Feed and Stream Type, which are used as the compound aggregation key. The request content is assumed to be UTF-8 encoded text data but can be in any format, e.g. XML, JSON, CSV, etc.

Stroom-Proxy will convert each request into the following JSON object and aggregate them by Feed and Stream Type in the Event Store, with one file per key. The JSON combines the receipt information, the HTTP headers and the event data into one structured object that can be processed and transformed by Stroom.

{
  "version": 0,
  "event-id": "1771956627189_0001_P_test-proxy",
  "proxy-id": "test-proxy",
  "feed": "FEED_X",
  "type": "Raw Events",
  "receive-time": "2026-02-24T18:10:27.192Z",
  "headers": [
    { "name": "Feed", "value": "FEED_X" },
    { "name": "Type", "value": "Raw Events" }
  ],
  "detail": "this\nis some data \n with new \n\n lines"
}
  • version - The version of the Event structure, currently 0.
  • event-id - A unique ID for the event. This uses the Receipt ID which is a unique identifier for the event.
  • proxy-id - The unique identity for the Stroom-Proxy instance within the estate.
  • feed - The Feed the event is destined for, taken from the Feed HTTP header.
  • type - The Stream Type the event is destined for, taken from the Type HTTP header.
  • receive-time - The ISO-8601 timestamp taken when the event was received.
  • headers - A list of the meta attributes extracted from the HTTP headers.
  • detail - The event payload, i.e. the HTTP request content.

Each event is written as one line in the aggregated file, delimited by a Line Feed (\n). A file containing one JSON object per line is typically referred to as JSON Lines Format . This format is mean easier to parse than a single JSON object containing many events.

If Stroom-Proxy is configured for aggregation then the Event Store essentially adds another layer of aggregation in front of Stroom-Proxy’s standard aggregation. The Event Store aggregation is configured separately to the standard aggregation. See Event Store Configuration for details on how to configure the Event Store and the aggregation thresholds.

Once a file of one or more individual event objects has met its aggregation thresholds it will be processed in the same way as data arriving via /datafeed.

Authentication

/api/event differs from the other /api/... REST endpoints in how requests are authenticated. It does not use the same authentication as the other endpoints.

Its authentication is performed in the same way as /datafeed and is configured using Event Store Configuration.

AWS Simple Queue Service Connector

Stroom-Proxy Supports receiving individual events from one or more AWS Simple Queue Service queues. Each event received is treated in the same way as event received via the Event Store API.

Receipt Filtering

Stroom-Proxy can be configured a number of different methods of data receipt filtering:

  • FEED_STATUS - Data is filtered based on the Status of the Feed in Stroom.
  • RECEIPT_POLICY - Data is filtered based on a set of policy rules that have been created in Stroom.
  • RECEIVE_ALL - All data is accepted, regardless.
  • DROP_ALL - All data is silently dropped.
  • REJECT_ALL - All data is rejected with an error.

Splitting

When ZIP data is received in Stroom ZIP Format it will be examined to determine if it contains multiple groups (where a group is identified by Feed and Stream Type). ZIP data with multiple groups will be split so that data for each group will be processed separately.

Aggregation

If enabled, the aggregation function will locally store the received data and aggregate data from multiple HTTP requests together until the aggregation threshold is reached. Data will be aggregated by common group key (Feed and Stream Type).

Aggregation can be limited by one or more of:

  • Item count - The number of items in the aggregate.
  • Maximum uncompressed size - The total uncompressed size of the aggregate. Note, this is a target as Stroom-Proxy may received a single item of data that is larger than this limit.
  • Frequency - How often data is assembled into a completed aggregate.

Forwarding

Stroom-Proxy can forward data to one or more destinations and the following destination types are supported:

  • File - The data (in ZIP format) is written to a configured directory.
  • HTTP - The data (in ZIP format) is POSTed to a configured URL.

If multiple destinations are configured, the ZIP to be forwarded will be copied to each of the forward destination input queues. This means the failure to send to one destination has no impact on sending to the other destinations.

Forwarding is configured using Forward Configuration.

For details of the directories used in forwarding, see /40_forwarding_input_queue/ and /50_forwarding/.

Instant Forwarding

This is a special type of forwarding that means data is streamed directly to a destination rather than being written to local disk first. The instant forwarding is only possible if there is only one forwarding destination configured. Data will still be subject to the configured receipt filtering.

Instant forwarding is enabled by setting instant to true on the forward destination configuration branch.

Forward Failure Handling

When there is a failure to forward a ZIP, Stroom-Proxy will move it to one of two places:

Retry Queue
If the reason for the failure is considered a recoverable one, e.g. the HTTP destination is down, it will move the ZIP onto the retry queue.

The retry behaviour is configured using Queue Configuration

Failure Directory
If the failure is deemed unrecoverable, the ZIP will be moved to the 03_failure sub directory within the forward destination directory. At this point the ZIP file is no longer under the control of Stroom-Proxy and will have to be dealt with manually by the administrator.

If the reason for the failure is addressed it is possible to re-process the failed data by moving it into a directory that is configured for Directory Scanning.

Directory Scanning

Stroom-Proxy can periodically scan one or more directories to look for ZIP files to ingest. Any ZIP files found will be treated as if they were received via the /datafeed API. The scanning will recurse into any directories found.

This feature is primarily aimed at re-processing data that Stroom-Proxy has been unable to forward due to an un-recoverable error or too many retries. This mechanism can also be used as an additional means of passing data into Stroom-Proxy (instead of via /datafeed).

Example

A typical case scenario is that some data has failed to send to Stroom and the retry age has been reached so the ZIP has been moved to the forward failure directory:

Contents of data/50_forwarding/downstream/

./03_failure/20251014/BAD_FEED/0/001/proxy.zip
./03_failure/20251014/BAD_FEED/0/001/proxy.meta
./03_failure/20251014/BAD_FEED/0/001/error.log

If you wish to re-send this ZIP you can do the following:

mv data/50_forwarding/downstream/03_failure/20251014/BAD_FEED/0/001 "./zip_file_ingest/${uuidgen)"

This will move the 001 directory into zip_file_ingest/, renaming it to a unique UUID UUID A Universally Unique Identifier for uniquely identifying something. UUIDs are used as the identifier in Doc Refs. An example of a UUID is 4ffeb895-53c9-40d6-bf33-3ef025401ad3.Click to see more details... to ensure it doesn’t clash with any existing files/directories. The name of this directory in the ingest directory has no bearing on processing, other than the order in which directories are scanned.

On the next scan, Stroom-Proxy will discover the proxy.zip file. It will check for the presence of any of the optional associated side-car files (i.e. proxy.meta and error.log). The entries in the .meta file will be consumed. The error.log file will be deleted following successful ingest.

Stroom-Proxy will scan into all sub-directories within the ingest directory, regardless of depth.

The .meta sidecar file is optional, but if provided will be used to provide meta values equivalent to HTTP headers when sending to /datafeed. For a .meta file to be consumed, it must have the same base-name as the ZIP file, e.g. data.zip and data.meta, and be in the same directory as the ZIP file.

4 - Proxy API

Details of the various APIs presented by Stroom-Proxy.

Application APIs

These are the public APIs of the Stroom-Proxy application and are all available on the application port (which defaults to 8090). Administrators may still want to restrict access to specific endpoints, e.g. making the /datafeed API public, but limiting the REST API to within the Stroom estate as the REST APIs are typically called by other Stroom-Proxy instances.

/datafeed

Stroom-Proxy presents the same /datafeed API as Stroom. This also has a legacy alias of /stroom/datafeed.

For more details of how to use this API, see Sending Data to Stroom.

/api/event

This is an alternative to the /datafeed API and is for sending individual events to Stroom-Proxy.

For more details see Event Store API.

/ui

This returns HTML and is intended to be used in a browser. It will display something like:

Stroom Proxy v7.10.20 built on 2026-02-25T15:32:45.708Z
Send data to http://localhost:8090/datafeed

/status

This provides a basic status response for Stroom-Proxy. It returns a JSON object like this:

{
  "upTime": 1772119560408,
  "buildVersion": "v7.10.20",
  "buildTime": 1772033565708
}

/debug

This endpoint can be used for debugging datafeed requests. A datafeed request can be POSTed to this endpoint instead, so that the client can see what headers and payload are reaching the server.

This example POSTs a simple bit of data with one extra header.

echo "Today is $(date)" \
| curl -X POST --data-binary @- -H "Feed:MY_FEED" http://localhost:8090/debug
(out)
(out)HTTP Header
(out)===========
(out)[Accept]=[*/*]
(out)[User-Agent]=[curl/8.18.0]
(out)[Host]=[localhost:8090]
(out)[Content-Length]=[38]
(out)[Feed]=[MY_FEED]
(out)[Content-Type]=[application/x-www-form-urlencoded]
(out)
(out)HTTP Header
(out)===========
(out)contentLength=38
(out)HTTP Payload
(out)============
(out)Today is Thu 26 Feb 16:23:50 GMT 2026

REST API

Stroom-Proxy presents a number of REST REST REST (Representational State Transfer) is essentially an architectural style that dictates how data should be handled and “transferred” across a network. REST APIs typically use JSON to send data between the client and the server, and the HTTP methods GET, PUT, PATCH, POST and DELETE.Click to see more details... endpoints:

  • POST - /api/apikey/v2/verifyApiKey - Allows an upstream Stroom-Proxy to verify an API key.
  • POST - /api/event - The Event Store API for POSTing individual events. Note that this endpoint does not use the same authentication as the other REST endpoints.
  • POST - /api/feedStatus/v1/getFeedStatus - Allows an upstream Stroom-Proxy to check the receipt status of a Feed.
  • POST - /api/feedStatus/v2/getFeedStatus - Allows an upstream Stroom-Proxy to check the receipt status of a Feed.
  • GET - /api/ruleset/v2/fetchHashedRules - Allows an upstream Stroom-Proxy to fetch the obfuscated receipt policy rules.

Admin APIs

These APIs are presented on the administration port/path which by default is:

localhost:8091/proxyAdmin/....

More details about the admin APIs (with the exception of the Prometheus endpoint) can be found here Metrics Servlets .

Metrics

Proxy exposes two endpoints for capturing metrics on its inner workings:

  • Dropwizard Metrics - http://localhost:8091/proxyAdmin/metrics. This exposes the metrics as a JSON object. For more details see Dropwizard Metrics .

  • Prometheus Metrics - http://localhost:8091/proxyAdmin/prometheusMetrics. Exposes the same data as Dropwizard Metrics, but in a format suitable for scraping by Prometheus .

Health Check

http://localhost:8091/proxyAdmin/healthcheck
http://localhost:8091/proxyAdmin/healthcheck?pretty=true

Performing a GET request on this endpoint will initiate a health check on all parts of Stroom-Proxy that have registered a health check. Each registered health check will return healthy or unhealthy along with any details relating to its state. If all health checks return healthy then the endpoint will return a 200 status.

It allows the Stroom-Proxy instance to self check its inner workings.

Current registered health checks are:

  • deadlocks - Checks for any deadlocked threads.
  • stroom.dropwizard.common.LogLevelInspector - Reports the current logger levels that have been set. This is not strictly a health check as it will always return healthy, more for information purposes.
  • stroom.proxy.app.ProxyConfigHealthCheck - Displays the current configuration values. This is not strictly a health check as it will always return healthy, more for information purposes.
  • stroom.proxy.app.ProxyConfigMonitor - Returns healthy if the monitoring of the config file is working correctly.
  • stroom.proxy.app.ReceiveDataRuleSetClient - Returns healthy if the receipt policy rules could be fetched from the downstream host. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.proxy.app.handler.RemoteFeedStatusClient - Returns healthy if a feed status check could be fetched from the downstream host. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.proxy.app.security.ProxyApiKeyCheckClient - Returns healthy if an API Key check could be performed. Will return healthy if receipt policy checking is not enabled/configured.
  • stroom.receive.common.DataFeedKeyDirWatcher - Returns healthy if the monitoring of the Datafeed Key directory is working correctly.
  • stroom.security.common.impl.ExternalIdpConfigurationProvider - Returns healthy if the configuration of the external IDP could be fetched. Will return healthy if no external IDP is configured.

Filtered Health Check

http://localhost:8091/proxyAdmin/filteredhealthcheck

This performs the same as the Health Check, but allows for filtering of the checks, which can be useful if there are certain checks that need to be ignored.

It takes the following optional query parameters:

  • allow - A comma delimited list of health check names to include.
  • deny - A comma delimited list of health check names to exclude.
  • minimal - Set to true to exclude all the detail in the health check response.
  • pretty - Set to true to format the JSON.

Queues

http://localhost:8091/proxyAdmin/queues

This endpoint returns HTML and is intended as a means for an admin to monitor the state of the various internal queues within Stroom-Proxy. It is intended to be called from a browser.

Tasks

Stroom-Proxy has a number of administrative tasks that can be executed via its tasks API.

The list of available task names can be found by performing a GET request on:

http://localhost:8091/proxyAdmin/tasks

The following is a list of the task names that are currently available:

  • clear-all-cache - Clears all caches in Stroom-Proxy.
  • clear-cache-Authenticated-Data-Feed-Key-Cache - Clears the Authenticated Datafeed Key cache.
  • clear-cache-Event-Store-Open-Appenders - Clears the Event Store Open Appenders cache.
  • clear-cache-Remote-Feed-Status-Response-Cache - Clears the Remote Feed Status Response cache.
  • gc - Forces a Java garbage collection to destroy unused objects in memory.
  • log-level - Sets the log level for a named class or package.

Tasks are executed using a POST and may require form data if the task requires it.

curl -X POST http://localhost:8091/proxyAdmin/tasks/clear-all-caches

The log-level task requires parameters to tell it the log level to set and on which class/package to set it.

curl -X POST http://localhost:8091/proxyAdmin/tasks/log-level -d "logger=stroom.core.servlet.StatusServlet&level=DEBUG"

The task may or may not return content.

Ping

http://localhost:8091/proxyAdmin/ping

Simple endpoint that will respond with the text pong and a 200 status if Stroom-Proxy is running. This can be used by load balancers to determine if Stroom-Proxy is up or not.

Threads

http://localhost:8091/proxyAdmin/threads

Lists the currently running threads with a stack trace for each. Can be useful for debugging.

5 - Receipt ID

A unique identifier that is assigned to each item of data received by Stroom-Proxy.

On receipt of data, Stroom-Proxy will assign the data a unique Receipt ID. This value will be set in the ReceiptId meta attribute. It will also be appended to the ReceiptIdPath meta attribute, which is a comma delimited list of Receipt IDs.

The format of this attribute has been made to make it more useful to administrators, while still being unique across the environment that the Stroom and Stroom-Proxy instances are deployed in.

The format is as follows:

<timestamp>_<seq no>_<(P|S)>_<proxyId or stroom nodeName>

  • <timestamp> - The receipt timestamp in milliseconds since the Unix Epoch Unix Epoch The Unix epoch is 00:00:00 UTC on 1st January 1970. Some timestamps in Stroom are represented as the number of milliseconds since the Unix epoch, e.g. 1738331628276, and may be referred to as epoch ms or epoch milliseconds.Click to see more details..., zero padded.

  • <seq no> - This is zero padded four digit sequential number (starting at 0000) that is used to distinguish between multiple receipt events happening during the same millisecond on the same instance.

  • <P|S> - Indicates whether the item was received by Stroom (S) or Stroom-Proxy (P).

  • <proxyId or stroom nodeName> - For Stroom-Proxy this will be the proxyConfig.proxyId that is either set in configuration to uniquely identify a proxy instance or is one of the Fully Qualified Domain Name (FQDN) Fully Qualified Domain Name (FQDN) The Fully Qualified Domain Name (FQDN) is the complete, unambiguous address of a device or service on the internet, specifying all domain levels including the hostname, domain name, and top-level domain. For example server57.some.domain.com.Click to see more details.../ IP address IP address The Internet Protocol (IP) address, e.g. 192.168.0.1. Typically an IP address is assumed to be an IPv4 address.Click to see more details.... For Stroom this is the node name of the Stroom instance. The proxyId set on each Stroom-Proxy instance must be unique across all Stroom-Proxy instances in the estate. The nodeName set on each Stroom instance must be unique across all Stroom instances in the estate.

An example Receipt ID is 0000001738332835967_0000_P_node1

The new format is useful for tracing the flow of data through a chain of proxies as it will be included in receive and send logs as well as being written to the meta attributes.

To ensure uniqueness of these IDs across the estate, proxyID values should be unique within the environment that data will flow. The same is true for Stroom nodeName values.

6 - Proxy Architecture

An overview of the architecture of Stroom-Proxy.

Overview

Stroom-Proxy has a number of moving parts and it can be configured in a variety of ways. This document aims to describe some typical configurations of Stroom-Proxy.

Directories as Queues

Stroom-Proxy makes heavy use of multiple file system directories as work queues. These queues act as the interface between the different processing steps in Stroom-Proxy.

Data representing one queue item is placed into a directory. That directory is atomically moved into a queue directory with a new name to represent its position in the queue. The directory is consumed from the directory queue by atomically moving it to a different path, typically this will be a numbered directory that acts as a staging area where it can be worked on before moving it to a different directory queue.

These sub-directories are placed in a path structure that indicates the position in the queue, e.g.:

./50_forwarding/downstream/02_retry/2/012/345/012345678

In the above example:

  • ./50_forwarding/downstream/02_retry represents the base directory of the queue.
  • /2/ represents the depth of the directory tree, i.e. the queue item has two sub-directories above it.
  • /012/ is a sub-directory containing items 12,000,000 to 12,999,999.
  • /345/ is a sub-directory containing items 12,345,000 to 12,345,999.
  • /012345678/ is the queue item containing the data to be processed. The number is the position in the queue and the number of digits is always left padded with zeros to be a multiple of three.

This structure ensures that there are never more than 999 items in each directory and the head/tail of the queue can be found quickly.

Numbered Directories

Typically between each queue is a numbered directory that acts as a staging area to work on the data. Numbered directories are sequentially numbered directories that all exist in a single parent directory. They are expected to be transient in nature, i.e. only existing until they can be move to another queue.

For example, 01_receiving_simple contains numbered directories and each one is used to stage non-ZIP data that has been received into proxy:

./01_receiving_simple/0000001407/
./01_receiving_simple/0000001408/
./01_receiving_simple/0000001409/

Each directory represents data for a single request into Stroom Proxy. Once the data has been successfully written to one of these directories, the directory will be atomically moved to one of the directory queues, e.g.

./01_receiving_simple/0000001407/ => 20_pre_aggregate_input_queue/0/382/

The directory then becomes the responsibility of the queue directory it was moved into.

Directory Structure

The following is a list of the directories used by Stroom-Proxy in its data directory (as configured by proxyConfig.path.data).

|-- 01_receiving_simple/
|-- 01_receiving_zip/
|-- 02_split_zip_input_queue/
|-- 03_split_zip_splits/
|-- 20_pre_aggregate_input_queue/
|-- 21_pre_aggregates/
|-- 22_splitting/
|-- 23_split_output/
|-- 30_aggregate_input_queue/
|-- 31_aggregates/
|-- 40_forwarding_input_queue/
|-- 50_forwarding/
|   |-- <destination name 1>/
|   |   |-- 01_forward/
|   |   |-- 02_retry/
|   |   `-- 03_failure/
|   `-- <destination name 2>/
|       |-- 01_forward/
|       |-- 02_retry/
|       `-- 03_failure/
|-- 99_deleting/
|-- event/
`-- temp_forward_copies/

The following diagram illustrates how data flows between the various queues and numbered directories.

images/proxy/architecture.puml.svg

/01_receiving_simple/

This directory is the reception for area for data that is NOT a ZIP ZIP A compressed file format for storing a one or more files with an associated directory structure. Stroom and Stroom Proxy use the ZIP format for exporting content and data as well as its Proxy ZIP format for holding multiple streams of data with associated meta data.Click to see more details... file, i.e. uncompressed or gzip compressed data. It contains numbered directories.

Data will be written to this directory before the client receives the HTTP response.

Each numbered directory will contain two files:

  • /01_receiving_simple/0000002034/0000000001.meta - The meta sidecar file containing the HTTP headers.
  • /01_receiving_simple/0000002034/0000000001.dat - The file containing the received payload data.

The filenames are always the same as it is only dealing with a single stream.

/02_receiving_zip/

This directory is the reception for area for data that has been received as a ZIP ZIP A compressed file format for storing a one or more files with an associated directory structure. Stroom and Stroom Proxy use the ZIP format for exporting content and data as well as its Proxy ZIP format for holding multiple streams of data with associated meta data.Click to see more details... file which may contain one or more streams of data and associated metadata. It contains numbered directories.

Received ZIP files will be written to a numbered sub-directory in this directory before the client receives the HTTP response.

All .meta files in the ZIP file will be updated to add the HTTP headers from the request. In order to do this, Stroom Proxy will first write the ZIP as a .zip.staging file. It will clone all the ZIP entries in this file into a .zip file, updating the .meta entries as it goes. The .zip.staging file will be deleted once complete.

The ZIP entries will be scanned and all valid entries will be written to a .entries sidecar file for subsequent processes to use. This .entries file defines the entries in the ZIP that are valid for further processing and allows subsequent processing to use this file as a reference rather than having to re-scan the ZIP.

The scanning process will also establish how many groups are in the ZIP. A group is defined as a combination of the Feed and the Stream Type.

If the ZIP contains more than one group or the ZIP does not adhere to the correct [Stroom ZIP Format](/docs/sending-data/payloads/#stroom-zip-format, the directory will be moved to /02_split_zip_input_queue/ for splitting.

If the ZIP has a valid format and only contains one group, it will either be moved to the 20_pre_aggregate_input_queue queue, if aggregation is enabled, or 40_forwarding_input_queue queue if not.

/02_split_zip_input_queue/

Each directory placed into this directory queue will contain a ZIP file and a .entries file. The ZIP may be in an invalid format, in which case a new ZIP will be created with the correct entry naming and structure. This is to ensure that all ZIP files received downstream are in a consistent format. Alternatively it will contain more than one group, so will need to be split into one ZIP file per group.

A numbered directory will be created in /03_split_zip_splits/ to hold each split. For each group of entries in a split, it will create a sub-directory named after the group in the numbered directory, e.g. for two splits:

  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.zip
  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.entries
  • /03_split_zip_splits/0000000392/FEED_X__raw_events/proxy.meta
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.zip
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.entries
  • /03_split_zip_splits/0000000392/FEED_Y__raw_events/proxy.meta

Once the splitting is complete, each split directory will be moved to the 20_pre_aggregate_input_queue queue, if aggregation is enabled, or 40_forwarding_input_queue queue if not.

/20_pre_aggregate_input_queue/

Each directory on this queue will contain a ZIP file that contains one or more entries for the same group (combination of Feed and Stream Type).

If proxyConfig.aggregator.splitSources is set to true, Stroom Proxy will inspect the ZIP to see if it needs to be split up into multiple parts, to meet the aggregation targets (defined by proxyConfig.aggregator.maxItemsPerAggregate and proxyConfig.aggregator.maxUncompressedByteSize), else the zip will be treated as a single split-part.

If there is just one split-part, the directory will be moved into the current aggregate directory for its group, e.g.

  • /21_pre_aggregates/FEED_X__raw_events/009/proxy.zip

If there are multiple split-parts the ZIP file will require splitting into multiple ZIP files with one per split-part, i.e. all entries from the input ZIP spread over multiple split-part ZIPs. Each split-part will be written like this:

  • /22_splitting/0000000343/009_part_1/proxy.zip
  • /22_splitting/0000000343/009_part_2/proxy.zip
  • /22_splitting/0000000343/009_part_3/proxy.zip

Once the splitting has been completed, the common parent directory is moved to /23_split_output/:

  • /23_split_output/0000000343/009_part_1/proxy.zip
  • /23_split_output/0000000343/009_part_1/proxy.zip
  • /23_split_output/0000000343/009_part_1/proxy.zip

Each split-part is then moved to /21_pre_aggregates/.

  • /21_pre_aggregates/FEED_X__raw_events/011/proxy.zip

When the aggregate for a Feed|Type group is complete (based on item count and uncompressed size), the aggregate will be closed. Closing of the aggregate involves moving the parent directory of all the aggregate items to /30_aggregate_input_queue/.

/30_aggregate_input_queue/

Each directory on this queue will contain multiple directory groups (each containing a ZIP file and its associated files) that are to be part of a single aggregate.

If there is only one item in the queue directory, the directory will be moved to /40_forwarding_input_queue/ for forwarding.

If there are more than one items in the queue directory then a new aggregate ZIP will be created in /31_aggregates/. The entries from each item ZIP will be written into the new aggregate ZIP.

It will also create a set of meta entries for the aggregate. This will contain only key/value entries that are present in every item in the aggregate.

Once the aggregate has been written it is moved to /40_forwarding_input_queue/.

/40_forwarding_input_queue/

Each directory on this queue will contain a single ZIP file that may contain one or more streams (plus associated files). In addition to the ZIP file will be a combined .meta file for the aggregate.

Depending on how forwarding has been configured (using proxyConfig.forwardFileDestinations and proxyConfig.forwardHttpDestinations), there will be a pair of directory queues for each of the forwarding destinations, with the destination name in the path, e.g.:

  • /50_forwarding/file-dest-1/01_forward/
  • /50_forwarding/file-dest-1/02_retry/
  • /50_forwarding/file-dest-2/01_forward/
  • /50_forwarding/file-dest-2/02_retry/
  • /50_forwarding/http-dest-1/01_forward/
  • /50_forwarding/http-dest-1/02_retry/
  • /50_forwarding/http-dest-2/01_forward/
  • /50_forwarding/http-dest-2/02_retry/

Each item on the /40_forwarding_input_queue/ queue will be copied into each of the 01_forward queues, then the source item will be deleted. This keeps each destination independent and prevents a loss of connection to one destination from impacting the others.

/50_forwarding/

This directory contains multiple directory queues, two per forward destination.

  • ..../<destination name>/01_forward/ - Items initially queued for forwarding to the destination.
  • ..../<destination name>/02_retry/ - Items that have failed to forward to the destination and have been queued for a retry.

Each forward destination directory also contains a failure directory:

  • ..../<destination name>/03_failure/ - Items that have failed to forward. Either they have failed too many times or have failed with an error that prevents retry. Items in this directory are now outside the control of Stroom-Proxy and will remain until moved/deleted by an administrator.