How to Deploy with Accumulo on Docker
After reading the previous page you should have a good understanding of what images are available for Gaffer and how to configure them to your needs. However, before running a deployment backed by Accumulo you will need to know a bit of background on Hadoop to understand how the data will scale and be distributed.
Usually when deploying a container image you simply run the image and everything is contained locally to the container (hence the name). For larger scale graphs this is less desirable as we will usually want to be able to scale and load balance the storage based on the volume of data; this is where Hadoop comes in.
Tip
Please see the Accumulo Store page for more information on Accumulo and Hadoop.
Running a Cluster via Docker for Gaffer
To run an Accumulo cluster on Docker containers, we will need a few different containers that will work together. The general set of containers we will need to run is the following:
zookeeper
gchq/hdfs
namenode
datanode
(one or more)
gchq/gaffer
(accumulo)
master
tserver
monitor
gc
gchq/gaffer-rest
The guide here will walk through the set up of each container using standard Docker but it may be more practical to use a tool such as docker compose. An example docker compose file can be found in the gaffer-docker repository.
Before starting any containers we need to create a network so all the containers can talk to each other. To do this we simply run the following command to make a network and name it appropriately:
Note
If you are using something like docker compose or host networking (e.g. with
--net=host
) you can skip the network creation step.
ZooKeeper
Starting with ZooKeeper, this is used by Accumulo to provide distributed synchronization so is useful to start up first.
docker run \
--detach \
--name zookeeper \
--hostname zookeeper \
--net gaffer-example \
--env ZOO_SERVERS="server.1=zookeeper:2888:3888;2181" \
--env ZOO_4LW_COMMANDS_WHITELIST="*" \
--volume /data \
--volume /datalog \
zookeeper:3.7.1
The above docker command will run a ZooKeeper container with few bits of
configuration. The main part that is being set up is the hostname and ports.
With ZooKeeper you can also do this by providing a zoo.cfg
file but as there
is not much to do for a small cluster so we can just pass the configuration in
via environment variables with --env
. See the official ZooKeeper docs
for more information and configuration options.
Hadoop/HDFS
Next we can launch the Hadoop cluster, we will use the custom distribution of
the HDFS image gchq/hdfs
. As mentioned before this provides a single-node
Hadoop cluster which we can run multiple times to extend into a multi-node
cluster.
To run a Hadoop cluster we first need the configuration files for Hadoop which
we can then add into the running containers. As a starting point you can use the
files from the
gaffer-docker
repository, but you may wish to edit these for your deployment and can read more
in the official Hadoop docs.
The first Hadoop container we need is a namenode
container which runs the
Namenode service essentially acting as a master node. We can run this using
docker like the following:
docker run \
--detach \
--name hdfs-namenode \
--hostname hdfs-namenode \
--net gaffer-example \
--publish 9870:9870 \
--env HADOOP_CONF_DIR="/etc/hadoop/conf" \
--volume /custom/configs/hdfs:/etc/hadoop/conf \
--volume /var/log/hadoop \
--volume /data1 \
--volume /data2 \
gchq/hdfs:3.3.3 namenode
The configuration of the above docker
command is fairly straight forward, we
make sure the port we have configured in the core-site.xml
file is made
available on the host (e.g. port 9870). We also set up the bind-mount and
environment variables so the custom configuration files are available and used.
Finally there are a few defined volumes for data, these can be changed to where
ever you might want to store the data; it is highly recommended to set these up
correctly for your available infrastructure.
Once the namenode
has been created we now need to add the datanode
(s) these
are additional nodes in the Hadoop cluster to store data. You can add multiple
data nodes to distribute the data across volumes and even machines (with some
additional setup).
To create a datanode
container its much of the same steps as the namenode
however, we use a different command to run the datanode
service.
docker run \
--detach \
--name hdfs-datanode1 \
--hostname hdfs-datanode1 \
--net gaffer-example \
--env HADOOP_CONF_DIR="/etc/hadoop/conf" \
--volume /custom/configs/hdfs:/etc/hadoop/conf \
--volume /var/log/hadoop \
--volume /data1 \
--volume /data2 \
gchq/hdfs:3.3.3 datanode
Note
It is recommended you configure the volumes used by your data nodes to fit
your infrastructure and the amount of data you are expecting to store. You
can also run datanode
containers across multiple machines and network them
together to have full data distribution.
Accumulo
There are a few different containers that need to be started for an Accumulo instance. They work on a similar principal to the Hadoop/HDFS containers where the configuration is added via a bind-mount and the different services started via the command passed to the container.
For Accumulo we will use the gchq/gaffer
image which includes the libraries
for both Gaffer and Accumulo. For a deployment of Accumulo generally the
following nodes/containers are needed:
master
- This is the primary coordinating process. Must specify one node. Can specify a few for fault tolerance (note as of Accumulo v2.1 this is referred to asmanager
).gc
- The Accumulo garbage collector. Must specify one node. Can specify a few for fault tolerance.monitor
- Node where Accumulo monitoring web server is run.tserver
- An Accumulo worker process.
This node needs to be running before starting the others.
docker run \
--detach \
--name accumulo-master \
--hostname accumulo-master \
--net gaffer-example \
--env ACCUMULO_CONF_DIR="/etc/accumulo/conf" \
--env HADOOP_USER_NAME="hadoop" \
--volume /custom/configs/accumulo:/etc/accumulo/conf \
--volume /var/log/accumulo \
gchq/gaffer:2.0.0-accumulo-2.0.1 master
docker run \
--detach \
--name accumulo-monitor \
--hostname accumulo-monitor \
--net gaffer-example \
--publish 9995:9995 \
--env ACCUMULO_CONF_DIR="/etc/accumulo/conf" \
--env HADOOP_USER_NAME="hadoop" \
--volume /custom/configs/accumulo:/etc/accumulo/conf \
--volume /var/log/accumulo \
gchq/gaffer:2.0.0-accumulo-2.0.1 monitor
docker run \
--detach \
--name accumulo-tserver \
--hostname accumulo-tserver \
--net gaffer-example \
--env ACCUMULO_CONF_DIR="/etc/accumulo/conf" \
--env HADOOP_USER_NAME="hadoop" \
--volume /custom/configs/accumulo:/etc/accumulo/conf \
--volume /var/log/accumulo \
gchq/gaffer:2.0.0-accumulo-2.0.1 tserver
Note
Please see the official Accumulo docs for more information on configuring the deployment.
REST API
The final container we need to start up is the REST API, this essentially gives the front end so we can use containers together in a Gaffer cluster. The REST API container is also where the configuration for the graph is applied, such as the schema files and store properties.
To start up the REST API it is a similar process to the other containers; however, there are a few more bind-mounts that need defining to configure the graph (you can also build a custom image with files baked in).
docker run \
--detach \
--name gaffer-rest \
--net gaffer-example \
--publish 8080:8080 \
--volume /custom/configs/application.properties:/gaffer/config/application.properties \
--volume /custom/configs/graph:/gaffer/graph \
--volume /custom/configs/schema:/gaffer/schema \
--volume /custom/configs/store:/gaffer/store \
gchq/gaffer-rest:2.0.0-accumulo-2.0.1 monitor
Note
The gaffer-rest
image comes with some default configuration files and
graph schema, you'll likely want to configure these for your project so
please see the pages on gaffer configs and
graph schema for more information.