Palisade Clients

This section gives some insight into how clients could be used as the entry point for users to access data via Palisade.

Current

Reading a stream of data

By default, the Data Service presents an HTTP stream of binary data to the client on request.

A code sample using the client-akka package to read all AVRO resources in a dataset (without the client interpreting the AVRO data):

void printInputStream(InputStream is) {
    new BufferedReader(new InputStreamReader(is))
        .lines()
        .forEach(line -> System.out.println(line));
}

void doRequest() {
    String token = client.register("Alice", "file:/palisade-data-store/dataset-1/", "").toCompletableFuture().join()
    Source<LeafResource, ?> avroResources = client.fetch(token).filter(resource -> resource.getType().equals("AVRO"));
    Source<InputStream, ?> avroRecords = avroResources.flatMapConcat(resource -> client.read(token, resource));
    avroRecords.runWith(Sink.forEach(this::printInputStream), materializer);
}

In reality, this Source<InputStream, ?> would be interpreted by the client using an AVRO deserialiser.

Similarly, Palisade has a client-java package which implements the Palisade Service/Filtered-Resource Service/Data Service protocol using the java standard-library, instead exposing a java.util.concurrent.Flow.Publisher and java.io.InputStream.

Using the 'cat' command line tool

It should be possible to use command line tools like 'cat' to be able to view files that are being protected by Palisade. To do that, we would need to write a client that mimics the behaviour of the 'cat' command but routing the request for data via Palisade. Then you could alias 'cat' to run that client code. Therefore, to the end user, there is very little difference to how they would normally use 'cat' if they did not have the data access policy restrictions.

When using Palisade, there is some extra context to be managed, and a 'dumb' cat-client might be quite verbose (the user would have to keep track of tokens themself). Instead, there are two implementations of this style of client.

The first of these, the client-shell package, is a simple shell text user interface with commands similar to cat, cd, and ls. This aims to demonstrate to a developer how Palisade works, mapping each part of the protocol between the client and server to a different UNIX-like command. As commands are executed, the data returned is printed out in a human-readable fashion.

The second of these, the client-fuse package, is a connector between Palisade and the FUSE interface, which will mount the results of a Palisade query as a filesystem. This aims to allow a quick-and-easy way to provide compatibility with a whole host of existing tools, as well as present an interactive graphical way to explore returned data. Of course, this client also allows use of many existing UNIX CLI tools, such as ls and cat, but also more complex tools such as sed or grep. This approach is not the most performant, especially when querying many thousands of resources, but should be more than enough for proof-of-concept and demonstrative purposes.

Using the S3 Client

It is possible, via the S3 Client to read resources stored in AWS S3 API compliant data stores, allowing for a full AWS implementation of Palisade. Upon deployment of Palisade using the S3 client, the relevant services are configured to expect data stored in S3, and are loaded with the relevant serialisers required to deseralise the returned LeafResource to return to the client.

Reading a stream of data using Apache Spark

You can easily adopt the S3 Client to read data from Apache Spark.
Given a Spark job running against AWS S3 as follows:

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.eu-west-2.amazonaws.com/")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
val nonrecursive = scala.io.Source.fromFile("/schema/nonrecursive.json").mkString
spark.read.format("avro").option("avroSchema", nonrecursive).load("s3a://palisade-application-dev/data/remote-data-store/data/employee_file0.avro").show()

Note that we use a modified non-recursive AVRO schema /schema/nonrecursive.json (this excludes the managers field) as recursive schema are not compatible with Spark SQL.

Adapt the Spark job to run against the Palisade S3 client (ensure the client is running and correctly configured). This short snippet requires curl, but otherwise works wholly within spark-shell and the s3 and avro libraries as the previous did:

import sys.process._;
// User 'Alice' wants 'file:/data/local-data-store/' directory for 'SALARY' purposes
// We get back the token '09d3a677-3d03-42e0-8cdb-f048f3929f8c', to be used as a bucket-name
val token = (Seq("curl", "-X", "POST", "http://localhost:8092/register?userId=Alice&resourceId=file%3A%2Fdata%2Flocal-data-store%2F&purpose=SALARY")!!).stripSuffix("\n")
Thread.sleep(5000)

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "localhost:8092/request")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
// These are not interpreted or validated by Palisade, but Spark requires them to be non-null
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "accesskey")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secretkey")
// spark.read.format("avro").load("s3a://" + token + "/with-policy/employee_small.avro").show()
val nonrecursive = scala.io.Source.fromFile("/schema/nonrecursive.json").mkString
spark.read.format("avro").option("avroSchema", nonrecursive).load("s3a://" + token + "/data/employee_file0.avro").show()

results matching ""

    No results matching ""