Writing the Schema
In Gaffer JSON based schemas need to be written upfront to model and understand how to load and treat the data in the graph. These schemas define all aspects of the entities and edges in the graph, and can even be used to automatically do basic analysis or aggregation on queries and ingested data.
For reference, this guide will use the same CSV data set from the project setup page.
Elements Schema
In Gaffer, an element refers to any object in the graph, i.e. your entities and edges. To set
up a graph we need to tell Gaffer what objects are in the graph and the properties they have. The
standard way to do this is a JSON config file in the schema directory. The filename can just be
called something like elements.json
, the name is not special as all files under the schema
directory will be merged into a master schema, but we recommended
using an appropriate name.
As covered in the Getting Started Schema page, to write a schema you can see that there are some required fields, but largely a schema is highly specific to your input data.
Starting with the entities
from the example, we can see there will be two distinct types of entity
in the graph; one representing a Person
and another for Software
. These can be added into the
schema to give something like the following:
The types here such as id.person.string
are covered in the next section.
{
"entities": {
"Person": {
"description": "Entity representing a person vertex",
"vertex": "id.person.string"
},
"Software": {
"description": "Entity representing a software vertex",
"vertex": "id.software.string"
}
}
}
From the basic schema you can see that we have added two entity types for the graph. For now, each
entity
just contains a short description and a type associated to the vertex
key. The type here
is just a placeholder, but it has been named appropriately as it's assumed that we will just use the
string representation of the entities id (this will be defined in the types.json
later in the
guide).
Expanding on the basic schema we will now add the edges
to the graph. As the example graph is
small we only need to add one edge - the Created
edge. This is a directed edge that connects a
Person
to a Software
and can be defined as the following.
{
"edges": {
"Created": {
"source": "id.person.string",
"destination": "id.software.string",
"directed": "true"
}
},
"entities": {
"Person": {
"description": "Entity representing a person vertex",
"vertex": "id.person.string"
},
"Software": {
"description": "Entity representing a software vertex",
"vertex": "id.software.string"
}
}
}
As discussed in the user schema guide, edges have some mandatory fields. Starting with
the source
and destination
fields, these must match the types associated with the vertex field
in the relevant entities. From the example, we can see that the source of a Created
edge is a
Person
so we will use the placeholder type we set as the vertex
field which is
id.person.string
. Similarly the destination is a Software
vertex so we will use its placeholder of
id.software.string
.
We must also set whether an edge is directed or not, in this case it is as only a person can create
software not the other way around. To set this we will use the true
type, but note that this is a
placeholder and must still be defined in the types.json.
Continuing with the example, the entities and edges also have some properties associated with each such as name, age etc. These can also be added to the schema using a properties map to result in the extended schema below.
{
"edges": {
"Created": {
"source": "id.person.string",
"destination": "id.software.string",
"directed": "true",
"aggregate": "false",
"properties": {
"weight": "property.float"
}
}
},
"entities": {
"Person": {
"description": "Entity representing a person vertex",
"vertex": "id.person.string",
"aggregate": "false",
"properties": {
"name": "property.string",
"age": "property.integer"
}
},
"Software": {
"description": "Entity representing a software vertex",
"vertex": "id.software.string",
"aggregate": "false",
"properties": {
"name": "property.string",
"lang": "property.string"
}
}
}
}
Note
Take note of the "aggregate": "false"
setting, this skips any ingest aggregation as it is not
required and out of scope of this example. All entity property types must have an aggregation
function in Gaffer unless this option is added. Aggregation is fairly advanced topic in Gaffer
but very powerful it is covered in more depth later in the documentation.
Types Schema
The other schema that now needs to be written is the types schema. As you have seen in the elements schema there are some placeholder types added as the values for many of the keys. These types work similarly to if you have ever programmed in a strongly typed language, they are essentially the wrapper for the value to encapsulate it.
Now starting with the types for the entities, we used two placeholder types, one for the
Person
entity and one for the Software
entity. From the example CSV you can see there is a _id
column that uses a string identifier that is used for the ID of the entity (this will also be used by
the edge
to identify the source and destination). We will define a type for each entity ID using the
standard java String
class to encapsulate it, this leads to a basic type.json
like the
following.
{
"types": {
"id.person.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
},
"id.software.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
}
}
}
The next set of types that need defining are, the ones used for the properties that are attached to
the entities. Again we need to take a look back at what our input data looks like, in the CSV
file we can see there are three different types that are used for the properties which are analogous
to a String
, an Integer
and a Float
.
Tip
Of course technically, all of these properties could be encapsulated in a string but assigning a relevant type allows some additional type specific features often used in grouping and aggregation.
If we make a type for each of the possible properties using the standard Java classes we end up with the following:
{
"types": {
"id.person.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
},
"id.software.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
},
"property.string": {
"description": "A type to hold string properties of entities",
"class": "java.lang.String"
},
"property.integer": {
"description": "A basic type to hold integer properties of entities",
"class": "java.lang.Integer"
},
"property.float": {
"description": "A basic type to hold float properties of entities",
"class": "java.lang.Float"
}
}
}
The final thing that we need to add to the schema is a type for the true
Boolean value that's used
by the directed field of the edge element. This leaves us with the complete list of types for this
example.
{
"types": {
"id.person.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
},
"id.software.string": {
"description": "A basic type to hold the string id of a person entity",
"class": "java.lang.String"
},
"property.string": {
"description": "A type to hold string properties of entities",
"class": "java.lang.String"
},
"property.integer": {
"description": "A basic type to hold integer properties of entities",
"class": "java.lang.Integer"
},
"property.float": {
"description": "A basic type to hold float properties of entities",
"class": "java.lang.Float"
},
"true": {
"description": "A simple boolean that must always be true.",
"class": "java.lang.Boolean",
"validateFunctions": [
{
"class": "uk.gov.gchq.koryphe.impl.predicate.IsTrue"
}
]
}
}
}
As you can see the Boolean value also demonstrates the validation feature which allows for validation of any values using the type. In this example it verifies its true but you could also check it exists, see if its less than another value etc. or even run your own custom validator class.
Tip
The Koryphe module provides lots of default functions that can be used to validate and aggregate data, see the predicate reference guide for more information.