FAQs
These are some previously asked questions about querying in Gaffer.
Iterable results
If I do queries such as GetElements
or GetAdjacentIds
the response type is an Iterable
- why?
To avoid loading all query results into memory, Gaffer Stores
should return an Iterable
which lazily loads and
returns the data as a user iterates through the results. In the case of Accumulo, this means a connection to Accumulo must
remain open whilst this iteration takes place. This should be closed automatically when the end of the results is reached.
However, if you decide not to read all the results, i.e. you just want to check if the results are not empty -
!results.iterator().hasNext()
- or an exception is thrown whilst iterating, then the results iterable will not be closed
and the connection to Accumulo will remain open. Therefore, to be safe results should always be consumed in a
try-with-resources block.
Following on from the previous question, why can't I iterate through results in parallel?
As mentioned above, the results iterable holds a connection open to Accumulo. To avoid opening multiple connections and
accidentally leaving the connections open, the Accumulo Store only allows one iterator to be active at any time. When you
call .iterator()
the connection is opened. If you call .iterator()
again, the original connection is closed and a new
connection is opened. This means you can't process the iterable in parallel using Java's streaming API if this involves
making multiple calls to .iterator()
. If your results fit in memory you could add them to a Set/List and then process
that collection in parallel.
Filtering Results
How do I return all my results summarised?
You need to provide a View to override the groupBy
fields for all element groups defined in the Schema. If you set the
groupBy
field to an empty array, this will mean no properties will be included in the element key and all the properties
will be summarised. You can do this by providing a View:
My queries are returning duplicate results - why and how can I deduplicate them?
For example, if you have a Graph containing the Edge A-B and you do a GetElements
with a large number of seeds, with A
as the first seed and B as the last seed, then you will get the Edge A-B back twice. This is because Gaffer stores lazily
return the results for your query to avoid loading all the results into memory, so it will not realise that A-B has been
queried for twice.
You can deduplicate your results in memory using the ToSet
operation.
But, be careful to only use this when you have a small number of results. It might also be worth using the
Limit
operation prior to ToSet
to ensure you don't run out of
memory. E.g:
new OperationChain.Builder()
.first(new GetAllElements())
.then(new Limit<>(1000000))
.then(new ToSet<>())
.build();
I want to filter the results of my query based on the destination of the result Edges
There are several ways of doing this, you will need to choose the most appropriate way for your needs.
If you are querying with just a single EntitySeed with a vertex value of X and require the destination to be Y, then you should
change your query to use an EdgeSeed with source = X
and destination = Y
and directedType = EITHER
.
If you are querying with multiple EntitySeeds, then just change each seed into an EdgeSeed, as described above.
If you require your destination to match a provided regex, than you will need to use a regex filter, Regex or MultiRegex. The predicate can then be used in your Operation View to filter out elements which don't match the regex.
When the query is run and a seed matches an edge vertex, your seed may match the source or the destination vertex. So, you need
to tell the filter to apply to the opposite end of the edge. If you are running against a store that implements the MATCHED_VERTEX
trait (e.g Accumulo) then it is easy. The edges returned from the store will have a matchedVertex
field, so you know which end
of the edge your seed matched. This means you can select the vertex at the other end of the edge using the ADJACENT_MATCHED_VERTEX
keyword. For example:
GetElements results = new GetElements.Builder()
.input(new EntitySeed("X"))
.view(new View.Builder()
.edge("yourEdge", new ViewElementDefinition.Builder()
.preAggregationFilter(
new ElementFilter.Builder()
.select(IdentifierType.ADJACENT_MATCHED_VERTEX.name())
.execute(new Regex("[yY]"))
.build())
.build())
.build())
.build();
Without the matchedVertex
field it is a bit more difficult. If you are using directed edges and you know what you seed will always
match the source, then you can select the 'DESTINATION' in the filter. Otherwise, you will need to provide a filter that checks the
SOURCE
or the DESTINATION
matches the regex. For example:
GetElements results = new GetElements.Builder()
.input(new EntitySeed("X"))
.view(new View.Builder()
.edge("yourEdge", new ViewElementDefinition.Builder()
.preAggregationFilter(
new ElementFilter.Builder()
.select(IdentifierType.SOURCE.name(), IdentifierType.DESTINATION.name())
.execute(new Or.Builder<>()
.select(0)
.execute(new Regex("[yY]"))
.select(1)
.execute(new Regex("[yY]"))
.build())
.build())
.build())
.build())
.build();
For more related information on filtering, see the Filtering Guide.
Second Hop
I've just done a GetElements
and want to do another hop, but I get strange results doing two sequential GetElements
.
You can seed a get related elements operation with vertices (EntityIds) or edges (EdgeIds). If you seed the operation with edges you will get back the Entities at the source and destination of the provided edges, in addition to the edges which match your seed.
You may get a lot of duplicates and unwanted results. What you really want to do is to use the GetAdjacentIds
operation to hop down the first edges and return just the vertices at the opposite end of the related edges. You can still
provide a View
and apply filters to the edges you traverse down. In addition, it is useful to add a direction to the query,
so you don't go back down the original edges. You can continue doing multiple GetAdjacentIds
to traverse around the Graph
further. If you want the properties on the edges to be returned you can use GetElements
as the final operation in your chain.
See the old version of this question for an example.
Optimising Queries
Any tips for optimising my queries?
Limit the number of groups you query for using a View - this could result in a big improvement.
When defining filters in your View try and use the preAggregationFilter
for all your filters as this will be run before
aggregation and will mean less work has to be done to aggregate properties that you will later just discard. On Accumulo,
postTransformFilters
are not distributed, they are computed on a single node and so they can be slow.
Some stores (e.g. Accumulo) store the properties in different columns and lazily deserialise a column as properties in that column are requested. So if you limit your filters to a single column, then less data needs to be deserialised. For Accumulo the columns are split up depending on whether the property is a groupBy property. So if you want to execute a time window query and your timestamp is a groupBy property, then depending on the store you are running against this may be optimised. On Accumulo this will be fast as it doesn't need to deserialise the entire Value, just the column qualifier containing your timestamp property.
Also, when defining the order of Predicates in a Filter, the order is important. It will run the predicates in the order you provide, so order them so that the first ones are the more efficient and will filter out the most data. It is generally more efficient to load/deserialise the groupBy properties than the non-groupBy properties, as there are usually less of them. So if your filter applies to two properties, a groupBy and a non-groupBy property, then we recommend putting the groupBy property filter first as that will normally be more efficient.
When doing queries, if you don't specify Pre or Post Aggregation filters then this means the entire filter can be skipped. When running on Accumulo stores this means entire iterators can be skipped and this will save a lot of time. So, if applicable, you will save time if you put all your filtering in either the Pre or Post section (in some cases this isn't possible).
Gaffer lets you specify validation predicates in your Schema to validate your data when added and continuously in the background for
age off. You can optimise this validation, by removing any unnecessary validation. You can do most of the validation you require in
your ElementGenerator
class when you generate your elements. The validation you provide in the schema should be just the validation
which you must have, because this may be run a lot. On Accumulo - it is run in major/minor compactions and for every query. If you
can, just validate properties that are in the groupBy, this will mean that the store may not need to deserialise all other properties
just to perform the validation.
How can I optimise the GetAdjacentIds
query?
When doing GetAdjacentIds
, try and avoid using PostTransformFilters. If you don't specify these then the final part of the query
won't need to deserialise the properties, it can just extract the destination off the edge. Also see the answer above for general
query optimisation.
How can I optimise AddElementsFromHdfs
?
Try using the SampleDataForSplitPoints
and SplitStoreFromFile
operations to calculate splits points. These can then be used to
partition your data in the map reduce job used to import the data. If adding elements into an empty Accumulo table or a table without
any splits then the SampleDataForSplitPoints
and SplitStoreFromFile
operations will be executed automatically for you. You can
also optionally provide your own splits points for your AddElementsFromHdfs
operation.