Class AccumuloStoreRelation

  • All Implemented Interfaces:
    org.apache.spark.sql.sources.PrunedFilteredScan, org.apache.spark.sql.sources.PrunedScan, org.apache.spark.sql.sources.TableScan

    public class AccumuloStoreRelation
    extends org.apache.spark.sql.sources.BaseRelation
    implements org.apache.spark.sql.sources.TableScan, org.apache.spark.sql.sources.PrunedScan, org.apache.spark.sql.sources.PrunedFilteredScan
    Allows Apache Spark to retrieve data from an AccumuloStore as a DataFrame. Spark's Java API does not expose the DataFrame class, but it is just a type alias for a Dataset of Rows. The schema of the DataFrame is formed from the schemas of the groups specified in the view.

    If two of the specified groups have properties with the same name, then the types of those properties must be the same.

    AccumuloStoreRelation implements the TableScan interface which allows all Elements to of the specified groups to be returned to the DataFrame.

    AccumuloStoreRelation implements the PrunedScan interface which allows all Elements of the specified groups to be returned to the DataFrame but with only the specified columns returned. Currently, AccumuloStore does not allow projection of properties in the tablet server, so this projection is performed within the Spark executors, rather than in Accumulo's tablet servers. Once AccumuloStore supports this projection in the tablet servers, then this will become more efficient.

    AccumuloStoreRelation implements the PrunedFilteredScan interface which allows only Elements that match the the provided Filters to be returned. The majority of these are implemented by adding them to the View, which causes them to be applied on Accumulo's tablet server (i.e. before the data is sent to a Spark executor). If a Filter is specified that specifies either the vertex in an Entity or either the source or destination vertex in an Edge then this is applied by using the appropriate range scan on Accumulo. Queries against this DataFrame that do this should be very quick.

    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan()
      Creates a DataFrame of all Elements from the specified groups.
      org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan​(String[] requiredColumns)
      Creates a DataFrame of all Elements from the specified groups with columns that are not required filtered out.
      org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan​(String[] requiredColumns, org.apache.spark.sql.sources.Filter[] filters)
      Creates a DataFrame of all Elements from the specified groups with columns that are not required filtered out and with (some of) the supplied Filters applied.
      org.apache.spark.sql.types.StructType schema()  
      org.apache.spark.sql.SQLContext sqlContext()  
      • Methods inherited from class org.apache.spark.sql.sources.BaseRelation

        needConversion, sizeInBytes, unhandledFilters
    • Method Detail

      • sqlContext

        public org.apache.spark.sql.SQLContext sqlContext()
        Specified by:
        sqlContext in class org.apache.spark.sql.sources.BaseRelation
      • schema

        public org.apache.spark.sql.types.StructType schema()
        Specified by:
        schema in class org.apache.spark.sql.sources.BaseRelation
      • buildScan

        public org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan()
        Creates a DataFrame of all Elements from the specified groups.
        Specified by:
        buildScan in interface org.apache.spark.sql.sources.TableScan
        Returns:
        An RDD of Rows containing Elements whose group is in groups.
      • buildScan

        public org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan​(String[] requiredColumns)
        Creates a DataFrame of all Elements from the specified groups with columns that are not required filtered out.

        Currently this does not push the projection down to the store (i.e. it should be implemented in an iterator, not in the transform). Issue 320 refers to this.

        Specified by:
        buildScan in interface org.apache.spark.sql.sources.PrunedScan
        Parameters:
        requiredColumns - The columns to return.
        Returns:
        An RDD of Rows containing the requested columns.
      • buildScan

        public org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> buildScan​(String[] requiredColumns,
                                                                            org.apache.spark.sql.sources.Filter[] filters)
        Creates a DataFrame of all Elements from the specified groups with columns that are not required filtered out and with (some of) the supplied Filters applied.

        Note that Spark also applies the provided Filters - applying them here is an optimisation to reduce the amount of data transferred from the store to Spark's executors (this is known as "predicate pushdown").

        Currently this does not push the projection down to the store (i.e. it should be implemented in an iterator, not in the transform). Issue 320 refers to this.

        Specified by:
        buildScan in interface org.apache.spark.sql.sources.PrunedFilteredScan
        Parameters:
        requiredColumns - The columns to return.
        filters - The Filters to apply (these are applied before aggregation).
        Returns:
        An RDD of Rows containing the requested columns.