Skip to content

API Reference

Rules

gchq_data_quality.rules.uniqueness.UniquenessRule

Bases: BaseRule

Rule for assessing uniqueness in a column.

Measures the proportion of unique, non-null values in a specified column. This is useful for checking distinct identifiers or reference keys. Additional null-like values can be specified via na_values.

Attributes:

Name Type Description
field str

Column to evaluate for uniqueness.

na_values Any | list[Any] | None

Values to treat as missing.

data_quality_dimension DamaFramework

Data quality dimension (Uniqueness).

rule_id str | None

Optional rule identifier.

rule_description str | None

Optional description for this rule.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> import pandas as pd
>>> from gchq_data_quality.rules.uniqueness import UniquenessRule
>>> df = pd.DataFrame({'id': [1, 2, 3, 3, None]})

# Basic uniqueness check
>>> rule = UniquenessRule(field='id')
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75

# Specify additional NA values
>>> rule = UniquenessRule(field='id', na_values=[-1])
>>> df = pd.DataFrame({'id': [1, 2, -1, 3, 3]})
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75
Note

The pass_rate metric is calculated as (number of unique values) / (number of non-null records). Therefore, if every value in the column appears exactly twice, pass_rate will be 0.5 (not 0.0!). For columns with even more duplication, pass_rate will decrease and approach zero as the number of unique values becomes small relative to the number of total records. Only if every record is identical will pass_rate be 1 / N (where N is the number of records).

Returns:

Name Type Description
DataQualityResult

Contains the uniqueness score (pass_rate), identifiers for failed records,

a sample of duplicate values, the number of records evaluated, and rule metadata.

See DataQualityResult documentation for further attribute details.

gchq_data_quality.rules.completeness.CompletenessRule

Bases: BaseRule

Rule to calculate the completeness score for a field.

Completeness is measured as the proportion of non-null values in the specified column. Values specified in na_values are converted to nulls prior to calculation.

Attributes:

Name Type Description
field str

The column name to assess.

na_values str | list[Any] | None

Additional indicators to treat as missing.

rule_id str | None

Optional identifier for the rule.

rule_description str | None

Optional description of the rule.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates completeness for the chosen fields on a Pandas or Spark DataFrame. Returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = CompletenessRule(field="column_name")
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)

>>> rule = CompletenessRule(field="column_name", na_values="missing")
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

Contains completeness score (pass_rate), field name,

number of records evaluated, and rule metadata. See DataQualityResult documentation

for further attribute details.

gchq_data_quality.rules.accuracy.AccuracyRule

Bases: BaseRule

Rule to check if values meet a list of valid (or invalid) values.

Skips NULLs, including those recognised via na_values. Instantiate this class and call .evaluate(df) to assess data quality for the chosen column.

Attributes:

Name Type Description
field str

The column to check for accuracy.

valid_values list[Any]

The set of acceptable values for the field.

inverse bool

If True, values in valid_values are considered invalid (exclusion list).

na_values str | list[Any] | None

Additional indicators to treat as missing values.

rule_id str | None

Optional identifier for the rule.

rule_description str | None

Optional description of the rule.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = AccuracyRule(field="category", valid_values=["A", "B", "C"])
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> print(result.records_failed_ids)

>>> rule = AccuracyRule(
...     field="department",
...     valid_values=["HR", "IT", "Sales"],
...     na_values=["N/A", "N/K"]
... )
>>> result = rule.evaluate(df)

>>> rule = AccuracyRule(
...     field="status",
...     valid_values=["expired", "deleted"],
...     inverse=True # value must NOT be expired or deleted
... )
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

An object containing the accuracy score (pass_rate),

the indices of failed rows (records_failed_ids), a sample of failed values

(records_failed_sample), the number of records evaluated, and further rule metadata.

See DataQualityResult documentation for full details.

gchq_data_quality.rules.consistency.ConsistencyRule

Bases: BaseRule

Rule for evaluating data consistency based on boolean expressions (with an optional condition).

Expressions may use any valid Pandas eval syntax that returns a boolean result. Backticks are required around all column names. Nulls and additional na_values are handled according to the skip policy.

Attributes:

Name Type Description
field str

The column to check for consistency.

expression str | dict[str, str]

A boolean expression, or a conditional {'if', 'then'} dictionary (with backticks for column names).

skip_if_null Literal['all', 'any', 'never']

Controls row skipping for null values in relevant columns.

na_values str | list[Any] | None

Additional values considered as missing.

data_quality_dimension DamaFramework

Associated data quality dimension - you may want to override it in this rule.

rule_id str | None

Optional identifier for the rule.

rule_description str | None

Optional description of the rule.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = ConsistencyRule(
...     field="score",
...     expression="`score` >= 50"
... )
>>> result = rule.evaluate(df)

>>> rule = ConsistencyRule(
...     field="completion_date",
...     expression={"if": "`status` == 'completed'", "then": "`completion_date`.notnull()"},
        data_quality_dimension='Completeness' # you can override the DAMA Dimension
... )
>>> result = rule.evaluate(df)

# all series .str. methods are available
>>> rule = ConsistencyRule(
...     field="postcode",
...     expression={
...         "if": "`country` == 'UK'",
...         "then": "`postcode`.str.match(r'^[A-Z]{2}[0-9]{2}$')"
...     }
... )
>>> result = rule.evaluate(df)

# Date parts and arithmetic using .dt accessor
>>> rule = ConsistencyRule(
...     field="report_year",
...     expression="`report_date`.dt.year == `report_year`"
... )
>>> result = rule.evaluate(df)

# Boolean logic (AND, OR, NOT) with grouping and comparisons
>>> rule = ConsistencyRule(
...     field="flag",
...     expression="(`score` > 90) & ((`status` == 'active') | ~`is_archived`)"
... )
>>> result = rule.evaluate(df)

# Using mathematical operations
>>> rule = ConsistencyRule(
...     field="predicted",
...     expression="abs(`actual` - `predicted`) < 10"
... )
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

An object containing the consistency score (pass_rate),

number of records evaluated, a sample of inconsistent records, and details of failed row indices.

See DataQualityResult documentation for full attribute descriptions.

gchq_data_quality.rules.timeliness.TimelinessRelativeRule

Bases: TimelinessBaseRule

Rule to assess whether datetime values fall between relative time boundaries from a reference date (which can be a static value or come from a column in the data source).

Timedelta bounds are specified for start and end, relative to a reference date or reference column. All datetime comparisons are performed in UTC, with date-only values assumed midnight. Only one of reference_date or reference_column may be provided. If neither is given, current UTC time is used as the reference_date.

Attributes:

Name Type Description
field str

Name of the datetime column to assess.

start_timedelta timedelta | str | int | float | None

Lower offset from the reference.

end_timedelta timedelta | str | int | float | None

Upper offset from the reference.

reference_date str | datetime | Timestamp | None

Fixed reference date/time (UTC).

reference_column str | None

Per-row column providing reference dates/times.

dayfirst bool

If True, parses ALL dates as day/month/year.

na_values str | list[Any] | None

Values treated as missing.

data_quality_dimension DamaFramework

Associated data quality dimension.

rule_id str | None

Optional rule identifier.

rule_description str | None

Optional rule description.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Note

Integer values into start or end timedelta are assumed to be nanoseconds (default pandas.to_timedelta() behaviour)

Example
>>> rule = TimelinessRelativeRule(
...     field="event_date",
...     start_timedelta="0d",
...     end_timedelta="30d",
...     reference_date="2024-01-01T00:00:00Z"
... )
>>> result = rule.evaluate(df)

>>> rule = TimelinessRelativeRule(
...     field="booking_date",
...     start_timedelta="-1d",
...     end_timedelta="5d",
...     reference_column="event_date"
... )
>>> result = rule.evaluate(df)

# Require event dates at least 5 days after the reference date
>>> rule = TimelinessRelativeRule(
...     field="event_date",
...     start_timedelta="5d",
...     end_timedelta=None,
...     reference_date="2023-06-01"
... )
>>> result = rule.evaluate(df)

>>> from datetime import timedelta
>>> rule = TimelinessRelativeRule(
...     field="sensor_timestamp",
...     start_timedelta=timedelta(hours=-12),
...     end_timedelta=timedelta(hours=12)
... )
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

Contains the timeliness score (pass_rate), indices and sample of failed records,

total records evaluated, and metadata. See DataQualityResult documentation for details.

gchq_data_quality.rules.timeliness.TimelinessStaticRule

Bases: TimelinessBaseRule

Rule to check whether datetime values in a column fall between absolute start and end date boundaries (inclusive).

Suitable where both boundaries are fixed or known in advance (e.g., events occurring in January 2024). All dates are treated as, or coerced to, UTC, with date-only strings assumed to be midnight. Invalid or unparsable datetime values are treated as missing. Combine with a validity rule and completeness rule on the same field for the best insights.

Attributes:

Name Type Description
field str

Name of the datetime column to assess.

start_date str | datetime | Timestamp | None

Inclusive lower boundary for valid values.

end_date str | datetime | Timestamp | None

Inclusive upper boundary for valid values.

dayfirst bool

If True, parses ALL dates and rule inputs as day/month/year, otherwise month/day/year.

na_values str | list[Any] | None

Values treated as missing.

data_quality_dimension DamaFramework

Associated data quality dimension. (You may want to override, e.g. perhaps 'Consistency' makes sense for some of these rules)

rule_id str | None

Optional rule identifier.

rule_description str | None

Optional rule description.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = TimelinessStaticRule(
...     field="event_date",
...     start_date="2024-01-01T00:00:00Z",
...     end_date="2024-01-31T23:59:59Z"
... )
>>> result = rule.evaluate(df)

# Only require that dates are on or after 2023-06-01
>>> rule = TimelinessStaticRule(
...     field="date_col",
...     start_date="2023-06-01",
...     end_date=None
... )
>>> result = rule.evaluate(df)

# Using string-based boundaries with day-first format
>>> rule = TimelinessStaticRule(
...     field="timestamp",
...     start_date="01/06/2023",
...     end_date="30/06/2023",
...     dayfirst=True # also assumes dates in field 'timestamp' are dayfirst
... )
>>> result = rule.evaluate(df)

# Using Python datetime objects as boundaries
>>> from datetime import datetime, timezone
>>> rule = TimelinessStaticRule(
...     field="timestamp",
...     start_date=datetime(2023, 6, 1, 0, 0, tzinfo=timezone.utc),
...     end_date=datetime(2023, 6, 30, 23, 59, tzinfo=timezone.utc),
... )
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

Contains the timeliness score (pass_rate), indices of failed records,

a sample of those records, and metadata. See DataQualityResult documentation for details.

gchq_data_quality.rules.validity.ValidityNumericalRangeRule

Bases: BaseRule

Rule for validating numerical values against a specified range.

Considers only non-null values; values outside the range or failing coercion to numeric are considered invalid. Diagnostic samples and record indices are returned for values outside the allowed range.

Attributes:

Name Type Description
field str

Column to check for numerical range validity.

min_value float

Minimum allowed value (inclusive; defaults to -infinity).

max_value float

Maximum allowed value (inclusive; defaults to +infinity).

na_values str | list[Any] | None

Additional values to treat as missing.

data_quality_dimension DamaFramework

Data quality dimension (Validity).

rule_id str | None

Optional rule identifier.

rule_description str | None

Optional rule description.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = ValidityNumericalRangeRule(
...     field="age",
...     min_value=0,
...     max_value=120
... )
>>> result = rule.evaluate(df)

# no upper limit
>>> rule = ValidityNumericalRangeRule(
...     field="temp_c",
...     min_value=0,
...     na_values=-999
... )
>>> result = rule.evaluate(df)

# no lower limit
>>> rule = ValidityNumericalRangeRule(
...     field="score",
...     max_value=100,
...     na_values=['missing', 'N/A']
... )
>>> result = rule.evaluate(df)

Returns:

Name Type Description
DataQualityResult

Contains the validity score (pass_rate),

sample and indices of failed records, total records evaluated,

and rule metadata. See DataQualityResult documentation for details.

gchq_data_quality.rules.validity.ValidityRegexRule

Bases: BaseRule

Rule for validating string values against a regular expression.

Considers only non-null entries, with additional missing-value patterns specified via na_values. A diagnostic sample of values failing the regex is returned if present.

Attributes:

Name Type Description
field str

Column to check for regex validity.

regex_pattern str

Regular expression pattern for validation.

na_values str | list[Any] | None

Additional values to treat as missing.

data_quality_dimension DamaFramework

Data quality dimension (Validity) by default.

rule_id str | None

Optional rule identifier.

rule_description str | None

Optional rule description.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example
>>> rule = ValidityRegexRule(
...     field="email",
...     regex_pattern=r'^[^@]+@[^@]+\.[^@]+$'
... )
>>> result = rule.evaluate(df)

>>> rule = ValidityRegexRule(
...     field="country_code",
...     regex_pattern=r'^[A-Z]{2}$'
... )
>>> result = rule.evaluate(df)
Note

To centrally manage and update regex patterns you can provide a separate YAML file containing named regex patterns (e.g., EMAIL_REGEX, POSTCODE_REGEX). Keys in this file are substituted in your main configuration files wherever referenced, enabling consistent and maintainable regex use.

When storing regex patterns in YAML, always use single quotes ('pattern') rather than double quotes to ensure correct handling of typical regex escape characters, such as \d or \w.

# regex_patterns.yaml
EMAIL_REGEX: '^[^@]+@[^@]+\.[^@]+$'
POSTCODE_REGEX: '^[A-Z]{2}[0-9]{2,3}\s?[0-9][A-Z]{2}$'

# In your DQ config YAML, use the key in place of the regex pattern:
rules:
- function: validity_regex
    field: email
    regex_pattern: EMAIL_REGEX

# Python code to load with substitution:
>>> from gchq_data_quality.config import DataQualityConfig
>>> dq_config = DataQualityConfig.from_yaml(
...     'your_config.yaml',
...     regex_yaml_path='regex_patterns.yaml'
... )

Returns:

Name Type Description
DataQualityResult

Contains the validity score (pass_rate),

sample and indices of failed records, number of evaluated records,

and rule metadata. See DataQualityResult documentation for details.

Data Quality Configuration and Results

gchq_data_quality.config.DataQualityConfig

Bases: BaseModel

Configuration describing a set of data quality checks to be run on a dataset.

Typically constructed by loading a YAML file specifying the dataset and a list of rule definitions. Can also be created programmatically.

Attributes:

Name Type Description
dataset_name str | None

Dataset name or identifier.

measurement_sample str | None

Description of data sample.

lifecycle_stage str | None

The lifecycle stage at which data is measured.

measurement_time datetime | None

Measurement timestamp.

dataset_id str | int | float | None

Local data catalogue ID.

rules list[RuleType] | None

List of rule models.

Example
# Loading from YAML
config = DataQualityConfig.from_yaml("my_config.yaml")
-- see tutorial for how to specify the yaml file, or create a config programatically and use .to_yaml() to create something to start with.

# Override regex patterns
config = DataQualityConfig.from_yaml("my_config.yaml", regex_yaml_path='regex_patterns.yaml')

# Running data quality checks
report = config.execute(data_source=my_dataframe)

# Or, creating config programmatically from scratch
config2 = DataQualityConfig(
    dataset_name="my_data",
    rules=[
        ValidityRegexRule(field="email", regex_pattern='.+@example.com'),
    ],
)

Methods:

execute(data_source) -> DataQualityReport:
    Execute the measurement configuration against the provided data source
    (e.g., pandas DataFrame, Spark DataFrame).
    Runs each rule's evaluate() method and returns a DataQualityReport
    containing the results.

from_yaml(file_path: str | Path, regex_yaml_path: str | Path | None = None) -> DataQualityConfig:
    Load a configuration instance from a YAML file. If regex_yaml_path is provided,
    regex patterns in rule definitions can be overridden or supplemented by patterns
    from this separate YAML file.

to_yaml(file_path: str | Path, overwrite: bool = False) -> None:
    Save as YAML file.

from_report(report: DataQualityReport) -> DataQualityConfig:
    Create config instance from report results. This will extract the rule defintition from the
    rule_data field (which is a JSON dump of all rule metadata)

gchq_data_quality.results.models.DataQualityReport

Bases: DataQualityBaseModel

A collection of individual data quality results for a dataset. This object is typically returned by executing a DataQualityConfig object, rather than instantiated directly by the user.

Attributes:

Name Type Description
results list[DataQualityResult]

List of individual DataQualityResults for each rule applied.

Methods:

Name Description
to_dataframe

Converts report results to a pandas DataFrame for analysis.

to_json

Serialises the report to JSON, optionally saving to file.

from_dataframe

Constructs a DataQualityReport from a pandas DataFrame formatted as in to_dataframe().

Example
config = DataQualityConfig.from_yaml('quality_cfg.yaml')
report = config.execute(df) # <- this is the DataQualityReport object creation step
df_results = report.to_dataframe(decimals=3)
report.to_json('results.json')

gchq_data_quality.results.models.DataQualityResult

Bases: DataQualityBaseModel

Represents the outcome of a single data quality rule applied to a dataset column. Noting that some rules may reference additional columns, such as ConsistencyRule

Attributes:

Name Type Description
dataset_name float | str | int | None

Common, human-readable name of the measured dataset.

dataset_id float | str | int | None

Machine-readable unique ID for the dataset.

measurement_sample str | None

Description of the sample measured.

lifecycle_stage Any | None

Stage of data lifecycle at the time of measurement (e.g., '01 ingest').

measurement_time UTCDateTimeStrict

UTC timestamp when measurement was taken. Defaults to 'now' in UTC.

field str

Name of the column the rule applies to.

data_quality_dimension DamaFramework

Data quality dimension evaluated (Uniqueness, Completeness, etc.).

records_evaluated int | None

Total records evaluated by this rule.

pass_rate float | None

Ratio (0-1) of passing records to evaluated records.

rule_id Any | None

Local identifier for the applied rule.

rule_description Any

Text, dict, or JSON describing rule parameters and logic.

rule_data str

JSON dump of rule metadata for reconstruction of rule.

records_failed_ids list | None

Up to 10 (default) identifiers for rows failing the rule.

records_failed_sample list[dict] | None

Sample output of failed records for diagnostics.

Example
# Typical user interaction is via DataQualityReport:
config = DataQualityConfig.from_yaml('config.yaml')
report = config.execute(df)
first_result = report.results[0]
print(first_result.pass_rate)  # Access result attributes
Note

Direct construction of DataQualityResult or DataQualityReport are rare; results are typically gathered in production using RuleType.evaluate(df) or DataQualityConfig.execute(data)

Spark Utilities

gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)

Flattens arrays and nested fields in a Spark DataFrame to produce a Spark-safe, single-level table.

The columns to flatten may include array or struct paths, with array selections: '[*]' - explodes arrays into multiple rows '[]' - selects the first non-null element from the array

Parameters:

Name Type Description Default
df DataFrame

The input Spark DataFrame containing nested or array fields.

required
flatten_cols list[str]

List of strings indicating nested columns to flatten. Paths may include array notation (e.g., 'orders[*].item', 'info.details[]').

required

Returns:

Name Type Description
DataFrame DataFrame

A Spark DataFrame with the specified columns flattened and Spark-safe

DataFrame

column names.

Raises:

Type Description
ValueError

If the column paths are inconsistent or not found in the schema, or if array notation is misapplied.

Example

Flatten three levels of orders in a customer DataFrame:

flat_df = flatten_spark(df, [
    "customer[*].orders[*].items[*].productId",
    "customer[*].name"
])
flat_df.show()

Types and Base Rule

The way we categorise the data quality dimensions

gchq_data_quality.models.DamaFramework

Bases: str, Enum

Allowed names for data quality framework dimensions following DAMA (Data Management Association).

Members

Uniqueness: Value is "Uniqueness". Completeness: Value is "Completeness". Validity: Value is "Validity". Consistency: Value is "Consistency". Accuracy: Value is "Accuracy". Timeliness: Value is "Timeliness".

Note

It will accept any string case, but coerce to title case.

Example
DamaFramework("uniqueness")   # Returns DamaFramework.Uniqueness
DamaFramework.Completeness.value  # "Completeness"

The base rule is never called by a user, but serves as a parent for all data quality rules.

gchq_data_quality.rules.base.BaseRule

Bases: DataQualityBaseModel, ABC

Abstract base class for data quality rule definitions.

Not intended for direct use. Use a Subclass with a specific rule type (e.g., AccuracyRule, CompletenessRule) for configuration or execution of data quality checks. BaseRule handles all generic configuration and evaluation steps, with rule-specific logic implemented via subclass overrides.

Attributes:

Name Type Description
field str

Column to check for rule evaluation.

rule_id str | None

Optional identifier for this rule.

rule_description str | None

Optional summary or explanation of the rule.

na_values str | int | float | list[Any] | None

Values to treat as NULL.

pd.NA = ") class-attribute instance-attribute (gchq_data_quality.rules.base.BaseRule.skip_if_null)" href="#gchq_data_quality.rules.base.BaseRule.skip_if_null">skip_if_null Literal['all', 'any', 'never']

Controls what records are skipped due to nulls.

data_quality_dimension DamaFramework

Linked DAMA data quality dimension.

Methods:

Name Description
evaluate

pd.DataFrame | SparkDataFrame | Elasticsearch) -> DataQualityResult Applies the rule to source data and returns evaluation metrics and diagnostics.

Note

This base class should not be instantiated directly. Use a rule subclass for actual configuration or evaluation.

Returns:

Name Type Description
DataQualityResult

Contains metrics of evaluation such as pass rate,

evaluated record count, indices/sample of failed records, and rule metadata.

See DataQualityResult documentation for details.

data_quality_dimension = Field(..., description='The Dama dimension for each rule') class-attribute instance-attribute

field = Field(..., description='Column to check') class-attribute instance-attribute

na_values = Field(default=None, description='Additional values to treat as null') class-attribute instance-attribute

rule_description = Field(default=None, description='Description of the rule') class-attribute instance-attribute

rule_id = Field(default=None, description='Identifier for this rule') class-attribute instance-attribute

skip_if_null = Field(default='any', description="Controls which rows are skipped that contain null values. If 'all' then it will only skip if all columns used are NULL.most rules this will just apply to the 'field' column, but some like TimelinessRelativeRule can use more than one column.If values aren't skipped, then NULL values are passed into the calculations so be cautious as to what you allow through as 3 > pd.NA = <NA> ") class-attribute instance-attribute

_coerce_dataframe_type(df)

Some rules require values to be coerced to a different data type. Timeliness > UTC datetime, ValiditiyNumericalRange > numeric

This function handles coercing to the relevant data type for the rule. Override if needed, the default behaviour is no coercion

Returns:

Type Description
DataFrame

pd.DataFrame: If no coercion, the original df. If coerced, a modifed dataframe

_copy_and_subset_dataframe(df, columns_used)

Copies the dataframe to avoid later mutations when we replace NA values or coerce to a different data type.

Also ensures the dataframe columns are kept in the same order as the orginal df

_evaluate_in_elastic(es, index_name, query=None)

_evaluate_in_pandas(df)

Evaluates the rule against the provided DataFrame.

Performs field existence check, handles NA values and coercion calculates number of records evaluated and passing, computes pass rate, and includes a sample of failed records if required. A subset of the steps below can be overriden to give any inherited rule the desirved behaviour, without having to completely override the evaluate() function itself.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to evaluate

required

Returns:

Name Type Description
DataQualityResult DataQualityResult

A summary of data quality metrics for the rule such as

DataQualityResult

records evaluated, pass rate, and details of failed records if required.

_evaluate_in_pandas_output_dataframe(df)

Wrapper to ensure when executing in Spark we return a DataFrame (this is a Spark requirement), yet we want to maintain the behaviour that _evaluate_in_pandas returns a DataQualityResult (so did not want to override that).

Returns:

Type Description
DataFrame

A Dataframe in a format that matches SparkDataQualityResultSchema

_evaluate_in_spark(spark_df)

By default we execute everything in pandas via mapInPandas, this partitions the data automatically and sends dataframes to each Spark worker, we then aggregate the resulting data.

_get_columns_used_pandas()

The columns used in evaluting the rule, defaults to just the field, but other rules such as consistency may use more than one column and will override this.

_get_null_count(df, field)

_get_null_counts_all_columns(df)

Goes through each column and calculates the null count.

Returns:

Type Description
dict[str, int]

A dictionary of {column_name : null_count} e.g. {'name' : 7, 'age' : 0}

_get_records_evaluated_mask_pandas(df)

The bool mask of whether a record is being evaluated. The majority of rules will not evaluate against records that are NULL With the exception of the CompletenessRule. So the default behaviour is evaluate NON null values.

_get_records_evaluated_pandas(df)

Computes the number of records that are evaluated against the rule.

By default, counts non-null entries in the target field. Override this for rules involving multiple columns or different completeness logic.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to process.

required

Returns:

Name Type Description
int int

The count of records in the field being assessed.

_get_records_failed_mask_pandas(df)

Abstract method to generate a boolean mask for records failing the rule.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to process.

required

Returns:

Type Description
Series

pd.Series: Boolean mask where True indicates a failing record.

_get_records_failed_pandas(df)

Returns a list of unique records from the field that failed the rule.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame instance to process (assumes df has been filtered to just contain required columns).

required

Returns:

Name Type Description
list list[dict]

Unique records from the field corresponding to failed records. In format [{colA : valueA, colB : valueB}, {...etc}]

_get_records_passing_mask_pandas(df) abstractmethod

The bool mask of what records are passing (i.e. this function is the main way we define our data quality rules), this is also an AND with the records_evaluated_mask by definition, as we cannot pass a record if it has not been evaluated.

_get_records_passing_pandas(df)

Abstract method to compute the number of records passing the data quality rule.

This must be customised for each specific rule.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to process.

required

Returns:

Name Type Description
int int

The count of records passing the rule's criteria.

_get_skip_if_null_mask(df)

Return mask for records to skip based on self.skip_if_null.

_get_spark_safe_rule()

Returns a modified (deep copy) of the rule with spark safe column names in any column used to evaluate the rule. This is required when working with nested data, as if we want to measure 'customers.age' after we flatten the dataframe and exract the age property from the 'customers' object our column will be renamed to customers_age when it gets passed to _evaluate_in_pandas.

As 'customers.age' is not a valid Spark column name once the data is flattened.

This is overridden for each subrule type if more than self.field is used

_handle_dataframe_coercion(df)

Coerce the dataframe to a new datatype (if required). We will also check if the null count changes upon coercion and raise a warning with the user

_handle_na_values_pandas(df, columns_used, na_values)

Replace specified values in a DataFrame with pd.NA if na_values is provided.

Parameters:

Name Type Description Default
df DataFrame

Input dataframe.

required
columns_used list

Columns to scan for null-like values.

required
na_values list

List of values to consider as missing.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A dataframe where the specified values are replaced with pd.NA, or the original if na_values is None.

_replace_na_in_bool_mask(mask)

If we get 'None' in a boolean mask we can't conduct mask operations such as inverting it or logical AND / OR, this replaces None / NA, with False.

This method can be overridden by child classes

_require_failed_records_sample(pass_rate)

Determines whether a diagnostic sample of failed records should be collected.

Parameters:

Name Type Description Default
pass_rate float | None

The rule pass rate, or None if no records were evaluated.

required

Returns:

Name Type Description
bool bool

True if failed record samples are required; otherwise, False.

_warn_if_null_counts_different(original_null_counts, new_null_counts)

Compares the null counts between the original and new (the keys will be the same), if the new has more nulls, raise a warning and mention the column. Typically something we do during coercion to a new datatype.

evaluate(data_source, index_name='', query=None)

evaluate(data_source: pd.DataFrame) -> DataQualityResult
evaluate(data_source: SparkDataFrame) -> DataQualityResult
evaluate(
    data_source: Elasticsearch,
    index_name: str = ...,
    query: dict | None = ...,
) -> DataQualityResult

Evaluates this rule against the provided data source.

Supports both Pandas and Spark DataFrames as input. Applies all rule configuration, handles nulls and data coercion, and computes relevant data quality metrics. If an Elasticsearch index and client are supplied, an error is raised unless that backend is implemented. Currently not implemented.

Parameters:

Name Type Description Default
data_source DataFrame | DataFrame | Elasticsearch

The data to evaluate— can be a Pandas DataFrame, a Spark DataFrame, or an Elasticsearch client.

required
index_name str

Required if evaluating with Elasticsearch; the index to check.

''
query dict

Required if evaluating with Elaticsearch, defaults to a query that matches all documents

None

Returns:

Name Type Description
DataQualityResult

Contains the metrics and diagnostics of rule evaluation,

including pass rate, number of records evaluated, indices and sample of failed records,

and rule metadata. See DataQualityResult documentation for details.

Raises:

Type Description
ValueError

If an unsupported data source is provided.

NotImplementedError

If Elasticsearch evaluation is requested but not supported.