API Reference

Rules

`gchq_data_quality.rules.uniqueness.UniquenessRule`

Bases: BaseRule

Rule for assessing uniqueness in a column.

Measures the proportion of unique, non-null values in a specified column. This is useful for checking distinct identifiers or reference keys. Additional null-like values can be specified via na_values.

Attributes:

Name	Type	Description
`field`	`str`	Column to evaluate for uniqueness.
`na_values`	`Any \| list[Any] \| None`	Values to treat as missing.
`data_quality_dimension`	`DamaFramework`	Data quality dimension (Uniqueness).
`rule_id`	`str \| None`	Optional rule identifier.
`rule_description`	`str \| None`	Optional description for this rule.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> import pandas as pd
>>> from gchq_data_quality.rules.uniqueness import UniquenessRule
>>> df = pd.DataFrame({'id': [1, 2, 3, 3, None]})

# Basic uniqueness check
>>> rule = UniquenessRule(field='id')
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75

# Specify additional NA values
>>> rule = UniquenessRule(field='id', na_values=[-1])
>>> df = pd.DataFrame({'id': [1, 2, -1, 3, 3]})
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75

Note

The pass_rate metric is calculated as (number of unique values) / (number of non-null records). Therefore, if every value in the column appears exactly twice, pass_rate will be 0.5 (not 0.0!). For columns with even more duplication, pass_rate will decrease and approach zero as the number of unique values becomes small relative to the number of total records. Only if every record is identical will pass_rate be 1 / N (where N is the number of records).

Returns:

Name	Type	Description
`DataQualityResult`		Contains the uniqueness score (`pass_rate`), identifiers for failed records,
		a sample of duplicate values, the number of records evaluated, and rule metadata.
		See DataQualityResult documentation for further attribute details.

`gchq_data_quality.rules.completeness.CompletenessRule`

Bases: BaseRule

Rule to calculate the completeness score for a field.

Completeness is measured as the proportion of non-null values in the specified column. Values specified in na_values are converted to nulls prior to calculation.

Attributes:

Name	Type	Description
`field`	`str`	The column name to assess.
`na_values`	`str \| list[Any] \| None`	Additional indicators to treat as missing.
`rule_id`	`str \| None`	Optional identifier for the rule.
`rule_description`	`str \| None`	Optional description of the rule.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates completeness for the chosen fields on a Pandas or Spark DataFrame. Returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = CompletenessRule(field="column_name")
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)

>>> rule = CompletenessRule(field="column_name", na_values="missing")
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		Contains completeness score (`pass_rate`), field name,
		number of records evaluated, and rule metadata. See DataQualityResult documentation
		for further attribute details.

`gchq_data_quality.rules.accuracy.AccuracyRule`

Bases: BaseRule

Rule to check if values meet a list of valid (or invalid) values.

Skips NULLs, including those recognised via na_values. Instantiate this class and call .evaluate(df) to assess data quality for the chosen column.

Attributes:

Name	Type	Description
`field`	`str`	The column to check for accuracy.
`valid_values`	`list[Any]`	The set of acceptable values for the field.
`inverse`	`bool`	If True, values in `valid_values` are considered invalid (exclusion list).
`na_values`	`str \| list[Any] \| None`	Additional indicators to treat as missing values.
`rule_id`	`str \| None`	Optional identifier for the rule.
`rule_description`	`str \| None`	Optional description of the rule.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = AccuracyRule(field="category", valid_values=["A", "B", "C"])
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> print(result.records_failed_ids)

>>> rule = AccuracyRule(
...     field="department",
...     valid_values=["HR", "IT", "Sales"],
...     na_values=["N/A", "N/K"]
... )
>>> result = rule.evaluate(df)

>>> rule = AccuracyRule(
...     field="status",
...     valid_values=["expired", "deleted"],
...     inverse=True # value must NOT be expired or deleted
... )
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		An object containing the accuracy score (pass_rate),
		the indices of failed rows (records_failed_ids), a sample of failed values
		(records_failed_sample), the number of records evaluated, and further rule metadata.
		See DataQualityResult documentation for full details.

`gchq_data_quality.rules.consistency.ConsistencyRule`

Bases: BaseRule

Rule for evaluating data consistency based on boolean expressions (with an optional condition).

Expressions may use any valid Pandas eval syntax that returns a boolean result. Backticks are required around all column names. Nulls and additional na_values are handled according to the skip policy.

Attributes:

Name	Type	Description
`field`	`str`	The column to check for consistency.
`expression`	`str \| dict[str, str]`	A boolean expression, or a conditional {'if', 'then'} dictionary (with backticks for column names).
`skip_if_null`	`Literal['all', 'any', 'never']`	Controls row skipping for null values in relevant columns.
`na_values`	`str \| list[Any] \| None`	Additional values considered as missing.
`data_quality_dimension`	`DamaFramework`	Associated data quality dimension - you may want to override it in this rule.
`rule_id`	`str \| None`	Optional identifier for the rule.
`rule_description`	`str \| None`	Optional description of the rule.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = ConsistencyRule(
...     field="score",
...     expression="`score` >= 50"
... )
>>> result = rule.evaluate(df)

>>> rule = ConsistencyRule(
...     field="completion_date",
...     expression={"if": "`status` == 'completed'", "then": "`completion_date`.notnull()"},
        data_quality_dimension='Completeness' # you can override the DAMA Dimension
... )
>>> result = rule.evaluate(df)

# all series .str. methods are available
>>> rule = ConsistencyRule(
...     field="postcode",
...     expression={
...         "if": "`country` == 'UK'",
...         "then": "`postcode`.str.match(r'^[A-Z]{2}[0-9]{2}$')"
...     }
... )
>>> result = rule.evaluate(df)

# Date parts and arithmetic using .dt accessor
>>> rule = ConsistencyRule(
...     field="report_year",
...     expression="`report_date`.dt.year == `report_year`"
... )
>>> result = rule.evaluate(df)

# Boolean logic (AND, OR, NOT) with grouping and comparisons
>>> rule = ConsistencyRule(
...     field="flag",
...     expression="(`score` > 90) & ((`status` == 'active') | ~`is_archived`)"
... )
>>> result = rule.evaluate(df)

# Using mathematical operations
>>> rule = ConsistencyRule(
...     field="predicted",
...     expression="abs(`actual` - `predicted`) < 10"
... )
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		An object containing the consistency score (`pass_rate`),
		number of records evaluated, a sample of inconsistent records, and details of failed row indices.
		See DataQualityResult documentation for full attribute descriptions.

`gchq_data_quality.rules.timeliness.TimelinessRelativeRule`

Bases: TimelinessBaseRule

Rule to assess whether datetime values fall between relative time boundaries from a reference date (which can be a static value or come from a column in the data source).

Timedelta bounds are specified for start and end, relative to a reference date or reference column. All datetime comparisons are performed in UTC, with date-only values assumed midnight. Only one of reference_date or reference_column may be provided. If neither is given, current UTC time is used as the reference_date.

Attributes:

Name	Type	Description
`field`	`str`	Name of the datetime column to assess.
`start_timedelta`	`timedelta \| str \| int \| float \| None`	Lower offset from the reference.
`end_timedelta`	`timedelta \| str \| int \| float \| None`	Upper offset from the reference.
`reference_date`	`str \| datetime \| Timestamp \| None`	Fixed reference date/time (UTC).
`reference_column`	`str \| None`	Per-row column providing reference dates/times.
`dayfirst`	`bool`	If True, parses ALL dates as day/month/year.
`na_values`	`str \| list[Any] \| None`	Values treated as missing.
`data_quality_dimension`	`DamaFramework`	Associated data quality dimension.
`rule_id`	`str \| None`	Optional rule identifier.
`rule_description`	`str \| None`	Optional rule description.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Note

Integer values into start or end timedelta are assumed to be nanoseconds (default pandas.to_timedelta() behaviour)

Example

>>> rule = TimelinessRelativeRule(
...     field="event_date",
...     start_timedelta="0d",
...     end_timedelta="30d",
...     reference_date="2024-01-01T00:00:00Z"
... )
>>> result = rule.evaluate(df)

>>> rule = TimelinessRelativeRule(
...     field="booking_date",
...     start_timedelta="-1d",
...     end_timedelta="5d",
...     reference_column="event_date"
... )
>>> result = rule.evaluate(df)

# Require event dates at least 5 days after the reference date
>>> rule = TimelinessRelativeRule(
...     field="event_date",
...     start_timedelta="5d",
...     end_timedelta=None,
...     reference_date="2023-06-01"
... )
>>> result = rule.evaluate(df)

>>> from datetime import timedelta
>>> rule = TimelinessRelativeRule(
...     field="sensor_timestamp",
...     start_timedelta=timedelta(hours=-12),
...     end_timedelta=timedelta(hours=12)
... )
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		Contains the timeliness score (`pass_rate`), indices and sample of failed records,
		total records evaluated, and metadata. See DataQualityResult documentation for details.

`gchq_data_quality.rules.timeliness.TimelinessStaticRule`

Bases: TimelinessBaseRule

Rule to check whether datetime values in a column fall between absolute start and end date boundaries (inclusive).

Suitable where both boundaries are fixed or known in advance (e.g., events occurring in January 2024). All dates are treated as, or coerced to, UTC, with date-only strings assumed to be midnight. Invalid or unparsable datetime values are treated as missing. Combine with a validity rule and completeness rule on the same field for the best insights.

Attributes:

Name	Type	Description
`field`	`str`	Name of the datetime column to assess.
`start_date`	`str \| datetime \| Timestamp \| None`	Inclusive lower boundary for valid values.
`end_date`	`str \| datetime \| Timestamp \| None`	Inclusive upper boundary for valid values.
`dayfirst`	`bool`	If True, parses ALL dates and rule inputs as day/month/year, otherwise month/day/year.
`na_values`	`str \| list[Any] \| None`	Values treated as missing.
`data_quality_dimension`	`DamaFramework`	Associated data quality dimension. (You may want to override, e.g. perhaps 'Consistency' makes sense for some of these rules)
`rule_id`	`str \| None`	Optional rule identifier.
`rule_description`	`str \| None`	Optional rule description.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = TimelinessStaticRule(
...     field="event_date",
...     start_date="2024-01-01T00:00:00Z",
...     end_date="2024-01-31T23:59:59Z"
... )
>>> result = rule.evaluate(df)

# Only require that dates are on or after 2023-06-01
>>> rule = TimelinessStaticRule(
...     field="date_col",
...     start_date="2023-06-01",
...     end_date=None
... )
>>> result = rule.evaluate(df)

# Using string-based boundaries with day-first format
>>> rule = TimelinessStaticRule(
...     field="timestamp",
...     start_date="01/06/2023",
...     end_date="30/06/2023",
...     dayfirst=True # also assumes dates in field 'timestamp' are dayfirst
... )
>>> result = rule.evaluate(df)

# Using Python datetime objects as boundaries
>>> from datetime import datetime, timezone
>>> rule = TimelinessStaticRule(
...     field="timestamp",
...     start_date=datetime(2023, 6, 1, 0, 0, tzinfo=timezone.utc),
...     end_date=datetime(2023, 6, 30, 23, 59, tzinfo=timezone.utc),
... )
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		Contains the timeliness score (`pass_rate`), indices of failed records,
		a sample of those records, and metadata. See DataQualityResult documentation for details.

`gchq_data_quality.rules.validity.ValidityNumericalRangeRule`

Bases: BaseRule

Rule for validating numerical values against a specified range.

Considers only non-null values; values outside the range or failing coercion to numeric are considered invalid. Diagnostic samples and record indices are returned for values outside the allowed range.

Attributes:

Name	Type	Description
`field`	`str`	Column to check for numerical range validity.
`min_value`	`float`	Minimum allowed value (inclusive; defaults to -infinity).
`max_value`	`float`	Maximum allowed value (inclusive; defaults to +infinity).
`na_values`	`str \| list[Any] \| None`	Additional values to treat as missing.
`data_quality_dimension`	`DamaFramework`	Data quality dimension (Validity).
`rule_id`	`str \| None`	Optional rule identifier.
`rule_description`	`str \| None`	Optional rule description.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = ValidityNumericalRangeRule(
...     field="age",
...     min_value=0,
...     max_value=120
... )
>>> result = rule.evaluate(df)

# no upper limit
>>> rule = ValidityNumericalRangeRule(
...     field="temp_c",
...     min_value=0,
...     na_values=-999
... )
>>> result = rule.evaluate(df)

# no lower limit
>>> rule = ValidityNumericalRangeRule(
...     field="score",
...     max_value=100,
...     na_values=['missing', 'N/A']
... )
>>> result = rule.evaluate(df)

Returns:

Name	Type	Description
`DataQualityResult`		Contains the validity score (`pass_rate`),
		sample and indices of failed records, total records evaluated,
		and rule metadata. See DataQualityResult documentation for details.

`gchq_data_quality.rules.validity.ValidityRegexRule`

Bases: BaseRule

Rule for validating string values against a regular expression.

Considers only non-null entries, with additional missing-value patterns specified via na_values. A diagnostic sample of values failing the regex is returned if present.

Attributes:

Name	Type	Description
`field`	`str`	Column to check for regex validity.
`regex_pattern`	`str`	Regular expression pattern for validation.
`na_values`	`str \| list[Any] \| None`	Additional values to treat as missing.
`data_quality_dimension`	`DamaFramework`	Data quality dimension (Validity) by default.
`rule_id`	`str \| None`	Optional rule identifier.
`rule_description`	`str \| None`	Optional rule description.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation.

Example

>>> rule = ValidityRegexRule(
...     field="email",
...     regex_pattern=r'^[^@]+@[^@]+\.[^@]+$'
... )
>>> result = rule.evaluate(df)

>>> rule = ValidityRegexRule(
...     field="country_code",
...     regex_pattern=r'^[A-Z]{2}$'
... )
>>> result = rule.evaluate(df)

Note

To centrally manage and update regex patterns you can provide a separate YAML file containing named regex patterns (e.g., EMAIL_REGEX, POSTCODE_REGEX). Keys in this file are substituted in your main configuration files wherever referenced, enabling consistent and maintainable regex use.

When storing regex patterns in YAML, always use single quotes ('pattern') rather than double quotes to ensure correct handling of typical regex escape characters, such as \d or \w.

# regex_patterns.yaml
EMAIL_REGEX: '^[^@]+@[^@]+\.[^@]+$'
POSTCODE_REGEX: '^[A-Z]{2}[0-9]{2,3}\s?[0-9][A-Z]{2}$'

# In your DQ config YAML, use the key in place of the regex pattern:
rules:
- function: validity_regex
    field: email
    regex_pattern: EMAIL_REGEX

# Python code to load with substitution:
>>> from gchq_data_quality.config import DataQualityConfig
>>> dq_config = DataQualityConfig.from_yaml(
...     'your_config.yaml',
...     regex_yaml_path='regex_patterns.yaml'
... )

Returns:

Name	Type	Description
`DataQualityResult`		Contains the validity score (`pass_rate`),
		sample and indices of failed records, number of evaluated records,
		and rule metadata. See DataQualityResult documentation for details.

Data Quality Configuration and Results

`gchq_data_quality.config.DataQualityConfig`

Bases: BaseModel

Configuration describing a set of data quality checks to be run on a dataset.

Typically constructed by loading a YAML file specifying the dataset and a list of rule definitions. Can also be created programmatically.

Attributes:

Name	Type	Description
`dataset_name`	`str \| None`	Dataset name or identifier.
`measurement_sample`	`str \| None`	Description of data sample.
`lifecycle_stage`	`str \| None`	The lifecycle stage at which data is measured.
`measurement_time`	`datetime \| None`	Measurement timestamp.
`dataset_id`	`str \| int \| float \| None`	Local data catalogue ID.
`rules`	`list[RuleType] \| None`	List of rule models.

Example

# Loading from YAML
config = DataQualityConfig.from_yaml("my_config.yaml")
-- see tutorial for how to specify the yaml file, or create a config programatically and use .to_yaml() to create something to start with.

# Override regex patterns
config = DataQualityConfig.from_yaml("my_config.yaml", regex_yaml_path='regex_patterns.yaml')

# Running data quality checks
report = config.execute(data_source=my_dataframe)

# Or, creating config programmatically from scratch
config2 = DataQualityConfig(
    dataset_name="my_data",
    rules=[
        ValidityRegexRule(field="email", regex_pattern='.+@example.com'),
    ],
)

Methods:

execute(data_source) -> DataQualityReport:
    Execute the measurement configuration against the provided data source
    (e.g., pandas DataFrame, Spark DataFrame).
    Runs each rule's evaluate() method and returns a DataQualityReport
    containing the results.

from_yaml(file_path: str | Path, regex_yaml_path: str | Path | None = None) -> DataQualityConfig:
    Load a configuration instance from a YAML file. If regex_yaml_path is provided,
    regex patterns in rule definitions can be overridden or supplemented by patterns
    from this separate YAML file.

to_yaml(file_path: str | Path, overwrite: bool = False) -> None:
    Save as YAML file.

from_report(report: DataQualityReport) -> DataQualityConfig:
    Create config instance from report results. This will extract the rule defintition from the
    rule_data field (which is a JSON dump of all rule metadata)

`gchq_data_quality.results.models.DataQualityReport`

Bases: DataQualityBaseModel

A collection of individual data quality results for a dataset. This object is typically returned by executing a DataQualityConfig object, rather than instantiated directly by the user.

Attributes:

Name	Type	Description
`results`	`list[DataQualityResult]`	List of individual DataQualityResults for each rule applied.

Methods:

Name	Description
`to_dataframe`	Converts report results to a pandas DataFrame for analysis.
`to_json`	Serialises the report to JSON, optionally saving to file.
`from_dataframe`	Constructs a DataQualityReport from a pandas DataFrame formatted as in to_dataframe().

Example

config = DataQualityConfig.from_yaml('quality_cfg.yaml')
report = config.execute(df) # <- this is the DataQualityReport object creation step
df_results = report.to_dataframe(decimals=3)
report.to_json('results.json')

`gchq_data_quality.results.models.DataQualityResult`

Bases: DataQualityBaseModel

Represents the outcome of a single data quality rule applied to a dataset column. Noting that some rules may reference additional columns, such as ConsistencyRule

Attributes:

Name	Type	Description
`dataset_name`	`float \| str \| int \| None`	Common, human-readable name of the measured dataset.
`dataset_id`	`float \| str \| int \| None`	Machine-readable unique ID for the dataset.
`measurement_sample`	`str \| None`	Description of the sample measured.
`lifecycle_stage`	`Any \| None`	Stage of data lifecycle at the time of measurement (e.g., '01 ingest').
`measurement_time`	`UTCDateTimeStrict`	UTC timestamp when measurement was taken. Defaults to 'now' in UTC.
`field`	`str`	Name of the column the rule applies to.
`data_quality_dimension`	`DamaFramework`	Data quality dimension evaluated (Uniqueness, Completeness, etc.).
`records_evaluated`	`int \| None`	Total records evaluated by this rule.
`pass_rate`	`float \| None`	Ratio (0-1) of passing records to evaluated records.
`rule_id`	`Any \| None`	Local identifier for the applied rule.
`rule_description`	`Any`	Text, dict, or JSON describing rule parameters and logic.
`rule_data`	`str`	JSON dump of rule metadata for reconstruction of rule.
`records_failed_ids`	`list \| None`	Up to 10 (default) identifiers for rows failing the rule.
`records_failed_sample`	`list[dict] \| None`	Sample output of failed records for diagnostics.

Example

# Typical user interaction is via DataQualityReport:
config = DataQualityConfig.from_yaml('config.yaml')
report = config.execute(df)
first_result = report.results[0]
print(first_result.pass_rate)  # Access result attributes

Note

Direct construction of DataQualityResult or DataQualityReport are rare; results are typically gathered in production using RuleType.evaluate(df) or DataQualityConfig.execute(data)

Spark Utilities

`gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)`

Flattens arrays and nested fields in a Spark DataFrame to produce a Spark-safe, single-level table.

The columns to flatten may include array or struct paths, with array selections: '[*]' - explodes arrays into multiple rows '[]' - selects the first non-null element from the array

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Spark DataFrame containing nested or array fields.	required
`flatten_cols`	`list[str]`	List of strings indicating nested columns to flatten. Paths may include array notation (e.g., 'orders[*].item', 'info.details[]').	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A Spark DataFrame with the specified columns flattened and Spark-safe
	`DataFrame`	column names.

Raises:

Type	Description
`ValueError`	If the column paths are inconsistent or not found in the schema, or if array notation is misapplied.

Example

Flatten three levels of orders in a customer DataFrame:

flat_df = flatten_spark(df, [
    "customer[*].orders[*].items[*].productId",
    "customer[*].name"
])
flat_df.show()

Types and Base Rule

The way we categorise the data quality dimensions

`gchq_data_quality.models.DamaFramework`

Bases: str, Enum

Allowed names for data quality framework dimensions following DAMA (Data Management Association).

Members

Uniqueness: Value is "Uniqueness". Completeness: Value is "Completeness". Validity: Value is "Validity". Consistency: Value is "Consistency". Accuracy: Value is "Accuracy". Timeliness: Value is "Timeliness".

Note

It will accept any string case, but coerce to title case.

Example

DamaFramework("uniqueness")   # Returns DamaFramework.Uniqueness
DamaFramework.Completeness.value  # "Completeness"

The base rule is never called by a user, but serves as a parent for all data quality rules.

`gchq_data_quality.rules.base.BaseRule`

Bases: DataQualityBaseModel, ABC

Abstract base class for data quality rule definitions.

Not intended for direct use. Use a Subclass with a specific rule type (e.g., AccuracyRule, CompletenessRule) for configuration or execution of data quality checks. BaseRule handles all generic configuration and evaluation steps, with rule-specific logic implemented via subclass overrides.

Attributes:

Name	Type	Description
`field`	`str`	Column to check for rule evaluation.
`rule_id`	`str \| None`	Optional identifier for this rule.
`rule_description`	`str \| None`	Optional summary or explanation of the rule.
`na_values`	`str \| int \| float \| list[Any] \| None`	Values to treat as NULL.
`pd.NA = ") class-attribute instance-attribute (gchq_data_quality.rules.base.BaseRule.skip_if_null)" href="#gchq_data_quality.rules.base.BaseRule.skip_if_null">skip_if_null`	`Literal['all', 'any', 'never']`	Controls what records are skipped due to nulls.
`data_quality_dimension`	`DamaFramework`	Linked DAMA data quality dimension.

Methods:

Name	Description
`evaluate`	pd.DataFrame \| SparkDataFrame \| Elasticsearch) -> DataQualityResult Applies the rule to source data and returns evaluation metrics and diagnostics.

Note

This base class should not be instantiated directly. Use a rule subclass for actual configuration or evaluation.

Returns:

Name	Type	Description
`DataQualityResult`		Contains metrics of evaluation such as pass rate,
		evaluated record count, indices/sample of failed records, and rule metadata.
		See DataQualityResult documentation for details.

`data_quality_dimension = Field(..., description='The Dama dimension for each rule')` `class-attribute` `instance-attribute`

`field = Field(..., description='Column to check')` `class-attribute` `instance-attribute`

`na_values = Field(default=None, description='Additional values to treat as null')` `class-attribute` `instance-attribute`

`rule_description = Field(default=None, description='Description of the rule')` `class-attribute` `instance-attribute`

`rule_id = Field(default=None, description='Identifier for this rule')` `class-attribute` `instance-attribute`

`skip_if_null = Field(default='any', description="Controls which rows are skipped that contain null values. If 'all' then it will only skip if all columns used are NULL.most rules this will just apply to the 'field' column, but some like TimelinessRelativeRule can use more than one column.If values aren't skipped, then NULL values are passed into the calculations so be cautious as to what you allow through as 3 > pd.NA = <NA> ")` `class-attribute` `instance-attribute`

`_coerce_dataframe_type(df)`

Some rules require values to be coerced to a different data type. Timeliness > UTC datetime, ValiditiyNumericalRange > numeric

This function handles coercing to the relevant data type for the rule. Override if needed, the default behaviour is no coercion

Returns:

Type	Description
`DataFrame`	pd.DataFrame: If no coercion, the original df. If coerced, a modifed dataframe

`_copy_and_subset_dataframe(df, columns_used)`

Copies the dataframe to avoid later mutations when we replace NA values or coerce to a different data type.

Also ensures the dataframe columns are kept in the same order as the orginal df

`_evaluate_in_elastic(es, index_name, query=None)`

`_evaluate_in_pandas(df)`

Evaluates the rule against the provided DataFrame.

Performs field existence check, handles NA values and coercion calculates number of records evaluated and passing, computes pass rate, and includes a sample of failed records if required. A subset of the steps below can be overriden to give any inherited rule the desirved behaviour, without having to completely override the evaluate() function itself.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to evaluate	required

Returns:

Name	Type	Description
`DataQualityResult`	`DataQualityResult`	A summary of data quality metrics for the rule such as
	`DataQualityResult`	records evaluated, pass rate, and details of failed records if required.

`_evaluate_in_pandas_output_dataframe(df)`

Wrapper to ensure when executing in Spark we return a DataFrame (this is a Spark requirement), yet we want to maintain the behaviour that _evaluate_in_pandas returns a DataQualityResult (so did not want to override that).

Returns:

Type	Description
`DataFrame`	A Dataframe in a format that matches SparkDataQualityResultSchema

`_evaluate_in_spark(spark_df)`

By default we execute everything in pandas via mapInPandas, this partitions the data automatically and sends dataframes to each Spark worker, we then aggregate the resulting data.

`_get_columns_used_pandas()`

The columns used in evaluting the rule, defaults to just the field, but other rules such as consistency may use more than one column and will override this.

`_get_null_count(df, field)`

`_get_null_counts_all_columns(df)`

Goes through each column and calculates the null count.

Returns:

Type	Description
`dict[str, int]`	A dictionary of {column_name : null_count} e.g. {'name' : 7, 'age' : 0}

`_get_records_evaluated_mask_pandas(df)`

The bool mask of whether a record is being evaluated. The majority of rules will not evaluate against records that are NULL With the exception of the CompletenessRule. So the default behaviour is evaluate NON null values.

`_get_records_evaluated_pandas(df)`

Computes the number of records that are evaluated against the rule.

By default, counts non-null entries in the target field. Override this for rules involving multiple columns or different completeness logic.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to process.	required

Returns:

Name	Type	Description
`int`	`int`	The count of records in the field being assessed.

`_get_records_failed_mask_pandas(df)`

Abstract method to generate a boolean mask for records failing the rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to process.	required

Returns:

Type	Description
`Series`	pd.Series: Boolean mask where True indicates a failing record.

`_get_records_failed_pandas(df)`

Returns a list of unique records from the field that failed the rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame instance to process (assumes df has been filtered to just contain required columns).	required

Returns:

Name	Type	Description
`list`	`list[dict]`	Unique records from the field corresponding to failed records. In format [{colA : valueA, colB : valueB}, {...etc}]

`_get_records_passing_mask_pandas(df)` `abstractmethod`

The bool mask of what records are passing (i.e. this function is the main way we define our data quality rules), this is also an AND with the records_evaluated_mask by definition, as we cannot pass a record if it has not been evaluated.

`_get_records_passing_pandas(df)`

Abstract method to compute the number of records passing the data quality rule.

This must be customised for each specific rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to process.	required

Returns:

Name	Type	Description
`int`	`int`	The count of records passing the rule's criteria.

`_get_skip_if_null_mask(df)`

Return mask for records to skip based on self.skip_if_null.

`_get_spark_safe_rule()`

Returns a modified (deep copy) of the rule with spark safe column names in any column used to evaluate the rule. This is required when working with nested data, as if we want to measure 'customers.age' after we flatten the dataframe and exract the age property from the 'customers' object our column will be renamed to customers_age when it gets passed to _evaluate_in_pandas.

As 'customers.age' is not a valid Spark column name once the data is flattened.

This is overridden for each subrule type if more than self.field is used

`_handle_dataframe_coercion(df)`

Coerce the dataframe to a new datatype (if required). We will also check if the null count changes upon coercion and raise a warning with the user

`_handle_na_values_pandas(df, columns_used, na_values)`

Replace specified values in a DataFrame with pd.NA if na_values is provided.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input dataframe.	required
`columns_used`	`list`	Columns to scan for null-like values.	required
`na_values`	`list`	List of values to consider as missing.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A dataframe where the specified values are replaced with pd.NA, or the original if na_values is None.

`_replace_na_in_bool_mask(mask)`

If we get 'None' in a boolean mask we can't conduct mask operations such as inverting it or logical AND / OR, this replaces None / NA, with False.

This method can be overridden by child classes

`_require_failed_records_sample(pass_rate)`

Determines whether a diagnostic sample of failed records should be collected.

Parameters:

Name	Type	Description	Default
`pass_rate`	`float \| None`	The rule pass rate, or None if no records were evaluated.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if failed record samples are required; otherwise, False.

`_warn_if_null_counts_different(original_null_counts, new_null_counts)`

Compares the null counts between the original and new (the keys will be the same), if the new has more nulls, raise a warning and mention the column. Typically something we do during coercion to a new datatype.

`evaluate(data_source, index_name='', query=None)`

evaluate(data_source: pd.DataFrame) -> DataQualityResult

evaluate(data_source: SparkDataFrame) -> DataQualityResult

evaluate(
    data_source: Elasticsearch,
    index_name: str = ...,
    query: dict | None = ...,
) -> DataQualityResult

Evaluates this rule against the provided data source.

Supports both Pandas and Spark DataFrames as input. Applies all rule configuration, handles nulls and data coercion, and computes relevant data quality metrics. If an Elasticsearch index and client are supplied, an error is raised unless that backend is implemented. Currently not implemented.

Parameters:

Name	Type	Description	Default
`data_source`	`DataFrame \| DataFrame \| Elasticsearch`	The data to evaluate— can be a Pandas DataFrame, a Spark DataFrame, or an Elasticsearch client.	required
`index_name`	`str`	Required if evaluating with Elasticsearch; the index to check.	`''`
`query`	`dict`	Required if evaluating with Elaticsearch, defaults to a query that matches all documents	`None`

Returns:

Name	Type	Description
`DataQualityResult`		Contains the metrics and diagnostics of rule evaluation,
		including pass rate, number of records evaluated, indices and sample of failed records,
		and rule metadata. See DataQualityResult documentation for details.

Raises:

Type	Description
`ValueError`	If an unsupported data source is provided.
`NotImplementedError`	If Elasticsearch evaluation is requested but not supported.

API Reference

Rules

gchq_data_quality.rules.uniqueness.UniquenessRule

gchq_data_quality.rules.completeness.CompletenessRule

gchq_data_quality.rules.accuracy.AccuracyRule

gchq_data_quality.rules.consistency.ConsistencyRule

gchq_data_quality.rules.timeliness.TimelinessRelativeRule

gchq_data_quality.rules.timeliness.TimelinessStaticRule

gchq_data_quality.rules.validity.ValidityNumericalRangeRule

gchq_data_quality.rules.validity.ValidityRegexRule

Data Quality Configuration and Results

gchq_data_quality.config.DataQualityConfig

gchq_data_quality.results.models.DataQualityReport

gchq_data_quality.results.models.DataQualityResult

Spark Utilities

gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)

Types and Base Rule

gchq_data_quality.models.DamaFramework

gchq_data_quality.rules.base.BaseRule

data_quality_dimension = Field(..., description='The Dama dimension for each rule') class-attribute instance-attribute

field = Field(..., description='Column to check') class-attribute instance-attribute

na_values = Field(default=None, description='Additional values to treat as null') class-attribute instance-attribute

rule_description = Field(default=None, description='Description of the rule') class-attribute instance-attribute

rule_id = Field(default=None, description='Identifier for this rule') class-attribute instance-attribute

_coerce_dataframe_type(df)

_copy_and_subset_dataframe(df, columns_used)

_evaluate_in_elastic(es, index_name, query=None)

_evaluate_in_pandas(df)

_evaluate_in_pandas_output_dataframe(df)

_evaluate_in_spark(spark_df)

_get_columns_used_pandas()

_get_null_count(df, field)

_get_null_counts_all_columns(df)

_get_records_evaluated_mask_pandas(df)

_get_records_evaluated_pandas(df)

_get_records_failed_mask_pandas(df)

_get_records_failed_pandas(df)

_get_records_passing_mask_pandas(df) abstractmethod

_get_records_passing_pandas(df)

_get_skip_if_null_mask(df)

_get_spark_safe_rule()

_handle_dataframe_coercion(df)

_handle_na_values_pandas(df, columns_used, na_values)

_replace_na_in_bool_mask(mask)

_require_failed_records_sample(pass_rate)

_warn_if_null_counts_different(original_null_counts, new_null_counts)

evaluate(data_source, index_name='', query=None)

`gchq_data_quality.rules.uniqueness.UniquenessRule`

`gchq_data_quality.rules.completeness.CompletenessRule`

`gchq_data_quality.rules.accuracy.AccuracyRule`

`gchq_data_quality.rules.consistency.ConsistencyRule`

`gchq_data_quality.rules.timeliness.TimelinessRelativeRule`

`gchq_data_quality.rules.timeliness.TimelinessStaticRule`

`gchq_data_quality.rules.validity.ValidityNumericalRangeRule`

`gchq_data_quality.rules.validity.ValidityRegexRule`

`gchq_data_quality.config.DataQualityConfig`

`gchq_data_quality.results.models.DataQualityReport`

`gchq_data_quality.results.models.DataQualityResult`

`gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)`

`gchq_data_quality.models.DamaFramework`

`gchq_data_quality.rules.base.BaseRule`

`data_quality_dimension = Field(..., description='The Dama dimension for each rule')` `class-attribute` `instance-attribute`

`field = Field(..., description='Column to check')` `class-attribute` `instance-attribute`

`na_values = Field(default=None, description='Additional values to treat as null')` `class-attribute` `instance-attribute`

`rule_description = Field(default=None, description='Description of the rule')` `class-attribute` `instance-attribute`

`rule_id = Field(default=None, description='Identifier for this rule')` `class-attribute` `instance-attribute`

`_coerce_dataframe_type(df)`

`_copy_and_subset_dataframe(df, columns_used)`

`_evaluate_in_elastic(es, index_name, query=None)`

`_evaluate_in_pandas(df)`

`_evaluate_in_pandas_output_dataframe(df)`

`_evaluate_in_spark(spark_df)`

`_get_columns_used_pandas()`

`_get_null_count(df, field)`

`_get_null_counts_all_columns(df)`

`_get_records_evaluated_mask_pandas(df)`

`_get_records_evaluated_pandas(df)`

`_get_records_failed_mask_pandas(df)`

`_get_records_failed_pandas(df)`

`_get_records_passing_mask_pandas(df)` `abstractmethod`

`_get_records_passing_pandas(df)`

`_get_skip_if_null_mask(df)`

`_get_spark_safe_rule()`

`_handle_dataframe_coercion(df)`

`_handle_na_values_pandas(df, columns_used, na_values)`

`_replace_na_in_bool_mask(mask)`

`_require_failed_records_sample(pass_rate)`

`_warn_if_null_counts_different(original_null_counts, new_null_counts)`

`evaluate(data_source, index_name='', query=None)`