API Reference
Rules
gchq_data_quality.rules.uniqueness.ValuesAreUnique
Bases: UniquenessBaseRule
Rule for assessing uniqueness in a column.
Preferred alias for UniquenessRule. Measures the proportion of unique, non-null
values in a specified column. This is useful for checking distinct identifiers or
reference keys. Additional null-like values can be specified via na_values.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to evaluate for uniqueness. |
na_values |
Any | list[Any] | None
|
Values to treat as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Uniqueness). |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional description for this rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> import pandas as pd
>>> from gchq_data_quality import ValuesAreUnique
>>> df = pd.DataFrame({'id': [1, 2, 3, 3, None]})
# Basic uniqueness check
>>> rule = ValuesAreUnique(field='id')
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75
# Specify additional NA values
>>> rule = ValuesAreUnique(field='id', na_values=[-1])
>>> df = pd.DataFrame({'id': [1, 2, -1, 3, 3]})
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75
>>> rule = ValuesAreUnique(
... field='id',
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Note
The pass_rate metric is calculated as (number of unique values) / (number of non-null records). Therefore, if every value in the column appears exactly twice, pass_rate will be 0.5 (not 0.0!). For columns with even more duplication, pass_rate will decrease and approach zero as the number of unique values becomes small relative to the number of total records. Only if every record is identical will pass_rate be 1 / N (where N is the number of records).
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the uniqueness score ( |
|
|
a sample of duplicate values, the number of records evaluated, and rule metadata. |
||
|
See DataQualityResult documentation for further attribute details. |
gchq_data_quality.rules.completeness.ValuesAreComplete
Bases: CompletenessBaseRule
Rule to calculate the completeness score for a field.
Preferred alias for CompletenessRule. Completeness is measured as the proportion
of non-null values in the specified column. Values specified in na_values are
converted to nulls prior to calculation.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column name to assess. |
na_values |
str | list[Any] | None
|
Additional indicators to treat as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Use backticks around column names. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates completeness for the chosen fields on a Pandas or Spark DataFrame. Returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesAreComplete(field="column_name")
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> rule = ValuesAreComplete(field="column_name", na_values="missing")
>>> result = rule.evaluate(df)
>>> rule = ValuesAreComplete(
... field="column_name",
... filter="`country` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains completeness score ( |
|
|
number of records evaluated, and rule metadata. See DataQualityResult documentation |
||
|
for further attribute details. |
gchq_data_quality.rules.accuracy.ValuesMatchList
Bases: AccuracyBaseRule
Rule to check if values meet a list of valid (or invalid) values.
Preferred alias for AccuracyRule. Skips NULLs, including those recognised via
na_values. Instantiate this class and call .evaluate(df) to assess data quality
for the chosen column.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column to check for accuracy. |
valid_values |
list[Any]
|
The set of acceptable values for the field. |
inverse |
bool
|
If True, values in |
na_values |
str | list[Any] | None
|
Additional indicators to treat as missing values. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Use backticks around column names. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesMatchList(field="category", valid_values=["A", "B", "C"])
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> print(result.records_failed_ids)
>>> rule = ValuesMatchList(
... field="department",
... valid_values=["HR", "IT", "Sales"],
... na_values=["N/A", "N/K"]
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchList(
... field="status",
... valid_values=["expired", "deleted"],
... inverse=True # value must NOT be expired or deleted
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchList(
... field="category",
... valid_values=["A", "B", "C"],
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
An object containing the accuracy score (pass_rate), |
|
|
the indices of failed rows (records_failed_ids), a sample of failed values |
||
|
(records_failed_sample), the number of records evaluated, and further rule metadata. |
||
|
See DataQualityResult documentation for full details. |
gchq_data_quality.rules.consistency.ValuesMatchExpression
Bases: ConsistencyBaseRule
Rule for evaluating data consistency based on boolean expressions (with an optional condition).
Preferred alias for ConsistencyRule. Expressions may use any valid Pandas eval syntax
that returns a boolean result. Backticks are required around all column names.
Nulls and additional na_values are handled according to the skip policy.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column to check for consistency. |
expression |
str | dict[str, str]
|
A boolean expression, or a conditional {'if', 'then'} dictionary (with backticks for column names). |
skip_if_null |
Literal['all', 'any', 'never']
|
Controls row skipping for null values in relevant columns. |
na_values |
str | list[Any] | None
|
Additional values considered as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension - you may want to override it in this rule. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesMatchExpression(
... field="score",
... expression="`score` >= 50"
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchExpression(
... field="completion_date",
... expression={"if": "`status` == 'completed'", "then": "`completion_date`.notnull()"},
data_quality_dimension='Completeness' # you can override the DAMA Dimension
... )
>>> result = rule.evaluate(df)
# all series .str. methods are available
>>> rule = ValuesMatchExpression(
... field="postcode",
... expression={
... "if": "`country` == 'UK'",
... "then": "`postcode`.str.match(r'^[A-Z]{2}[0-9]{2}$')"
... }
... )
>>> result = rule.evaluate(df)
# Date parts and arithmetic using .dt accessor
>>> rule = ValuesMatchExpression(
... field="report_year",
... expression="`report_date`.dt.year == `report_year`"
... )
>>> result = rule.evaluate(df)
# Boolean logic (AND, OR, NOT) with grouping and comparisons
>>> rule = ValuesMatchExpression(
... field="flag",
... expression="(`score` > 90) & ((`status` == 'active') | ~`is_archived`)"
... )
>>> result = rule.evaluate(df)
# Using mathematical operations
>>> rule = ValuesMatchExpression(
... field="predicted",
... expression="abs(`actual` - `predicted`) < 10"
... )
>>> result = rule.evaluate(df)
# Evaluate only a filtered subset of rows
>>> rule = ValuesMatchExpression(
... field="score",
... expression="`score` >= 50",
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
An object containing the consistency score ( |
|
|
number of records evaluated, a sample of inconsistent records, and details of failed row indices. |
||
|
See DataQualityResult documentation for full attribute descriptions. |
gchq_data_quality.rules.timeliness.ValuesMatchRelativeTimeBounds
Bases: TimelinessRelativeBaseRule
Rule to assess whether datetime values fall between relative time boundaries from a reference date (which can be a static value or come from a column in the data source).
Preferred alias for TimelinessRelativeRule. Timedelta bounds are specified for start and end,
relative to a reference date or reference column. All datetime comparisons are performed in UTC,
with date-only values assumed midnight. Only one of reference_date or reference_column may be
provided. If neither is given, current UTC time is used as the reference_date.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Name of the datetime column to assess. |
start_timedelta |
timedelta | str | int | float | None
|
Lower offset from the reference. |
end_timedelta |
timedelta | str | int | float | None
|
Upper offset from the reference. |
reference_date |
str | datetime | Timestamp | None
|
Fixed reference date/time (UTC). |
reference_column |
str | None
|
Per-row column providing reference dates/times. |
dayfirst |
bool
|
If True, parses ALL dates as day/month/year. |
na_values |
str | list[Any] | None
|
Values treated as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension. |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Note
Integer values into start or end timedelta are assumed to be nanoseconds (default pandas.to_timedelta() behaviour)
Example
>>> rule = ValuesMatchRelativeTimeBounds(
... field="event_date",
... start_timedelta="0d",
... end_timedelta="30d",
... reference_date="2024-01-01T00:00:00Z"
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchRelativeTimeBounds(
... field="booking_date",
... start_timedelta="-1d",
... end_timedelta="5d",
... reference_column="event_date"
... )
>>> result = rule.evaluate(df)
# Require event dates at least 5 days after the reference date
>>> rule = ValuesMatchRelativeTimeBounds(
... field="event_date",
... start_timedelta="5d",
... end_timedelta=None,
... reference_date="2023-06-01"
... )
>>> result = rule.evaluate(df)
>>> from datetime import timedelta
>>> rule = ValuesMatchRelativeTimeBounds(
... field="sensor_timestamp",
... start_timedelta=timedelta(hours=-12),
... end_timedelta=timedelta(hours=12)
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchRelativeTimeBounds(
... field="booking_date",
... start_timedelta="-1d",
... end_timedelta="5d",
... reference_column="event_date",
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the timeliness score ( |
|
|
total records evaluated, and metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.timeliness.ValuesMatchStaticTimeBounds
Bases: TimelinessStaticBaseRule
Rule to check whether datetime values in a column fall between absolute start and end date boundaries (inclusive).
Preferred alias for TimelinessStaticRule. Suitable where both boundaries are fixed or known in advance
(e.g., events occurring in January 2024). All dates are treated as, or coerced to, UTC, with date-only strings
assumed to be midnight. Invalid or unparsable datetime values are treated as missing.
Combine with a validity rule and completeness rule on the same field for the best insights.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Name of the datetime column to assess. |
start_date |
str | datetime | Timestamp | None
|
Inclusive lower boundary for valid values. |
end_date |
str | datetime | Timestamp | None
|
Inclusive upper boundary for valid values. |
dayfirst |
bool
|
If True, parses ALL dates and rule inputs as day/month/year, otherwise month/day/year. |
na_values |
str | list[Any] | None
|
Values treated as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension. (You may want to override, e.g. perhaps 'Consistency' makes sense for some of these rules) |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesMatchStaticTimeBounds(
... field="event_date",
... start_date="2024-01-01T00:00:00Z",
... end_date="2024-01-31T23:59:59Z"
... )
>>> result = rule.evaluate(df)
# Only require that dates are on or after 2023-06-01
>>> rule = ValuesMatchStaticTimeBounds(
... field="date_col",
... start_date="2023-06-01",
... end_date=None
... )
>>> result = rule.evaluate(df)
# Using string-based boundaries with day-first format
>>> rule = ValuesMatchStaticTimeBounds(
... field="timestamp",
... start_date="01/06/2023",
... end_date="30/06/2023",
... dayfirst=True # also assumes dates in field 'timestamp' are dayfirst
... )
>>> result = rule.evaluate(df)
# Using Python datetime objects as boundaries
>>> from datetime import datetime, timezone
>>> rule = ValuesMatchStaticTimeBounds(
... field="timestamp",
... start_date=datetime(2023, 6, 1, 0, 0, tzinfo=timezone.utc),
... end_date=datetime(2023, 6, 30, 23, 59, tzinfo=timezone.utc),
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchStaticTimeBounds(
... field="event_date",
... start_date="2024-01-01",
... end_date="2024-01-31",
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the timeliness score ( |
|
|
a sample of those records, and metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.validity.ValuesMatchNumericalRange
Bases: ValidityNumericalRangeBaseRule
Rule for validating numerical values against a specified range.
Preferred alias for ValidityNumericalRangeRule. Considers only non-null values;
values outside the range or failing coercion to numeric are considered invalid.
Diagnostic samples and record indices are returned for values outside the allowed range.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for numerical range validity. |
min_value |
float
|
Minimum allowed value (inclusive; defaults to -infinity). |
max_value |
float
|
Maximum allowed value (inclusive; defaults to +infinity). |
na_values |
str | list[Any] | None
|
Additional values to treat as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Validity). |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesMatchNumericalRange(
... field="age",
... min_value=0,
... max_value=120
... )
>>> result = rule.evaluate(df)
# no upper limit
>>> rule = ValuesMatchNumericalRange(
... field="temp_c",
... min_value=0,
... na_values=-999
... )
>>> result = rule.evaluate(df)
# no lower limit
>>> rule = ValuesMatchNumericalRange(
... field="score",
... max_value=100,
... na_values=['missing', 'N/A']
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchNumericalRange(
... field="age",
... min_value=18,
... max_value=65,
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the validity score ( |
|
|
sample and indices of failed records, total records evaluated, |
||
|
and rule metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.validity.ValuesMatchRegex
Bases: ValidityRegexBaseRule
Rule for validating string values against a regular expression.
Preferred alias for ValidityRegexRule. Considers only non-null entries, with
additional missing-value patterns specified via na_values. A diagnostic sample of
values failing the regex is returned if present.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for regex validity. |
regex_pattern |
str
|
Regular expression pattern for validation. |
na_values |
str | list[Any] | None
|
Additional values to treat as missing. |
filter |
str | None
|
Optional pandas eval boolean expression used to filter rows before evaluation. Must evaluate to bool and use backticks around column names. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Validity) by default. |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValuesMatchRegex(
... field="email",
... regex_pattern=r'^[^@]+@[^@]+\.[^@]+$'
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchRegex(
... field="country_code",
... regex_pattern=r'^[A-Z]{2}$'
... )
>>> result = rule.evaluate(df)
>>> rule = ValuesMatchRegex(
... field="email",
... regex_pattern=r'^[^@]+@[^@]+\.[^@]+$',
... filter="`region` == 'UK'"
... )
>>> result = rule.evaluate(df)
Note
To centrally manage and update regex patterns you can provide a separate YAML file containing named regex patterns (e.g., EMAIL_REGEX, POSTCODE_REGEX). Keys in this file are substituted in your main configuration files wherever referenced, enabling consistent and maintainable regex use.
When storing regex patterns in YAML, always use single quotes ('pattern') rather than double quotes to ensure correct handling of typical regex escape characters, such as \d or \w.
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the validity score ( |
|
|
sample and indices of failed records, number of evaluated records, |
||
|
and rule metadata. See DataQualityResult documentation for details. |
Data Quality Configuration and Results
gchq_data_quality.config.DataQualityConfig
Bases: BaseModel
Configuration describing a set of data quality checks to be run on a dataset.
Typically constructed by loading a YAML file specifying the dataset and a list of rule definitions. Can also be created programmatically.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
str | None
|
Dataset name or identifier. |
measurement_sample |
str | None
|
Description of data sample. |
lifecycle_stage |
str | None
|
The lifecycle stage at which data is measured. |
measurement_time |
datetime | None
|
Measurement timestamp. |
dataset_id |
str | int | float | None
|
Local data catalogue ID. |
rules |
list[BaseRule] | None
|
List of rule models. |
Example
# Loading from YAML
config = DataQualityConfig.from_yaml("my_config.yaml")
-- see tutorial for how to specify the yaml file, or create a config programatically and use .to_yaml() to create something to start with.
# Override regex patterns
config = DataQualityConfig.from_yaml("my_config.yaml", regex_yaml_path='regex_patterns.yaml')
# Running data quality checks
report = config.execute(data_source=my_dataframe)
# Or, creating config programmatically from scratch
config2 = DataQualityConfig(
dataset_name="my_data",
rules=[
ValuesMatchRegex(field="email", regex_pattern='.+@example.com'),
],
)
Methods:
execute(data_source) -> DataQualityReport:
Execute the measurement configuration against the provided data source
(e.g., pandas DataFrame, Spark DataFrame).
Runs each rule's evaluate() method and returns a DataQualityReport
containing the results.
from_yaml(file_path: str | Path, regex_yaml_path: str | Path | None = None) -> DataQualityConfig:
Load a configuration instance from a YAML file. If regex_yaml_path is provided,
regex patterns in rule definitions can be overridden or supplemented by patterns
from this separate YAML file.
to_yaml(file_path: str | Path, overwrite: bool = False) -> None:
Save as YAML file.
from_report(report: DataQualityReport) -> DataQualityConfig:
Create config instance from report results. This will extract the rule defintition from the
rule_data field (which is a JSON dump of all rule metadata)
gchq_data_quality.results.models.DataQualityReport
Bases: DataQualityBaseModel
A collection of individual data quality results for a dataset. This object is typically returned by executing a DataQualityConfig object, rather than instantiated directly by the user.
Attributes:
| Name | Type | Description |
|---|---|---|
results |
list[DataQualityResult]
|
List of individual DataQualityResults for each rule applied. |
Methods:
| Name | Description |
|---|---|
to_dataframe |
Converts report results to a pandas DataFrame for analysis. |
to_json |
Serialises the report to JSON, optionally saving to file. |
from_dataframe |
Constructs a DataQualityReport from a pandas DataFrame formatted as in to_dataframe(). |
Example
config = DataQualityConfig.from_yaml('quality_cfg.yaml')
report = config.execute(df) # <- this is the DataQualityReport object creation step
df_results = report.to_dataframe(decimals=3)
report.to_json('results.json')
gchq_data_quality.results.models.DataQualityResult
Bases: DataQualityBaseModel
Represents the outcome of a single data quality rule applied to a dataset column. Noting that some rules may reference additional columns, such as ConsistencyRule
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
float | str | int | None
|
Common, human-readable name of the measured dataset. |
dataset_id |
float | str | int | None
|
Machine-readable unique ID for the dataset. |
measurement_sample |
str | None
|
Description of the sample measured. |
lifecycle_stage |
Any | None
|
Stage of data lifecycle at the time of measurement (e.g., '01 ingest'). |
measurement_time |
UTCDateTimeStrict
|
UTC timestamp when measurement was taken. Defaults to 'now' in UTC. |
field |
str
|
Name of the column the rule applies to. |
data_quality_dimension |
DamaFramework
|
Data quality dimension evaluated (Uniqueness, Completeness, etc.). |
records_evaluated |
int | None
|
Total records evaluated by this rule. |
records_passed |
int | None
|
Total records that passed the rule. If records_evaluated is 0, then this is None by definition. |
pass_rate |
float | None
|
Ratio (0-1) of passing records to evaluated records. |
rule_id |
Any | None
|
Local identifier for the applied rule. |
rule_description |
Any
|
Text, dict, or JSON describing rule parameters and logic. |
rule_data |
str
|
JSON dump of rule metadata for reconstruction of rule. |
records_failed_ids |
list | None
|
Up to 10 (default) identifiers for rows failing the rule. |
records_failed_sample |
list[dict] | None
|
Sample output of failed records for diagnostics. |
Example
# Typical user interaction is via DataQualityReport:
config = DataQualityConfig.from_yaml('config.yaml')
report = config.execute(df)
first_result = report.results[0]
print(first_result.pass_rate) # Access result attributes
Note
Direct construction of DataQualityResult or DataQualityReport are rare; results are typically gathered in production using RuleType.evaluate(df) or DataQualityConfig.execute(data) records_passed can be 'missing' from creation and then will not get serialised. This is for backwards compatibility with versions < 1.2
Spark Utilities
gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)
Flattens arrays and nested fields in a Spark DataFrame to produce a Spark-safe, single-level table.
The columns to flatten may include array or struct paths, with array selections: '[*]' - explodes arrays into multiple rows '[]' - selects the first non-null element from the array
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Spark DataFrame containing nested or array fields. |
required |
flatten_cols
|
list[str]
|
List of strings indicating nested columns to flatten. Paths may include array notation (e.g., 'orders[*].item', 'info.details[]'). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
A Spark DataFrame with the specified columns flattened and Spark-safe |
DataFrame
|
column names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the column paths are inconsistent or not found in the schema, or if array notation is misapplied. |
Example
Flatten three levels of orders in a customer DataFrame:
flat_df = flatten_spark(df, [
"customer[*].orders[*].items[*].productId",
"customer[*].name"
])
flat_df.show()
Types and Base Rule
The way we categorise the data quality dimensions
gchq_data_quality.models.DamaFramework
Bases: str, Enum
Allowed names for data quality framework dimensions following DAMA (Data Management Association).
Members
Uniqueness: Value is "Uniqueness". Completeness: Value is "Completeness". Validity: Value is "Validity". Consistency: Value is "Consistency". Accuracy: Value is "Accuracy". Timeliness: Value is "Timeliness".
Note
It will accept any string case, but coerce to title case.
Example
DamaFramework("uniqueness") # Returns DamaFramework.Uniqueness
DamaFramework.Completeness.value # "Completeness"
The base rule is never called by a user, but serves as a parent for all data quality rules.
gchq_data_quality.rules.base.BaseRule
Bases: DataQualityBaseModel, ABC
Abstract base class for data quality rule definitions.
Not intended for direct use. Use a Subclass with a specific rule type (e.g., AccuracyRule, CompletenessRule) for configuration or execution of data quality checks. BaseRule handles all generic configuration and evaluation steps, with rule-specific logic implemented via subclass overrides.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for rule evaluation. |
filter |
str | None
|
Boolean filter in pandas eval syntax to apply prior to rule evaluation. Defaults to None. |
rule_id |
str | None
|
Optional identifier for this rule. |
rule_description |
str | None
|
Optional summary or explanation of the rule. |
na_values |
str | int | float | list[Any] | None
|
Values to treat as NULL. |
skip_if_null |
Literal['all', 'any', 'never']
|
Controls what records are skipped due to nulls. |
data_quality_dimension |
DamaFramework
|
Linked DAMA data quality dimension. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Applies the rule to source data and returns evaluation metrics and diagnostics. |
Note
This base class should not be instantiated directly. Use a rule subclass for actual configuration or evaluation. The order of operations is dataframe type coercian > replace na_values > filter dataframe. There are edge cases where this order creates different results, e.g. if -1 is NULL, then -1 values will become NULL before any filtering happens
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains metrics of evaluation such as pass rate, |
|
|
evaluated record count, indices/sample of failed records, and rule metadata. |
||
|
See DataQualityResult documentation for details. |
data_quality_dimension = Field(..., description='The Dama dimension for each rule')
class-attribute
instance-attribute
field = Field(..., description='Column to check')
class-attribute
instance-attribute
filter = Field(default=None, description='The boolean filter to apply, using pandas eval syntax, before evaluating each rule.')
class-attribute
instance-attribute
na_values = Field(default=None, description='Additional values to treat as null')
class-attribute
instance-attribute
rule_description = Field(default=None, description='Description of the rule')
class-attribute
instance-attribute
rule_id = Field(default=None, description='Identifier for this rule')
class-attribute
instance-attribute
skip_if_null = Field(default='any', description="Controls which rows are skipped that contain null values. If 'all' then it will only skip if all columns used are NULL.most rules this will just apply to the 'field' column, but some like TimelinessRelativeRule can use more than one column.If values aren't skipped, then NULL values are passed into the calculations so be cautious as to what you allow through. Any logical expression compared with NA returns NA.")
class-attribute
instance-attribute
_coerce_dataframe_type(df)
Some rules require values to be coerced to a different data type. Timeliness > UTC datetime, ValidityNumericalRange > numeric
This function handles coercing to the relevant data type for the rule. Override if needed, the default behaviour is no coercion
The columns unique to self.filter expression are not coerced by default.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: If no coercion, the original df. If coerced, a modified dataframe |
_copy_and_subset_dataframe(df, columns_used)
Copies the dataframe to avoid later mutations when we replace NA values or coerce to a different data type.
Also ensures the dataframe columns are kept in the same order as the orginal df
_evaluate_in_pandas(df)
Evaluates the rule against the provided DataFrame.
Performs field existence check, handles NA values and coercion calculates number of records evaluated and passing, computes pass rate, and includes a sample of failed records if required. A subset of the steps below can be overriden to give any inherited rule the desirved behaviour, without having to completely override the evaluate() function itself.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to evaluate |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
DataQualityResult
|
A summary of data quality metrics for the rule such as |
DataQualityResult
|
records evaluated, pass rate, and details of failed records if required. |
_evaluate_in_pandas_output_dataframe(df)
Wrapper to ensure when executing in Spark we return a DataFrame (this is a Spark requirement), yet we want to maintain the behaviour that _evaluate_in_pandas returns a DataQualityResult (so did not want to override that).
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Dataframe in a format that matches SparkDataQualityResultSchema |
_evaluate_in_spark(spark_df)
By default we execute everything in pandas via mapInPandas, this partitions the data automatically and sends dataframes to each Spark worker, we then aggregate the resulting data.
_filter_dataframe(df)
Filters the dataframe after it has been copied subset via _copy_and_subset_dataframe. uses self.filter as a boolean evaluation to return a filtered subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The dataframe to filter (already copied from the original) |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The filtered dataframe |
_get_columns_used_pandas()
The columns used in evaluting the rule, defaults to just the field and filter expression, but other rules such as consistency may use more than one column and will override this.
_get_null_count(df, field)
_get_null_counts_all_columns(df)
Goes through each column and calculates the null count.
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
A dictionary of {column_name : null_count} e.g. {'name' : 7, 'age' : 0} |
_get_records_evaluated_mask_pandas(df)
The bool mask of whether a record is being evaluated. The majority of rules will not evaluate against records that are NULL With the exception of the CompletenessRule. So the default behaviour is evaluate NON null values.
_get_records_evaluated_pandas(df)
Computes the number of records that are evaluated against the rule.
By default, counts non-null entries in the target field. Override this for rules involving multiple columns or different completeness logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
The count of records in the field being assessed. |
_get_records_failed_mask_pandas(df)
Abstract method to generate a boolean mask for records failing the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Boolean mask where True indicates a failing record. |
_get_records_failed_pandas(df)
Returns a list of unique records from the field that failed the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame instance to process (assumes df has been filtered to just contain required columns). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list[dict]
|
Unique records from the field corresponding to failed records. In format [{colA : valueA, colB : valueB}, {...etc}] |
_get_records_passed_mask_pandas(df)
abstractmethod
The bool mask of what records are passing (i.e. this function is the main way we define our data quality rules), this is also an AND with the records_evaluated_mask by definition, as we cannot pass a record if it has not been evaluated.
_get_records_passed_pandas(df)
Abstract method to compute the number of records passing the data quality rule.
This must be customised for each specific rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
The count of records passing the rule's criteria. |
_get_skip_if_null_mask(df)
Return mask for records to skip based on self.skip_if_null.
_get_spark_safe_rule()
Returns a modified (deep copy) of the rule with spark safe column names in any column used to evaluate the rule. This is required when working with nested data, as if we want to measure 'customers.age' after we flatten the dataframe and exract the age property from the 'customers' object our column will be renamed to customers_age when it gets passed to _evaluate_in_pandas.
As 'customers.age' is not a valid Spark column name once the data is flattened.
This is overridden for each subrule type if more than self.field is used
_handle_dataframe_coercion(df)
Coerce the dataframe to a new datatype (if required). We will also check if the null count changes upon coercion and raise a warning with the user
_handle_na_values_pandas(df, columns_used, na_values)
Replace specified values in a DataFrame with pd.NA if na_values is provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
columns_used
|
list
|
Columns to scan for null-like values. |
required |
na_values
|
list
|
List of values to consider as missing. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A dataframe where the specified values are replaced with pd.NA, or the original if na_values is None. |
_replace_na_in_bool_mask(mask)
If we get 'None' in a boolean mask we can't conduct mask operations such as inverting it or logical AND / OR, this replaces None / NA, with False.
This method can be overridden by child classes
_require_failed_records_sample(pass_rate)
Determines whether a diagnostic sample of failed records should be collected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pass_rate
|
float | None
|
The rule pass rate, or None if no records were evaluated. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if failed record samples are required; otherwise, False. |
_warn_if_null_counts_different(original_null_counts, new_null_counts)
Compares the null counts between the original and new (the keys will be the same), if the new has more nulls, raise a warning and mention the column. Typically something we do during coercion to a new datatype.
evaluate(data_source)
Evaluates this rule against the provided data source.
Supports both Pandas and Spark DataFrames as input. Applies all rule configuration, handles nulls and data coercion, and computes relevant data quality metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_source
|
DataFrame | DataFrame
|
The data to evaluate— can be a Pandas DataFrame or a Spark DataFrame. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
DataQualityResult
|
Contains the metrics and diagnostics of rule evaluation, |
DataQualityResult
|
including pass rate, number of records evaluated, indices and sample of failed records, |
|
DataQualityResult
|
and rule metadata. See DataQualityResult documentation for details. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported data source is provided. |