API Reference
Rules
gchq_data_quality.rules.uniqueness.UniquenessRule
Bases: BaseRule
Rule for assessing uniqueness in a column.
Measures the proportion of unique, non-null values in a specified column. This is
useful for checking distinct identifiers or reference keys. Additional null-like
values can be specified via na_values.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to evaluate for uniqueness. |
na_values |
Any | list[Any] | None
|
Values to treat as missing. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Uniqueness). |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional description for this rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> import pandas as pd
>>> from gchq_data_quality.rules.uniqueness import UniquenessRule
>>> df = pd.DataFrame({'id': [1, 2, 3, 3, None]})
# Basic uniqueness check
>>> rule = UniquenessRule(field='id')
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75
# Specify additional NA values
>>> rule = UniquenessRule(field='id', na_values=[-1])
>>> df = pd.DataFrame({'id': [1, 2, -1, 3, 3]})
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
0.75
Note
The pass_rate metric is calculated as (number of unique values) / (number of non-null records). Therefore, if every value in the column appears exactly twice, pass_rate will be 0.5 (not 0.0!). For columns with even more duplication, pass_rate will decrease and approach zero as the number of unique values becomes small relative to the number of total records. Only if every record is identical will pass_rate be 1 / N (where N is the number of records).
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the uniqueness score ( |
|
|
a sample of duplicate values, the number of records evaluated, and rule metadata. |
||
|
See DataQualityResult documentation for further attribute details. |
gchq_data_quality.rules.completeness.CompletenessRule
Bases: BaseRule
Rule to calculate the completeness score for a field.
Completeness is measured as the proportion of non-null values in the specified
column. Values specified in na_values are converted to nulls prior to calculation.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column name to assess. |
na_values |
str | list[Any] | None
|
Additional indicators to treat as missing. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates completeness for the chosen fields on a Pandas or Spark DataFrame. Returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = CompletenessRule(field="column_name")
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> rule = CompletenessRule(field="column_name", na_values="missing")
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains completeness score ( |
|
|
number of records evaluated, and rule metadata. See DataQualityResult documentation |
||
|
for further attribute details. |
gchq_data_quality.rules.accuracy.AccuracyRule
Bases: BaseRule
Rule to check if values meet a list of valid (or invalid) values.
Skips NULLs, including those recognised via na_values. Instantiate this class and
call .evaluate(df) to assess data quality for the chosen column.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column to check for accuracy. |
valid_values |
list[Any]
|
The set of acceptable values for the field. |
inverse |
bool
|
If True, values in |
na_values |
str | list[Any] | None
|
Additional indicators to treat as missing values. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = AccuracyRule(field="category", valid_values=["A", "B", "C"])
>>> result = rule.evaluate(df)
>>> print(result.pass_rate)
>>> print(result.records_failed_ids)
>>> rule = AccuracyRule(
... field="department",
... valid_values=["HR", "IT", "Sales"],
... na_values=["N/A", "N/K"]
... )
>>> result = rule.evaluate(df)
>>> rule = AccuracyRule(
... field="status",
... valid_values=["expired", "deleted"],
... inverse=True # value must NOT be expired or deleted
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
An object containing the accuracy score (pass_rate), |
|
|
the indices of failed rows (records_failed_ids), a sample of failed values |
||
|
(records_failed_sample), the number of records evaluated, and further rule metadata. |
||
|
See DataQualityResult documentation for full details. |
gchq_data_quality.rules.consistency.ConsistencyRule
Bases: BaseRule
Rule for evaluating data consistency based on boolean expressions (with an optional condition).
Expressions may use any valid Pandas eval syntax that returns a boolean result. Backticks are required around all column names. Nulls and additional na_values are handled according to the skip policy.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
The column to check for consistency. |
expression |
str | dict[str, str]
|
A boolean expression, or a conditional {'if', 'then'} dictionary (with backticks for column names). |
skip_if_null |
Literal['all', 'any', 'never']
|
Controls row skipping for null values in relevant columns. |
na_values |
str | list[Any] | None
|
Additional values considered as missing. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension - you may want to override it in this rule. |
rule_id |
str | None
|
Optional identifier for the rule. |
rule_description |
str | None
|
Optional description of the rule. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ConsistencyRule(
... field="score",
... expression="`score` >= 50"
... )
>>> result = rule.evaluate(df)
>>> rule = ConsistencyRule(
... field="completion_date",
... expression={"if": "`status` == 'completed'", "then": "`completion_date`.notnull()"},
data_quality_dimension='Completeness' # you can override the DAMA Dimension
... )
>>> result = rule.evaluate(df)
# all series .str. methods are available
>>> rule = ConsistencyRule(
... field="postcode",
... expression={
... "if": "`country` == 'UK'",
... "then": "`postcode`.str.match(r'^[A-Z]{2}[0-9]{2}$')"
... }
... )
>>> result = rule.evaluate(df)
# Date parts and arithmetic using .dt accessor
>>> rule = ConsistencyRule(
... field="report_year",
... expression="`report_date`.dt.year == `report_year`"
... )
>>> result = rule.evaluate(df)
# Boolean logic (AND, OR, NOT) with grouping and comparisons
>>> rule = ConsistencyRule(
... field="flag",
... expression="(`score` > 90) & ((`status` == 'active') | ~`is_archived`)"
... )
>>> result = rule.evaluate(df)
# Using mathematical operations
>>> rule = ConsistencyRule(
... field="predicted",
... expression="abs(`actual` - `predicted`) < 10"
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
An object containing the consistency score ( |
|
|
number of records evaluated, a sample of inconsistent records, and details of failed row indices. |
||
|
See DataQualityResult documentation for full attribute descriptions. |
gchq_data_quality.rules.timeliness.TimelinessRelativeRule
Bases: TimelinessBaseRule
Rule to assess whether datetime values fall between relative time boundaries from a reference date (which can be a static value or come from a column in the data source).
Timedelta bounds are specified for start and end, relative to a reference date or reference column.
All datetime comparisons are performed in UTC, with date-only values assumed midnight.
Only one of reference_date or reference_column may be provided. If neither is given,
current UTC time is used as the reference_date.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Name of the datetime column to assess. |
start_timedelta |
timedelta | str | int | float | None
|
Lower offset from the reference. |
end_timedelta |
timedelta | str | int | float | None
|
Upper offset from the reference. |
reference_date |
str | datetime | Timestamp | None
|
Fixed reference date/time (UTC). |
reference_column |
str | None
|
Per-row column providing reference dates/times. |
dayfirst |
bool
|
If True, parses ALL dates as day/month/year. |
na_values |
str | list[Any] | None
|
Values treated as missing. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension. |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Note
Integer values into start or end timedelta are assumed to be nanoseconds (default pandas.to_timedelta() behaviour)
Example
>>> rule = TimelinessRelativeRule(
... field="event_date",
... start_timedelta="0d",
... end_timedelta="30d",
... reference_date="2024-01-01T00:00:00Z"
... )
>>> result = rule.evaluate(df)
>>> rule = TimelinessRelativeRule(
... field="booking_date",
... start_timedelta="-1d",
... end_timedelta="5d",
... reference_column="event_date"
... )
>>> result = rule.evaluate(df)
# Require event dates at least 5 days after the reference date
>>> rule = TimelinessRelativeRule(
... field="event_date",
... start_timedelta="5d",
... end_timedelta=None,
... reference_date="2023-06-01"
... )
>>> result = rule.evaluate(df)
>>> from datetime import timedelta
>>> rule = TimelinessRelativeRule(
... field="sensor_timestamp",
... start_timedelta=timedelta(hours=-12),
... end_timedelta=timedelta(hours=12)
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the timeliness score ( |
|
|
total records evaluated, and metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.timeliness.TimelinessStaticRule
Bases: TimelinessBaseRule
Rule to check whether datetime values in a column fall between absolute start and end date boundaries (inclusive).
Suitable where both boundaries are fixed or known in advance (e.g., events occurring in January 2024). All dates are treated as, or coerced to, UTC, with date-only strings assumed to be midnight. Invalid or unparsable datetime values are treated as missing. Combine with a validity rule and completeness rule on the same field for the best insights.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Name of the datetime column to assess. |
start_date |
str | datetime | Timestamp | None
|
Inclusive lower boundary for valid values. |
end_date |
str | datetime | Timestamp | None
|
Inclusive upper boundary for valid values. |
dayfirst |
bool
|
If True, parses ALL dates and rule inputs as day/month/year, otherwise month/day/year. |
na_values |
str | list[Any] | None
|
Values treated as missing. |
data_quality_dimension |
DamaFramework
|
Associated data quality dimension. (You may want to override, e.g. perhaps 'Consistency' makes sense for some of these rules) |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = TimelinessStaticRule(
... field="event_date",
... start_date="2024-01-01T00:00:00Z",
... end_date="2024-01-31T23:59:59Z"
... )
>>> result = rule.evaluate(df)
# Only require that dates are on or after 2023-06-01
>>> rule = TimelinessStaticRule(
... field="date_col",
... start_date="2023-06-01",
... end_date=None
... )
>>> result = rule.evaluate(df)
# Using string-based boundaries with day-first format
>>> rule = TimelinessStaticRule(
... field="timestamp",
... start_date="01/06/2023",
... end_date="30/06/2023",
... dayfirst=True # also assumes dates in field 'timestamp' are dayfirst
... )
>>> result = rule.evaluate(df)
# Using Python datetime objects as boundaries
>>> from datetime import datetime, timezone
>>> rule = TimelinessStaticRule(
... field="timestamp",
... start_date=datetime(2023, 6, 1, 0, 0, tzinfo=timezone.utc),
... end_date=datetime(2023, 6, 30, 23, 59, tzinfo=timezone.utc),
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the timeliness score ( |
|
|
a sample of those records, and metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.validity.ValidityNumericalRangeRule
Bases: BaseRule
Rule for validating numerical values against a specified range.
Considers only non-null values; values outside the range or failing coercion to numeric are considered invalid. Diagnostic samples and record indices are returned for values outside the allowed range.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for numerical range validity. |
min_value |
float
|
Minimum allowed value (inclusive; defaults to -infinity). |
max_value |
float
|
Maximum allowed value (inclusive; defaults to +infinity). |
na_values |
str | list[Any] | None
|
Additional values to treat as missing. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Validity). |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValidityNumericalRangeRule(
... field="age",
... min_value=0,
... max_value=120
... )
>>> result = rule.evaluate(df)
# no upper limit
>>> rule = ValidityNumericalRangeRule(
... field="temp_c",
... min_value=0,
... na_values=-999
... )
>>> result = rule.evaluate(df)
# no lower limit
>>> rule = ValidityNumericalRangeRule(
... field="score",
... max_value=100,
... na_values=['missing', 'N/A']
... )
>>> result = rule.evaluate(df)
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the validity score ( |
|
|
sample and indices of failed records, total records evaluated, |
||
|
and rule metadata. See DataQualityResult documentation for details. |
gchq_data_quality.rules.validity.ValidityRegexRule
Bases: BaseRule
Rule for validating string values against a regular expression.
Considers only non-null entries, with additional missing-value patterns specified via
na_values. A diagnostic sample of values failing the regex is returned if present.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for regex validity. |
regex_pattern |
str
|
Regular expression pattern for validation. |
na_values |
str | list[Any] | None
|
Additional values to treat as missing. |
data_quality_dimension |
DamaFramework
|
Data quality dimension (Validity) by default. |
rule_id |
str | None
|
Optional rule identifier. |
rule_description |
str | None
|
Optional rule description. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame) -> DataQualityResult Evaluates the rule on the provided Pandas or Spark DataFrame and returns the metrics and diagnostics of the rule evaluation. |
Example
>>> rule = ValidityRegexRule(
... field="email",
... regex_pattern=r'^[^@]+@[^@]+\.[^@]+$'
... )
>>> result = rule.evaluate(df)
>>> rule = ValidityRegexRule(
... field="country_code",
... regex_pattern=r'^[A-Z]{2}$'
... )
>>> result = rule.evaluate(df)
Note
To centrally manage and update regex patterns you can provide a separate YAML file containing named regex patterns (e.g., EMAIL_REGEX, POSTCODE_REGEX). Keys in this file are substituted in your main configuration files wherever referenced, enabling consistent and maintainable regex use.
When storing regex patterns in YAML, always use single quotes ('pattern') rather than double quotes to ensure correct handling of typical regex escape characters, such as \d or \w.
# regex_patterns.yaml
EMAIL_REGEX: '^[^@]+@[^@]+\.[^@]+$'
POSTCODE_REGEX: '^[A-Z]{2}[0-9]{2,3}\s?[0-9][A-Z]{2}$'
# In your DQ config YAML, use the key in place of the regex pattern:
rules:
- function: validity_regex
field: email
regex_pattern: EMAIL_REGEX
# Python code to load with substitution:
>>> from gchq_data_quality.config import DataQualityConfig
>>> dq_config = DataQualityConfig.from_yaml(
... 'your_config.yaml',
... regex_yaml_path='regex_patterns.yaml'
... )
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the validity score ( |
|
|
sample and indices of failed records, number of evaluated records, |
||
|
and rule metadata. See DataQualityResult documentation for details. |
Data Quality Configuration and Results
gchq_data_quality.config.DataQualityConfig
Bases: BaseModel
Configuration describing a set of data quality checks to be run on a dataset.
Typically constructed by loading a YAML file specifying the dataset and a list of rule definitions. Can also be created programmatically.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
str | None
|
Dataset name or identifier. |
measurement_sample |
str | None
|
Description of data sample. |
lifecycle_stage |
str | None
|
The lifecycle stage at which data is measured. |
measurement_time |
datetime | None
|
Measurement timestamp. |
dataset_id |
str | int | float | None
|
Local data catalogue ID. |
rules |
list[RuleType] | None
|
List of rule models. |
Example
# Loading from YAML
config = DataQualityConfig.from_yaml("my_config.yaml")
-- see tutorial for how to specify the yaml file, or create a config programatically and use .to_yaml() to create something to start with.
# Override regex patterns
config = DataQualityConfig.from_yaml("my_config.yaml", regex_yaml_path='regex_patterns.yaml')
# Running data quality checks
report = config.execute(data_source=my_dataframe)
# Or, creating config programmatically from scratch
config2 = DataQualityConfig(
dataset_name="my_data",
rules=[
ValidityRegexRule(field="email", regex_pattern='.+@example.com'),
],
)
Methods:
execute(data_source) -> DataQualityReport:
Execute the measurement configuration against the provided data source
(e.g., pandas DataFrame, Spark DataFrame).
Runs each rule's evaluate() method and returns a DataQualityReport
containing the results.
from_yaml(file_path: str | Path, regex_yaml_path: str | Path | None = None) -> DataQualityConfig:
Load a configuration instance from a YAML file. If regex_yaml_path is provided,
regex patterns in rule definitions can be overridden or supplemented by patterns
from this separate YAML file.
to_yaml(file_path: str | Path, overwrite: bool = False) -> None:
Save as YAML file.
from_report(report: DataQualityReport) -> DataQualityConfig:
Create config instance from report results. This will extract the rule defintition from the
rule_data field (which is a JSON dump of all rule metadata)
gchq_data_quality.results.models.DataQualityReport
Bases: DataQualityBaseModel
A collection of individual data quality results for a dataset. This object is typically returned by executing a DataQualityConfig object, rather than instantiated directly by the user.
Attributes:
| Name | Type | Description |
|---|---|---|
results |
list[DataQualityResult]
|
List of individual DataQualityResults for each rule applied. |
Methods:
| Name | Description |
|---|---|
to_dataframe |
Converts report results to a pandas DataFrame for analysis. |
to_json |
Serialises the report to JSON, optionally saving to file. |
from_dataframe |
Constructs a DataQualityReport from a pandas DataFrame formatted as in to_dataframe(). |
Example
config = DataQualityConfig.from_yaml('quality_cfg.yaml')
report = config.execute(df) # <- this is the DataQualityReport object creation step
df_results = report.to_dataframe(decimals=3)
report.to_json('results.json')
gchq_data_quality.results.models.DataQualityResult
Bases: DataQualityBaseModel
Represents the outcome of a single data quality rule applied to a dataset column. Noting that some rules may reference additional columns, such as ConsistencyRule
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
float | str | int | None
|
Common, human-readable name of the measured dataset. |
dataset_id |
float | str | int | None
|
Machine-readable unique ID for the dataset. |
measurement_sample |
str | None
|
Description of the sample measured. |
lifecycle_stage |
Any | None
|
Stage of data lifecycle at the time of measurement (e.g., '01 ingest'). |
measurement_time |
UTCDateTimeStrict
|
UTC timestamp when measurement was taken. Defaults to 'now' in UTC. |
field |
str
|
Name of the column the rule applies to. |
data_quality_dimension |
DamaFramework
|
Data quality dimension evaluated (Uniqueness, Completeness, etc.). |
records_evaluated |
int | None
|
Total records evaluated by this rule. |
pass_rate |
float | None
|
Ratio (0-1) of passing records to evaluated records. |
rule_id |
Any | None
|
Local identifier for the applied rule. |
rule_description |
Any
|
Text, dict, or JSON describing rule parameters and logic. |
rule_data |
str
|
JSON dump of rule metadata for reconstruction of rule. |
records_failed_ids |
list | None
|
Up to 10 (default) identifiers for rows failing the rule. |
records_failed_sample |
list[dict] | None
|
Sample output of failed records for diagnostics. |
Example
# Typical user interaction is via DataQualityReport:
config = DataQualityConfig.from_yaml('config.yaml')
report = config.execute(df)
first_result = report.results[0]
print(first_result.pass_rate) # Access result attributes
Note
Direct construction of DataQualityResult or DataQualityReport are rare; results are typically gathered in production using RuleType.evaluate(df) or DataQualityConfig.execute(data)
Spark Utilities
gchq_data_quality.spark.dataframe_operations.flatten_spark(df, flatten_cols)
Flattens arrays and nested fields in a Spark DataFrame to produce a Spark-safe, single-level table.
The columns to flatten may include array or struct paths, with array selections: '[*]' - explodes arrays into multiple rows '[]' - selects the first non-null element from the array
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Spark DataFrame containing nested or array fields. |
required |
flatten_cols
|
list[str]
|
List of strings indicating nested columns to flatten. Paths may include array notation (e.g., 'orders[*].item', 'info.details[]'). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
A Spark DataFrame with the specified columns flattened and Spark-safe |
DataFrame
|
column names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the column paths are inconsistent or not found in the schema, or if array notation is misapplied. |
Example
Flatten three levels of orders in a customer DataFrame:
flat_df = flatten_spark(df, [
"customer[*].orders[*].items[*].productId",
"customer[*].name"
])
flat_df.show()
Types and Base Rule
The way we categorise the data quality dimensions
gchq_data_quality.models.DamaFramework
Bases: str, Enum
Allowed names for data quality framework dimensions following DAMA (Data Management Association).
Members
Uniqueness: Value is "Uniqueness". Completeness: Value is "Completeness". Validity: Value is "Validity". Consistency: Value is "Consistency". Accuracy: Value is "Accuracy". Timeliness: Value is "Timeliness".
Note
It will accept any string case, but coerce to title case.
Example
DamaFramework("uniqueness") # Returns DamaFramework.Uniqueness
DamaFramework.Completeness.value # "Completeness"
The base rule is never called by a user, but serves as a parent for all data quality rules.
gchq_data_quality.rules.base.BaseRule
Bases: DataQualityBaseModel, ABC
Abstract base class for data quality rule definitions.
Not intended for direct use. Use a Subclass with a specific rule type (e.g., AccuracyRule, CompletenessRule) for configuration or execution of data quality checks. BaseRule handles all generic configuration and evaluation steps, with rule-specific logic implemented via subclass overrides.
Attributes:
| Name | Type | Description |
|---|---|---|
field |
str
|
Column to check for rule evaluation. |
rule_id |
str | None
|
Optional identifier for this rule. |
rule_description |
str | None
|
Optional summary or explanation of the rule. |
na_values |
str | int | float | list[Any] | None
|
Values to treat as NULL. |
pd.NA = |
Literal['all', 'any', 'never']
|
Controls what records are skipped due to nulls. |
data_quality_dimension |
DamaFramework
|
Linked DAMA data quality dimension. |
Methods:
| Name | Description |
|---|---|
evaluate |
pd.DataFrame | SparkDataFrame | Elasticsearch) -> DataQualityResult Applies the rule to source data and returns evaluation metrics and diagnostics. |
Note
This base class should not be instantiated directly. Use a rule subclass for actual configuration or evaluation.
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains metrics of evaluation such as pass rate, |
|
|
evaluated record count, indices/sample of failed records, and rule metadata. |
||
|
See DataQualityResult documentation for details. |
data_quality_dimension = Field(..., description='The Dama dimension for each rule')
class-attribute
instance-attribute
field = Field(..., description='Column to check')
class-attribute
instance-attribute
na_values = Field(default=None, description='Additional values to treat as null')
class-attribute
instance-attribute
rule_description = Field(default=None, description='Description of the rule')
class-attribute
instance-attribute
rule_id = Field(default=None, description='Identifier for this rule')
class-attribute
instance-attribute
skip_if_null = Field(default='any', description="Controls which rows are skipped that contain null values. If 'all' then it will only skip if all columns used are NULL.most rules this will just apply to the 'field' column, but some like TimelinessRelativeRule can use more than one column.If values aren't skipped, then NULL values are passed into the calculations so be cautious as to what you allow through as 3 > pd.NA = <NA> ")
class-attribute
instance-attribute
_coerce_dataframe_type(df)
Some rules require values to be coerced to a different data type. Timeliness > UTC datetime, ValiditiyNumericalRange > numeric
This function handles coercing to the relevant data type for the rule. Override if needed, the default behaviour is no coercion
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: If no coercion, the original df. If coerced, a modifed dataframe |
_copy_and_subset_dataframe(df, columns_used)
Copies the dataframe to avoid later mutations when we replace NA values or coerce to a different data type.
Also ensures the dataframe columns are kept in the same order as the orginal df
_evaluate_in_elastic(es, index_name, query=None)
_evaluate_in_pandas(df)
Evaluates the rule against the provided DataFrame.
Performs field existence check, handles NA values and coercion calculates number of records evaluated and passing, computes pass rate, and includes a sample of failed records if required. A subset of the steps below can be overriden to give any inherited rule the desirved behaviour, without having to completely override the evaluate() function itself.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to evaluate |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
DataQualityResult
|
A summary of data quality metrics for the rule such as |
DataQualityResult
|
records evaluated, pass rate, and details of failed records if required. |
_evaluate_in_pandas_output_dataframe(df)
Wrapper to ensure when executing in Spark we return a DataFrame (this is a Spark requirement), yet we want to maintain the behaviour that _evaluate_in_pandas returns a DataQualityResult (so did not want to override that).
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Dataframe in a format that matches SparkDataQualityResultSchema |
_evaluate_in_spark(spark_df)
By default we execute everything in pandas via mapInPandas, this partitions the data automatically and sends dataframes to each Spark worker, we then aggregate the resulting data.
_get_columns_used_pandas()
The columns used in evaluting the rule, defaults to just the field, but other rules such as consistency may use more than one column and will override this.
_get_null_count(df, field)
_get_null_counts_all_columns(df)
Goes through each column and calculates the null count.
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
A dictionary of {column_name : null_count} e.g. {'name' : 7, 'age' : 0} |
_get_records_evaluated_mask_pandas(df)
The bool mask of whether a record is being evaluated. The majority of rules will not evaluate against records that are NULL With the exception of the CompletenessRule. So the default behaviour is evaluate NON null values.
_get_records_evaluated_pandas(df)
Computes the number of records that are evaluated against the rule.
By default, counts non-null entries in the target field. Override this for rules involving multiple columns or different completeness logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
The count of records in the field being assessed. |
_get_records_failed_mask_pandas(df)
Abstract method to generate a boolean mask for records failing the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Boolean mask where True indicates a failing record. |
_get_records_failed_pandas(df)
Returns a list of unique records from the field that failed the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame instance to process (assumes df has been filtered to just contain required columns). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list[dict]
|
Unique records from the field corresponding to failed records. In format [{colA : valueA, colB : valueB}, {...etc}] |
_get_records_passing_mask_pandas(df)
abstractmethod
The bool mask of what records are passing (i.e. this function is the main way we define our data quality rules), this is also an AND with the records_evaluated_mask by definition, as we cannot pass a record if it has not been evaluated.
_get_records_passing_pandas(df)
Abstract method to compute the number of records passing the data quality rule.
This must be customised for each specific rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
The count of records passing the rule's criteria. |
_get_skip_if_null_mask(df)
Return mask for records to skip based on self.skip_if_null.
_get_spark_safe_rule()
Returns a modified (deep copy) of the rule with spark safe column names in any column used to evaluate the rule. This is required when working with nested data, as if we want to measure 'customers.age' after we flatten the dataframe and exract the age property from the 'customers' object our column will be renamed to customers_age when it gets passed to _evaluate_in_pandas.
As 'customers.age' is not a valid Spark column name once the data is flattened.
This is overridden for each subrule type if more than self.field is used
_handle_dataframe_coercion(df)
Coerce the dataframe to a new datatype (if required). We will also check if the null count changes upon coercion and raise a warning with the user
_handle_na_values_pandas(df, columns_used, na_values)
Replace specified values in a DataFrame with pd.NA if na_values is provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
columns_used
|
list
|
Columns to scan for null-like values. |
required |
na_values
|
list
|
List of values to consider as missing. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A dataframe where the specified values are replaced with pd.NA, or the original if na_values is None. |
_replace_na_in_bool_mask(mask)
If we get 'None' in a boolean mask we can't conduct mask operations such as inverting it or logical AND / OR, this replaces None / NA, with False.
This method can be overridden by child classes
_require_failed_records_sample(pass_rate)
Determines whether a diagnostic sample of failed records should be collected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pass_rate
|
float | None
|
The rule pass rate, or None if no records were evaluated. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if failed record samples are required; otherwise, False. |
_warn_if_null_counts_different(original_null_counts, new_null_counts)
Compares the null counts between the original and new (the keys will be the same), if the new has more nulls, raise a warning and mention the column. Typically something we do during coercion to a new datatype.
evaluate(data_source, index_name='', query=None)
evaluate(data_source: pd.DataFrame) -> DataQualityResult
evaluate(data_source: SparkDataFrame) -> DataQualityResult
evaluate(
data_source: Elasticsearch,
index_name: str = ...,
query: dict | None = ...,
) -> DataQualityResult
Evaluates this rule against the provided data source.
Supports both Pandas and Spark DataFrames as input. Applies all rule configuration, handles nulls and data coercion, and computes relevant data quality metrics. If an Elasticsearch index and client are supplied, an error is raised unless that backend is implemented. Currently not implemented.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_source
|
DataFrame | DataFrame | Elasticsearch
|
The data to evaluate— can be a Pandas DataFrame, a Spark DataFrame, or an Elasticsearch client. |
required |
index_name
|
str
|
Required if evaluating with Elasticsearch; the index to check. |
''
|
query
|
dict
|
Required if evaluating with Elaticsearch, defaults to a query that matches all documents |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
DataQualityResult |
Contains the metrics and diagnostics of rule evaluation, |
|
|
including pass rate, number of records evaluated, indices and sample of failed records, |
||
|
and rule metadata. See DataQualityResult documentation for details. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported data source is provided. |
NotImplementedError
|
If Elasticsearch evaluation is requested but not supported. |