Skip to content

Data Quality Tutorial - Python 2

Prerequisites

  • Completion of the Python 1 Tutorial (comfortable running data quality functions on DataFrames).

Coding:
You should know about pandas DataFrames and basic Python syntax.

Aim

  • Create Data Quality config files (YAML-based rule lists).
  • Run these configs directly against your DataFrames.
  • Manage your regex patterns from a single YAML file

1. Reusing Example Data

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    "id": [1, 2, 3, 3, 5],
    "name": ["John", "Jane", "Dave", None, "Missing"],
    "age": [30, 25, 102, 15, -5],
    "email": [
        "john@example.com",
        "jane@example.com",
        "dave@example",
        "test@test.com",
        "alice@example.com",
    ],
    "category": ["A", "B", "C", "D", "X"],
    "score": [10, 20, 30, 40, -1],
    "date": [
        datetime(2023, 1, 1),
        datetime(2023, 2, 1),
        datetime(2023, 3, 1),
        datetime(2021, 1, 1),
        datetime(2023, 5, 1),
    ]
})

2. YAML Config Files for Data Quality Rules

Why YAML?

YAML is human-readable and machine-readable, well easier for a human to read than JSON anyway.

Key structure:

  • Overall metadata (e.g. dataset_name, measurement_time) all optional
  • List of rules you must have at least one rule before you run the config.

Example:

dataset_name: My Source Data
measurement_sample: 10% of records
lifecycle_stage: null
rules:
  - field: id
    function: values_are_unique
  - field: name
    na_values: ''
    function: values_match_regex
    regex_pattern: '[A-z0-9_]'

Lists in YAML

valid_values: [A, B, C, D]   # simple
valid_values:
  - A
  - B
  - C
  - D               # verbose, useful for long lists

Regex in YAML

Regular expressions often contain characters (\, ', :, {}, etc.) that YAML may misinterpret. The safest approach is to treat the regex as a string literal. Our recommended approach:

1. Single‑quoted string (recommended for most simple regex)

Single quotes treat almost everything literally, including backslashes.

regex_pattern: '[A-Za-z]+'
regex_pattern: '\d{4}-\d{2}-\d{2}'
regex_pattern: 'don''t'   # escape single quote by doubling it

2. Literal block (|)

Preserves the string exactly as written. Useful for long or complex regex patterns. Or where you have many single quotes in the regex. Note: you do not need to surround this with any quote character.

regex_pattern: |
  ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

3. Loading and Validating Config Files

from gchq_data_quality import DataQualityConfig

config = DataQualityConfig.from_yaml("your_config.yaml")

You can load multiple files.

We find it makes sense to split your rules into separate files based on what you are measuring. For example, if you own a pet shop and have rules around pet details and customer details and orders, you might want a dates.yaml for storing rules relating to all your dates and names.yaml relating to rules for names of pets and owners:

config = DataQualityConfig.from_yaml(['dates.yaml', 'names.yaml'])

4. Running a Config Against Your Data

Once your config is loaded:

report = config.execute(df)
print(report.to_dataframe(measurement_time_format="%Y-%m-%d %H:%M"))

You can adjust config metadata programmatically. It can be useful to override measurement_time, as you may want to pretend the data was measured at the date of ingest, rather than when you actually measured it. It can help make sense of your analysis later to understand the quality of the data based on when it landed.

from datetime import timezone

config.measurement_sample = "Test Sample"
config.dataset_name = "Overwrite Dataset Name"
config.measurement_time = datetime.now(tz=timezone.utc)

5. Building a Config Incrementally

DataQualityConfig supports len(), +, and += so you can inspect and grow a config programmatically.

Checking the rule count

config = DataQualityConfig.from_yaml("config.yaml")
print(len(config))   # number of rules currently in the config

Adding rules

Operator Behaviour Original modified?
config + rule Returns a new DataQualityConfig with the rule appended; original unchanged No
config += rule Appends the rule to config in place Yes
config += [rule1, rule2] Appends all rules from the list in place Yes

6. Creating a Config File From a Report

A typical workflow:

  • Experiment with rules in Python
  • Produce a DataQualityReport
  • Extract those rules back into a deployable YAML config and modify
  • saves you writing out the entire YAML file from scratch
config_from_report = DataQualityConfig.from_report(report)
config_from_report.to_yaml("yaml_from_report.yaml", overwrite=True)

7. Managing Regular Expressions

Use a separate YAML file for regex patterns, to keep config rules readable and maintainable.

regex_patterns.yaml:

EMAIL_REGEX: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
PHONE_REGEX: '[0-9]+'

Reference pattern names instead of raw regex in your rules:

- field: email
  function: values_match_regex
  regex_pattern: EMAIL_REGEX

When loading your config:

config = DataQualityConfig.from_yaml(
    "config_with_regex_refs.yaml", 
    regex_yaml_path="regex_patterns.yaml"
)
This is a 'dumb' find-and-replace operation. It will replace EMAIL_REGEX with the equivalent regex value from your regex file.

8. Tweak Output Display (Advanced)

Control sample output size globally:

from gchq_data_quality.globals import SampleConfig
SampleConfig.RECORDS_FAILED_SAMPLE_SIZE = 25

This value can be made very large. If you wanted a list of every failed record in a dataset of 100,000 you could in theory set this to be 100,000. Just be aware that the values are held in memory so you will need suitable amounts of RAM, although this is only likely to be an issue if you are storing ~1 million or more.