Data Quality Tutorial - Python 2

Prerequisites

Completion of the Python 1 Tutorial (comfortable running data quality functions on DataFrames).

Coding:
You should know about pandas DataFrames and basic Python syntax.

Aim

Create Data Quality config files (YAML-based rule lists).
Run these configs directly against your DataFrames.
Manage your regex patterns from a single YAML file

1. Reusing Example Data

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    "id": [1, 2, 3, 3, 5],
    "name": ["John", "Jane", "Dave", None, "Missing"],
    "age": [30, 25, 102, 15, -5],
    "email": [
        "john@example.com",
        "jane@example.com",
        "dave@example",
        "test@test.com",
        "alice@example.com",
    ],
    "category": ["A", "B", "C", "D", "X"],
    "score": [10, 20, 30, 40, -1],
    "date": [
        datetime(2023, 1, 1),
        datetime(2023, 2, 1),
        datetime(2023, 3, 1),
        datetime(2021, 1, 1),
        datetime(2023, 5, 1),
    ]
})

2. YAML Config Files for Data Quality Rules

Why YAML?

YAML is human-readable and machine-readable, well easier for a human to read than JSON anyway.

Key structure:

Overall metadata (e.g. dataset_name, measurement_time) all optional
List of rules you must have at least one rule before you run the config.

Example:

dataset_name: My Source Data
measurement_sample: 10% of records
lifecycle_stage: null
rules:
  - field: id
    function: uniqueness
  - field: name
    na_values: ''
    function: validity_regex
    regex_pattern: '[A-z0-9_]'

Lists in YAML

valid_values: [A, B, C, D]   # simple
valid_values:
  - A
  - B
  - C
  - D               # verbose, useful for long lists

Regex in YAML

Always surround regex_pattern with single quotes:

regex_pattern: '[A-Za-z]+'
regex_pattern: '\d{4}-\d{2}-\d{2}'
regex_pattern: 'don''t'    # To include a single quote

3. Loading and Validating Config Files

from gchq_data_quality import DataQualityConfig

config = DataQualityConfig.from_yaml("your_config.yaml")

You can load multiple files.

We find it makes sense to split your rules into separate files based on what you are measuring. For example, if you own a pet shop and have rules around pet details and customer details and orders, you might want a dates.yaml for storing rules relating to all your dates and names.yaml relating to rules for names of pets and owners:

config = DataQualityConfig.from_yaml(['dates.yaml', 'names.yaml'])

4. Running a Config Against Your Data

Once your config is loaded:

report = config.execute(df)
print(report.to_dataframe(measurement_time_format="%Y-%m-%d %H:%M"))

You can adjust config metadata programmatically. It can be useful to override measurement_time, as you may want to pretend the data was measured at the date of ingest, rather than when you actually measured it. It can help make sense of your analysis later to understand the quality of the data based on what it landed.

from datetime import timezone

config.measurement_sample = "Test Sample"
config.dataset_name = "Overwrite Dataset Name"
config.measurement_time = datetime.now(tz=timezone.utc)

5. Creating a Config File From a Report

A typical workflow: - Experiment with rules in Python - Produce a DataQualityReport - Extract those rules back into a deployable YAML config and modify - saves you writing out the entire YAML file from scratch

config_from_report = DataQualityConfig.from_report(report)
config_from_report.to_yaml("yaml_from_report.yaml", overwrite=True)

6. Mangaing Regular Expressions

Use a separate YAML file for regex patterns, to keep config rules readable and maintainable.

regex_patterns.yaml:

EMAIL_REGEX: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
PHONE_REGEX: '[0-9]+'

Reference pattern names instead of raw regex in your rules:

- field: email
  function: validity_regex
  regex_pattern: EMAIL_REGEX

When loading your config:

config = DataQualityConfig.from_yaml(
    "config_with_regex_refs.yaml", 
    regex_yaml_path="regex_patterns.yaml"
)

7. Tweak Output Display (Advanced)

Control sample output size globally:

from gchq_data_quality.globals import SampleConfig
SampleConfig.RECORDS_FAILED_SAMPLE_SIZE = 25