Data Quality Tutorial - Python 2
Prerequisites
- Completion of the Python 1 Tutorial (comfortable running data quality functions on DataFrames).
Coding:
You should know about pandas DataFrames and basic Python syntax.
Aim
- Create Data Quality config files (YAML-based rule lists).
- Run these configs directly against your DataFrames.
- Manage your regex patterns from a single YAML file
1. Reusing Example Data
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"id": [1, 2, 3, 3, 5],
"name": ["John", "Jane", "Dave", None, "Missing"],
"age": [30, 25, 102, 15, -5],
"email": [
"john@example.com",
"jane@example.com",
"dave@example",
"test@test.com",
"alice@example.com",
],
"category": ["A", "B", "C", "D", "X"],
"score": [10, 20, 30, 40, -1],
"date": [
datetime(2023, 1, 1),
datetime(2023, 2, 1),
datetime(2023, 3, 1),
datetime(2021, 1, 1),
datetime(2023, 5, 1),
]
})
2. YAML Config Files for Data Quality Rules
Why YAML?
YAML is human-readable and machine-readable, well easier for a human to read than JSON anyway.
Key structure:
- Overall metadata (e.g. dataset_name, measurement_time) all optional
- List of rules you must have at least one rule before you run the config.
Example:
dataset_name: My Source Data
measurement_sample: 10% of records
lifecycle_stage: null
rules:
- field: id
function: values_are_unique
- field: name
na_values: ''
function: values_match_regex
regex_pattern: '[A-z0-9_]'
Lists in YAML
valid_values: [A, B, C, D] # simple
valid_values:
- A
- B
- C
- D # verbose, useful for long lists
Regex in YAML
Regular expressions often contain characters (\, ', :, {}, etc.) that YAML may misinterpret. The safest approach is to treat the regex as a string literal. Our recommended approach:
1. Single‑quoted string (recommended for most simple regex)
Single quotes treat almost everything literally, including backslashes.
regex_pattern: '[A-Za-z]+'
regex_pattern: '\d{4}-\d{2}-\d{2}'
regex_pattern: 'don''t' # escape single quote by doubling it
2. Literal block (|)
Preserves the string exactly as written. Useful for long or complex regex patterns. Or where you have many single quotes in the regex. Note: you do not need to surround this with any quote character.
regex_pattern: |
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
3. Loading and Validating Config Files
from gchq_data_quality import DataQualityConfig
config = DataQualityConfig.from_yaml("your_config.yaml")
You can load multiple files.
We find it makes sense to split your rules into separate files based on what you are measuring. For example, if you own a pet shop and have rules around pet details and customer details and orders, you might want a dates.yaml for storing rules relating to all your dates and names.yaml relating to rules for names of pets and owners:
config = DataQualityConfig.from_yaml(['dates.yaml', 'names.yaml'])
4. Running a Config Against Your Data
Once your config is loaded:
report = config.execute(df)
print(report.to_dataframe(measurement_time_format="%Y-%m-%d %H:%M"))
You can adjust config metadata programmatically. It can be useful to override measurement_time, as you may want to pretend the data was measured at the date of ingest, rather than when you actually measured it. It can help make sense of your analysis later to understand the quality of the data based on when it landed.
from datetime import timezone
config.measurement_sample = "Test Sample"
config.dataset_name = "Overwrite Dataset Name"
config.measurement_time = datetime.now(tz=timezone.utc)
5. Building a Config Incrementally
DataQualityConfig supports len(), +, and += so you can inspect and grow a config programmatically.
Checking the rule count
config = DataQualityConfig.from_yaml("config.yaml")
print(len(config)) # number of rules currently in the config
Adding rules
| Operator | Behaviour | Original modified? |
|---|---|---|
config + rule |
Returns a new DataQualityConfig with the rule appended; original unchanged |
No |
config += rule |
Appends the rule to config in place |
Yes |
config += [rule1, rule2] |
Appends all rules from the list in place | Yes |
6. Creating a Config File From a Report
A typical workflow:
- Experiment with rules in Python
- Produce a DataQualityReport
- Extract those rules back into a deployable YAML config and modify
- saves you writing out the entire YAML file from scratch
config_from_report = DataQualityConfig.from_report(report)
config_from_report.to_yaml("yaml_from_report.yaml", overwrite=True)
7. Managing Regular Expressions
Use a separate YAML file for regex patterns, to keep config rules readable and maintainable.
regex_patterns.yaml:
EMAIL_REGEX: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
PHONE_REGEX: '[0-9]+'
Reference pattern names instead of raw regex in your rules:
- field: email
function: values_match_regex
regex_pattern: EMAIL_REGEX
When loading your config:
config = DataQualityConfig.from_yaml(
"config_with_regex_refs.yaml",
regex_yaml_path="regex_patterns.yaml"
)
8. Tweak Output Display (Advanced)
Control sample output size globally:
from gchq_data_quality.globals import SampleConfig
SampleConfig.RECORDS_FAILED_SAMPLE_SIZE = 25
This value can be made very large. If you wanted a list of every failed record in a dataset of 100,000 you could in theory set this to be 100,000. Just be aware that the values are held in memory so you will need suitable amounts of RAM, although this is only likely to be an issue if you are storing ~1 million or more.