ClickHouse Restore Validation
This article explains how to use the new validate_restore_consistency.yml of clickhouse_ansible to perform cross-cluster data acceptance on the source cluster and the recovery target cluster.
This is not a built-in step in restore_cluster.yml, but a standalone playbook. The reason is clear:
- It needs to read two sets of inventory at the same time.
- It is essentially a “comparison task between two sets of clusters”.
- For production source clusters that are continuously writing, there is no stable “strong consistency acceptance after recovery” window. Only write-stop/static drill scenarios are suitable for this kind of validation.
1. Applicable scenarios
- Exercise environment
- Static source cluster that has been stopped
- Restore target cluster of
restore_cluster.ymlcompleted
Not recommended for direct use:
- Production source cluster that is continuously writing
- Recovery target cluster whose business is still changing
2. Why make a new Playbook separately?
restore_cluster.yml is responsible for:
- read manifest
- Find the backup path
- Execute ClickHouse
RESTORE - Verify recovery target side replication health
It is not suitable to be directly responsible for “source cluster comparison with target cluster” because:
- The source cluster may not be part of the current inventory
- The source cluster may still be writing
- The acceptance standards for TTL tables, asynchronous merge tables, and materialized view tables are also different.
Therefore, recovery actions and recovery acceptance should be separated.
3. Calling method
validate_restore_consistency.yml needs to pass in two sets of inventory at the same time:
cd /usr/local/dbbot/clickhouse_ansible/playbooks
ansible-playbook \
-i ../inventory/hosts.backup.ini \
-i ../inventory/hosts.restore.ini \
validate_restore_consistency.yml
Default convention:
- Source cluster grouping:
clickhouse_backup - Target cluster grouping:
clickhouse_restore
If you use a custom group name, you can override it in playbooks/vars/validate_restore_config.yml:
validate_source_groupvalidate_target_group
4. Key variables
File: playbooks/vars/validate_restore_config.yml
| Variable | Default value | Function |
|---|---|---|
validate_source_group | clickhouse_backup | Source cluster group |
validate_target_group | clickhouse_restore | Target cluster group |
validate_source_query_host | 127.0.0.1 | Source query address |
validate_target_query_host | 127.0.0.1 | Target query address |
validate_clickhouse_user | default | Query user name |
validate_clickhouse_password | {{ clickhouse_default_password }} | Query password |
validate_fail_on_missing_pairs | true | Whether to fail when the source/target lacks a corresponding shard copy |
validate_checks | [] | List of validation items |
5. How to write check items
5.1 General local table total volume validation
validate_checks:
- name: "order_items_full_count"
database: "lab_ck_biz"
local_table: "fact_order_item_local"
When query is empty, Playbook will be automatically generated:
SELECT toString(count()) FROM lab_ck_biz.fact_order_item_local
5.2 TTL table fixed time window validation
This is currently the most recommended way of writing.
validate_checks:
- name: "mysql_slowlog_recent_window"
database: "lab_ck_biz"
local_table: "mysql_slowlog_raw_local"
where: "log_time >= toDateTime('2026-02-07 00:00:00')"
Playbook will automatically generate:
SELECT toString(count())
FROM lab_ck_biz.mysql_slowlog_raw_local
WHERE log_time >= toDateTime('2026-02-07 00:00:00')
5.3 Customized single value SQL
If you want to do more complex acceptance, you can write the complete SQL directly, but the result must be a single value:
validate_checks:
- name: "slowlog_recent_sum_rows_examined"
query: "SELECT toString(sum(rows_examined)) FROM lab_ck_biz.mysql_slowlog_raw_local WHERE log_time >= toDateTime('2026-02-07 00:00:00')"
6. Why the TTL table cannot be directly compared to the total number of the entire table
For tables like mysql_slowlog_raw_local, the default is:
- Number span
60days - Table TTL
30 DAY DELETE
Even if this kind of table is restored successfully, it is not suitable to directly use count() of the entire table on the source and target sides to strengthen consistent acceptance. The reason is:
RESTOREonly restores the part in the backup.TTL DELETErelies on background merge asynchronous cleanup and will not be completed instantly during recovery.- Even if
MATERIALIZE TTLis executed manually, the execution times of the source and target are different, and the data near the boundary may continue to drift.
Therefore:
- TTL tables prioritize using fixed time windows for comparison
- For non-TTL tables, consider the total amount of the entire table for comparison.
7. Return results
When the validation passes, Playbook will output:
source_grouptarget_groupcheckspairs
When validation fails, mismatches will be listed directly, for example:
- which
check - which
shard/replica - Source terminal value
- target value
8. Recommended order
- ClickHouse Backup
- ClickHouse Restore
- After the recovery is completed, execute
validate_restore_consistency.yml - If you need to clean the environment, execute the uninstall script again.