ClickHouse Restore Validation

This article explains how to use the new validate_restore_consistency.yml of clickhouse_ansible to perform cross-cluster data acceptance on the source cluster and the recovery target cluster.

This is not a built-in step in restore_cluster.yml, but a standalone playbook. The reason is clear:

  1. It needs to read two sets of inventory at the same time.
  2. It is essentially a “comparison task between two sets of clusters”.
  3. For production source clusters that are continuously writing, there is no stable “strong consistency acceptance after recovery” window. Only write-stop/static drill scenarios are suitable for this kind of validation.

1. Applicable scenarios

  • Exercise environment
  • Static source cluster that has been stopped
  • Restore target cluster of restore_cluster.yml completed

Not recommended for direct use:

  • Production source cluster that is continuously writing
  • Recovery target cluster whose business is still changing

2. Why make a new Playbook separately?

restore_cluster.yml is responsible for:

  • read manifest
  • Find the backup path
  • Execute ClickHouse RESTORE
  • Verify recovery target side replication health

It is not suitable to be directly responsible for “source cluster comparison with target cluster” because:

  • The source cluster may not be part of the current inventory
  • The source cluster may still be writing
  • The acceptance standards for TTL tables, asynchronous merge tables, and materialized view tables are also different.

Therefore, recovery actions and recovery acceptance should be separated.

3. Calling method

validate_restore_consistency.yml needs to pass in two sets of inventory at the same time:

cd /usr/local/dbbot/clickhouse_ansible/playbooks
ansible-playbook \
  -i ../inventory/hosts.backup.ini \
  -i ../inventory/hosts.restore.ini \
  validate_restore_consistency.yml

Default convention:

  • Source cluster grouping: clickhouse_backup
  • Target cluster grouping: clickhouse_restore

If you use a custom group name, you can override it in playbooks/vars/validate_restore_config.yml:

  • validate_source_group
  • validate_target_group

4. Key variables

File: playbooks/vars/validate_restore_config.yml

VariableDefault valueFunction
validate_source_groupclickhouse_backupSource cluster group
validate_target_groupclickhouse_restoreTarget cluster group
validate_source_query_host127.0.0.1Source query address
validate_target_query_host127.0.0.1Target query address
validate_clickhouse_userdefaultQuery user name
validate_clickhouse_password{{ clickhouse_default_password }}Query password
validate_fail_on_missing_pairstrueWhether to fail when the source/target lacks a corresponding shard copy
validate_checks[]List of validation items

5. How to write check items

5.1 General local table total volume validation

validate_checks:
  - name: "order_items_full_count"
    database: "lab_ck_biz"
    local_table: "fact_order_item_local"

When query is empty, Playbook will be automatically generated:

SELECT toString(count()) FROM lab_ck_biz.fact_order_item_local

5.2 TTL table fixed time window validation

This is currently the most recommended way of writing.

validate_checks:
  - name: "mysql_slowlog_recent_window"
    database: "lab_ck_biz"
    local_table: "mysql_slowlog_raw_local"
    where: "log_time >= toDateTime('2026-02-07 00:00:00')"

Playbook will automatically generate:

SELECT toString(count())
FROM lab_ck_biz.mysql_slowlog_raw_local
WHERE log_time >= toDateTime('2026-02-07 00:00:00')

5.3 Customized single value SQL

If you want to do more complex acceptance, you can write the complete SQL directly, but the result must be a single value:

validate_checks:
  - name: "slowlog_recent_sum_rows_examined"
    query: "SELECT toString(sum(rows_examined)) FROM lab_ck_biz.mysql_slowlog_raw_local WHERE log_time >= toDateTime('2026-02-07 00:00:00')"

6. Why the TTL table cannot be directly compared to the total number of the entire table

For tables like mysql_slowlog_raw_local, the default is:

  • Number span 60 days
  • Table TTL 30 DAY DELETE

Even if this kind of table is restored successfully, it is not suitable to directly use count() of the entire table on the source and target sides to strengthen consistent acceptance. The reason is:

  1. RESTORE only restores the part in the backup.
  2. TTL DELETE relies on background merge asynchronous cleanup and will not be completed instantly during recovery.
  3. Even if MATERIALIZE TTL is executed manually, the execution times of the source and target are different, and the data near the boundary may continue to drift.

Therefore:

  • TTL tables prioritize using fixed time windows for comparison
  • For non-TTL tables, consider the total amount of the entire table for comparison.

7. Return results

When the validation passes, Playbook will output:

  • source_group
  • target_group
  • checks
  • pairs

When validation fails, mismatches will be listed directly, for example:

  • which check
  • which shard/replica
  • Source terminal value
  • target value
  1. ClickHouse Backup
  2. ClickHouse Restore
  3. After the recovery is completed, execute validate_restore_consistency.yml
  4. If you need to clean the environment, execute the uninstall script again.