ClickHouse Deployment, Backup, and Restore Specification

This document provides the public specifications of dbbot in the ClickHouse scenario, which is used to unify the basic requirements for deployment, backup, restore, and daily operations.

1. Goal

Establish unified ClickHouse cluster delivery specifications.
Reduce the impact of backup and recovery on production business.
Reduce high-risk misoperations through inventory guarding, manual confirmation, and manifest.
Provide consistent execution standards for in-place cluster restore and cross-site disaster recovery.

2. Basic environment requirements

2.1 Operating system and hardware

It is recommended to use the x86_64 architecture and meet the SSE4.2 instruction set requirements.
It is recommended to turn off transparent huge pages and SWAP.
Recommend unified time zone, file descriptor limits and system security baseline.

2.2 Network and time

A stable and low-latency network should be maintained between cluster nodes.
NTP or Chrony should be unified across the entire cluster.
The control node needs to access both the ClickHouse node and the coordination node.

2.3 Storage

SSD is preferred for production.
It is recommended to separate the data disk and system disk.
The backup disk should be isolated from the data disk and ensure that the ClickHouse process can read and write.

3. Topology and object specifications

The default documentation example uses a 3x2 topology, but this is only a reference model.
Distributed tables are used for unified access, and local tables carry actual data.
In production scenarios, it is recommended to use the Replicated*MergeTree series of tables first.
Naming conventions should be consistent, for example:
- <domain>_<topic>_local: local table
- <domain>_<topic>: distributed table

4. Execution control specifications

4.1 inventory division of labor

Deploy using hosts.deploy.ini
Backup using hosts.backup.ini
Restore using hosts.restore.ini

4.2 inventory purpose

It is recommended to set it explicitly in [all:vars]:

dbbot_inventory_purpose=deploy
dbbot_inventory_purpose=backup
dbbot_inventory_purpose=restore

4.3 Manual confirmation

Deployments enable manual confirmation by default.
Restore default to enable manual confirmation.
It is not recommended to close this type of access control for a long time in a production environment.

5. Backup and recovery principles

5.1 Backup Principles

Only one copy of each shard is selected to perform physical backup to avoid repeated IO.
Use safe_ts as the cross-shard time caliber.
Save batch number, path, status, time and object range through manifest.

5.2 Recovery Principles

The primary copy performs full recovery.
Other replicas perform structural recovery, and then replicate to equalize.
Object cleanup, release, and business recovery are necessary steps for restoring a runbook and should not be omitted.

6. Security and Documentation Standards

Document examples must use placeholders and do not write real passwords and real intranet addresses.
Production operations should retain batch numbers, execution logs and manifests to facilitate auditing.
Two people need to review the recovery object and inventory range before executing high-risk commands.

7. Recommended acceptance items

After deployment is complete, verify at least the following:

Whether the topology in system.clusters is complete.
Whether there are read-only, stacked or delayed exceptions in system.replicas.
Check whether the backup batch successfully generates the manifest.
After the recovery is completed, check whether the total volume, fragmented data and replication status are as expected.