ClickHouse Deployment, Backup, and Restore Specification
This document provides the public specifications of dbbot in the ClickHouse scenario, which is used to unify the basic requirements for deployment, backup, restore, and daily operations.
1. Goal
- Establish unified ClickHouse cluster delivery specifications.
- Reduce the impact of backup and recovery on production business.
- Reduce high-risk misoperations through inventory guarding, manual confirmation, and manifest.
- Provide consistent execution standards for in-place cluster restore and cross-site disaster recovery.
2. Basic environment requirements
2.1 Operating system and hardware
- It is recommended to use the
x86_64architecture and meet theSSE4.2instruction set requirements. - It is recommended to turn off transparent huge pages and SWAP.
- Recommend unified time zone, file descriptor limits and system security baseline.
2.2 Network and time
- A stable and low-latency network should be maintained between cluster nodes.
- NTP or Chrony should be unified across the entire cluster.
- The control node needs to access both the ClickHouse node and the coordination node.
2.3 Storage
- SSD is preferred for production.
- It is recommended to separate the data disk and system disk.
- The backup disk should be isolated from the data disk and ensure that the ClickHouse process can read and write.
3. Topology and object specifications
- The default documentation example uses a
3x2topology, but this is only a reference model. - Distributed tables are used for unified access, and local tables carry actual data.
- In production scenarios, it is recommended to use the
Replicated*MergeTreeseries of tables first. - Naming conventions should be consistent, for example:
<domain>_<topic>_local: local table<domain>_<topic>: distributed table
4. Execution control specifications
4.1 inventory division of labor
- Deploy using
hosts.deploy.ini - Backup using
hosts.backup.ini - Restore using
hosts.restore.ini
4.2 inventory purpose
It is recommended to set it explicitly in [all:vars]:
dbbot_inventory_purpose=deploydbbot_inventory_purpose=backupdbbot_inventory_purpose=restore
4.3 Manual confirmation
- Deployments enable manual confirmation by default.
- Restore default to enable manual confirmation.
- It is not recommended to close this type of access control for a long time in a production environment.
5. Backup and recovery principles
5.1 Backup Principles
- Only one copy of each shard is selected to perform physical backup to avoid repeated IO.
- Use
safe_tsas the cross-shard time caliber. - Save batch number, path, status, time and object range through manifest.
5.2 Recovery Principles
- The primary copy performs full recovery.
- Other replicas perform structural recovery, and then replicate to equalize.
- Object cleanup, release, and business recovery are necessary steps for restoring a runbook and should not be omitted.
6. Security and Documentation Standards
- Document examples must use placeholders and do not write real passwords and real intranet addresses.
- Production operations should retain batch numbers, execution logs and manifests to facilitate auditing.
- Two people need to review the recovery object and inventory range before executing high-risk commands.
7. Recommended acceptance items
After deployment is complete, verify at least the following:
- Whether the topology in
system.clustersis complete. - Whether there are read-only, stacked or delayed exceptions in
system.replicas. - Check whether the backup batch successfully generates the manifest.
- After the recovery is completed, check whether the total volume, fragmented data and replication status are as expected.