Creating a ClickHouse cluster - Part II: Replication

In the previous article I showed how to run ClickHouse in cluster mode using only sharding. It's enough for load distribution, but we also need to ensure fault tolerance via replication.

ZooKeeper

Apache ZooKeeper is a coordination service which provides distributed synchronization of config information across nodes. It's obligatory to have a cluster or a single node of it (above 3.4.5) if you want to enable replication on your ClickHouse cluster.

It's pretty easy to launch several nodes of ZooKeeper and assemble them into a cluster. Here's the configuration snippet for docker-compose.yml:

zk-1:
    container_name: zk_1
    image: bitnami/zookeeper:3.6.1
    environment:
      - ZOO_SERVER_ID=1
      - ALLOW_ANONYMOUS_LOGIN=yes
      - ZOO_SERVERS=zk-1:2888:3888,zk-2:2888:3888

  zk-2:
    container_name: zk_2
    image: bitnami/zookeeper:3.6.1
    environment:
      - ZOO_SERVER_ID=2
      - ALLOW_ANONYMOUS_LOGIN=yes
      - ZOO_SERVERS=zk-1:2888:3888,zk-2:2888:3888

And here's my configuration for a single node:

ch-zookeeper:
    container_name: ch_zookeeper
    image: zookeeper:3.6.1
    ports:
      - 2181:2181

Pay attention that ClickHouse can work with ZooKeeper cluster only if its nodes are based on the Docker image by bitnami.

ClickHouse does about 10 write operations on ZooKeeper for each INSERT query.

Now you need to tell all the ClickHouse nodes about your ZooKeeper cluster. For this, the config file has a zookeeper section:

<zookeeper>
        <node index="1">
            <host>zk-1</host>
            <port>2181</port>
        </node>
        <node index="2">
            <host>zk-2</host>
            <port>2181</port>
        </node>
    </zookeeper>

Cluster configuration

I will use 1 master with 2 shards, 2 replicas for each shard. Here's the deployments configuration:

ch-master:
    container_name: ch_master
    image: yandex/clickhouse-server:19.14.13.4
    depends_on:
      - ch-zookeeper
      - ch-sub-1
      - ch-sub-2
      - ch-sub-3
      - ch-sub-4
    ports:
      - 9000:9000
      - 8123:8123
    volumes:
      - type: volume
        source: ch-master-data
        target: /var/lib/clickhouse
      - type: volume
        source: ch-master-logs
        target: /var/log/clickhouse-server
      - ./master-config.xml:/etc/clickhouse-server/config.xml

  ch-sub-1:
    container_name: ch_sub_1
    image: yandex/clickhouse-server:19.14.13.4
    depends_on:
      - ch-zookeeper
    ports:
      - 9001:9000
      - 8124:8123
    volumes:
      - type: volume
        source: ch-sub-1-data
        target: /var/lib/clickhouse
      - type: volume
        source: ch-sub-1-logs
        target: /var/log/clickhouse-server
      - ./sub-1-config.xml:/etc/clickhouse-server/config.xml

  ch-sub-2:
    container_name: ch_sub_2
    image: yandex/clickhouse-server:19.14.13.4
    depends_on:
      - ch-zookeeper
    ports:
      - 9002:9000
      - 8125:8123
    volumes:
      - type: volume
        source: ch-sub-2-data
        target: /var/lib/clickhouse
      - type: volume
        source: ch-sub-2-logs
        target: /var/log/clickhouse-server
      - ./sub-2-config.xml:/etc/clickhouse-server/config.xml

  ch-sub-3:
    container_name: ch_sub_3
    image: yandex/clickhouse-server:19.14.13.4
    depends_on:
      - ch-zookeeper
    ports:
      - 9003:9000
      - 8126:8123
    volumes:
      - type: volume
        source: ch-sub-3-data
        target: /var/lib/clickhouse
      - type: volume
        source: ch-sub-3-logs
        target: /var/log/clickhouse-server
      - ./sub-3-config.xml:/etc/clickhouse-server/config.xml

  ch-sub-4:
    container_name: ch_sub_4
    image: yandex/clickhouse-server:19.14.13.4
    depends_on:
      - ch-zookeeper
    ports:
      - 9004:9000
      - 8127:8123
    volumes:
      - type: volume
        source: ch-sub-4-data
        target: /var/lib/clickhouse
      - type: volume
        source: ch-sub-4-logs
        target: /var/log/clickhouse-server
      - ./sub-4-config.xml:/etc/clickhouse-server/config.xml

volumes:
  ch-master-data:
  ch-master-logs:
  ch-sub-1-data:
  ch-sub-1-logs:
  ch-sub-2-data:
  ch-sub-2-logs:
  ch-sub-3-data:
  ch-sub-3-logs:
  ch-sub-4-data:
  ch-sub-4-logs:

The remote_servers section in master-config.xml will now look like this:

<remote_servers>
        <example_cluster>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ch-sub-1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>ch-sub-2</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ch-sub-3</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>ch-sub-4</host>
                    <port>9000</port>
                </replica>
            </shard>
        </example_cluster>
    </remote_servers>

As you can see, the internal_replication parameter is now specified for each shard. It's set to true so the replicas themselves will send all their data to other replicas of the shard instead of doing it through the master node. If the parameter is set to false, the master node will replicate all the data on it's own which may lead to an inconsistent state when all the replicas contain different data.

As for configuration files for subordinate nodes, a new section must be defined in them in order to enable replication. This section is called macros.

<macros>
        <shard>01</shard>
        <replica>ch-sub-1</replica>
    </macros>

It contains parameters that will be accessible through SQL queries related to the node. In the above source, shard and replica are parameters defined by the user, and "01" and "ch-sub-1" are their corresponding values. The first parameter is a cluster identifier and the second one is a replica name which coincides with the domain name specified in docker-compose.yml.

After all the config files are set up, we can finally use scripts to create a cluster and run it. When all the database nodes are up and running, we should first execute our Python scripts for subordinate nodes. All of them look like this:

from clickhouse_driver import Client
from datetime import datetime

if __name__ == "__main__":
    client = Client("127.0.0.1", port="9001")

    client.execute("CREATE DATABASE IF NOT EXISTS billing")

    client.execute(r'''CREATE TABLE IF NOT EXISTS billing.transactions(
                      timestamp DateTime,
                      currency String,
                      value Float64)
                      ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/billing.transactions', '{replica}')
                      PARTITION BY currency
                      ORDER BY timestamp''')

As you can see, the subordinate table now uses ReplicatedMergeTree engine. Its constructor takes the path to the table records in ZooKeeper as the first parameter and the replica name as the second parameter. The path to the table in ZooKeeper should be unique. All the parameters in {} are taken from the aforementioned macros section of the replica config file.

When all the subordinate tables are created, it's time to create a master table. There's no difference from the previous case when only sharding was utilized:

from clickhouse_driver import Client
from datetime import datetime

if __name__ == "__main__":
    client = Client("127.0.0.1", port="9000")

    client.execute("CREATE DATABASE IF NOT EXISTS billing")

    client.execute('''CREATE TABLE IF NOT EXISTS billing.transactions(
                      timestamp DateTime,
                      currency String,
                      value Float64)
                      ENGINE = Distributed(example_cluster, billing, transactions, rand())''')

If you set up all the things properly, you will get a working ClickHouse cluster with replication enabled. The shard is alive if at least one of its replicas is up. Table replication strengthens fault tolerance of the cluster.

All the necessary source code is here.

22