Setting up Causal Cluster Fails

问题

I am trying to setup up a Neo4J Causal Cluster with 3 cores (core only). I have three Debian servers all debian 8.5. I have installed Java 8 and Neo4J Enterprise 3.4.0 (package source deb https://debian.neo4j.org/repo stable/) on each server.

My hosts are 192.168.20.163, 192.168.20.164 and 192.168.20.165. The config is the same on each host with the obvious change for IP address. The following is for the .163 host

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=192.168.20.163
dbms.mode=CORE
causal_clustering.expected_core_cluster_size=3
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
causal_clustering.initial_discovery_members=192.168.20.163:5000,192.168.20.164:5000,192.168.20.165:5000
causal_clustering.discovery_type=LIST
causal_clustering.discovery_listen_address=192.168.20.163:5000
causal_clustering.transaction_listen_address=192.168.20.163:6000
causal_clustering.raft_listen_address=192.168.20.163:7000

The servers go through the election process but the LEADER continues to switch back to FOLLOWER and trigger a new election.

The non-leader servers or 'members' each get the following error:

ERROR [o.n.c.c.s.s.CoreStateDownloader] Store copy failed due to store ID mismatch

The server that was started first becomes a LEADER but as indicated switches back to FOLLOWER:

2018-05-30 14:58:22.808+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election
2018-05-30 14:58:22.825+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.165:7000]
2018-05-30 14:58:22.827+0000 INFO [o.n.c.m.SenderService] Creating channel to: [192.168.20.164:7000]
2018-05-30 14:58:22.838+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.848+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null
2018-05-30 14:58:22.861+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x2ee2e930, L:/192.168.20.163:50169 - R:/192.168.20.165:7000]
2018-05-30 14:58:22.862+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:50169 remote /192.168.20.165:7000
2018-05-30 14:58:22.863+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x3d670ef3, L:/192.168.20.163:38239 - R:/192.168.20.164:7000]
2018-05-30 14:58:22.863+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /192.168.20.163:38239 remote /192.168.20.164:7000
2018-05-30 14:58:22.928+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.929+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}
2018-05-30 14:58:22.965+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.164:41725
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.RaftMachine] Moving to LEADER state at term 111 (I am MemberId{fbdff840}), voted for by [MemberId{4fe121e0}]
2018-05-30 14:58:23.036+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{fbdff840}
2018-05-30 14:58:23.044+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{f202d023}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:23.045+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Idle{}
2018-05-30 14:58:23.046+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{fbdff840} updating leader info for database default and term 111
2018-05-30 14:58:24.105+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.164:58041
2018-05-30 14:58:26.841+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:7000 remote /192.168.20.165:48317
2018-05-30 14:58:30.881+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /192.168.20.163:6000 remote /192.168.20.165:47015
2018-05-30 14:58:38.462+0000 INFO [o.n.c.c.c.m.MembershipWaiter] Leader commit unknown
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.RaftMachine] Moving to FOLLOWER state after not receiving heartbeat responses in this election timeout period. Heartbeats received: []
2018-05-30 14:58:40.411+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{fbdff840} to null
2018-05-30 14:58:40.412+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{f202d023}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{4fe121e0}[matchIndex: -1, lastSentIndex: 3, localAppendIndex: 3, mode: MISMATCH]
2018-05-30 14:58:40.413+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}
2018-05-30 14:58:40.413+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{fbdff840}, was leader in term 111, now moving to follower.
2018-05-30 14:58:48.342+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered

Eventually servers fail with:

ERROR [o.n.c.c.c.m.MembershipWaiterLifecycle] Server failed to join cluster within catchup time limit [600000 ms]

回答1:

Based on the messages you have I assume you are trying to seed the cluster with a backup from somewhere ? Here's what you should do :

Check if the cluster forms correctly with no seeding (so with an empty database). That way you verify if all your settings are correct.
When seeding the cluster with a backup you need to neo4j-admin unbind the database on each of the instances before starting. Check https://neo4j.com/docs/operations-manual/current/clustering/causal-clustering/seed-cluster/ to find out the specific instructions for your case. The store ID mismatch is what you get if you don't unbind.
If 1. and 2. don't solve your problem, check with Neo4j support (since you are using the EE I assume you do have support).

Hope this helps.

Regards, Tom

来源：https://stackoverflow.com/questions/50609521/setting-up-causal-cluster-fails

标签

neo4j