Cassandra Cluster why node shows UNREACHABLE?

风流意气都作罢 提交于 2021-01-28 09:07:30

问题


I have a Cassandra v3.9 cluster of 3 nodes on CentOS. Replication factor=2, replication strategy as networktopologystrategy, single data center.

One of the nodes which is not a "seed" node, many times when I do "nodetool describecluster", shows as UNREACHABLE for some time, upto 10 min in some cases. After that it again shows up as normal node. At that time when I look at /var/log/cassandra/debug.log, I see following line on one of the seed nodes:

"DEBUG [RMI TCP Connection(118)-127.0.0.1] 2017-07-06 09:15:40,519 StorageProxy.java:2254 - Hosts not in agreement. Didn't get a response from everybody:10.0.0.113"

Here is my cassandra.yaml file configuration. The similar configuration is on other 2 nodes with only IP addresses change.

    # Cassandra storage config YAML

# NOTE:
#   See http://wiki.apache.org/cassandra/StorageConfiguration for
#   full explanations of configuration directives
# /NOTE

# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'MyCluster'

num_tokens: 256

<<some other default settings>>

# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
# hints_directory: /var/lib/cassandra/hints
hints_directory: /var/lib/cassandra/data/hints

<<some other default settings>>

# any class that implements the SeedProvider interface and has a
# constructor that takes a Map<String, String> of parameters will do.
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "10.0.0.111,10.0.0.112"

# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them. Same applies to
# "concurrent_counter_writes", since counter writes read the current
# values before incrementing and writing them back.
#
# On the other hand, since writes are almost never IO bound, the ideal
# number of "concurrent_writes" is dependent on the number of cores in
# your system; (8 * number_of_cores) is a good rule of thumb.
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32

<<some other default settings>>

# TCP port, for commands and data
# For security reasons, you should not expose this port to the internet.  Firewall it if needed.
storage_port: 7000

# SSL port, for encrypted communication.  Unused unless enabled in
# encryption_options
# For security reasons, you should not expose this port to the internet.  Firewall it if needed.
ssl_storage_port: 7001

listen_address: 10.0.0.113
listen_interface_prefer_ipv6: false

#broadcast_address: 134.207.255.11

#listen_on_broadcast_address: true

start_native_transport: true
native_transport_port: 9042

native_transport_max_frame_size_in_mb: 256

native_transport_max_concurrent_connections: -1

# The maximum number of concurrent client connections per source ip.
# The default is -1, which means unlimited.
native_transport_max_concurrent_connections_per_ip: -1

# Whether to start the thrift rpc server.
start_rpc: false


rpc_address: 10.0.0.113

rpc_interface_prefer_ipv6: false

# port for Thrift to listen for clients on
rpc_port: 9160

# RPC address to broadcast to drivers and other Cassandra nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
broadcast_rpc_address: 134.207.255.11

# enable or disable keepalive on rpc/native connections
rpc_keepalive: true

<<some other default settings>>

# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 1800000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 1800000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 200000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000

# Enable operation timeout information exchange between nodes to accurately
# measure request timeouts.  If disabled, replicas will assume that requests
# were forwarded to them instantly by the coordinator, which means that
# under overload conditions we will waste that much extra time processing
# already-timed-out requests.
#
# Warning: before enabling this property make sure to ntp is installed
# and the times are synchronized between the nodes.
cross_node_timeout: true

endpoint_snitch: GossipingPropertyFileSnitch

# controls how often to perform the more expensive part of host score
# calculation
dynamic_snitch_update_interval_in_ms: 100
# controls how often to reset all host scores, allowing a bad host to
# possibly recover
dynamic_snitch_reset_interval_in_ms: 600000

request_scheduler: org.apache.cassandra.scheduler.NoScheduler

<<some other default settings>>

Here is the system.log file from 10.0.0.113 server

When the above error occurs, my application behaves weird and fails. I am using NTP on all the three nodes.

The question is why the node 10.0.0.113 shows UNREACHABLE for so many times and how do I fix it permanently. I do not want to remove the node from cluster. I want to fix this error and make the node available all the time.

Thank you in advance.

来源:https://stackoverflow.com/questions/44954630/cassandra-cluster-why-node-shows-unreachable

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!