ES作用超强悍的搜索引擎,除了需要具有齐全的功能支持,超高的性能,还必须要有任意扩展的能力。一定程度上,它是一个大数据产品。而要做扩展性,集群自然少不了。然而单独的集群又是不够的,能够做的事情太少,所以它需要自己组件合适的集群。也就是服务需要自动发现,自动协调集群实例。当然,这只是扩展性的第一步。
那么,ES是如何做到集群间的自动发现的呢?本文就一起来探索探索吧。
虽然我们想探究的是es的不用配置就可以自动发现的实现原理,但是当你去看新的es的实现时,会惊奇地发现,它已经不再支持这种功能了。即新版本的es不再支持隐式集群发现了,实际上这个功能是在5.0之后取消掉了。
至于为什么会取消该功能,我想可能和可靠性有比较大的关系。当然,这个问题我们抛却不说,只管从理论上来讨论讨论这事即可。
es的2.1版本中,还有相应的集群自动发现功能,我们就以这为参考吧。事实上,在这些已经有实现的版本中,它也只是作为一个插件式存在,即后续版本不再支持,仅是没有发布该插件而已。
而核心原理,自然是多播或者广播。
其实平时我们会遇到很多自动发现服务的场景,比如RPC的调用,MQ消息的分发,docker的集群管理。。。
所以,自动发现几乎是一个平常的应用场景,那么,一般它都是是怎么解决的呢?通常,就是有一个注册中心,然后各组件启动后,将自身注册到注册中心,然后由注册中心将消息同步给到使用方,从而让使用方感知这一变化,从而完成自动发现。这几乎是一个通用的解决办法,也很容易理解。
但注册中心会引入一个额外的服务,如果不想带这额外的服务,则可能需要各节点间自行协调,或者说让各自节点都成为可能的注册中心。
注册中心,确实是充当了自动发现的角色,然而如何处理发现之后的步骤,则是各具体应用具体分析的了。所以,除了注册中心这么一个邮递员之外,还必须上下游的配合。
做自动发现的初衷,一是为了能够随时扩容,还有一定程度上的高可用。所以,通常注册中本身就不能成为单点。当然,一般的这种组件都是集群高可用的。为场景而生嘛!
还有就是本文标题所说,使用多播实现动发现。具体原理原理如何,且看下文分解。
es的配置还是比较简化的,绝大部分都是默认值,只做一些简单的配置即可。甚至对于单机的部署,下载下来什么都不用改,立即就可以运行了。下面我们看两个简单的集群配置样例:(elasticsearch.yml)
# 多播配置下,节点向集群发送多播请求,其他节点收到请求后会做出响应。配置参数如下:
discovery.zen.ping.multicast.group:224.5.6.7
discovery.zen.ping.multicast.port:1234
discovery.zen.ping.multicast.ttl:3
discovery.zen.ping.multicast.address:null
discovery.zen.ping.multicast.enabled:true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 3s
discovery.zen.ping.multicast.enabled:false
discovery.zen.ping.unicast.hosts: ["172.16.0.2:9300","172.16.0.3:9300","172.16.0.5:9300"]
稍微完整点的配置文件:(供参考)
cluster.name: elasticsearch
node.name: "node1"
node.master: true
node.data: true
#设置集群中master节点的初始列表,可以通过这些节点来自动发现新加入集群的节点。
#下面是一些查询时的慢日志参数设置
总之,要简单配置很容易。以上,就可以进行es集群部署了。也就是说已经可以自动发现了,尤其是对于多播的自动发现,你都不用配置。就可以了,即只要名字相同就会被组成同一个集群了,是不是很神奇。
本次讨论仅为multicast广播版本的实现,不含其他。
它是以plugin形式接入的,以 discovery.zen.ping.multicast.enabled 作为开关。
// org.elasticsearch.plugin.discovery.multicast.MulticastDiscoveryPlugin
public class MulticastDiscoveryPlugin extends Plugin {
private final Settings settings;
public MulticastDiscoveryPlugin(Settings settings) {
this.settings = settings;
}
@Override
public String name() {
return "discovery-multicast";
}
@Override
public String description() {
return "Multicast Discovery Plugin";
}
public void onModule(DiscoveryModule module) {
// 只有将开关打开,才会进行多播发现模块的接入
if (settings.getAsBoolean("discovery.zen.ping.multicast.enabled", false)) {
module.addZenPing(MulticastZenPing.class);
}
}
}
所以,所有广播实现相关的东西,就落到了MulticastZenPing的身上了。从构造中方法,我们就可以看到,具体支持的配置参数都有哪些,以默认值如何?
// org.elasticsearch.plugin.discovery.multicast.MulticastZenPing
public MulticastZenPing(ThreadPool threadPool, TransportService transportService, ClusterName clusterName, Version version) {
this(EMPTY\_SETTINGS, threadPool, transportService, clusterName, new NetworkService(EMPTY\_SETTINGS), version);
}
@Inject
public MulticastZenPing(Settings settings, ThreadPool threadPool, TransportService transportService, ClusterName clusterName, NetworkService networkService, Version version) {
super(settings);
this.threadPool = threadPool;
this.transportService = transportService;
this.clusterName = clusterName;
this.networkService = networkService;
this.version = version;
// 广播配置参数读取,及默认值
this.address = this.settings.get("discovery.zen.ping.multicast.address");
this.port = this.settings.getAsInt("discovery.zen.ping.multicast.port", 54328);
this.group = this.settings.get("discovery.zen.ping.multicast.group", "224.2.2.4");
this.bufferSize = this.settings.getAsInt("discovery.zen.ping.multicast.buffer\_size", 2048);
this.ttl = this.settings.getAsInt("discovery.zen.ping.multicast.ttl", 3);
this.pingEnabled = this.settings.getAsBoolean("discovery.zen.ping.multicast.ping.enabled", true);
logger.debug("using group \[{}\], with port \[{}\], ttl \[{}\], and address \[{}\]", group, port, ttl, address);
// 注册业务处理器 MulticastPingResponseRequestHandler 处理 "internal:discovery/zen/multicast" 请求
this.transportService.registerRequestHandler(ACTION\_NAME, MulticastPingResponse.class, ThreadPool.Names.SAME, new MulticastPingResponseRequestHandler());
}
构造实例完成后,等待后续的ES进程的start调用。此时,才会进行广播channel的创建,即广播监听与发送。
// org.elasticsearch.plugin.discovery.multicast.MulticastZenPing.doStart
@Override
protected void doStart() {
try {
// we know OSX has bugs in the JVM when creating multiple instances of multicast sockets
// causing for "socket close" exceptions when receive and/or crashes
boolean shared = settings.getAsBoolean("discovery.zen.ping.multicast.shared", Constants.MAC\_OS\_X);
// OSX does not correctly send multicasts FROM the right interface
boolean deferToInterface = settings.getAsBoolean("discovery.zen.ping.multicast.defer\_group\_to\_set\_interface", Constants.MAC\_OS\_X);
// 调用本模块的channel工具类,channel相关的操作都由其统一实现
multicastChannel = MulticastChannel.getChannel(nodeName(), shared,
new MulticastChannel.Config(port, group, bufferSize, ttl,
// don't use publish address, the use case for that is e.g. a firewall or proxy and
// may not even be bound to an interface on this machine! use the first bound address.
networkService.resolveBindHostAddress(address)\[0\],
deferToInterface),
new Receiver());
} catch (Throwable t) {
String msg = "multicast failed to start \[{}\], disabling. Consider using IPv4 only (by defining env. variable \`ES\_USE\_IPV4\`)";
if (logger.isDebugEnabled()) {
logger.debug(msg, t, ExceptionsHelper.detailedMessage(t));
} else {
logger.info(msg, ExceptionsHelper.detailedMessage(t));
}
}
}
// multicast.MulticastChannel.getChannel
/\*\*
\* Builds a channel based on the provided config, allowing to control if sharing a channel that uses
\* the same config is allowed or not.
\*/
public static MulticastChannel getChannel(String name, boolean shared, Config config, Listener listener) throws Exception {
if (!shared) {
return new Plain(listener, name, config);
}
return Shared.getSharedChannel(listener, config);
}
// 以简版实现看过程
/\*\*
\* Simple implementation of a channel.
\*/
@SuppressForbidden(reason = "I bind to wildcard addresses. I am a total nightmare")
private static class Plain extends MulticastChannel {
private final ESLogger logger;
private final Config config;
private volatile MulticastSocket multicastSocket;
private final DatagramPacket datagramPacketSend;
private final DatagramPacket datagramPacketReceive;
private final Object sendMutex = new Object();
private final Object receiveMutex = new Object();
private final Receiver receiver;
private final Thread receiverThread;
Plain(Listener listener, String name, Config config) throws Exception {
super(listener);
this.logger = ESLoggerFactory.getLogger(name);
this.config = config;
this.datagramPacketReceive = new DatagramPacket(new byte\[config.bufferSize\], config.bufferSize);
this.datagramPacketSend = new DatagramPacket(new byte\[config.bufferSize\], config.bufferSize, InetAddress.getByName(config.group), config.port);
// 通过multcastSocket 完成广播功能
this.multicastSocket = buildMulticastSocket(config);
this.receiver = new Receiver();
this.receiverThread = daemonThreadFactory(Settings.builder().put("name", name).build(), "discovery#multicast#receiver").newThread(receiver);
this.receiverThread.start();
}
private MulticastSocket buildMulticastSocket(Config config) throws Exception {
SocketAddress addr = new InetSocketAddress(InetAddress.getByName(config.group), config.port);
MulticastSocket multicastSocket = new MulticastSocket(config.port);
try {
multicastSocket.setTimeToLive(config.ttl);
// OSX is not smart enough to tell that a socket bound to the
// 'lo0' interface needs to make sure to send the UDP packet
// out of the lo0 interface, so we need to do some special
// workarounds to fix it.
if (config.deferToInterface) {
// 'null' here tells the socket to deter to the interface set
// with .setInterface
multicastSocket.joinGroup(addr, null);
multicastSocket.setInterface(config.multicastInterface);
} else {
multicastSocket.setInterface(config.multicastInterface);
multicastSocket.joinGroup(InetAddress.getByName(config.group));
}
multicastSocket.setReceiveBufferSize(config.bufferSize);
multicastSocket.setSendBufferSize(config.bufferSize);
multicastSocket.setSoTimeout(60000);
} catch (Throwable e) {
IOUtils.closeWhileHandlingException(multicastSocket);
throw e;
}
return multicastSocket;
}
public Config getConfig() {
return this.config;
}
// 发送广播消息
@Override
public void send(BytesReference data) throws Exception {
synchronized (sendMutex) {
datagramPacketSend.setData(data.toBytes());
multicastSocket.send(datagramPacketSend);
}
}
@Override
protected void close(Listener listener) {
receiver.stop();
receiverThread.interrupt();
if (multicastSocket != null) {
IOUtils.closeWhileHandlingException(multicastSocket);
multicastSocket = null;
}
try {
receiverThread.join(10000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// 接收广播消息
private class Receiver implements Runnable {
private volatile boolean running = true;
public void stop() {
running = false;
}
@Override
public void run() {
while (running) {
try {
synchronized (receiveMutex) {
try {
multicastSocket.receive(datagramPacketReceive);
} catch (SocketTimeoutException ignore) {
continue;
} catch (Exception e) {
if (running) {
if (multicastSocket.isClosed()) {
logger.warn("multicast socket closed while running, restarting...");
multicastSocket = buildMulticastSocket(config);
} else {
logger.warn("failed to receive packet, throttling...", e);
Thread.sleep(500);
}
}
continue;
}
}
// 接收到消息后,监听者进行业务处理
if (datagramPacketReceive.getData().length > 0) {
listener.onMessage(new BytesArray(datagramPacketReceive.getData()), datagramPacketReceive.getSocketAddress());
}
} catch (Throwable e) {
if (running) {
logger.warn("unexpected exception in multicast receiver", e);
}
}
}
}
}
}
可以看到,广播消息的实现,是基于java的MulticastSocket进行实现的,也可以看到它是基于udp的,即可靠性并不保证。通过一个死等接收广播消息的receiver线程,实现广播消息的监听,并最终通过listener进行消息的业务处理。所以,广播是框架,而业务核心则是监听者listener的实现了。
而这里的listener则是通过在 MulticastZenPing 中实现的 Receiver 完成的。
// multicast.MulticastZenPing.Receiver
private class Receiver implements MulticastChannel.Listener {
// 广播消息处理入口
@Override
public void onMessage(BytesReference data, SocketAddress address) {
int id = -1;
DiscoveryNode requestingNodeX = null;
ClusterName clusterName = null;
Map<String, Object> externalPingData = null;
XContentType xContentType = null;
try {
boolean internal = false;
if (data.length() > 4) {
int counter = 0;
for (; counter < INTERNAL\_HEADER.length; counter++) {
if (data.get(counter) != INTERNAL\_HEADER\[counter\]) {
break;
}
}
if (counter == INTERNAL\_HEADER.length) {
internal = true;
}
}
if (internal) {
StreamInput input = StreamInput.wrap(new BytesArray(data.toBytes(), INTERNAL\_HEADER.length, data.length() - INTERNAL\_HEADER.length));
Version version = Version.readVersion(input);
input.setVersion(version);
id = input.readInt();
clusterName = ClusterName.readClusterName(input);
requestingNodeX = readNode(input);
} else {
xContentType = XContentFactory.xContentType(data);
if (xContentType != null) {
// an external ping
try (XContentParser parser = XContentFactory.xContent(xContentType).createParser(data)) {
externalPingData = parser.map();
}
} else {
throw new IllegalStateException("failed multicast message, probably message from previous version");
}
}
if (externalPingData != null) {
handleExternalPingRequest(externalPingData, xContentType, address);
} else {
handleNodePingRequest(id, requestingNodeX, clusterName);
}
} catch (Exception e) {
if (!lifecycle.started() || (e instanceof EsRejectedExecutionException)) {
logger.debug("failed to read requesting data from {}", e, address);
} else {
logger.warn("failed to read requesting data from {}", e, address);
}
}
}
@SuppressWarnings("unchecked")
private void handleExternalPingRequest(Map<String, Object> externalPingData, XContentType contentType, SocketAddress remoteAddress) {
if (externalPingData.containsKey("response")) {
// ignoring responses sent over the multicast channel
logger.trace("got an external ping response (ignoring) from {}, content {}", remoteAddress, externalPingData);
return;
}
if (multicastChannel == null) {
logger.debug("can't send ping response, no socket, from {}, content {}", remoteAddress, externalPingData);
return;
}
Map<String, Object> request = (Map<String, Object>) externalPingData.get("request");
if (request == null) {
logger.warn("malformed external ping request, no 'request' element from {}, content {}", remoteAddress, externalPingData);
return;
}
// 读取广播方的 cluster\_name, 如果相同则认为是同一个集群
final String requestClusterName = request.containsKey("cluster\_name") ? request.get("cluster\_name").toString() : request.containsKey("clusterName") ? request.get("clusterName").toString() : null;
if (requestClusterName == null) {
logger.warn("malformed external ping request, missing 'cluster\_name' element within request, from {}, content {}", remoteAddress, externalPingData);
return;
}
if (!requestClusterName.equals(clusterName.value())) {
logger.trace("got request for cluster\_name {}, but our cluster\_name is {}, from {}, content {}",
requestClusterName, clusterName.value(), remoteAddress, externalPingData);
return;
}
if (logger.isTraceEnabled()) {
logger.trace("got external ping request from {}, content {}", remoteAddress, externalPingData);
}
try {
DiscoveryNode localNode = contextProvider.nodes().localNode();
XContentBuilder builder = XContentFactory.contentBuilder(contentType);
builder.startObject().startObject("response");
builder.field("cluster\_name", clusterName.value());
builder.startObject("version").field("number", version.number()).field("snapshot\_build", version.snapshot).endObject();
builder.field("transport\_address", localNode.address().toString());
if (contextProvider.nodeService() != null) {
for (Map.Entry<String, String> attr : contextProvider.nodeService().attributes().entrySet()) {
builder.field(attr.getKey(), attr.getValue());
}
}
builder.startObject("attributes");
for (Map.Entry<String, String> attr : localNode.attributes().entrySet()) {
builder.field(attr.getKey(), attr.getValue());
}
builder.endObject();
builder.endObject().endObject();
multicastChannel.send(builder.bytes());
if (logger.isTraceEnabled()) {
logger.trace("sending external ping response {}", builder.string());
}
} catch (Exception e) {
logger.warn("failed to send external multicast response", e);
}
}
private void handleNodePingRequest(int id, DiscoveryNode requestingNodeX, ClusterName requestClusterName) {
if (!pingEnabled || multicastChannel == null) {
return;
}
final DiscoveryNodes discoveryNodes = contextProvider.nodes();
final DiscoveryNode requestingNode = requestingNodeX;
if (requestingNode.id().equals(discoveryNodes.localNodeId())) {
// that's me, ignore
return;
}
if (!requestClusterName.equals(clusterName)) {
if (logger.isTraceEnabled()) {
logger.trace("\[{}\] received ping\_request from \[{}\], but wrong cluster\_name \[{}\], expected \[{}\], ignoring",
id, requestingNode, requestClusterName.value(), clusterName.value());
}
return;
}
// don't connect between two client nodes, no need for that...
if (!discoveryNodes.localNode().shouldConnectTo(requestingNode)) {
if (logger.isTraceEnabled()) {
logger.trace("\[{}\] received ping\_request from \[{}\], both are client nodes, ignoring", id, requestingNode, requestClusterName);
}
return;
}
final MulticastPingResponse multicastPingResponse = new MulticastPingResponse();
multicastPingResponse.id = id;
multicastPingResponse.pingResponse = new PingResponse(discoveryNodes.localNode(), discoveryNodes.masterNode(), clusterName, contextProvider.nodeHasJoinedClusterOnce());
if (logger.isTraceEnabled()) {
logger.trace("\[{}\] received ping\_request from \[{}\], sending {}", id, requestingNode, multicastPingResponse.pingResponse);
}
// 加入集群
if (!transportService.nodeConnected(requestingNode)) {
// do the connect and send on a thread pool
threadPool.generic().execute(new Runnable() {
@Override
public void run() {
// connect to the node if possible
try {
transportService.connectToNode(requestingNode);
transportService.sendRequest(requestingNode, ACTION\_NAME, multicastPingResponse, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {
@Override
public void handleException(TransportException exp) {
logger.warn("failed to receive confirmation on sent ping response to \[{}\]", exp, requestingNode);
}
});
} catch (Exception e) {
if (lifecycle.started()) {
logger.warn("failed to connect to requesting node {}", e, requestingNode);
}
}
}
});
} else {
transportService.sendRequest(requestingNode, ACTION\_NAME, multicastPingResponse, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {
@Override
public void handleException(TransportException exp) {
if (lifecycle.started()) {
logger.warn("failed to receive confirmation on sent ping response to \[{}\]", exp, requestingNode);
}
}
});
}
}
}
处理方法就是,收到某个节点的广播消息,则读取集群名,相同则认为是同一集群。发送消息信息,以及连接到该节点,从而保持节点间的通信链路。
还有其他许多细节,略去不说。但我们已经从整体上解答了,es是如何进行自动集群节点发现的了,一个发送广播消息,同一广播组的实例收到消息后,读取cluster_name,从而判定是否是同一集群,进而自动组网。
手机扫一扫
移动阅读更方便
你可能感兴趣的文章