IPVS的ICMP报文处理-由内到外
阅读原文时间:2023年07月08日阅读:2

  这里主要明与NAT/Masq转发模式相关的ICMP报文处理,但也会提及由于出错引发的IPVS系统主动发送的ICMP报文。

  入口函数ip_vs_in实质上挂载在netfilter的2个hook点上,分别为:NF_INET_LOCAL_IN和NF_INET_LOCAL_OUT。第一个hook点作用于目的地址为本机的报文;后者作用于由本机发送的报文。此函数用于处理IPVS由外到内的请求报文,当然也包括ICMP报文。如果协议号为IPPROTO_ICMP/IPPROTO_ICMPV6,分别使用函数ip_vs_in_icmp、ip_vs_in_icmp_v6进行处理。

static unsigned int ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int af)
{
struct ip_vs_iphdr iph;
struct ip_vs_protocol *pp;
struct ip_vs_proto_data *pd;
struct ip_vs_conn *cp;

#ifdef CONFIG_IP_VS_IPV6
if (af == AF_INET6) {
if (unlikely(iph.protocol == IPPROTO_ICMPV6)) {
int verdict = ip_vs_in_icmp_v6(ipvs, skb, &related, hooknum, &iph);
if (related)
return verdict;
}
} else
#endif
if (unlikely(iph.protocol == IPPROTO_ICMP)) {
int verdict = ip_vs_in_icmp(ipvs, skb, &related, hooknum);
if (related)
return verdict;
}
/* Protocol supported? */
pd = ip_vs_proto_data_get(ipvs, iph.protocol);
if (unlikely(!pd))
return NF_ACCEPT;

  如果上述的ip_vs_in_icmp函数未能进行ICMP处理,在随后的协议查找中也会失败,因为IPVS不支持ICMP协议。

  函数ip_vs_in_icmp目前仅处理三种类型的ICMP报文:ICMP_DEST_UNREACH、ICMP_SOURCE_QUENCH和ICMP_TIME_EXCEEDED。如果不是这三种类型,设置为不相关联的ICMP,结束处理。

static int ip_vs_in_icmp(struct netns_ipvs *ipvs, struct sk_buff *skb, int *related, unsigned int hooknum)
{
struct icmphdr _icmph, *ic;
struct iphdr _ciph, *cih; /* The ip header contained within the ICMP */
struct ip_vs_iphdr ciph;
struct ip_vs_conn *cp;
struct ip_vs_protocol *pp;
struct ip_vs_proto_data *pd;

\*related = 1;  
iph = ip\_hdr(skb);  
offset = ihl = iph->ihl \* 4;  
ic = skb\_header\_pointer(skb, offset, sizeof(\_icmph), &\_icmph);  
/\*  
 \* Work through seeing if this is for us.  
 \* These checks are supposed to be in an order that means easy things are checked first to speed up processing.... however  
 \* this means that some packets will manage to get a long way down this stack and then be rejected, but that's life.  
 \*/  
if ((ic->type != ICMP\_DEST\_UNREACH) && (ic->type != ICMP\_SOURCE\_QUENCH) && (ic->type != ICMP\_TIME\_EXCEEDED)) {  
    \*related = 0;  
    return NF\_ACCEPT;  
}

  接下来,找到ICMP报文中内层的IP报文。在这里,先检查以下内层的是不是IPIP协议报文,如果是IPIP协议,进行合法性检查,最后,偏移到最内层的IP报头处。

/\* Now find the contained IP header \*/  
offset += sizeof(\_icmph);  
cih = skb\_header\_pointer(skb, offset, sizeof(\_ciph), &\_ciph);

/\* Special case for errors for IPIP packets \*/  
ipip = false;  
if (cih->protocol == IPPROTO\_IPIP) {  
    if (unlikely(cih->frag\_off & htons(IP\_OFFSET)))  
        return NF\_ACCEPT;  
    /\* Error for our IPIP must arrive at LOCAL\_IN \*/  
    if (!(skb\_rtable(skb)->rt\_flags & RTCF\_LOCAL))  
        return NF\_ACCEPT;  
    offset += cih->ihl \* 4;  
    cih = skb\_header\_pointer(skb, offset, sizeof(\_ciph), &\_ciph);  
    if (cih == NULL)  
        return NF\_ACCEPT; /\* The packet looks wrong, ignore \*/  
    ipip = true;  
}

  之后根据找到的最内层IP报头中的协议字段,来查找相应的IPVS协议数据结构,进而找到协议结构。为了完整加解密的需要,AH/ESP协议要求报文不能分片(dont_defag)。

  根据其中的IP头部信息,查找IPVS连接。如果找到的话,表明此ICMP报文是由之前客户端的请求报文所触发的,由真实服务器回复的ICMP报文。就有函数handle_response_icmp处理。

pd = ip\_vs\_proto\_data\_get(ipvs, cih->protocol);  
if (!pd)  
    return NF\_ACCEPT;  
pp = pd->pp;

/\* Is the embedded protocol header present? \*/  
if (unlikely(cih->frag\_off & htons(IP\_OFFSET) && pp->dont\_defrag))  
    return NF\_ACCEPT;

  对于找不到关联IPVS连接的ICMP报文,默认是不进行处理的,这可通过PROC文件/proc/sys/net/ipv4/vs/schedule_icmp进行更改。如果其为真,IPVS系统将尝试将此ICMP报文调度的选择的目的服务器。

offset2 = offset;  
ip\_vs\_fill\_iph\_skb\_icmp(AF\_INET, skb, offset, !ipip, &ciph);  
offset = ciph.len;

/\* The embedded headers contain source and dest in reverse order. For IPIP this is error for request, not for reply.  
 \*/  
cp = pp->conn\_in\_get(ipvs, AF\_INET, skb, &ciph);  
if (!cp) {  
    if (!sysctl\_schedule\_icmp(ipvs))  
        return NF\_ACCEPT;  
    if (!ip\_vs\_try\_to\_schedule(ipvs, AF\_INET, skb, pd, &v, &cp, &ciph))  
        return v;  
    new\_cp = true;  
}  
verdict = NF\_DROP;

/\* Ensure the checksum is correct \*/  
if (!skb\_csum\_unnecessary(skb) && ip\_vs\_checksum\_complete(skb, ihl)) {  
    /\* Failed checksum! \*/  
    IP\_VS\_DBG(1, "Incoming ICMP: failed checksum from %pI4!\\n", &iph->saddr);  
    goto out;  
}

  对于原报文是IPIP协议报文的特殊情况,即IPVS在隧道转发模式下,接收到的ICMP错误报文,如果ICMP的类型为ICMP_DEST_UNREACH,并且代码为ICMP_FRAG_NEEDED(需要分片),从ICMP报文中取出要求的MTU值,作为路径MTU更新到对应的路由表项中。

if (ipip) {  
    \_\_be32 info = ic->un.gateway;  
    \_\_u8 type = ic->type;  
    \_\_u8 code = ic->code;

    /\* Update the MTU \*/  
    if (ic->type == ICMP\_DEST\_UNREACH && ic->code == ICMP\_FRAG\_NEEDED) {  
        struct ip\_vs\_dest \*dest = cp->dest;  
        u32 mtu = ntohs(ic->un.frag.mtu);  
        \_\_be16 frag\_off = cih->frag\_off;

        /\* Strip outer IP and ICMP, go to IPIP header \*/  
        if (pskb\_pull(skb, ihl + sizeof(\_icmph)) == NULL)  
            goto ignore\_ipip;  
        offset2 -= ihl + sizeof(\_icmph);  
        skb\_reset\_network\_header(skb);  
        IP\_VS\_DBG(12, "ICMP for IPIP %pI4->%pI4: mtu=%u\\n", &ip\_hdr(skb)->saddr, &ip\_hdr(skb)->daddr, mtu);  
        ipv4\_update\_pmtu(skb, ipvs->net,  mtu, 0, 0, 0, 0);  
        /\* Client uses PMTUD? \*/  
        if (!(frag\_off & htons(IP\_DF)))  
            goto ignore\_ipip;  
        /\* Prefer the resulting PMTU \*/  
        if (dest) {  
            struct ip\_vs\_dest\_dst \*dest\_dst;

            dest\_dst = rcu\_dereference(dest->dest\_dst);  
            if (dest\_dst)  
                mtu = dst\_mtu(dest\_dst->dst\_cache);  
        }  
        if (mtu > 68 + sizeof(struct iphdr))  
            mtu -= sizeof(struct iphdr);  
        info = htonl(mtu);  
    }

  此处,去掉此ICMP报文的最外层IP头,ICMP头部以及IPIP头部,仅保留原始的客户端IP请求报文,使用icmp_send函数发送ICMP报文到最初的客户端。除去以上的ICMP分片进行了处理,其它类型的ICMP报文,未做处理。

    /\* Strip outer IP, ICMP and IPIP, go to IP header of original request. \*/  
    if (pskb\_pull(skb, offset2) == NULL)  
        goto ignore\_ipip;  
    skb\_reset\_network\_header(skb);  
    IP\_VS\_DBG(12, "Sending ICMP for %pI4->%pI4: t=%u, c=%u, i=%u\\n", &ip\_hdr(skb)->saddr, &ip\_hdr(skb)->daddr, type, code, ntohl(info));  
    icmp\_send(skb, type, code, info);  
    /\* ICMP can be shorter but anyways, account it \*/  
    ip\_vs\_out\_stats(cp, skb);

ignore_ipip:
consume_skb(skb);
verdict = NF_STOLEN;
goto out;
}

  函数的最后,对于内层IP头部协议字段为:IPPROTO_TCP、IPPROTO_UDP和IPPROTO_SCTP的报文,offset偏移到四层头部的源端口和目的端口处,调用ip_vs_icmp_xmit函数转发ICMP报文。

/\* do the statistics and put it back \*/  
ip\_vs\_in\_stats(cp, skb);  
if (IPPROTO\_TCP == cih->protocol || IPPROTO\_UDP == cih->protocol || IPPROTO\_SCTP == cih->protocol)  
    offset += 2 \* sizeof(\_\_u16);  
verdict = ip\_vs\_icmp\_xmit(skb, cp, pp, offset, hooknum, &ciph);

out:

  对于除NAT/Masq转发模式之外的其它模式,由于不需要进行地址或者端口的转换,直接调用IPVS连接的发送函数packet_xmit处理。

int ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
struct ip_vs_protocol *pp, int offset, unsigned int hooknum, struct ip_vs_iphdr *iph)
{
/* The ICMP packet for VS/TUN, VS/DR and LOCALNODE will be forwarded directly here, because there is no need to
translate address/port back */
if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) {
if (cp->packet_xmit)
rc = cp->packet_xmit(skb, cp, pp, iph);
else
rc = NF_ACCEPT;
/* do not touch skb anymore */
atomic_inc(&cp->in_pkts);
goto out;
}

  对于转发NF_INET_FORWARD的hook点,在查找路由时使用IP_VS_RT_MODE_NON_LOCAL标志,表示不允许结果是到本机的路由。

/\* mangle and send the packet here (only for VS/NAT) \*/  
was\_input = rt\_is\_input\_route(skb\_rtable(skb));

/\* LOCALNODE from FORWARD hook is not supported \*/  
rt\_mode = (hooknum != NF\_INET\_FORWARD) ?  
          IP\_VS\_RT\_MODE\_LOCAL | IP\_VS\_RT\_MODE\_NON\_LOCAL | IP\_VS\_RT\_MODE\_RDR :  
  IP\_VS\_RT\_MODE\_NON\_LOCAL;  
local = \_\_ip\_vs\_get\_out\_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip, rt\_mode, NULL, iph);  
if (local < 0)  
    goto tx\_error;  
rt = skb\_rtable(skb);

  如果此连接是由同步进程接收到的,并且前面路由查询的结果目的是发往本机,而且netfilter系统已经创建了连接跟踪结构,结束处理返回。

/\* Avoid duplicate tuple in reply direction for NAT traffic to local address when connection is sync-ed  \*/  

#if IS_ENABLED(CONFIG_NF_CONNTRACK)
if (cp->flags & IP_VS_CONN_F_SYNC && local) {
enum ip_conntrack_info ctinfo;
struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
if (ct) {
IP_VS_DBG(10, "%s(): stopping DNAT to local address %pI4\n", __func__, &cp->daddr.ip);
goto tx_error;
}
}
#endif

  以下判断,对于原始报文路由到本机,目的IP为回环地址,并且以上查询到的出口路由也是发送本机的报文,停止DNAT处理。

/\* From world but DNAT to loopback address? \*/  
if (local && ipv4\_is\_loopback(cp->daddr.ip) && was\_input) {  
    IP\_VS\_DBG(1, "%s(): stopping DNAT to loopback %pI4\\n", \_\_func\_\_, &cp->daddr.ip);  
    goto tx\_error;  
}

  函数ip_vs_nat_icmp执行ICMP报文的DNAT转换,最终由函数ip_vs_nat_send_or_cont执行发送操作。

/\* copy-on-write the packet before mangling it \*/  
if (!skb\_make\_writable(skb, offset))  
    goto tx\_error;

if (skb\_cow(skb, rt->dst.dev->hard\_header\_len))  
    goto tx\_error;

ip\_vs\_nat\_icmp(skb, pp, cp, 0);

/\* Another hack: avoid icmp\_send in ip\_fragment \*/  
skb->ignore\_df = 1;

rc = ip\_vs\_nat\_send\_or\_cont(NFPROTO\_IPV4, skb, cp, local);  
goto out;

  函数ip_vs_nat_icmp负责对ICMP报文进行DNAT处理。由于当前的处理报文是由外部到内部,inout参数为0。修改报文的IP头部的目的地址,和ICMP内层IP报文的源IP地址(因为内层IP表示原方向报文),同时更新IP头部校验和。

void ip_vs_nat_icmp(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, int inout)
{
struct iphdr *iph = ip_hdr(skb);
unsigned int icmp_offset = iph->ihl*4;
struct icmphdr *icmph = (struct icmphdr *)(skb_network_header(skb) + icmp_offset);
struct iphdr *ciph = (struct iphdr *)(icmph + 1);

if (inout) {  
    iph->saddr = cp->vaddr.ip;  
    ip\_send\_check(iph);  
    ciph->daddr = cp->vaddr.ip;  
    ip\_send\_check(ciph);  
} else {  
    iph->daddr = cp->daddr.ip;  
    ip\_send\_check(iph);  
    ciph->saddr = cp->daddr.ip;  
    ip\_send\_check(ciph);  
}

  随后,对于4层协议IPPROTO_TCP、IPPROTO_UDP和IPPROTO_SCTP,如果报文为由外到内,修改ICMP内部4层头中源端口号(还原为发送时真实服务器的端口号)。

/\* the TCP/UDP/SCTP port \*/  
if (IPPROTO\_TCP == ciph->protocol || IPPROTO\_UDP == ciph->protocol || IPPROTO\_SCTP == ciph->protocol) {  
    \_\_be16 \*ports = (void \*)ciph + ciph->ihl\*4;

    if (inout)  
        ports\[1\] = cp->vport;  
    else  
        ports\[0\] = cp->dport;  
}

/\* And finally the ICMP checksum \*/  
icmph->checksum = 0;  
icmph->checksum = ip\_vs\_checksum\_complete(skb, icmp\_offset);  
skb->ip\_summed = CHECKSUM\_UNNECESSARY;

  函数ip_vs_nat_send_or_cont执行最后的发送操作。在此阶段,如果连接没有设置连接跟踪标志IP_VS_CONN_F_NFCT,释放建立的连接跟踪结构;否则,更新连接跟踪信息。默认情况下IPVS不会为新连接添加标志IP_VS_CONN_F_NFCT,即不会保留连接跟踪信息,但是可通过PROC文件:/proc/sys/net/ipv4/vs/conntrack 修改此默认行为。

/* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, struct ip_vs_conn *cp, int local)
{
int ret = NF_STOLEN;

skb->ipvs\_property = 1;  
if (likely(!(cp->flags & IP\_VS\_CONN\_F\_NFCT)))  
    ip\_vs\_notrack(skb);  
else  
    ip\_vs\_update\_conntrack(skb, cp, 1);

  如果目的地址非本地,或者目的端口变化,或者目的地址有变化,任何一种情况发送都将导致缓存的sock结构失效。最后,对于非本地目的地址的报文,在调用NF_INET_LOCAL_OUT点的hook函数之后,由dst_output发出。

/\* Remove the early\_demux association unless it's bound for the exact same port and address on this host after translation.  
 \*/  
if (!local || cp->vport != cp->dport || !ip\_vs\_addr\_equal(cp->af, &cp->vaddr, &cp->daddr))  
    ip\_vs\_drop\_early\_demux\_sk(skb);

if (!local) {  
    skb\_forward\_csum(skb);  
    NF\_HOOK(pf, NF\_INET\_LOCAL\_OUT, cp->ipvs->net, NULL, skb, NULL, skb\_dst(skb)->dev, dst\_output);  
} else  
    ret = NF\_ACCEPT;

  另外,看一下IPVS在注册netfilter的hook点的定义结构ip_vs_ops,除了以上的hook的ip_vs_in函数,在hook点NF_INET_FORWARD上,注册了ip_vs_forward_icmp函数,用于处理目的地址为0.0.0.0/0的ICMP报文。

static const struct nf_hook_ops ip_vs_ops[] = {
/* After packet filtering (but before ip_vs_out_icmp), catch icmp destined for 0.0.0.0/0, which is for incoming IPVS connections */
{
.hook = ip_vs_forward_icmp,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_FORWARD,
.priority = 99,
},
}

  由于使用fwmark配置的IPVS虚拟服务,iptables的MARK功能不能进行标记。所以在NF_INET_FORWARD进行处理。

# iptables -A PREROUTING -t mangle -d 207.175.44.110/31 -j MARK --set-mark 1

  内核版本 4.15

  转载: https://blog.csdn.net/sinat_20184565/article/details/102410231