VPP with Babel - Part 1

VPP with Babel - Part 1

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Thanks to the [Linux ControlPlane ] plugin, higher level control plane software becomes available, that is to say: things like BGP, OSPF, LDP, VRRP and so on become quite natural for VPP.

IPng Networks is a small service provider that has built a network based entirely on open source: [Debian ] servers with widely available Intel and Mellanox 10G/25G/100G network cards, paired with [VPP ] for the dataplane, and [Bird2 ] for the control plane.

As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at which an initial allocation was a /19, and subsequent allocations usually a /20 based on justification. Then it watered down to a /22 for new Local Internet Registries, then that became a /24 for new LIRs, and ultimately we ran out. What was once a plentiful resource, has now become a very constrained resource.

In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring one of the newer routing protocols: Babel.

?? A sad waste

I have to go back to something very fundamental about routing. When RouterA holds a routing table, it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a packet, it’ll look up the destination address, and then forward the packet on to RouterB which is the next router in the path towards the destination:

  1. RouterA does a route lookup in its routing table. For destination 192.0.2.1, the covering prefix is 192.0.2.0/24 and it might find that it can reach it via IPv4 next hop 100.64.0.1.
  2. RouterA then does another lookup in its routing table, to figure out how can it reach 100.64.0.1. It may find that this address is directly connected, say to interface eth0, on which RouterA is 100.64.0.2/30.
  3. Assuming that eth0 is an ethernet device, which the vast majority of interfaces are, then RouterA can look up the link-layer address for that IPv4 address 100.64.0.1, by using ARP.
  4. The ARP request asks, quite literally who-has 100.64.0.1? using a broadcast message on eth0, to which the other RouterB will answer 100.64.0.1 is-at 90:e2:ba:3f:ca:d5.
  5. Now that RouterA knows that, it can forward along the IP packet out on its eth0 device and towards 90:e2:ba:3f:ca:d5. Huzzah.

?? A clever trick

I can’t help but notice that the only purpose of having the 100.64.0.0/30 transit network between these two routers is to:

  1. provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses, using ARP resolution.
  2. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop along the way will respond with an TTL exceeded message. And I do like traceroutes!

Let me discuss these two purposes in more detail:

1. IPv4 ARP, née IPv6 NDP

One really neat trick is simply replacing ARP resolution by something that can resolve the link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that’s called Neighbor Discovery Protocol in which a router can determine the link-layer address of a neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses ICMPv6 to send out a query with the Neighbor Solicitation, which is followed by a response in the form of a Neighbor Advertisement.

Why am I talking about IPv6 neighbor discovery when I’m explaining IPv4 forwarding, you may be wondering? Well, because of this neat trick that the IPv4 prefix brokers don’t want you to know:

pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1

pim@vpp0-0:~$ ip -br a show e1
e1               UP             fe80::5054:ff:fef0:1101/64 
pim@vpp0-0:~$ ip ro get 192.0.2.0
192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 
    cache 
pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110
fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE

pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0
tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98:
  (tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64        

While it looks counter-intuitive at first, this is actually pretty straight forward. When the router gets a packet destined for 192.0.2.0/24, it will know that the next hop is some link-local IPv6 address, which it can resolve by NDP on ethernet interface e1. It can then simply forward the IPv4 datagram to the MAC address it found.

Who would’ve thunk that you do not need ARP or even IPv4 on the interface at all?

2. Originating ICMP messages

The Internet Control Message Protocol is described in [RFC792 ]. It’s mostly used to carry diagnostic and debugging information, either originated by end hosts, for example the “destination unreachable, port unreachable” types of messages, but they may also be originated by intermediate routers, for example with most other kinds of “destination unreachable” packets.

Path MTU Discovery, described in [RFC1191 ] allows a host to discover the maximum packet size that a route is able to carry. There’s a few different types of PMTUd, but the most common one uses ICMPv4 packets coming from these intermediate routers, informing them that packets which are marked as un-fragmentable, will not be able to be transmitted due to them being too large.

Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able originate ICMPv4 traffic.

If you’re curious, you can read more in this [IETF Draft ] from Juliusz Chroboczek et al. It’s really insightful, yet elegant.

Introducing Babel

I’ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets, as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to originate ICMPv4 packets, therefore it needs at least one IPv4 address.

These two claims mean that I need at most one IPv4 address on each router. Could it be?!

Babel is a loop-avoiding distance-vector routing protocol that is designed to be robust and efficient both in networks using prefix-based routing and in networks using flat routing (“mesh networks”), and both in relatively stable wired networks and in highly dynamic wireless networks.

The definitive [RFC8966 ] describes it in great detail, and previous work are in [RFC7557 ] and [RFC6126 ]. Lots of reading :) Babel is a hybrid routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4 and IPv6), regardless of which protocol the Babel packets are themselves being carried over.

I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops across address families, which is described in [RFC9229 ]:

When a packet is routed according to a given routing table entry, the forwarding plane typically uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol [RFC4861 ] in the case of IPv6 and the Address Resolution Protocol (ARP) [RFC826 ] in the case of IPv4) to map the next-hop address to a link-layer address (a “Media Access Control (MAC) address”), which is then used to construct the link-layer frames that encapsulate forwarded packets.

Babel and Bird2

There’s an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a year ago in Bird2 [ref ], with a hat-tip to Toke H?iland-J?rgensen.

The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than this commit, so my first order of business is to get a Debian package:

pim@summer:~/src$ sudo apt install devscripts
pim@summer:~/src$ wget https://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz
pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz 
pim@summer:~/src/bird-2.14$ wget https://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz 
pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i
pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us        

And that yields me a fresh Bird 2.14 package. I can’t help but wonder though, why did the semantic versioning [ref ] of 2.0.X change to 2.14? I found an answer in the NEWS file of the 2.13 release [link ]. It’s a little bit of a disappointment, but I quickly get over myself because I want to take this Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!

Babel and the LAB

I decide to take an IPng [lab ] out for a spin. These labs come with four VPP routers and two Debian machines connected like so:

The configuration snippet for Bird2 is very simple, as most of the defaults are sensible:

pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf
protocol babel {
  interface "e*" {
    type wired;
    extended next hop on;
  };    
  ipv6 { import all; export all; };
  ipv4 { import all; export all; };
}
EOF

pim@vpp0-0:~$ birdc show babel interfaces
BIRD 2.14 ready.
babel1:
Interface  State  Auth  RX cost   Nbrs   Timer Next hop (v4)   Next hop (v6)
e1         Up     No         96      1   0.958 ::              fe80::5054:ff:fef0:1101

pim@vpp0-0:~$ birdc show babel neigh
BIRD 2.14 ready.
babel1:
IP address                Interface  Metric Routes Hellos Expires Auth  RTT (ms)
fe80::5054:ff:fef0:1110   e1             96      8     16   5.003 No       4.831

pim@vpp0-0:~$ birdc show babel entries
BIRD 2.14 ready.
babel1:
Prefix                   Router ID               Metric Seqno  Routes Sources
192.168.10.0/32          00:00:00:00:c0:a8:0a:00      0     1       0       0
192.168.10.0/24          00:00:00:00:c0:a8:0a:00      0     1       1       0
192.168.10.1/32          00:00:00:00:c0:a8:0a:01     96     7       1       0
2001:678:d78:200::/128   00:00:00:00:c0:a8:0a:00      0     1       0       0
2001:678:d78:200::/60    00:00:00:00:c0:a8:0a:00      0     1       1       0
2001:678:d78:200::1/128  00:00:00:00:c0:a8:0a:01     96     7       1       0        

Based on this simple configuration, Bird2 will start the babel protocol on e0 and e1, and it quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol database (called entries), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and 2001:678:d78:200::), the neighbor’s IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1), and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).

The coolest part is the extended next hop on statement, which enables Babel to set the nexthop to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:

pim@vpp0-0:~$ ip ro
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 
unreachable 192.168.10.0/24 proto bird metric 32 

pim@vpp0-0:~$ ip -6 ro
2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium
unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e1 proto kernel metric 256 pref medium        

? Setting IPv4 routes over IPv6 nexthops works!

Babel and VPP

For the [VPP ] configuration, I start off with a pretty much empty configuration, creating only a loopback interface called loop0, setting the interfaces up, and exposing them in LinuxCP:

vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128

vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface state GigabitEthernet10/0/1 up

vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1        

Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and IPv6 loopbacks across the network.

Adding support to VPP

IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I remember reviewing a change from Adrian during our MPLS [project ], which he submitted in this [Gerrit ]. His change allows VPP to use routes with rtnl_route_nh_get_via() to map them to a different address family, exactly what I am looking for. The routes are correctly installed in the FIB:

pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ]
192.168.10.1/32 fib:0 index:31 locks:2
  lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
    path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ]
      path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop:  oper-flags:resolved,
        fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1
      [@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd

 forwarding:   unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]]
    [0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800        

Using the Open vSwitch tap I can see I can clearly see the packets go out from vpp0-0.e1 and into vpp0-1.e0, but there is no response, so they are getting lost in vpp0-1 somewhere. I take a look at a packet trace on vpp0-1, I’m expecting the ICMP packet there:

pim@vpp0-1:~$ vppctl show trace
07:42:53:178694: dpdk-input
  GigabitEthernet10/0/0 rx queue 0
  buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
  PKT MBUF: port 0, nb_segs 1, pkt_len 98
    buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
  ICMP: 192.168.10.0 -> 192.168.10.1
    tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
    fragment id 0xf52b, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178765: ethernet-input
  frame: flags 0x1, hw-if-index 1, sw-if-index 1
  IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
07:42:53:178791: ip4-input
  ICMP: 192.168.10.0 -> 192.168.10.1
    tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
    fragment id 0xf52b, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178810: ip4-not-enabled
    ICMP: 192.168.10.0 -> 192.168.10.1
      tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
      fragment id 0xf52b, flags DONT_FRAGMENT
    ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178833: error-drop
  rx:GigabitEthernet10/0/0
07:42:53:178835: drop
  dpdk-input: no error        

Okay, that checks out! Going over this packet trace, the ip4-input node indeed got handed a packet, which it promptly rejected by forwarding it to ip4-not-enabled which drops it. It kind of makes sense, the VPP dataplane doesn’t think it’s logical to handle IPv4 traffic on an interface which does not have an IPv4 address. Except – I’m bending the rules a little bit by doing exactly that.

Approach 1: force-enable ip4 in VPP

There’s an internal function ip4_sw_interface_enable_disable() which is called to enable IPv4 processing on an interface once the first IPv4 address is added. So my first fix is to force this to be enabled for any interface that is exposed via Linux Control Plane, notably in lcp_itf_pair_create() [here ].

This approach is partially effective:

pim@vpp0-0:~$ ip ro get 192.168.10.1
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 
    cache 
pim@vpp0-0:~$ ping -c5 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms
64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms
^C
--- 192.168.10.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms

pim@vpp0-0:~$ traceroute 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  192.168.10.3 (192.168.10.3)  10.418 ms  10.343 ms  11.362 ms        

I take a moment to think about why the traceroutes are not responding in the routers in the middle, and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can’t select an IPv4 address to originate the message from, as the interface has none.

?? Forwarding works, but ? PMTUd does not!

Approach 2: Use unnumbered interfaces

Looking at my options, I see that VPP is capable of using so-called unnumbered interfaces. These can be left unconfigured, but borrow an address from another interface. It’s a good idea to borrow from loop0, which has a valid IPv4 and IPv6 address. It looks like this in VPP:

vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0

vpp0-0# show interface address
GigabitEthernet10/0/0 (dn):
  unnumbered, use loop0
  L3 192.168.10.0/32
  L3 2001:678:d78:200::/128
GigabitEthernet10/0/1 (up): 
  unnumbered, use loop0
  L3 192.168.10.0/32
  L3 2001:678:d78:200::/128
loop0 (up):
  L3 192.168.10.0/32
  L3 2001:678:d78:200::/128        

The Linux ControlPlane configuration will always synchronize interface information from VPP to Linux, as I described back then when I [worked on the plugin ]. Babel starts and sets next hops for IPv4 that look like this:

pim@vpp0-2:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
loop0            UNKNOWN        192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0               UP             192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1               UP             192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64

pim@vpp0-2:~$ ip ro
192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink
unreachable 192.168.10.0/24 proto bird metric 32
192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink
192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink
        

While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors (192.168.10.1 and 192.168.10.3) are not reachable:

pim@vpp0-2:~# ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
From 192.168.10.2 icmp_seq=1 Destination Host Unreachable
From 192.168.10.2 icmp_seq=2 Destination Host Unreachable
From 192.168.10.2 icmp_seq=3 Destination Host Unreachable

--- 192.168.10.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms
        

I take a look at why that might be, and I notice this on the neighbor vpp0-1 when I try to ping it from vpp0-2:

vpp0-1# show err
   Count                  Node                              Reason               Severity
         5             arp-reply             IP4 source address not local to sub   error
         1             arp-reply             IP4 source address matches local in   error
        

Oh, snap! I traced this down to src/vnet/arp/arp.c around line 522 where I can see that VPP, when it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a point to point link like this one, there is nobody else in the 192.168.10.1/32 subnet! I think this error should not be returned if the interface is arp_unnumbered(), defined further up in the same source file. I write a small patch in Gerrit [40482 ] which removes this requirement and the test that asserts the previous behavior, allowing the ARP request to succeed, and things shoot to life:

pim@vpp0-2:~$ ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms

--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms
        

I make a mental note to discuss this ARP relaxation Gerrit with [vpp-dev ], and I’ll see where that takes me. [Edit: Gerrit was accepted and merged]

? Forwarding IPv4 routes over IPv4 point-to-point nexthops works!

Approach 3: VPP Unnumbered Hack

At this point, I think I’m good, but one of the cool features of Babel is that it can use IPv6 next hops for IPv4 destinations. Setting GigabitEthernet10/0/X to unnumbered will make 192.168.10.X/32 reappear on the e0 an e1 interfaces, which will make Babel prefer the more classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?

One option is to ask Babel to use extended next hop even when IPv4 is available, which would be a change to Bird (and possibly a violation of the Babel specification, I should read up on that).

But I think there’s another way, so I take a look at the VPP code which prints out the unnumbered, use loop0 message, and I find a way to know if an interface is borrowing addresses in this way. I decide to change the LCP plugin to inhibit sync’ing the addresses if they belong to an interface which is unnumbered. Because I don’t know for sure if everybody would find this behavior desirable, I make sure to guard the behavior behind a backwards compatible configuration option.

If you’re curious, please take a look at the change in my [GitHub repo ], in which I:

  1. add a new configuration option, lcp-sync-unnumbered, which defaults to on. That would be what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
  2. add a CLI call to change the value, lcp lcp-sync-unnumbered [on|enable|off|disable]
  3. extend the CLI call to show the LCP plugin state, as an additional output of lcp show

And with that, the VPP configuration becomes:

vpp0-0# lcp lcp-sync on
vpp0-0# lcp lcp-sync-unnumbered off

vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128

vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# set interface state GigabitEthernet10/0/1 up

vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1        

Results

I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have to admit:

pim@vpp0-0:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128 
loop0            UNKNOWN        192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64 
e0               UP             fe80::5054:ff:fef0:1100/64 
e1               UP             fe80::5054:ff:fef0:1101/64 
e2               DOWN           
e3               DOWN           

pim@vpp0-0:~$ traceroute -n 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
 1  192.168.10.1  1.882 ms  2.231 ms  1.472 ms
 2  192.168.10.2  4.243 ms  3.492 ms  2.797 ms
 3  192.168.10.3  6.689 ms  5.925 ms  5.157 ms

pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3
traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets
 1  2001:678:d78:200::1  2.543 ms  1.762 ms  2.154 ms
 2  2001:678:d78:200::2  4.943 ms  3.063 ms  3.562 ms
 3  2001:678:d78:200::3  6.273 ms  6.694 ms  7.086 ms        

? Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!

I recorded a little [screencast ] that shows my work, so far:

(See the screencast on IPng's website)

Additional thoughts

Comparing OSPFv2 and Babel

Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4 transit networks, by making use of this peer pattern, which is similar but not quite the same as what I discussed in Approach 2 above:

$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0
$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1        

The Linux ControlPlane plugin is not currently capable of accepting the peer netlink message, and I can see a problem: VPP does not allow for two interfaces to have the same IP address, unless one is borrowing from another using unnumbered. I wonder why that is …

I could certainly give implementing that peer pattern in Netlink a go, but I’m not enthusiastic. To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4 address strictly corresponds to a loopback, and then internally rewrite the address addition into a unnumbered use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!

I think there’s a more idiomatic way of doing this in VPP. OSPFv2 doesn’t really need to use the peer pattern, as long as the point to point peer is reachable. Babel is emitting a static route over the interface after using IPv6 to learn its peer’s IPv4 address, which is really neat! I suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as well.

The VPP idiom for the peer pattern above, which Babel does naturally, and OSPFv2 could be manually configured to do, would look like this:

vpp0-2# set interface ip address loop0 192.168.10.2/32
vpp0-2# set interface state loop0 up

vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-2# set interface state GigabitEthernet10/0/0 up
vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0

vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-2# set interface state GigabitEthernet10/0/1 up
vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1        

Either way, when using point to point connections (like these explicit static routes, or the implied static routes that the peer pattern will yield) over an ethernet broadcast medium, will require to get the ARP [Gerrit ] merged. This one seems reasonably straight forward because allowing point to point to work over an Ethernet broadcast medium is successfully done in many popular vendors, and I can’t find any RFC that forbids it. Perhaps VPP is being a bit too strict.

To Unnumbered or Not To Unnumbered

I’m torn between Approach 2 and Approach 3. While on the one hand, setting the unnumbered interface would be best reflected in Linux, it is not without problems. If the operator subsequently tries to remove one of the addresses on e0 or e1, that will yield a desync between Linux and VPP (Linux will have removed the address, but VPP will still be unnumbered). On the other hand, tricking Linux (and the operator) to believe there isn’t an IPv4 (and IPv6) address configured on the interface, is also not great.

Of the two approaches, I think I prefer Approach 3 (changing the Linux CP plugin to not sync unnumbered addresses), because it minimizes the chance of operator error. If you’re reading this and have an Opinion?, would you please let me know?

What’s Next

I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see if it makes sense to upstream my changes. Personally, I think it’s a reasonable direction, because (a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I’ll also add some configuration knobs to [vppcfg ] to make it easier to configure VPP in this way.

Of course, migrating AS8298 won’t be overnight, I need to gain a bit more confidence, and obviously upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review. And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions, which considering the IGP is the most fundamental building block of the network, may be tricky.

But, I am uncomfortably excited by the prospect of having my network go entirely without backbone transit networks. By the way: Babel is amazing!

Haythum Babiker

Senior Network Engineer at Google

4 个月

Thanks Pim for sharing the details of this solution.

Great insights, thank you.

interessant stuk PIm, blijf het graag volgen :)

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了