VPP MPLS - (bonus) Part 4

VPP MPLS - (bonus) Part 4

About this series

Special Thanks: Adrian?vifino?Pistol for writing this code and for the wonderful collaboration!

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic?ASR?(aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.

In the last three articles, I thought I had described “all we need to know” to perform MPLS using the Linux Controlplane in VPP:

  1. In the [first article] of this series, I took a look at MPLS in general.
  2. In the [second article] of the series, I demonstrated a few special case labels (such as?Explicit Null?and?Implicit Null?which enables the fabled?Penultimate Hop Popping?behavior of MPLS.
  3. Then, in the [third article], I worked with?@vifino?to implement the plumbing for MPLS in the Linux Control Plane plugin for VPP. He did most of the work, I just watched :)

As if in a state of premonition, I mentioned:

Caveat empor, outside of a modest functional and load-test, this MPLS functionality hasn’t seen a lot of mileage as it’s only a few weeks old at this point, so it could definitely contain some rough edges. Use at your own risk, but if you did want to discuss issues, the [vpp-dev@] mailinglist is a good first stop.

Introduction


As a reminder, the LAB we built is running VPP with a feature added to Linux Control Plane Plugin, which lets it consume MPLS routes and program the IPv4/IPv6 routing table as well as the MPLS forwarding table in VPP. At this point, we are running [Gerrit 38702, PatchSet 10].

First, let me specify the problem statement: @vifino and I both noticed that sometimes, pinging from one VPP node to another worked fine, while SSHing did not. This article describes an issue I diagnosed, and provided a fix for, in the Linux Controlplane plugin implementation.

Clue 1: Intermittent ping

My first finding is that our LAB machines run?all?the VPP plugins, notably the?ping?plugin, which means that VPP was responding to ping/ping6, and the Linux controlplane plugin sometimes did not receive any traffic, while other times it did receive the echo-request and dutifully responded to it, but the ICMP echo-reply was never seen back.

If I were to disable the?ping?plugin, indeed pinging from seemingly random pairs of?vpp0-[0123]?no longer works, while pinging direct neighbors (eg.?vpp0-0.e1?to?vpp0-1.e0) consistently works well.

Clue 2: Corrupted MPLS packets

Using the?tap0-0?virtual machine, which sees a copy of all packets on the Open vSwitch underlay in our lab, I started tcpdumping and noticed two curious packets from time to time:

09:23:00.357583 52:54:00:01:10:00 > 52:54:00:00:10:01, ethertype 802.1Q (0x8100), length 160: vlan 20, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 0 (IPv4 explicit NULL), tc 0, [S], ttl 61)
    IP6, wrong link-layer encapsulation (invalid)

09:22:55.349977 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63)
    version error: 4 != 6        

Looking at the payload of these broken packets, they are DNS packets coming from the?vpp0-3?Linux Control Plane there, and they are being sent to either the IPv4 address of?192.168.10.4?or the IPv6 address of?2001:678:d78:201::ffff. Interestingly, these are the lab’s resolvers, so I think?vpp0-3?is just trying to resolve something.

Clue 3: Vanishing MPLS packets

As I mentioned, some source/destinattion pairs in the lab do not seem to pass traffic, while others are fine. One such case of?packetlo?is any traffic from?vpp0-3?to the IPv4 address of?vpp0-1.e0. The path from?vpp0-3?to that IPv4 address should go out on?vpp0-3.e0?and into?vpp0-2.e1, but using tcpdump shows absolutely no such traffic at between?vpp0-3?and?vpp0-2, while I’d expect to see it on VLAN 22!

Diagnosis

Well, based on?Clue 3, I take a look at what is happening on?vpp0-3. I start by looking at the Linux controlplane view, where the route to?lab?looks like this:

root@vpp0-3:~$ ip route get 192.168.10.4
192.168.10.4/31 nhid 154  encap mpls  36 via 192.168.10.10 dev e0 proto ospf src 192.168.10.3 metric 20

root@vpp0-3:~$ tcpdump -evni e0 mpls 36
15:07:50.864605 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype MPLS unicast (0x8847), length 136:
  MPLS (label 36, tc 0, [S], ttl 64)
    (tos 0x0, ttl 64, id 15752, offset 0, flags [DF], proto UDP (17), length 118)
    192.168.10.3.36954 > 192.168.10.4.53: 20950+ PTR?
      1.9.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa. (90)        

Yes indeed, Linux is sending a DNS packet out on e0, so what am I seeing on the switch fabric? In the LAB diagram above, I can look up that traffic from?vpp0-3?destined to?vpp0-2?should show up on VLAN 22:

root@tap0-0:~$ tcpdump -evni enp16s0f0 -s 1500 -X vlan 22 and mpls
15:19:56.453521 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63)
        version error: 4 != 6
        0x0000:  0000 213f 4500 0076 d17e 4000 4011 d3a0  ..!?E..v.~@.@...
        0x0010:  c0a8 0a03 c0a8 0a04 e139 0035 0062 0dde  .........9.5.b..
        0x0020:  079e 0100 0001 0000 0000 0000 0131 0139  .............1.9
        0x0030:  0130 0130 0130 0130 0130 0130 0130 0130  .0.0.0.0.0.0.0.0
        0x0040:  0130 0130 0130 0130 0130 0130 0133 0130  .0.0.0.0.0.0.3.0
        0x0050:  0130 0130 0138 0137 0164 0130 0138 0137  .0.0.8.7.d.0.8.7
        0x0060:  0136 0130 0131 0130 0130 0132 0369 7036  .6.0.1.0.0.2.ip6
        0x0070:  0461 7270 6100 000c 0001                 .arpa.....        

MPLS Corruption

Ouch, that hurts my eyes! Linux sent the packet into the TAP device carrying label value 36, so why is it being observed as an?IPv6 Explicit Null?with label value 2? That can’t be right. In an attempt to learn more, I ask VPP to give me a packet trace. I happen to remember that on the way from Linux to VPP, the?virtio-input?driver is used (while, on the way from the wire to VPP, I see?dpdk-input?is used).

The trace teaches me something really valuable:

vpp0-3# trace add virtio-input 100
vpp0-3# show trace

00:03:27:192490: virtio-input
  virtio: hw_if_index 7 next-index 4 vring 0 len 136
    hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:03:27:192500: ethernet-input
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:03:27:192504: mpls-input
  MPLS: next mpls-lookup[1]  label 36 ttl 64 exp 0
00:03:27:192506: mpls-lookup
  MPLS: next [6], lookup fib index 0, LB index 92 hash 0 label 36 eos 1
00:03:27:192510: mpls-label-imposition-pipe
    mpls-header:[ipv6-explicit-null:63:0:eos]
00:03:27:192512: mpls-output
  adj-idx 21 : mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 flow hash: 0x00000000
00:03:27:192515: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x00180005
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
  label 2 exp 0, s 1, ttl 63
00:03:27:192517: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4c2ea1: current data 0, length 136, buffer-pool 0, ref-count 1, trace handle 0x7
                   l2-hdr-offset 0 l3-hdr-offset 14 
  PKT MBUF: port 65535, nb_segs 1, pkt_len 136
    buf_len 2176, data_len 136, ol_flags 0x0, data_off 128, phys_addr 0x730ba8c0
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
  label 2 exp 0, s 1, ttl 63        

At this point, I think I’ve figured it out. I can see clearly that the MPLS packet is seen coming from Linux, and it has label value 36. But, it is then offered to graph node?mpls-input, which does what it is designed to do, namely look up the label in the FIB:

vpp0-3# show mpls fib 36
MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
36:neos/21 fib:0 index:88 locks:2
  lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
    path-list:[50] locks:24 flags:shared, uPRF-list:38 len:1 itfs:[1, ]
      path:[66] pl-index:50 ip6 weight=1 pref=0 attached-nexthop:  oper-flags:resolved,
        fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0
      [@0]: ipv6 via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:4 flags:[] 52540002100152540003100086dd
    Extensions:
     path:66  labels:[[ipv6-explicit-null pipe ttl:0 exp:0]]
 forwarding:   mpls-neos-chain
  [@0]: dpo-load-balance: [proto:mpls index:91 buckets:1 uRPF:38 to:[0:0]]
    [0] [@6]: mpls-label[@34]:[ipv6-explicit-null:64:0:neos]
        [@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847        


Haha, I love it when the brain-ligutbulb goes to the?on?position. What’s happening is that when we turned on the MPLS feature on the VPP?tap?that is connected to?e0, and VPP saw an MPLS packet, that it looked up in MPLS FIB from label 36, telling it to?SWAP?it for?IPv6 Explicit NULL?(which is label value 2), and send it out on Gi10/0/0 to an IPv6 nexthop. Yeah,?that’ll break all right.

MPLS Drops

OK, that explains the garbled packets, but what about the ones that I never even saw on the wire (Clue 3)? Well, now that I’ve enjoyed my lightbulb moment, I know exactly where to look. Consider the following route in Linux, which is sending out encapsulated with MPLS label value 37; and consider also what happens if?mpls-input?receives an MPLS frame with that value:

root@vpp0-3:~# ip ro get 192.168.10.6
192.168.10.6  encap mpls  37 via 192.168.10.10 dev e0 src 192.168.10.3 uid 0 

vpp0-3# show mpls fib 37
MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]        

.. that’s right, there?IS?no entry. As such, I would expect VPP to not know what to do with such a mislabeled packet, and drop it. Unsurprisingly at this point, here’s a nice proof:

00:10:31:107882: virtio-input
  virtio: hw_if_index 7 next-index 4 vring 0 len 102
    hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:10:31:107891: ethernet-input
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:10:31:107897: mpls-input
  MPLS: next mpls-lookup[1]  label 37 ttl 64 exp 0
00:10:31:107898: mpls-lookup
  MPLS: next [0], lookup fib index 0, LB index 22 hash 0 label 37 eos 1
00:10:31:107901: mpls-drop
  drop
00:10:31:107902: error-drop
  rx:tap1
00:10:31:107905: drop
  mpls-input: MPLS DROP DPO        

Conclusion:?tadaa.wav, when VPP receives the MPLS packet from Linux, it has already been routed (encapsulated and put in an MPLS packet that’s meant to be sent to the next router), so it should be left alone. Instead, VPP is forcing the packet through the MPLS FIB, where if I’m lucky (and I’m not, clearly …) the right thing happens. But, sometimes, the MPLS FIB has instructions that are different to what Linux had intended, bad things happen, and kittens get hurt. I can’t allow that to happen. I like kittens!

Fixing Linux CP + MPLS

Now that I know what’s actually going on, the fix comes quickly into focus. Of course, when Linux sends an MPLS packet, VPP?must not?do a FIB lookup. Instead, it should emit the packet on the correct interface as-is. It sounds a little bit like re-arranging the directed graph that VPP uses internally. I’ve never done this before, but why not give it a go .. you know, for science :)

VPP has a concept called?feature arcs. These are codepoints where features can be inserted and turned on/off. There’s a feature arc for MPLS called?mpls-input. I can create a graph node that does anything I’d like to the packets at this point, and what I want to do is take the packet and instead of offering it to the?mpls-input?node, just emit it on its egress interface using?interface-output.

First, I call?VLIB_NODE_FN?which defines a new node in VPP, and I call it?lcp_xc_mpls(). I register this node with?VLIB_REGISTER_NODE?giving it the symbolic name?linux-cp-xc-mpls?which extends the existing code in this plugin for ARP and IPv4/IPv6 forwarding. Once the packet enters my new node, there are two possible places for it to go, defined by the?next_nodes?field:

  1. LCP_XC_MPLS_NEXT_DROP: If I can’t figure out where this packet is headed (there should be an existing adjacency for it), I will send it to?error-drop?where it will be discarded.
  2. LCP_XC_MPLS_NEXT_IO: If I do know, however, I ask VPP to send this packet simply to?interface-output, where it will be marshalled on the wire, unmodified.

Taking this short cut for MPLS packets avoids them being looked up in the FIB, and in hindsight this is no different to how IPv4 and IPv6 packets are also short circuited: for those,?ip4-lookup?and?ip6-lookup?are also not called, but instead?lcp_xc_inline()?does the business.

I can inform VPP that my new node should be attached as a feature on the?mpls-input?arc, by calling?VNET_FEATURE_INIT?with it.

Implementing the VPP node is a bit of fiddling - but I take inspiration from the existing function?lc_xc_inline()?which does this for IPv4 and IPv6. Really all I must do, is two things:

  1. Using the?Linux Interface Pair (LIP)?entry, figure out which physical interface corresponds to the TAP interface I just received the packet on, and then set the TX interface to that.
  2. Retrieve the ethernet adjacency based on the destination MAC address, use it to set the correct L2 nexthop. If I don’t know what adjacency to use, set?LCP_XC_MPLS_NEXT_DROP?as the next node, otherwise set?LCP_XC_MPLS_NEXT_IO.

The finishing touch on the graph node is to make sure that it’s trace-aware. I use packet tracing?a lot, as can be seen as well in this article, so I’ll detect if tracing for a given packet is turned on, and if so, tack on a?lcp_xc_trace_t?object, so traces will reveal my new node in use.

Once the node is ready, I have one final step. When constructing the?Linux Interface Pair?in?lcp_itf_pair_add(), I will enable the newly created feature called?linux-cp-xc-mpls?on the?mpls-input?feature arc for the TAP interface, by calling?vnet_feature_enable_disable(). Conversely, I’ll disable the feature when removing the?LIP?in?lcp_itf_pair_del().

Results

After rebasing @vifino’s change, I add my code in [Gerrit 38702, PatchSet 11-14]. I think the simplest thing to show the effect of the change is by taking a look at these MPLS packets that come in from Linux Controlplane, and how they now get moved into?linux-cp-xc-mpls?instead of?mpls-input?before:

00:04:12:846748: virtio-input
  virtio: hw_if_index 7 next-index 4 vring 0 len 102
    hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:04:12:846804: ethernet-input
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:04:12:846811: mpls-input
  MPLS: next BUG![3]  label 37 ttl 64 exp 0
00:04:12:846812: linux-cp-xc-mpls
  lcp-xc: itf:1 adj:21
00:04:12:846844: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x00180005
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
  label 37 exp 0, s 1, ttl 64
00:04:12:846846: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4be948: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   l2-hdr-offset 0 l3-hdr-offset 14 
  PKT MBUF: port 65535, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1f9a5280
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
  label 37 exp 0, s 1, ttl 64        

The same is true for the original DNS packet with MPLS label 36 – it just goes out of Gi10/0/0 with the same label, which is dope! Indeed, no more garbled MPLS packets are seen, and the following simple acceptance test shows that all machines can reach all other machines on the LAB cluster:

ipng@vpp0-3:~$ fping -g 192.168.10.0 192.168.10.3
192.168.10.0 is alive
192.168.10.1 is alive
192.168.10.2 is alive
192.168.10.3 is alive

ipng@vpp0-3:~$ fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3
2001:678:d78:200:: is alive
2001:678:d78:200::1 is alive
2001:678:d78:200::2 is alive
2001:678:d78:200::3 is alive        

My ping test here from?vpp0-3?tries to ping (via the Linux controlplane) each of the other routers, including itself. It first does this with IPv4, and then with IPv6, showing that all?eight?possible destinations are alive. Progress, sweet sweet progress.

I then expand that with this nice oneliner:

pim@lab:~$ for af in 4 6; do \
  for node in $(seq 0 3); do \
    ssh -$af ipng@vpp0-$node "fping -g 192.168.10.0 192.168.10.3; \
      fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3"; \
    done \
  done | grep -c alive

64        

Taking both IPv4 and iPv6, I log in to all four nodes (so in total I invoke SSH 8 times), and then perform both?fping?operations, and receive each time?eight?respondes,?sixty-four?in total. This checks out. I am very pleased with my work.

What’s next

I joined forces with?@vifino?who has effectively added MPLS handling to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR’s label distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)

Our work is mostly complete, there’s two pending Gerrit’s which should be ready to review and certainly ready to play with:

  1. [Gerrit 38826]: This adds the ability to listen to internal state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the?LIP?interfaces and Linux sysctl for MPLS input.
  2. [Gerrit 38702/10]: This adds the ability to listen to Netlink messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 and MPLS FIB in the VPP dataplane.
  3. [Gerrit 38702/14]: This Gerrit now also adds the ability to directly output MPLS packets from Linux out on the correct interface, without pulling it through the MPLS fib.

Finally, a note from your friendly neighborhood developers: this code is brand-new and has had?very limited?peer-review from the VPP developer community. It adds a significant feature to the Linux Controlplane plugin, so make sure you both understand the semantics, the differences between Linux and VPP, and the overall implementation before attempting to use in production. We’re pretty sure we got at least some of this right, but testing and runtime experience will tell.

I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on [GitHub]. If you’d like to test this - reach out to the VPP Developer [mailinglist] any time!

要查看或添加评论,请登录

Pim van Pelt的更多文章

  • FreeIX Remote - Part 2

    FreeIX Remote - Part 2

    Introduction A few months ago, I wrote about [an idea] to help boost the value of small Internet Exchange Points…

    8 条评论
  • FreeIX Remote - Part 1

    FreeIX Remote - Part 1

    Introduction Tier1 and aspiring Tier2 providers interconnect only in large metropolitan areas, due to commercial…

    10 条评论
  • VPP with sFlow - Part 2

    VPP with sFlow - Part 2

    Introduction Last month, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an…

    6 条评论
  • VPP with sFlow - Part 1

    VPP with sFlow - Part 1

    Introduction In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called Ciprian…

    10 条评论
  • Case Study: From Jekyll to Hugo

    Case Study: From Jekyll to Hugo

    Introduction In the before-days, I had a very modest personal website running on [ipng.nl] and [ipng.

    12 条评论
  • Case Study: NAT64 in AS8298

    Case Study: NAT64 in AS8298

    Introduction IPng’s network is built up in two main layers, (1) an MPLS transport layer, which is disconnected from the…

    32 条评论
  • VPP on FreeBSD (part 2)

    VPP on FreeBSD (part 2)

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    6 条评论
  • VPP on FreeBSD (part 1)

    VPP on FreeBSD (part 1)

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    8 条评论
  • Case Study: Selfhosted e-mail

    Case Study: Selfhosted e-mail

    Intro I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I’ve…

    18 条评论
  • VPP and OSPFv3: without IPv4 addresses!

    VPP and OSPFv3: without IPv4 addresses!

    Introduction When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway…

    23 条评论

社区洞察

其他会员也浏览了