Case Study: NAT64 in AS8298

Case Study: NAT64 in AS8298

Introduction

IPng’s network is built up in two main layers, (1) an MPLS transport layer, which is disconnected from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4, IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per port.

Centec switches allow for a modest but not huge amount of routes in the hardware forwarding tables. I loadtested them in [a previous article ] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), and they forward IPv4, IPv6 and MPLS traffic effortlessly, at 45 watts.

I wrote more about the Centec switches in [my review ] of them back in 2022.

IPng Site Local

I leverage this internal transport network for more than just MPLS. The transport switches are perfectly capable of line rate (at 100G+) IPv4 and IPv6 forwarding as well. When designing IPng Site Local, I created a number plan that assigns IPv4 from the 198.19.0.0/16 prefix, and IPv6 from the 2001:678:d78:500::/56 prefix. Within these, I allocate blocks for Loopback addresses, PointToPoint subnets, and hypervisor networks for VMs and internal traffic.

Take a look at the diagram to the right. Each site has one or more Centec switches (in red), and there are three redundant gateways that connect the IPng Site Local network to the Internet (in orange). I run lots of services in this red portion of the network: site to site backups [Borgbackup ], ZFS replication [ZRepl ], a message bus using [Nats ], and of course monitoring with SNMP and Prometheus all make use of this network. But it’s not only internal services like management traffic, I also actively use this private network to expose public services!

For example, I operate a bunch of [NGINX Frontends ] that have a public IPv4/IPv6 address, and reversed proxy for webservices (like [ublog.tech ] or [Rallly ]) which run on VMs and Docker hosts which don’t have public IP addresses. Another example which I wrote about [last week ], is a bunch of mail services that run on VMs without public access, but are each carefully exposed via reversed proxies (like Postfix, Dovecot, or [Roundcube ]). It’s an incredibly versatile network design!

Border Gateways

Seeing as IPng Site Local uses native IPv6, it’s rather straight forward to give each hypervisor and VM an IPv6 address, and configure IPv4 only on the externally facing NGINX Frontends. As a reversed proxy, NGINX will create a new TCP session to the internal server, and that’s a fine solution. However, I also want my internal hypervisors and servers to have full Internet connectivity. For IPv6, this feels pretty straight forward, as I can just route the 2001:678:d78:500::/56 through a firewall that blocks incoming traffic, and call it a day. For IPv4, similarly I can use classic NAT just like one would in a residential network.

But what if I wanted to go IPv6-only? This poses a small challenge, because while IPng is fully IPv6 capable, and has been since the early 2000s, the rest of the internet is not quite there yet. For example, the quite popular [GitHub ] hosting site still has only an IPv4 address. Come on, folks, what’s taking you so long?! It is for this purpose that NAT64 was invented. Described in [RFC6146 ]:

Stateful NAT64 translation allows IPv6-only clients to contact IPv4 servers using unicast UDP, TCP, or ICMP. One or more public IPv4 addresses assigned to a NAT64 translator are shared among several IPv6-only clients. When stateful NAT64 is used in conjunction with DNS64, no changes are usually required in the IPv6 client or the IPv4 server.

The rest of this article describes version 2 of the IPng SL border gateways, which opens the path for IPng to go IPv6-only. By the way, I thought it would be super complicated, but in hindsight: I should have done this years ago!

Gateway Design


Let me take a closer look at the orange boxes that I drew in the network diagram above. I call these machines Border Gateways. Their job is to sit between IPng Site Local and the Internet. They’ll each have one network interface connected to the Centec switch, and another connected to the VPP routers at AS8298. They will provide two main functions: firewalling, so that no unwanted traffic enters IPng Site local, and NAT translation, so that:

  1. IPv4 users from 198.19.0.0/16 can reach external IPv4 addresses,
  2. IPv6 users from 2001:678:d78:500::/56 can reach external IPv6,
  3. IPv6-only users can reach external IPv4 addresses, a neat trick.

IPv4 and IPv6 NAT

Let me start off with the basic tablestakes. You’ll likely be familiar with masquerading, a NAT technique in Linux that uses the public IPv4 address assigned by your provider, allowing many internal clients, often using [RFC1918 ] addresses, to access the internet via that shared IPv4 address. You may not have come across IPv6 masquerading though, but it’s equally possible to take an internal (private, non-routable) IPv6 network and access the internet via a shared IPv6 address.

I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:

Linux iptables masquerading will only work with the IP addresses assigned to the external interface, so I will need to use a slightly different approach to be able to use these pools. In case you’re wondering – IPng’s internal network has grown to the size now that I cannot expose it all behind a single IPv4 address; there will not be enough TCP/UDP ports. Luckily, NATing via a pool is pretty easy using the SNAT module:

From the top – I’ll first make it the default for the kernel to refuse to FORWARD any traffic that is not explicitly accepted. I will only allow traffic that comes in via enp1s0f1 (the internal interface), only if it comes from the assigned IPv4 and IPv6 site local prefixes. On the way back, I’ll allow traffic that matches states created on the way out. This is the firewalling portion of the setup.

Then, two POSTROUTING rules turn on network address translation. If the source address is any of the site local prefixes, I’ll rewrite it to come from the IPv4 or IPv6 pool addresses, respectively. This is the NAT44 and NAT66 portion of the setup.

NAT64: Jool

So far, so good. But this article is about NAT64 :-) Here’s where I grossly overestimated how difficult it might be – and if there’s one takeaway from my story here, it should be that NAT64 is as straight forward as the others! Enter [Jool ], an Open Source SIIT and NAT64 for Linux. It’s available in Debian as a DKMS kernel module and userspace tool, and it integrates cleanly with both iptables and netfilter.

Jool is a network address and port translating implementation, which is referred to as NAPT, just as regular IPv4 NAT. When internal IPv6 clients try to reach an external endpoint, Jool will make note of the internal src6:port, then select an external IPv4 address:port, rewrite the packet, and on the way back, correlate the src4:port with the internal src6:port, and rewrite the packet. If this sounds an awful lot like NAT, then you’re not wrong! The only difference is, Jool will also translate the address family: it will rewrite the internal IPv6 addresses to external IPv4 addresses.

Installing Jool is as simple as this:

.. and that, as they say, is all there is to it! There’s two things I make note of here:

  1. I have assigned 2001:678:d78:564::/96 as NAT64 pool6, which means that if this machine sees any traffic destined to that prefix, it’ll activate Jool, select an available IPv4 address:port from the pool4, and send the packet to the IPv4 destination address which it takes from the last 32 bits of the original IPv6 destination address.
  2. Cool trick: I am reusing the same IPv4 pool as for regular NAT. The Jool kernel module happily coexists with the iptables implementation!

DNS64: Unbound

There’s one vital piece of information missing, and it took me a little while to appreciate that. If I take an IPv6 only host, like Summer, and I try to connect to an IPv4-only host, how does that even work?

Now comes the really clever reveal – NAT64 works by assigning an IPv6 prefix that snugly fits the entire IPv4 address space, typically 64:ff9b::/96, but operators can chose any prefix they’d like. For IPng’s site local network, I decided to assign 2001:678:d78:564::/96 for this purpose (this is the global.pool6 attribute in Jool’s config file I described above). A resolver can then tweak DNS lookups for IPv6-only hosts to return addresses from that IPv6 range. This tweaking is called DNS64, described in [RFC6147 ]:

DNS64 is a mechanism for synthesizing AAAA records from A records. DNS64 is used with an IPv6/IPv4 translator to enable client-server communication between an IPv6-only client and an IPv4-only server, without requiring any changes to either the IPv6 or the IPv4 node, for the class of applications that work through NATs.

I run the popular [Unbound ] resolver at IPng, deployed as a set of anycasted instances across the network. With two lines of configuration only, I can turn on this feature:

The behavior of the resolver now changes in a very subtle but cool way:

Before, [github.com ] did not return an AAAA record, so there was no way for Summer to connect to it. But now, not only does it return an AAAA record, but it also rewrites the PTR request, knowing that I’m asking for something in the DNS64 range of 2001:678:d78:564::/96, Unbound will instead strip off the last 32 bits (8c52:7903, which is the hex encoding for the original IPv4 address), and return the answer for a PTR lookup for the original 3.121.82.140.in-addr.arpa instead. Game changer!

DNS64 + NAT64

What I learned from this, is that the combination of these two tools provides the magic:

  1. When an IPv6-only client asks for AAAA for an IPv4-only hostname, Unbound will synthesize an AAAA from the IPv4 address, casting it into the last 32 bits of its NAT64 prefix 2001:678:d78:564::/96
  2. When an IPv6-only client tries to send traffic to 2001:678:d78:564::/96, Jool will do the address family (and address/port) translation. This is represented by the red (ipv6) flow in the diagram to the right turning into a green (ipv4) flow to the left.

What’s left for me to do is to ensure that (a) the NAT64 prefix is routed from IPng Site Local to the gateways and (b) that the IPv4 and IPv6 NAT address pools is routed from the Internet to the gateways.

Internal: OSPF

I use Bird2 to accomplish the dynamic routing - and considering the Centec switch network is by design BGP Free, I will use OSPF and OSPFv3 for these announcements. Using OSPF has an important benefit: I can selectively turn on and off the Bird announcements to the Centec IPng Site local network. Seeing as there will be multiple redundant gateways, if one of them goes down (either due to failure or because of maintenance), the network will quickly reconverge on another replica. Neat!

Here’s how I configure the OSPF import and export filters:

When learning prefixes from the Centec switch, I will only accept precisely the IPng Site Local IPv4 (198.19.0.0/16) and IPv6 (2001:678:d78:500::/56) supernets. On sending prefixes to the Centec switches, I will announce:

  • 198.19.0.255/32 and 2001:678:d78:500::1:0/128: These are the anycast addresses of the Unbound resolver.
  • 0.0.0.0/0 and ::/0: These are default routes for IPv4 and IPv6 respectively
  • 2001:678:d78:564::/96: This is the NAT64 prefix, which will attract the IPv6-only traffic towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation of github.com, which is reachable only at legacy address 140.82.121.3.

I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the external metric in addition to the internal cost within OSPF to reach that network. The cost of E2 routes will always be the external metric, the metric will take no notice of the internal cost to reach that router. Therefor, I emit these prefixes without Bird’s ospf_metric2 set, so that the closest border gateway is always used.

With that, I can see the following:

I’m not quite there yet, I have one more step to go. What’s happening at the Border Gateway? Let me take a look at this, while I ping6 to github.com :

Unbound and Jool are doing great work. Unbound saw my DNS request for IPv4-only github.com, and synthesized a DNS64 response for me. Jool then saw the inbound packet from enp1s0f1, the internal interface pointed at IPng Site Local. This is because the 2001:678:d78:564::/96 prefix is announced in OSPFv3 so every host knows to route traffic to that prefix to this border gateway. But then, I see the NAT64 in action on the outbound interface enp1s0f0. Here, one of the IPv4 pool addresses is selected as source address. But there is no return packet, because there is no route back from the Internet, yet.

External: BGP

The final step for me is to allow return traffic, from the Internet to the IPv4 and IPv6 pools to reach this Border Gateway instance. For this, I configure BGP with the following Bird2 configuration snippet:

I then establish an eBGP session from private AS64513 to two of IPng Networks’ core routers at AS8298. I add the wellknown BGP no-export community (FFFF:FF01) so that these prefixes are learned in AS8298, but never propagated. It’s not strictly necessary, because AS8298 won’t announce more specifics like these anyway, but it’s a nice way to really assert that these are meant to stay local. Because AS8298 is already announcing 194.126.235.0/24 and 2001:678:d78::/48 supernets, return traffic will already be able to reach IPng’s routers upstream. With these more specific announcements of the /30 and /125 pools, the upstream VPP routers will be able to route the return traffic to this specific server.

And with that, the ping to Unbound’s DNS64 provided IPv6 address for github.com shoots to life.

Results

I deployed four of these Border Gateways using Ansible: one at my office in Brüttisellen, one in Zurich, one in Geneva and one in Amsterdam. They do all three types of NAT:

  • Announcing the IPv4 default 0.0.0.0/0 will allow them to serve as NAT44 gateways for 198.19.0.0/16
  • Announcing the IPv6 default ::/0 will allow them to serve as NAT66 gateway for 2001:678:d78:500::/56
  • Announcing the IPv6 nat64 prefix 2001:678:d78:564::/96 will allow them to serve as NAT64 gateway
  • Announcing the IPv4 and IPv6 anycast address for nscache.net.ipng.ch allows them to serve DNS64

Each individual service can be turned on or off. For example, stopping to announce the IPv4 default into the Centec network, will no longer attract NAT44 traffic through a replica. Similarly, stopping to announce the NAT64 prefix will no longer attract NAT64 traffic through that replica. OSPF in the IPng Site Local network will automatically select an alternative replica in such cases. Shutting down Bird2 alltogether will immediately drain the machine of all traffic, while traffic is immediately rerouted.

If you’re curious, here’s a few minutes of me playing with failover, while watching YouTube videos concurrently.

A recording of this screencast is on [IPng's website ].

What’s Next

I’ve added an Ansible module in which I can configure the individual instances’ IPv4 and IPv6 NAT pools, and turn on/off the three NAT types by means of steering the OSPF announcements. I can also turn on/off the Anycast Unbound announcements, in much the same way.

If you’re a regular reader of my stories, you’ll maybe be asking: Why didn’t you use VPP? And that would be an excellent question. I need to noodle a little bit more with respect to having all three NAT types concurrently working alongside Linux CP for the Bird and Unbound stuff, but I think in the future you might see a followup article on how to do all of this in VPP. Stay tuned!

Jeff F.

Systems and Network Consultant

1 个月

Think you can answer that question pretty easily -- laziness. When regular old NAT/PAT and load balancers came on the scene everybody basically de-prioritized lighting up v6 space because you could just use private space on the back end. The other contributor to poor adoption -- having to manage 2 address different spaces, especially if some of your older tech didn't support v6.

回复
Mark White

Cybersecurity Research and Pilot Program Director

1 个月

IPv6 supports extension packets, each one of which can have a payload of up to four gigabytes. Think about how network monitoring and IPv6 requires thoughtful convergence.

回复
Jody Lemoine

Network Greasemonkey, Packet Macrame Specialist, Virtual Pneumatic Tube Transport Designer and Connectivity Nerfherder.

1 个月

There’s a good logic to it. If we’re doing NAT from private to public on IPv4 anyway, why not do NAT64 and go single stack on the internal network? It’s the same idea and eliminates a point of management.

Jeff Cooper

I like cloud security and I cannot lie... Cloud Security Architect. Zero Trust Architect.

1 个月

I recently came across the GitHub case. I was blown away. I would imagine smaller providers still only offering IPv4, but completely shocked to see GitHub only offer IPv4.

Savvas Bout

Digital Innovator

2 个月

Great share Pim!

要查看或添加评论,请登录

Pim van Pelt的更多文章

  • VPP with sFlow - Part 2

    VPP with sFlow - Part 2

    Introduction Last month, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an…

    6 条评论
  • VPP with sFlow - Part 1

    VPP with sFlow - Part 1

    Introduction In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called Ciprian…

    10 条评论
  • Case Study: From Jekyll to Hugo

    Case Study: From Jekyll to Hugo

    Introduction In the before-days, I had a very modest personal website running on [ipng.nl] and [ipng.

    12 条评论
  • VPP on FreeBSD (part 2)

    VPP on FreeBSD (part 2)

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    6 条评论
  • VPP on FreeBSD (part 1)

    VPP on FreeBSD (part 1)

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    8 条评论
  • Case Study: Selfhosted e-mail

    Case Study: Selfhosted e-mail

    Intro I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I’ve…

    19 条评论
  • VPP and OSPFv3: without IPv4 addresses!

    VPP and OSPFv3: without IPv4 addresses!

    Introduction When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway…

    23 条评论
  • VPP with loopback-only OSPFv3

    VPP with loopback-only OSPFv3

    Introduction A few weeks ago I took a good look at the [Babel] protocol. I found a set of features there that I really…

    11 条评论
  • VPP with Babel - Part 1

    VPP with Babel - Part 1

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    18 条评论
  • VPP Linux CP - Part7

    VPP Linux CP - Part7

    About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its…

    7 条评论

社区洞察

其他会员也浏览了