NVMe Over TCP
NVMe Over TCP

NVMe Over TCP

NVMe over TCP is enhanced feature of NVMe over Fabrics. It used the standard network stack(Ethernet) without any additional RDMA capable network device.

NVMe-oF is networked storage with low latency. The storage protocol which combination low latency and highly efficient fabric technologies like RDMA, FC or TCP. The Linux kernel mainly focus on the improvement of storage software for the latest generation CPUs, NVMe SSD, Network Interfaces to improve the performance and efficiency of storage applications. Recent trend on high data processing requirement of AI/ML applications which need more efficient data accessing architecture. In such scenarios, NVMe over TCP play crucial role.

NVMe/TCP get the benefits from the Transport Control Protocol which offers following benefits:

1.??????? Ease to deploy: It is easy to integrate with existed TCP/IP infrastructure and not required any special hardware requirement like RDMA.

2.??????? Scalable: TCP/IP infrastructure is widely available, and it can be scalable easily based on storage requirement.

3.??????? Reliable: TCP/IP provides packet acknowledgement, retransmission, congestion control, etc. These features provide more reliable data access via NVMe/TCP.

4.??????? Manageable: No special libraries and drivers are required to implement and to deploy.

Figure 1: Evolution of NVMe-oF

NVMe IP Based Storage Area Network:

Let consider the basic view of the NVMe IP Based Storage Area Network. It will give clarity on the terminology used in future NVMe/TCP architecture.

1.??????? End Points: Hosts and storages are considered as end points, and it identified as NVMe Qualified Names(NQN) and IP Address.

2.??????? IP Network: Modern switches which connect each end points (e.g. 25GbE capable)

3.??????? Subsystem: Storage array, analogous to SCSI target. Identified by NQN. NQN has similar functions to WWNN(Worldwide unique Node Name) in FC and IQN in iSCSI.

4.??????? CDC (Centralized Discovery Controller): Each CDC instance provides a discovery controller for end points that are taking part particularly NVMe IP Based SAN Instance.

DDC (Direct Discovery Controller): An NVMe Discovery controller that resides on subsystems hosts could connect directly to storage via DDC but would lose the advantage of centralization.

Figure 2: NVMe IP Based SAN

NVMe Storage Architecture:

NVMe used the PCIe to access the NVMe SSDs. This NVMe SSD will have Host Interfaces, SSD Controller( IO CNTL, Flash Controller, Processor as in below diagram) and NAND(Nov Volatile Memory).? NVMe Driver in host utilizes the MIMO controller registers and System DRAM for I/O submission (SQ) and completion Queue (CQ).

NVMe uses a small number of optimized commands and command completions. The data structure format for the command uses a fixed size of 64 bytes and for the completion it uses a fixed size of 16 bytes. There are two types of commands in NVMe: Admin Commands and I/O Commands. Admin Commands?are sent to an Admin Queue (single SQ/CQ pair) and I/O Commands are sent to I/O Queues (each of which has an SQ/CQ pair, or is part of a structure where one CQ handles completions for I/O Commands submitted via multiple SQs). For more information refer the NVMe over TCP Specification.

Figure 3: NVMe Storage Architecture

NVMe TCP Ports:

·????? TCP Port 4420 has been assigned for use by NVMe over fabrics

·????? TCP Port 8009 is default TCP port for NVMe/TCP discovery controller.

There is no default TCP port for NVMe/TCP I/O controllers, the Transport Service Identifier (TRSVCID) field in the Discovery Log Entry indicates the TCP port to use.

The TCP ports that may be used for NVMe/TCP I/O controllers include TCP port 4420, and the Dynamic and/or Private TCP ports (i.e., ports in the TCP port number range from 49152 to 65535). NVMe/TCP I/O controllers should not use TCP port 8009. TCP port 4420 shall not be used for both NVMe/iWARP and NVMe/TCP at the same IP address on the same network.

The TRSVCID field in a Discovery Log Entry for the NVMe/TCP transport shall contain a TCP port number in decimal representation as an ASCII string. If such a TRSVCID field does not contain a TCP port number in decimal representation as an ASCII string, then the host shall not use the information in that Discovery Log Entry to connect to a controller.

Check the following packet structure and details about the PDUs used in NVMe over TCP communication.

Figure 4: NVMe over TCP PDU Information

NVMe over TCP Connection Explained:

Listing the set operations involved in the NMVe over TCP data read from the destination.

1.??????? Discover the Storage Appliance – user has to trigger the ‘nvme discover’ cli command at host to check storage appliance subsystem.

Eg: nvme discover -t tcp -a 1.1.1.1 -s 4420

a.??????? Initiate Connection Request(ICReq) to Discovery Controller

b.??????? Create NVMe Admin Queue

c.??????? Get controller capabilities

d.??????? Set controller configurations

e.??????? Get controller status

f.?????????? Identify Controller CNS-01

g.???????? Identify Data Buffer

h.??????? Get Discovery Log Page

i.??????????? Discovery log entry

2.??????? Connect to I/O sub system.

User will issue the following nvme command to connect to nvme system.

Eg: nvme connect -t tcp -a 1.1.1.1 -n subNQN -s 4420

a.??????? Connection Request to IO subsystem.

b.??????? Create NVMe Admin Q

c.??????? Get I/O Controller capabilities

d.??????? Set I/O Controller configurations.

e.??????? Get I/O controller Status after change in configurations

f.?????????? Get I/O controller version

g.???????? Identify the I/O controller CNS-01

h.??????? Create 64 I/O queues request

i.??????????? Accept the 8 I/0 Queues only

j.??????????? Create 8 I/O queues

k.???????? NVMe I/O Data Queue

l.?????????? Set Notification Flag

m.??? Send Notification Request

n.??????? Identify Active Namespaces

o.??????? Get Active namespace list

p.??????? Identify Namespace CNS-0

q.??????? Namespace Info returned

r.?????????? Issue NVMe Read

s.???????? NVMe Read Data

Figure 5: Discover the Storage Appliance
Figure 6: Connect I/O Subsystem

Based on the above details, some of the following key points can be concluded

1.??????? NVMe uses the existed TCP infra to access fast storage.

2.??????? NVMe over TCP basically involved with commands and data

3.??????? It is faster and flexible compared to traditional storage methods

?

NVMe over TCP provide more performance compared to legacy architectures but also we need to consider the following pain points.

a.??????? NVMe TCP commands are purely implemented based on CPU availability, which might cause issue on CPU utilization. It may limit the processing capability of server.

b.??????? More traffic over the TCP communication may impact in CC(congestion Control) and huge size packet transmission.

c.??????? Encryption of data over NVMe TCP leads to performance impact.

d.??????? New Authentication and authorization required to handle the secure access of storage array over TCP.

e.??????? Diagnosis and failure require process need more tools and these tools still under development.

Conclusion:

NVMe over TCP provide the immense performance compared to legacy technologies. It provides more flexibility on growing networked storage in terms of implementation and scalability. One of the cost-effective solutions and easy in maintainability. New research and improvements are reducing the risks and challenges associated with ?NVMe Over TCP.





Manohar Badiger

Vice President, Principle Engineer, Ex Goldman Sachs

8 个月

Good Job!

要查看或添加评论,请登录

Shrikant Badiger的更多文章

  • Bazel Build for C++ Software Application

    Bazel Build for C++ Software Application

    Bazel Tool is developed by google to automate the build process. Now It's an open source and it can be used by anyone.

  • C++ Class Layout

    C++ Class Layout

    Class Layout: Only non-static data members will contribute to the size of the class object. If we have static and…

    1 条评论
  • High-performance Computing in C++ : Open Muti Processing(OpenMP)

    High-performance Computing in C++ : Open Muti Processing(OpenMP)

    Open Multi-Processing: Let's consider the parallelization approaches, basically, we can think of imperative…

  • High-performance Computing in C++

    High-performance Computing in C++

    Single Instruction Multiple Data (SIMD) Multiple core CPUs and Multithreading: Declarative(OpenMP), imperative…

  • vSocket Interface - Guest to Host Communication

    vSocket Interface - Guest to Host Communication

    vSocket: VMware vSocket provides a very similar API to the Unix Socker interface for communication. vSocket library is…

  • Custom Memory Management in C++

    Custom Memory Management in C++

    Memory Management: Process in which memory allocation and de-allocation to the variable in running program and handle…

  • Pointers in C

    Pointers in C

    Pointers in C: Pointers are fundamental parts of C Programming. Pointers provide the lots of power and flexibility in C…

  • CMake and Useful Info

    CMake and Useful Info

    CMake is an open-source tool to build, test, and package software applications. CMake provides control over the…

    1 条评论
  • Interrupt !!

    Interrupt !!

    Processors need to detect hardware activities. There are multiple solutions to detect hardware activities.

  • PXE: Preboot Execution Environment

    PXE: Preboot Execution Environment

    PXE: Preboot Execution Environment. Technology helps computers to boot up remotely through a network interface.

社区洞察

其他会员也浏览了