OpenStack interview Q&A
This article contains a series of Q&A for an OpenStack operations engineer, involved in configuring VNFs, performance monitoring, network troubleshooting, and technical support roles. The whole series is divided into four major sections i.e. resource utilization, networking, Ceph storage, and general troubleshooting scenarios. The very basic theoretical queries are omitted.
Section 1 - Resource Utilization
Q1 - How will you check the CPU partitioning (isolated & vCPU cores), NUMA isolation & Hugepage allocation of the compute node?
Ans - CPU partitioning
Host CPUs (hypervisor dedicated) can be found as affinity CPUs in
#cat /etc/systemd/system.conf | grep -i affinity
??????????????????????=?? ?? ???? ???? ???? ????
vCPU-set (CPUs available for VMs) can be found in
#cat etc/nova/nova.conf | grep -i vcpu_pin
????????_??????_?????? = ??-????,????-????,????-????,????-????
Depending on the OpenStack version the nova.conf file could be found either in /etc or /var/lib folders (make a linux search for the filename using find command)
NUMA isolation can be found via this command.
It will give you the no of CPUs & memory per NUMA node in a compute host.
$ numactl –H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 130950 MB
node 0 free: 125143 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 131072 MB
node 1 free: 125682 MB
Hugepages - The below command can give you the size of each hugepage (1GB in this case) & the total no of hugepages (215) in the compute node.
$ cat /proc/meminfo | grep -i huge
AnonHugePages: 0 kB
HugePages_Total: 215
HugePages_Free: 213
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
All these settings can also be viewed in /proc/cmdline under each server.
Q2 - How will you create a VM with pinned CPUs from a single NUMA node & also allocate huge-pages in the same VM?
Ans - It can be achieved using appropriate values in the flavor through which the VM is going to be created. The following flavor contains 8 CPUs, all pinned to a single NUMA(????:??????_????????????='??????????????????’, ????:????????_??????????='??’) & a memory of 16 GB in hugepages (????:??????_????????_????????='??????')
Q3 - How can you identify pinned & non-pinned VMs in compute nodes? What is the best practice to avoid performance issues (like app-level reboots, sluggish performances) of the VMs in case you have pinned & non-pinned VMs together in your cluster?
Ans - Pinned & non-pinned VMs can be easily identified using the "virsh vcpupin" command. Pinned VM will show a one-on-one mapping between physical & virtual CPU cores & non-pinned VM will show a range (as shown in the picture below)
Now, there is a difference between VM reboot & application reboot. In such scenarios where the application is getting sluggish, always check whether a VM is rebooting or not.
This can be checked via "nova instance-action-list <VM UUID>".
Most probably the VM is not rebooting & it can be a CPU sharing scenario by the VMs.The picture below shows two scenarios where a CPU can be shared by more than one VM & so it is affecting the performance of the application.
????????????????-??: It is a bad practice to host pinned & non-pinned VMs in the same compute because the non-pinned VMs use a range of CPUs & may consume a CPU core that is pinned to another VM. This placement is not checked by the Nova scheduler by itself. The administrator has to keep an eye on this one.
????????????????-??: Even if you have all pinned CPUs in a compute then also it is possible that some CPUs might get shared between more than one VM. This happens because of software problems in the OpenStack nova during migrations or evacuations.
The solution in both cases is to migrate any one of the VMs to another compute.
?????? ??????: Here is a bash command for loop to check the vCPU allocation of all the VMs in a compute node for a quick check.
for i in sudo virsh list | grep -vE '"--"|Name' |awk '{print $2}'; do echo; echo $i:; sudo virsh dumpxml $i | grep cpu;done
Q4 - You have 250 GB of RAM in a compute node. This compute has ???????? ?????? ???? with 50 GB RAM consumption. Still, you are getting "???????? ???? ???????????? ??????????????????" alarm in your monitoring system for this compute. What can be the reason?
Ans - Checking the "free -g" output we can see that only 24 Gb RAM is available
$ free -g
total used free shared buff/cache available
Mem: 250 226 20 0 4 24
But, there is only one VM working in this compute with 50 GB ram (50 x 1024 = 52400) consumption
$ openstack flavor show TEST_VM
+----------------------------+---------------------------------------------------------------------------------------+
| properties |
hw:cpu_policy='shared' |
| ram | 52400
| vcpus | 8
Remember, that "free -g" command will not show hugepages related info.
Checking the hugepage configuration we found that out of total 250 G memory in compute, 70% (176 G) is reserved i.e. occupied by HugePages. So, remaining 74 GB is actually available for normal VMs i.e. VMs without HugePage config
[root@overcloud-ovscompute-5 ~]# more /proc/meminfo | grep -i huge
AnonHugePages: 0 kB
HugePages_Total: 176
HugePages_Free: 176
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
The VM consumed 50 GB out of this 74 GB & so the "free -g" command is showing 24 GB available.
Now, 24 GB is actually 10% of the total capacity & most of the monitoring systems are configured to raise an alarm if available memory goes below 20% & that is why there is an alert raised.
The solution - Do not mix Hugepage & non-hugepage VMs in same compute. In this case VM migration to another compute with no Hugepages will be the best option.
Q5 - What are the common causes of "No valid host available" during VNF instantiation?
Ans - Actually, there is no fixed answer to this question. In fact, there can be many answers to it. I am summarising some most common issues encountered -
The first thing to check is the error in nova_conductor log (grep with <stack ID>) from the controller node because error might not have written in nova_compute if the scheduler has not found a valid host for placement & the stack got failed.
i. The top candidate for this failure is CPU or RAM mismatch against the demand put in VNF template vs the resources available in the cluster.
ii. The required CPU/RAM is available in the cluster but they're required from a particular NUMA node in the flavors & so, you need to check the available CPUs per NUMA node in computes. Use (?????????????? -?? & ?????????? ???????????????? commands to check this allocation)
iii. This one is silly but it happens - Spelling mistakes in object naming (flavors, networks, subnets, etc) - either in the templates or at the cluster.
iv. Networks, Physnets, SRIOV Virtual functions requested in the VNF template but not available at the cluster side OR if available then that NIC doesn't belong to the requested NUMA node. Check this under this file /??????/??????????/??????/<??????_????????>/????????????/????????_????????
v. Resources are requested from a newly scaled out host_aggregate & all the computes of that host_aggregate have a common fault like a service failure (nova, neutron, cinder, etc), SDN topology not defined, etc.
Want to learn OpenStack in depth? Here is my bestselling course on Udemy which has helped thousands of people across the world to change their career in the TelcoCloud/NFV area. For Discount coupons connect with me on LinkedIn or email ([email protected])
Section 2 - Networking & Architecture
Q6 - If a VM UUID is available with you, What is that single command that can provide the neutron port ID, IP address & MAC address of all the network interfaces of that VM?
Ans - As shown in the picture below, (nova interface-list <VM UUID>) can show you all three things in a single output. This is a legacy command from the OpenStack CLI but I haven't encountered any openstack command that can show all three properties in a single output.
???????????? - To check this information for all the VMs of your cluster, use the bash "for loop" given below -
for i in openstack server list | grep -vE '"--"|Name' |awk '{print $2}'; do echo; echo $i:; nova interface-list $i ;done
Q7 - Now that you have got the neutron port ID of the VM interface. As we know there is a tap interface in the compute node associated with every VM interface. How will you identify the tap interface from the "ifconfig" output inside the compute node where the VM is hosted?
Ans - The first 11 characters of the neutron port ID match with the tap interface at the compute level. Check the picture below-
The same logic applies for the other ovs interfaces i.e. qbr, qvb & qvo (their digits will match the first 11 characters of the neutron port ID). Identifying the tap interface is crucial to trace the interface using "tcpdump" for troubleshooting purposes & to check PM counters on the tap.
Q8 - If a VM has multiple network interfaces, how will you identify which interface belongs to a tenant network & which one is a provider network interface, using CLI only?
Ans - You already have the network ID from the (nova interface-list). As shown in the picture below, Perform an "openstack network show <net ID>" & grep with "physical_network". In the case of ???????????? ???????????????? physical_network will always be "????????" & in the case of ???????????????? ???????????????? it will be any of the "??????????????" values like physnet_1, physnet_2, etc.
???????????? - To print all VM-UUIDs, their corresponding ports and Physical_network type use the bash "for loop" shown below -
This is a nested for loop in which the VM UUIDs are stored as variable 'i' & further the net-IDs of all the VMs are stored as variable 'j' & physical_network type is printed for all the j's
for i in openstack server list | grep -vE '"--"|Name' |awk '{print $2}'; do for j in nova interface-list $i | grep -vE '"--"|ID' |awk '{print $6}'; do echo; openstack network show $j | grep physical_network; echo; echo "VM ID" $i : "network ID" $j;done; done
Q9 - Once you have identified a provider network interface then, from the same output, how will you identify whether that is an OVS provider network or an SRIOV provider network?
Ans - In this case, you need to check the bridge mappings of openvswitch & sriov agents in the compute node where you will find a physnet:NIC mapping for OVS & SRIOV. Now see the picture below & you see that the physical_network this time in ???????????????? which is mapped over the ?????? ???????????? ????-???? in the openvswitch agent.
If this could have been some other physnet value (physnet5 or physnet6) then, this network would have been an SRIOV network as shown in the sriov agent config.
Q10 - This is a comparatively easy but very important one. The compute nodes have multiple NIC cards. The HW engineer has configured the servers in the rack & the network connections are also completed. There is nobody at the site. How can you check remotely, which NIC port has a cable connected & which is disconnected? If connected then, whether it is a fiber channel or a CAT5/CAT6 connection? Their speeds, NIC driver, and FW version?
Ans - In the picture below you can see that "??????????????" is the command which gives you all this information. You can get all the NIC device names from the "ifconfig" command in the compute server.
In the below example, it is visible that "????????????" is a 10G fiber port with a link connected, while "????????????" is a disconnected port with unknown speed. Similarly "eno5" is a 1G twisted pair port with a link connected. The "ethtool -i" option gives you the driver name, driver version, FW version, etc.
Q11 - Let's check the network reachability problems. There are 2 VMs hosted in two different compute nodes. Both the VMs are part of the same subnet but somehow not reachable to each other. There are many devices involved in between. It includes tap interfaces, Linux bridges, OVS bridges, bonds, Physical NICs, and Leaf/Spine switches. What approach you will follow to isolate this fault?
Ans - The packet flow between the VMs is shown in the picture below (follow the orange arrows). The packets will travel from VM -->tap interface --> OVS bridges--> Physical NIC-->Leaf switch& the same path on the other compute.
Start a PING from VM1 to VM2 & start capturing tcpdump on all the ??????-?????????????? points & check till which point you can see the "ICMP request" messages passing through. This way you find the problematic point. It can be the VM itself (incorrect security group), OVS system (neutron problems), physical NIC (Linux kernel of compute node), or the physical networking section (incorrect config of the leaf switches).
Kindly note that you cannot capture a tcpdump on the ovs devices (br-int, br-ex, etc) because those are not linux devices. If you suspect a problem in the OVS system then either restart the openvswitch agent service or check "ovs-ofctl dump-flows br-int" & "ovs-ofctl dump-ports br-int" commands for some advance level checks on the flow entries & packet drops inside these devices.
Q12 - What is the need to integrate an external SDN solution (like Contrail, Nuage, CISCO ACI, etc) with OpenStack when the native SDN component "Neutron" is already there?
Ans - There are mainly three reasons -
i. Overhead - Neutron server is generally hosted on the controller node which is already handling many control functions of OpenStack, this acts as a choke point for the machine's overall processing capacity.
ii. Capacity - Neutron is based on OpenvSwitch & OVS was originally not designed for large-scale data centers, therefore, when the number of compute nodes expands beyond 60-70 servers & flow table entries also expand then in some cases OVS starts misbehaving & gets restored after an OVS restart.
iii. Automation - You cannot make any changes in Leaf/TOR switch using native SDN (Neutron) i.e. Leaf switch to be configured manually for your provider VLANs used in OpenStack. However, using external SDN plugins, this task can be automated.
To combat this problem & achieve better scaling capabilities, an external SDN plugin is used wherein the SDN controllers are hosted on separate servers (may be in the same rack or another) & the OVS agents get replaced by the respective SDN agents on each compute node.
Q13 - Redundancy is a mandatory concept in system design. At what levels redundancy can be maintained in an OpenStack-based datacenter?
Ans - Redundancy is maintained at all the levels mentioned below & shown in the diagram -
1. Power system at Rack level
2. VM level redundancy (Hot and standby VMs created in separate compute servers)
3. Network level redundancy - The most important thing (at the application level & at the infrastructure level)
领英推荐
4. Storage level redundancy for Data - managed by Ceph
5. Controller redundancy - managed by pacemaker & HA proxy
Q14 - OVS vs SRIOV
For what kind of network functions OVS is a good choice & where does SRIOV win over OVS?
Ans - OpenVswitch (??????) is a highly intelligent, multilayer software switch, managed by neutron which is the native SDN component of OpenStack. The feature that distinguishes OVS is its capability to manage the networking between all the VMs hosted on a compute server without involving the physical NIC of the host (if the VMs are part of the same subnet).
It doesn't mean that OVS doesn't work well in case of external communication but, it adds an overhead within the compute for every packet processing.
Therefore, ?????? ???? ???????????????? ?????? all the ??????????????-?????????? ?????????????????? like - CSCFs, TAS, PCRF, HSS, MSS server, etc. i.e., for ???????? ?????????????????? ?????? ?????????????? ??????????????????????.
?????????? - The trick here is to avoid hypervisor altogether and have VM access the physical NIC directly & get connected to the leaf/TOR switch, thus enabling almost line throughput. This increases performance compared to OVS.
?????????? ???? ???????????????? ?????? ?????? ?????? ????????-?????????? ?????????????????? like UPF, media gateways, SGW, PGW, vDU/vCU, etc. i.e., for ???????? ?????????????????? ???????? ???????? ?????????????? ??????????????????????.
The original credit for this answer goes to Faisal Khan who has been a great mentor to the TelcoCloud community.
Q15 - How to analyze network latency between two endpoints?
Ans - Watch this short video to know the latency analysis between two endpoints with a simple pcap capture using a cool wireshark trick.
Q16 - Which command options do you use to capture "??????????????" at various interfaces across the OpenStack environment? How do you rotate the traces & manage space for bulky trace files?
Ans - Save the following commands & thank me later.
i. Capture a trace on a tap interface (ex. tap12345-ab), inside a compute node
tcpdump -peni tap12345-ab
ii. Filter a specific protocol from this trace, say ICMP
tcpdump -peni tap12345-ab icmp
iii. Save this trace in a file
tcpdump -peni tap12345-ab -w trace.pcap
iv. Keep the trace running & make 4 files of 100 MB each & rotate the filename with the current date & time in the filename
tcpdump -peni tap12345-ab -w trace-%m-%d-%H-%M.pcap -C 100 -W 4
v. Rotate files every 1 hour or 500 MB (whichever is earlier) & keep the last 4 files only
tcpdump -peni tap12345-ab-w trace-%m-%d-%H-%M.pcap -G 3600 -C 500 -W 4
vi. If you want to check the same conversation on the physical NIC of compute (say ens1f0) then apply "-T vxlan" otherwise the encapsulated packets of vxlan tunnels won't be visible.
tcpdump -nnvve -i ens1f0 -T vxlan -w trace.pcap
Section 3 - Ceph storage
Q17 - What is Ceph? Describe its various components and features.
Ans - Whoever has worked in an OpenStack environment, must have worked on Ceph. I mean it is almost impossible to skip Ceph on an openstack-based infrastructure. It is a scalable, open-source, community focussed software-defined storage platform, extensively used in private cloud datacenters. The picture below shows -
1. The basic architecture of ceph & its components i.e. OSDs, MONs (monitors) & MGR (manager)
2. How does the data replication work? Logically, the same type of data chunks are called pools, pools-->have placement groups (PGs)--> have objects-->PGs are replicated over OSDs (Disk drives) as defined by the replication factor. (Pools and PGs are defined at the time of deployment)
Some other facts about ceph need to be remembered as follows -
Q18 - What are some common problems encountered in Ceph and how they can be resolved?
Ans - There can be many problems in your CEPH storage system including OSD down, MON down, Storage latency, Disk problems, Low disk space, etc. Below are some of the problematic scenarios identified via commands and log outputs along with their resolutions. Obviously, they are ?????? the only problems in Ceph.
Section 4 - Generic concepts & troubleshooting
Q19 - A very basic question generally asked in the interviews - "Summarize the VM creation flow in simple terms"
Ans - This picture summarizes the steps in a very easy way. I am not sure about the source of the image but I have it from my initial days in OpenStack.
Q20 - What are the steps involved and time taken in the OpenStack deployment for a production-grade Telco data center?
Ans - This seems a very basic question but it is very hard to answer. You will ?????? find this information in any of the official documentation of OpenStack. Deploying OpenStack in a production environment is not at all an easy task. It requires a lot of planning and execution at various levels. The picture shows the steps involved in the process along with the "?????????? ??????????????" for each task however, it doesn't include the physical installation of the rack, servers, and network devices.
This setup was a high-availability system deployed by ?????? ???? ???? ???????? ?????????????? using ?????????????? ???????? ???????????????????? ???? ?????????????????? ?????????? ?????????????? with the following details -
3 controllers, 3 Storage nodes, 12 Compute nodes, 1 Undercloud/Director machine, single rack with 4 Leaf/TOR switches, centOS 7-x86 base image, and an OpenStack (stein) image for Overcloud installation.
Q21 - Why there are always an odd number (3,5,7) of controllers in any production-grade OpenStack environment?
Ans - Any highly available (HA) system, whether OpenStack or Kubernetes works on the principle of RAFT consensus algorithm for quorum. It needs a minimum fault tolerance of "one" to maintain high availability in any production-grade system.
How fault tolerance is calculated? See below -
Fault tolerance = (No of control nodes - Quorum)
Where Quorum = (n/2 + 1) Rounded off to the nearest whole number.
Quorum is the minimum number of nodes required to commit any changes to the database.
In the picture below you can see that (3,5 & 7) are the best choices for an HA system while (4 & 6) are not the best choices because they give the same fault tolerance that too with an extra node. Similarly (1 & 2) give '????' fault tolerance and so, are not applicable for providing HA.
More than 7 master nodes will result in an overhead for determining cluster membership and quorum and so, it is not recommended. Depending on your needs, you typically end up with 3 or 5 master nodes.
Q22 - Performance management in OpenStack via SAR utility.
Ans - Although in every OpenStack-based datacenter, some or the other PM tools (Zabbix, Prometheus, etc) are already integrated, in the background all these PM tools are deriving their metrics from the PM counters generated by the Linux Kernel of the compute nodes.
The ?????? (???????????? ???????????????? ????????????) command in Linux is a powerful tool for monitoring and analyzing system resources & if you can clearly understand the SAR utility then you understand the root of PM, as this can be used even if no PM tool is integrated with your system. Generally, it comes bundled with every Linux distribution like RHEL, CentOS, etc.
SAR commands below will help you to understand the system resource utilization in detail. With the below outputs, you can quickly conclude a resource problem is coming from the application or the infrastructure. Remember that these commands should be fired inside the compute servers as root user. The granularity of the report is generally set as 10 mins but, can also be configurable to other values.
The syntax has the following meanings - Kindly put the syntax values as per your requirement.
sa 05 is 5th day of the month
-s is start time
-e is end time
1. For ex - the following command will give you the "??????" related metrics (used_mem, free_mem, buffer_mem) on the 10th day of the month between 9-11 AM
sar -r -f /var/log/sa/sa10 -s 09:00:00 -e 11:00:00
2. Similarly, this will give you all the "??????" metrics like (%CPU remained idle, used by the system, by application, etc)
sar -P ALL -f /var/log/sa/sa10 -s 09:00:00 -e 11:00:00
3. This one will fetch all the "?????????? ??/?? ??????????" like (await time, %util, reads/sec, writes/sec)
sar -p -d -f /var/log/sa/sa10 -s 09:00:00 -e 11:00:00
4. This will get the "???????????? ?????????? & ????????????" for all the interfaces (physical or virtual) in the compute node
sar -f /var/log/sa/sa10 -n EDEV -s 09:00:00 -e 11:00:00
5. This will get the "?????????????? ??????????????" (TX & RX packets/sec, speed in kbps) for all the interfaces (physical or virtual) in the compute node
sar -f /var/log/sa/sa29 -n DEV -s 09:00:00 -e 09:30:00
Now, you can club these commands with other Linux utilities like 'grep', 'cut', 'awk', etc to get the desired output of your choice.
Q23 - Is there a generic troubleshooting guide for OpenStack?
Ans - Yes, there is :-)
I used this guide during my initial OpenStack days and I found it useful for some cases. However, this doesn't claim to cover all the possible faults and problems.
Written down back in 2019 it might not contain the latest updates but the generic approach toward problem-solving is useful to build your approach.
The document is available through a public link here - https://docs.google.com/presentation/d/e/2PACX-1vSYFGHMy8iIEkhc_2F0V_N8OYGbPWlxGmCZ6WJfXZeCRYh3kl3PBQRNWWgvoWmhGbsae8XM0FzztGzP/pub?start=false&loop=false&delayms=5000&pli=1&slide=id.p2
???? - I don't know the authors personally ??
This marks the end of this Q&A series. In the future, if I find some more useful answers then I will include them here. Keep following me for more useful stuff.
Your next-door cloud trainer
Asad Khan
Devops Engineer,Reliance Industries Limited
11 个月Very useful and well described. Thanks Asad Khan bhai
AGM|Cloud|BTECH+MBA|ETC+HR+FIN|RHOSP/CP/CS|RHAAP|AWS CCP|RHCSA|5G|GenAI|Container|Cisco|Management|Consultnt|Project|Telco|10+Yrs|CXX|2xEP|WomaninTech|Learner|Influencr|Content Creatr|Dancr|DogLvr|1K+ Flwrs ????????♀???
1 年Thank you so much for this gold series, looking forward to more such on the same and like storage, services, etsi mano, linux, real time scenario, deployment, troubleshooting etc. Asad Khan ???? you are simply awesome !!! Keep coming ?????