[Case Study] Kernel Panic @0x0 from xfrm_local_error+0x4c
Background
In real-world projects, unexpected system crashes or freezes, such as kernel panics, can occur. These incidents are challenging because embedded software often operates under significant stress regarding crash issues. However, we must remain calm and analyze the issue systematically. By carefully reviewing kernel logs and memory dumps, the root cause can be identified and resolved.
Log Analysis
The first step is to inspect the kernel log at the moment of the crash. Below is the log signature captured during the crash:
[ 262.401303] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 262.401365] pgd = dbdc4000
[ 262.401389] [00000000] *pgd=00000000
[ 262.401433] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM
[ 262.401459] Modules linked in:
[ 262.401495] CPU: 0 PID: 7107 Comm: Framework Tainted: G W 3.10.49-g356bd9f-00007-gadca646 #1
[ 262.401522] task: da6b0540 ti: d9412000 task.ti: d9412000
[ 262.401549] PC is at 0x0
[ 262.401590] LR is at xfrm_local_error+0x4c/0x58
[ 262.401619] pc : [<00000000>] lr : [<c0adc274>] psr: a00f0013
[ 262.401619] sp : d9413c68 ip : c0ac6c20 fp : 0000dd86
[ 262.401654] r10: 0000010e r9 : 0000010a r8 : de0ddc20
[ 262.401678] r7 : c13ddf00 r6 : 00000500 r5 : d9094540 r4 : c13e3780
[ 262.401703] r3 : 00000000 r2 : 00000001 r1 : 00000500 r0 : d9094540
Analyzing kernel logs and memory dumps involves checking the signature, including the call stack and registers, to understand the crash.
Initial Observations
Using TRACE32, the call stack was extracted based on the ARM calling convention:.
-000|xfrm_local_error(skb = 0xDCD376C0, mtu = -611987776)
-001|__xfrm4_output(skb = 0xD9094540)
-002|xfrm_output_resume(skb = 0xD9094540, err = 1)
-003|__xfrm4_output(skb = 0xD9094540)
-004|ip_local_out(skb = 0xD9094540)
-005|ip_send_skb(net = 0xC13DDF00, ?)
-006|udp_send_skb(skb = 0xD9094540, ?)
-007|udp_sendmsg(?, sk = 0xDE0F1680, msg = 0xD9413EE0, len = 1300)
-008|inet_sendmsg(iocb = 0xD9413E58, ?, msg = 0xD9413EE0, size = 1300)
-009|sock_sendmsg(sock = 0xDA9A5500, msg = 0xD9413EE0, size = 1300)
-010|SYSC_sendto(inline)
-011|sys_sendto(?, [...] addr_len = 16)
-012|ret_fast_syscall(asm)
From the call stack, the crash occurred in xfrm_local_error.
Assembly Instruction Analysis
The xfrm_local_error function was analyzed at the assembly level:
NSR:C0ADC264|E1A00005 cpy r0, r5 ; proto, skb
NSR:C0ADC268|E594344C ldr r3, [r4, #0x44C]
NSR:C0ADC26C|E1A01006 cpy r1, r6 ; r1, mtu
NSR:C0ADC270|E12FFF33 blx r3 // Branch to function pointer
From the kernel log, the R3 register value is 0x0, causing the blx r3 instruction to dereference a NULL pointer, leading to the kernel panic.
领英推荐
C Code Review
The corresponding C code for the assembly instructions was identified:
kernel/net/xfrm/xfrm_output.c
234 void xfrm_local_error(struct sk_buff *skb, int mtu)
235 {
236 unsigned int proto;
237 struct xfrm_state_afinfo *afinfo;
238 // [snip]
245
246 afinfo = xfrm_state_get_afinfo(proto);
247 if (!afinfo)
248 return;
249
250 afinfo->local_error(skb, mtu); //<<--
251 xfrm_state_put_afinfo(afinfo);
In line 250, local_error is called through a function pointer, that corresponds to the assembly code. When examining the assembly instruction together with the C statement, we can derive the following code:
250 afinfo->local_error(skb, mtu); //<<--
NSR:C0ADC268|E594344C ldr r3, [r4, #0x44C]
NSR:C0ADC270|E12FFF33 blx r3 // Branch to function pointer
Note tat afinfo->local_error is corresponding to blx r3 at address C0ADC270.
TRACE32 debugging
By inspecting the local_error member of the struct xfrm_state_afinfo using TRACE32, it was found to be NULL:
$ v.v %all (struct xfrm_state_afinfo *)0xc13e3780
(unsigned int) family = 2 = 0x2 = '....',
(unsigned int) proto = 4 = 0x4 = '....',
(__be16) eth_proto = 8 = 0x8 = '..',
(struct module *) owner = 0x0 = -> NULL,
... [snip]...
(int (*)()) tmpl_sort = 0x0 = -> NULL,
(int (*)()) state_sort = 0x0 = -> NULL,
(int (*)()) output = 0xC0AD12E4 = xfrm4_output -> ,
(int (*)()) output_finish = 0xC0AD129C = xfrm4_output_finish -> ,
(int (*)()) extract_input = 0xC0AD0D8C = xfrm4_extract_input -> ,
(int (*)()) extract_output = 0xC0AD11E0 = xfrm4_extract_output -> ,
(int (*)()) transport_finish = 0xC0AD0D94 = xfrm4_transport_finish -> ,
(void (*)()) local_error = 0x0 = -> NULL) //<<--
Resolution
During a code review, it was observed that the local_error member in the xfrm4_state_afinfo variable was not initialized:
//kernel/net/ipv4/xfrm4_state.c
static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.family = AF_INET,
.proto = IPPROTO_IPIP,
.eth_proto = htons(ETH_P_IP),
.owner = THIS_MODULE,
.init_flags = xfrm4_init_flags,
.init_tempsel = __xfrm4_init_tempsel,
.init_temprop = xfrm4_init_temprop,
.output = xfrm4_output,
.output_finish = xfrm4_output_finish,
.extract_input = xfrm4_extract_input,
.extract_output = xfrm4_extract_output,
.transport_finish = xfrm4_transport_finish,
};
After adding a callback function for the local_error member, the issue was resolved:
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index 9258e75..0b2a064 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -83,6 +83,7 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.extract_input = xfrm4_extract_input,
.extract_output = xfrm4_extract_output,
.transport_finish = xfrm4_transport_finish,
+ .local_error = xfrm4_local_error,
};
Hope this post is helpful for troubleshooting.
Senior Embedded Software Engineer
3 个月Good post. Crisp and simple explanation on how to approach. I believe LKM crashes can also be done in a similar way.
Senior Software Engineer at Ciena
3 个月Very useful, Thanks for posting. Could you please share how to get TRACE32 tool source. Thanks.
Professional contract instructor (Linux, programming topics, many more) --
3 个月It's a good walkthru of the process of debugging a panic. A few things were left out or glossed over. First, is the LR register identifies the location of the instructions that triggered the problem, <funcname>+0x4c. That "+0x4c" can take you straight to the instruction in the disassembled code. (If the code is in a module, you'll have to be careful as the init_module symbol will have two noop's in front of it that should be included in the "+0x4c" offset. At least, it does on kernel 6.10. YMMV.) Second, the PC is the address at which the next instruction will be fetched, so it's pretty clear that this is the result of a jump/branch/call instruction that tried to go that address. On most architectures, it could also be the result of stack corruption, since the cpu may store the return address on the stack. This is particularly difficult to debug, as the corruption could've occurred from anywhere in the kernel and the stack frame itself no longer exists, so it might take some advanced sleuthing skills to find the code that caused it. (Continued in the next comment)
Technologist & Believer in Systems for People and People for Systems
3 个月Great advice for the good ??
Independent Consultant | Team starter | Moving embedded builds to the cloud
3 个月Very useful. Thanks for posting. When I look at the xfrm_local_error function it seems to be a function that could be tested on a host machine (e.g. Linux). It does not contain any low-level direct memory or device register access etc. This opens up for the option of testing the code by building a version of a reduced application and running it in a controlled environment on a host debugger. That might also direct the attention to the null pointer exception before it happened on target.