[Case Study] Kernel Panic @0x0 from xfrm_local_error+0x4c

[Case Study] Kernel Panic @0x0 from xfrm_local_error+0x4c

Background

In real-world projects, unexpected system crashes or freezes, such as kernel panics, can occur. These incidents are challenging because embedded software often operates under significant stress regarding crash issues. However, we must remain calm and analyze the issue systematically. By carefully reviewing kernel logs and memory dumps, the root cause can be identified and resolved.

Log Analysis

The first step is to inspect the kernel log at the moment of the crash. Below is the log signature captured during the crash:

[  262.401303] Unable to handle kernel NULL pointer dereference at virtual address 00000000  
[  262.401365] pgd = dbdc4000  
[  262.401389] [00000000] *pgd=00000000  
[  262.401433] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM  
[  262.401459] Modules linked in:  
[  262.401495] CPU: 0 PID: 7107 Comm: Framework Tainted: G        W    3.10.49-g356bd9f-00007-gadca646 #1  
[  262.401522] task: da6b0540 ti: d9412000 task.ti: d9412000  
[  262.401549] PC is at 0x0  
[  262.401590] LR is at xfrm_local_error+0x4c/0x58  
[  262.401619] pc : [<00000000>]    lr : [<c0adc274>]    psr: a00f0013  
[  262.401619] sp : d9413c68  ip : c0ac6c20  fp : 0000dd86  
[  262.401654] r10: 0000010e  r9 : 0000010a  r8 : de0ddc20  
[  262.401678] r7 : c13ddf00  r6 : 00000500  r5 : d9094540  r4 : c13e3780  
[  262.401703] r3 : 00000000  r2 : 00000001  r1 : 00000500  r0 : d9094540          

Analyzing kernel logs and memory dumps involves checking the signature, including the call stack and registers, to understand the crash.

Initial Observations

  • Program Counter (PC): Points to 0x0, indicating a NULL pointer dereference.
  • Link Register (LR): Points to xfrm_local_error+0x4c.

Using TRACE32, the call stack was extracted based on the ARM calling convention:.

-000|xfrm_local_error(skb = 0xDCD376C0, mtu = -611987776)  
-001|__xfrm4_output(skb = 0xD9094540)  
-002|xfrm_output_resume(skb = 0xD9094540, err = 1)  
-003|__xfrm4_output(skb = 0xD9094540)  
-004|ip_local_out(skb = 0xD9094540)  
-005|ip_send_skb(net = 0xC13DDF00, ?)  
-006|udp_send_skb(skb = 0xD9094540, ?)  
-007|udp_sendmsg(?, sk = 0xDE0F1680, msg = 0xD9413EE0, len = 1300)  
-008|inet_sendmsg(iocb = 0xD9413E58, ?, msg = 0xD9413EE0, size = 1300)  
-009|sock_sendmsg(sock = 0xDA9A5500, msg = 0xD9413EE0, size = 1300)  
-010|SYSC_sendto(inline)  
-011|sys_sendto(?, [...] addr_len = 16)  
-012|ret_fast_syscall(asm)         

From the call stack, the crash occurred in xfrm_local_error.

Assembly Instruction Analysis

The xfrm_local_error function was analyzed at the assembly level:

NSR:C0ADC264|E1A00005 cpy     r0, r5          ; proto, skb  
NSR:C0ADC268|E594344C ldr     r3, [r4, #0x44C]  
NSR:C0ADC26C|E1A01006 cpy     r1, r6          ; r1, mtu  
NSR:C0ADC270|E12FFF33 blx     r3              // Branch to function pointer          

From the kernel log, the R3 register value is 0x0, causing the blx r3 instruction to dereference a NULL pointer, leading to the kernel panic.

C Code Review

The corresponding C code for the assembly instructions was identified:

kernel/net/xfrm/xfrm_output.c
234 void xfrm_local_error(struct sk_buff *skb, int mtu)  
235 {  
236     unsigned int proto;  
237     struct xfrm_state_afinfo *afinfo;  
238     // [snip]  
245  
246     afinfo = xfrm_state_get_afinfo(proto);  
247     if (!afinfo)  
248         return;  
249  
250     afinfo->local_error(skb, mtu);  //<<--  
251     xfrm_state_put_afinfo(afinfo);           

In line 250, local_error is called through a function pointer, that corresponds to the assembly code. When examining the assembly instruction together with the C statement, we can derive the following code:

250 afinfo->local_error(skb, mtu);  //<<--  
NSR:C0ADC268|E594344C ldr     r3, [r4, #0x44C]  
NSR:C0ADC270|E12FFF33 blx     r3              // Branch to function pointer          

Note tat afinfo->local_error is corresponding to blx r3 at address C0ADC270.

TRACE32 debugging

By inspecting the local_error member of the struct xfrm_state_afinfo using TRACE32, it was found to be NULL:

$ v.v %all (struct xfrm_state_afinfo *)0xc13e3780 
    (unsigned int) family = 2 = 0x2 = '....',
    (unsigned int) proto = 4 = 0x4 = '....',
    (__be16) eth_proto = 8 = 0x8 = '..',
    (struct module *) owner = 0x0 =  -> NULL,
	... [snip]...
    (int (*)()) tmpl_sort = 0x0 =  -> NULL,
    (int (*)()) state_sort = 0x0 =  -> NULL,
    (int (*)()) output = 0xC0AD12E4 = xfrm4_output -> ,
    (int (*)()) output_finish = 0xC0AD129C = xfrm4_output_finish -> ,
    (int (*)()) extract_input = 0xC0AD0D8C = xfrm4_extract_input -> ,
    (int (*)()) extract_output = 0xC0AD11E0 = xfrm4_extract_output -> ,
    (int (*)()) transport_finish = 0xC0AD0D94 = xfrm4_transport_finish -> ,
    (void (*)()) local_error = 0x0 =  -> NULL)  //<<--        

Resolution

During a code review, it was observed that the local_error member in the xfrm4_state_afinfo variable was not initialized:

//kernel/net/ipv4/xfrm4_state.c   
static struct xfrm_state_afinfo xfrm4_state_afinfo = {  
        .family                 = AF_INET,  
        .proto                  = IPPROTO_IPIP,  
        .eth_proto              = htons(ETH_P_IP),  
        .owner                  = THIS_MODULE,  
        .init_flags             = xfrm4_init_flags,  
        .init_tempsel           = __xfrm4_init_tempsel,  
        .init_temprop           = xfrm4_init_temprop,  
        .output                 = xfrm4_output,  
        .output_finish          = xfrm4_output_finish,  
        .extract_input          = xfrm4_extract_input,  
        .extract_output         = xfrm4_extract_output,  
        .transport_finish       = xfrm4_transport_finish,  
};          

After adding a callback function for the local_error member, the issue was resolved:

diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c  
index 9258e75..0b2a064 100644  
--- a/net/ipv4/xfrm4_state.c  
+++ b/net/ipv4/xfrm4_state.c  
@@ -83,6 +83,7 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {  
        .extract_input          = xfrm4_extract_input,  
        .extract_output         = xfrm4_extract_output,  
        .transport_finish       = xfrm4_transport_finish,  
+       .local_error            = xfrm4_local_error,  
};          

Hope this post is helpful for troubleshooting.


Sachin Devasia

Senior Embedded Software Engineer

3 个月

Good post. Crisp and simple explanation on how to approach. I believe LKM crashes can also be done in a similar way.

Van-Quyen Do

Senior Software Engineer at Ciena

3 个月

Very useful, Thanks for posting. Could you please share how to get TRACE32 tool source. Thanks.

Frank Edwards

Professional contract instructor (Linux, programming topics, many more) --

3 个月

It's a good walkthru of the process of debugging a panic. A few things were left out or glossed over. First, is the LR register identifies the location of the instructions that triggered the problem, <funcname>+0x4c. That "+0x4c" can take you straight to the instruction in the disassembled code. (If the code is in a module, you'll have to be careful as the init_module symbol will have two noop's in front of it that should be included in the "+0x4c" offset. At least, it does on kernel 6.10. YMMV.) Second, the PC is the address at which the next instruction will be fetched, so it's pretty clear that this is the result of a jump/branch/call instruction that tried to go that address. On most architectures, it could also be the result of stack corruption, since the cpu may store the return address on the stack. This is particularly difficult to debug, as the corruption could've occurred from anywhere in the kernel and the stack frame itself no longer exists, so it might take some advanced sleuthing skills to find the code that caused it. (Continued in the next comment)

Meenakshi A.

Technologist & Believer in Systems for People and People for Systems

3 个月

Great advice for the good ??

Stephan Erbs Korsholm

Independent Consultant | Team starter | Moving embedded builds to the cloud

3 个月

Very useful. Thanks for posting. When I look at the xfrm_local_error function it seems to be a function that could be tested on a host machine (e.g. Linux). It does not contain any low-level direct memory or device register access etc. This opens up for the option of testing the code by building a version of a reduced application and running it in a controlled environment on a host debugger. That might also direct the attention to the null pointer exception before it happened on target.

要查看或添加评论,请登录

Austin Kim的更多文章

社区洞察

其他会员也浏览了