登录查看更多内容

[Case Study] Kernel Panic @0x0 from xfrm_local_error+0x4c

Austin Kim

Linux kernel Engineer | Author | RISC-V | Arm | 4x LinkedIn Top Voice ??

发布日期: 2024年11月25日

Background

In real-world projects, unexpected system crashes or freezes, such as kernel panics, can occur. These incidents are challenging because embedded software often operates under significant stress regarding crash issues. However, we must remain calm and analyze the issue systematically. By carefully reviewing kernel logs and memory dumps, the root cause can be identified and resolved.

Log Analysis

The first step is to inspect the kernel log at the moment of the crash. Below is the log signature captured during the crash:

[  262.401303] Unable to handle kernel NULL pointer dereference at virtual address 00000000  
[  262.401365] pgd = dbdc4000  
[  262.401389] [00000000] *pgd=00000000  
[  262.401433] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM  
[  262.401459] Modules linked in:  
[  262.401495] CPU: 0 PID: 7107 Comm: Framework Tainted: G        W    3.10.49-g356bd9f-00007-gadca646 #1  
[  262.401522] task: da6b0540 ti: d9412000 task.ti: d9412000  
[  262.401549] PC is at 0x0  
[  262.401590] LR is at xfrm_local_error+0x4c/0x58  
[  262.401619] pc : [<00000000>]    lr : [<c0adc274>]    psr: a00f0013  
[  262.401619] sp : d9413c68  ip : c0ac6c20  fp : 0000dd86  
[  262.401654] r10: 0000010e  r9 : 0000010a  r8 : de0ddc20  
[  262.401678] r7 : c13ddf00  r6 : 00000500  r5 : d9094540  r4 : c13e3780  
[  262.401703] r3 : 00000000  r2 : 00000001  r1 : 00000500  r0 : d9094540

Analyzing kernel logs and memory dumps involves checking the signature, including the call stack and registers, to understand the crash.

Initial Observations

Program Counter (PC): Points to 0x0, indicating a NULL pointer dereference.
Link Register (LR): Points to xfrm_local_error+0x4c.

Using TRACE32, the call stack was extracted based on the ARM calling convention:.

-000|xfrm_local_error(skb = 0xDCD376C0, mtu = -611987776)  
-001|__xfrm4_output(skb = 0xD9094540)  
-002|xfrm_output_resume(skb = 0xD9094540, err = 1)  
-003|__xfrm4_output(skb = 0xD9094540)  
-004|ip_local_out(skb = 0xD9094540)  
-005|ip_send_skb(net = 0xC13DDF00, ?)  
-006|udp_send_skb(skb = 0xD9094540, ?)  
-007|udp_sendmsg(?, sk = 0xDE0F1680, msg = 0xD9413EE0, len = 1300)  
-008|inet_sendmsg(iocb = 0xD9413E58, ?, msg = 0xD9413EE0, size = 1300)  
-009|sock_sendmsg(sock = 0xDA9A5500, msg = 0xD9413EE0, size = 1300)  
-010|SYSC_sendto(inline)  
-011|sys_sendto(?, [...] addr_len = 16)  
-012|ret_fast_syscall(asm)

From the call stack, the crash occurred in xfrm_local_error.

Assembly Instruction Analysis

The xfrm_local_error function was analyzed at the assembly level:

NSR:C0ADC264|E1A00005 cpy     r0, r5          ; proto, skb  
NSR:C0ADC268|E594344C ldr     r3, [r4, #0x44C]  
NSR:C0ADC26C|E1A01006 cpy     r1, r6          ; r1, mtu  
NSR:C0ADC270|E12FFF33 blx     r3              // Branch to function pointer

From the kernel log, the R3 register value is 0x0, causing the blx r3 instruction to dereference a NULL pointer, leading to the kernel panic.

领英推荐

Free API Implementation of SAE J1979-2 Released

PEAK-System Technik GmbH 7 个月前

Critter Stack Roadmap Update

Jeremy Miller 2 周前

Cooperative Interruption of a Thread in C++20:…

Rainer Grimm 7 个月前

C Code Review

The corresponding C code for the assembly instructions was identified:

kernel/net/xfrm/xfrm_output.c
234 void xfrm_local_error(struct sk_buff *skb, int mtu)  
235 {  
236     unsigned int proto;  
237     struct xfrm_state_afinfo *afinfo;  
238     // [snip]  
245  
246     afinfo = xfrm_state_get_afinfo(proto);  
247     if (!afinfo)  
248         return;  
249  
250     afinfo->local_error(skb, mtu);  //<<--  
251     xfrm_state_put_afinfo(afinfo);

In line 250, local_error is called through a function pointer, that corresponds to the assembly code. When examining the assembly instruction together with the C statement, we can derive the following code:

250 afinfo->local_error(skb, mtu);  //<<--  
NSR:C0ADC268|E594344C ldr     r3, [r4, #0x44C]  
NSR:C0ADC270|E12FFF33 blx     r3              // Branch to function pointer

Note tat afinfo->local_error is corresponding to blx r3 at address C0ADC270.

TRACE32 debugging

By inspecting the local_error member of the struct xfrm_state_afinfo using TRACE32, it was found to be NULL:

$ v.v %all (struct xfrm_state_afinfo *)0xc13e3780 
    (unsigned int) family = 2 = 0x2 = '....',
    (unsigned int) proto = 4 = 0x4 = '....',
    (__be16) eth_proto = 8 = 0x8 = '..',
    (struct module *) owner = 0x0 =  -> NULL,
	... [snip]...
    (int (*)()) tmpl_sort = 0x0 =  -> NULL,
    (int (*)()) state_sort = 0x0 =  -> NULL,
    (int (*)()) output = 0xC0AD12E4 = xfrm4_output -> ,
    (int (*)()) output_finish = 0xC0AD129C = xfrm4_output_finish -> ,
    (int (*)()) extract_input = 0xC0AD0D8C = xfrm4_extract_input -> ,
    (int (*)()) extract_output = 0xC0AD11E0 = xfrm4_extract_output -> ,
    (int (*)()) transport_finish = 0xC0AD0D94 = xfrm4_transport_finish -> ,
    (void (*)()) local_error = 0x0 =  -> NULL)  //<<--

Resolution

During a code review, it was observed that the local_error member in the xfrm4_state_afinfo variable was not initialized:

//kernel/net/ipv4/xfrm4_state.c   
static struct xfrm_state_afinfo xfrm4_state_afinfo = {  
        .family                 = AF_INET,  
        .proto                  = IPPROTO_IPIP,  
        .eth_proto              = htons(ETH_P_IP),  
        .owner                  = THIS_MODULE,  
        .init_flags             = xfrm4_init_flags,  
        .init_tempsel           = __xfrm4_init_tempsel,  
        .init_temprop           = xfrm4_init_temprop,  
        .output                 = xfrm4_output,  
        .output_finish          = xfrm4_output_finish,  
        .extract_input          = xfrm4_extract_input,  
        .extract_output         = xfrm4_extract_output,  
        .transport_finish       = xfrm4_transport_finish,  
};

After adding a callback function for the local_error member, the issue was resolved:

diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c  
index 9258e75..0b2a064 100644  
--- a/net/ipv4/xfrm4_state.c  
+++ b/net/ipv4/xfrm4_state.c  
@@ -83,6 +83,7 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {  
        .extract_input          = xfrm4_extract_input,  
        .extract_output         = xfrm4_extract_output,  
        .transport_finish       = xfrm4_transport_finish,  
+       .local_error            = xfrm4_local_error,  
};

Hope this post is helpful for troubleshooting.

Sachin Devasia

Senior Embedded Software Engineer

3 个月

Good post. Crisp and simple explanation on how to approach. I believe LKM crashes can also be done in a similar way.

1 次回应

Van-Quyen Do

Senior Software Engineer at Ciena

3 个月

Very useful, Thanks for posting. Could you please share how to get TRACE32 tool source. Thanks.

2 次回应

Frank Edwards

Professional contract instructor (Linux, programming topics, many more) --

3 个月

It's a good walkthru of the process of debugging a panic. A few things were left out or glossed over. First, is the LR register identifies the location of the instructions that triggered the problem, <funcname>+0x4c. That "+0x4c" can take you straight to the instruction in the disassembled code. (If the code is in a module, you'll have to be careful as the init_module symbol will have two noop's in front of it that should be included in the "+0x4c" offset. At least, it does on kernel 6.10. YMMV.) Second, the PC is the address at which the next instruction will be fetched, so it's pretty clear that this is the result of a jump/branch/call instruction that tried to go that address. On most architectures, it could also be the result of stack corruption, since the cpu may store the return address on the stack. This is particularly difficult to debug, as the corruption could've occurred from anywhere in the kernel and the stack frame itself no longer exists, so it might take some advanced sleuthing skills to find the code that caused it. (Continued in the next comment)

1 次回应

Meenakshi A.

Technologist & Believer in Systems for People and People for Systems

3 个月

Great advice for the good ??

1 次回应

Stephan Erbs Korsholm

Independent Consultant | Team starter | Moving embedded builds to the cloud

3 个月

Very useful. Thanks for posting. When I look at the xfrm_local_error function it seems to be a function that could be tested on a host machine (e.g. Linux). It does not contain any low-level direct memory or device register access etc. This opens up for the option of testing the code by building a version of a reduced application and running it in a controlled environment on a host debugger. That might also direct the attention to the null pointer exception before it happened on target.

2 次回应

查看更多评论

要查看或添加评论，请登录

Austin Kim的更多文章

[Linux Kernel] RISC-V: How to find the satp using swapper_pg_dir

2025年2月19日

[Linux Kernel] RISC-V: How to find the satp using swapper_pg_dir

Background The RISC-V architecture has various CSR (Control and Status Registers). One of them is the satp register.
[Linux Kernel] ftrace: Printing Call Stack via nop tracer (CALLER_ADDR0~CALLER_ADDR3)

2025年2月14日

[Linux Kernel] ftrace: Printing Call Stack via nop tracer (CALLER_ADDR0~CALLER_ADDR3)

Overview The ftrace tool allows you to print various debugging information in the Linux kernel. While function and…

3 条评论
[RISC-V][Linux] Exception handling routine due to memory abort

2025年2月9日

[RISC-V][Linux] Exception handling routine due to memory abort

Previously, we explored how exceptions are handled in RISC-V architecture. Now, let's analyze the actual assembly…

4 条评论
[RISC-V][Linux Kernel] How Interrupts Are Handled

2025年2月5日

[RISC-V][Linux Kernel] How Interrupts Are Handled

No matter what project you are working on, you will likely develop a device driver. To develop a good device driver, it…

7 条评论
Built-in Macros in the Linux Kernel

2025年2月4日

Built-in Macros in the Linux Kernel

The Linux kernel uses several built-in macros to optimize performance. Two commonly used macros are defined in…

7 条评论
[Arm][Linux] Understanding Stacks in User and Kernel space

2025年2月3日

[Arm][Linux] Understanding Stacks in User and Kernel space

It is important to figure out how the stacks are organized in user space and kernel space in Linux system. This post…

7 条评论
[Linux Kernel] API to Identify Kernel Code Regions

2025年2月2日

[Linux Kernel] API to Identify Kernel Code Regions

Version: v6.12 In the Linux kernel, some functions help determine whether a given memory address belongs to kernel…

4 条评论
[Linux Kernel] Mutex (3) - struct mutex

2025年2月1日

[Linux Kernel] Mutex (3) - struct mutex

What is a Mutex? A mutex (short for mutual exclusion) is a synchronization mechanism that ensures only one process can…

3 条评论
[Linux Kernel] kmalloc vs vmalloc (feat. interview)

2025年1月31日

[Linux Kernel] kmalloc vs vmalloc (feat. interview)

The most common question in technical interviews is the difference between kmalloc() and vmalloc(). This post will help…

6 条评论
[Linux Kernel] Mutex (2) - History and option

2025年1月31日

[Linux Kernel] Mutex (2) - History and option

Now, we will learn about the history of mutex and the kernel options related to it. A mutex (mutual exclusion) is a…

See all articles

[Case Study] Kernel Panic @0x0 from xfrm_local_error+0x4c

Austin Kim

Linux kernel Engineer | Author | RISC-V | Arm | 4x LinkedIn Top Voice ??

Background

Log Analysis

Assembly Instruction Analysis

领英推荐

C Code Review

TRACE32 debugging

Resolution

Austin Kim的更多文章

社区洞察

其他会员也浏览了

Best way to start debugging the application(Read/Write failure) on SCSI device

Hide it - they could be watching

Don't lose your mind over the different types of memory leaks

CTRL-C vs CTRL-Z - vs CTRL-D

Functional ECO to handle different hierarchical instances

Closures vs Nested Functions in Swift

What are advance routing techniuques?

How to Create a Countdown Timer Inside a Closure in Swift

Demonstration illustrating Non-Equivalence Debug through Pattern Back-Annotation

Background

Log Analysis

Assembly Instruction Analysis

领英推荐

C Code Review

TRACE32 debugging

Resolution

Austin Kim的更多文章

[Linux Kernel] RISC-V: How to find the satp using swapper_pg_dir

[Linux Kernel] ftrace: Printing Call Stack via nop tracer (CALLER_ADDR0~CALLER_ADDR3)

[RISC-V][Linux] Exception handling routine due to memory abort

[RISC-V][Linux Kernel] How Interrupts Are Handled

Built-in Macros in the Linux Kernel

[Arm][Linux] Understanding Stacks in User and Kernel space

[Linux Kernel] API to Identify Kernel Code Regions

[Linux Kernel] Mutex (3) - struct mutex

[Linux Kernel] kmalloc vs vmalloc (feat. interview)

[Linux Kernel] Mutex (2) - History and option

社区洞察

其他会员也浏览了

Best way to start debugging the application(Read/Write failure) on SCSI device

Hide it - they could be watching

Don't lose your mind over the different types of memory leaks

CTRL-C vs CTRL-Z - vs CTRL-D

Functional ECO to handle different hierarchical instances

Closures vs Nested Functions in Swift

What are advance routing techniuques?

How to Create a Countdown Timer Inside a Closure in Swift

Demonstration illustrating Non-Equivalence Debug through Pattern Back-Annotation