Recently, I was working on a VMKernel core dump in which GDB’s backtrace information for some threads was not reliable. This was due to a data structure holding the register values not having been updated before the kernel crashed. However, since whatever was running in the affected threads was probably still writing to the call stack (or had been until recently before the crash), I thought perhaps there would be something interesting to see below the last stored rsp.
Thankfully, GDB has a command called info frame which you can give an address and it will attempt to reconstruct a stack frame out of it. However, that command was failing. In fact, it was causing GDB to terminate. I’m assuming it was because the DWARF data was corrupt (see GDB Internals, chapter 7), but I’m not really sure. No information about the nature of the exception was printed to screen. Sadly, GDB handled the exception without generating a core dump, so the ironic exercise of debugging GDB using GDB eluded me.
Bottom line is I needed a way to reconstruct the call stack for dozens of CPU cores and no way to do it. The internet wasn’t a terribly good help, either. For a bit, I used x/64gx <address> to manually parse the hexdump of the stack, picking out values that looked like they could be addresses inside the .text segment and checking whether they could be a valid return address. This way of attempting to find needles in haystacks quickly gets old when you have close to 50 hay stacks in front of you.
So, I automated it (crudely). At first, I tried to use GDB’s built-in scripting language, which I’d used a few times before. However, parsing command outputs isn’t feasible and, frankly, it’s a huge pain to use for anything that isn’t trivial.
The Python API allows you to write new commands that behave much like built-in ones. The output of the resulting stackwalk command can be seen below, as compared with the output of bt:
(gdb) bt
#0 0x000055c1aa4bf135 in baz () at ./stackwalk.c:6
#1 0x000055c1aa4bf14c in bar () at ./stackwalk.c:11
#2 0x000055c1aa4bf15d in foo () at ./stackwalk.c:16
#3 0x000055c1aa4bf179 in main () at ./stackwalk.c:24
(gdb) p/x $rsp
$5 = 0x7fffbdd516b0
(gdb) stackwalk 0x7fffbdd516b0
6 possible stack frames found.
Note: the frame boundary is assumed to be the location of the return address.
Frame #0 baz(...) at 0x55c1aa4bf125
Returns to bar + 0xe at 0x55c1aa4bf14c
Call at 0x55c1aa4bf147
Return address at 0x7fffbdd516b8
Hex Dump:
0x7fffbdd516b0: 0x00007fffbdd516c0 0x000055c1aa4bf14c
Frame #1 bar(...) at 0x55c1aa4bf13e
Returns to foo + 0xe at 0x55c1aa4bf15d
Call at 0x55c1aa4bf158
Return address at 0x7fffbdd516c8
Hex Dump:
0x7fffbdd516c0: 0x00007fffbdd516d0 0x000055c1aa4bf15d
Frame #2 foo(...) at 0x55c1aa4bf14f
Returns to main + 0x19 at 0x55c1aa4bf179
Call at 0x55c1aa4bf174
Return address at 0x7fffbdd516d8
Hex Dump:
0x7fffbdd516d0: 0x00007fffbdd516f0 0x000055c1aa4bf179
Frame #3 ??(...) at *%rax
Returns to __libc_start_main + 0xeb at 0x7f2148d98bbb
Call at 0x7f2148d98bb9
Return address at 0x7fffbdd516f8
Hex Dump:
0x7fffbdd516e0: 0x00007fffbdd517d0 0x0000007b00000000
0x7fffbdd516f0: 0x000055c1aa4bf180 0x00007f2148d98bbb
Frame #4 call_init(...) at 0x7f2148f57520
Returns to _dl_init + 0x79 at 0x7f2148f576b9
Call at 0x7f2148f576b4
Return address at 0x7fffbdd51788
Hex Dump:
0x7fffbdd51700: 0x0000000000000000 0x00007fffbdd517d8
0x7fffbdd51710: 0x0000000100040000 0x000055c1aa4bf160
0x7fffbdd51720: 0x0000000000000000 0xfaa0761ec9b36810
0x7fffbdd51730: 0x000055c1aa4bf040 0x00007fffbdd517d0
0x7fffbdd51740: 0x0000000000000000 0x0000000000000000
0x7fffbdd51750: 0xaedc592304b36810 0xaf61b33a3c556810
0x7fffbdd51760: 0x0000000000000000 0x0000000000000000
0x7fffbdd51770: 0x0000000000000000 0x00007fffbdd517e8
0x7fffbdd51780: 0x00007f2148f72190 0x00007f2148f576b9
Frame #5 ??(...) at *0x2f76(%rip)
Returns to _start + 0x2a at 0x55c1aa4bf06a
Call at 0x55c1aa4bf064
Return address at 0x7fffbdd517b8
Hex Dump:
0x7fffbdd51790: 0x0000000000000000 0x0000000000000000
0x7fffbdd517a0: 0x000055c1aa4bf040 0x00007fffbdd517d0
0x7fffbdd517b0: 0x0000000000000000 0x000055c1aa4bf06a
(gdb)
Since the core dump I was working on is of proprietary software, I instead wrote a toy program to demonstrate the script’s output. Having said that, the script manages to find more stack frames than bt prints. Since it relies only on data on the stack, it will keep going until it cannot find a return address within a set number of bytes of the previous return address’ location. The default is set at 512 bytes but can be adjusted by supplying the desired value as a second argument, like so: stackwalk 0x7fffbdd516b0 1024. Beware that larger values mean the script will run for longer.
The source code can be found in my github repo. Currently, the code is still very “warts and all”. One day, I’ll fix it so it looks presentable (you and I both know that’s a lie).
Note that in the above listing, the function name could not always be resolved. Now, if there is a frame above the one we are currently printing, we could take the name from its return address. This is not done as the script aims to make few assumptions about what the data should look like. The use case here is to investigate a section in memory that may contain remnants of past call stacks and so we have to ensure that we can differentiate between a stack frame that belongs with the rest of the call stack and one that doesn’t.
There is one key assumption that the script makes: that an address is a return address if and only if it points to an instruction inside a function. That also means that any address found on the stack that fulfills that requirement is assumed to be a return address. This assumption seems to be good enough, but I’d gladly be told otherwise by someone more knowledgeable on the matter.
Note: This script only works on C core dumps / inferiors. I’ve only tested it with C cores, but C++ should work as well.