关于gdb的心得

GDB作为一个很强的调试工具,我以为已经是人尽皆知了,至少在嵌入式中是这样,但是在平时工作中发现并非如此,很多人依然在使用加log的方式来定位诸如段错误之类的问题,不是说加log不好,只是有时候GDB效率更高。

关于用法网上的资料已经很丰富了,这里只是做个简单的总结,以及我经常用到调试方法,我把它分成三部分:常用命令   难以复现的问题   没有加-g的文件调试。

常用命令

1.set args:设置入参,有的程序需要传参

include 
int main(int argc, char* argv[])
{
int i = 0;
for(i = 1; i < argc; i++)
{
printf("%d %s\n",i,argv[i]);
}
return 0;
}
(gdb) set args how are you?
(gdb) r
Starting program: /home/luogf/20210109/a.out how are you?
1 how
2 are
3 you?

2.break 可简写成首字母b,用于打断点,后面可以是函数名或者文件行(文件行断点依赖编译选项-g)

3.list 可简写成首字母l,用于打印当前代码段(依赖编译选项-g)

4.continue 可简写成首字母c,当程序停下来时,用于让程序继续执行

(gdb) b check
Breakpoint 1 at 0x40047b: file gdb.c, line 4.
(gdb) r
Starting program: /home/luogf/20210109/a.out
Breakpoint 1, check (n=0) at gdb.c:4
4 int c = 10 / n;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64
(gdb) l
1 #include
2 int check(int n)
3 {
4 int c = 10 / n;
5 return 0;
6 }
7 int main()
8 {
9 int num = 0;
10 check(num);
(gdb) b 4
Note: breakpoint 1 also set at pc 0x40047b.
Breakpoint 2 at 0x40047b: file gdb.c, line 4.
(gdb) c
Continuing.
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400485 in check (n=0) at gdb.c:4
4 int c = 10 / n;
(gdb)

5.除此之外,p命令可以打印变量的值,set var(有的人喜欢省略var,这是不好的习惯,因为set在gdb还有其他子命令,搞不好会冲突)命令gdb还可以改变变量的值,比如下面程序正常运行是会发生复位的,这个功能通常用来验证程序内部逻辑。

(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/luogf/20210109/a.out
Breakpoint 1, check (n=0) at gdb.c:4
4 int c = 10 / n;
(gdb) p n
$2 = 0
(gdb) set var n = 2
(gdb) c
Continuing.
Program exited normally.

如果你懂汇编,disassemble可以查看当前函数的汇编代码,i r查看当前寄存器的状态,这两个命令一般在程序没有被-g编译或者被strip的情况下用到。

Breakpoint 1, check (n=0) at gdb.c:4
4 int c = 10 / n;
(gdb) disassemble
Dump of assembler code for function check:
0x0000000000400474 <+0>: push %rbp
0x0000000000400475 <+1>: mov %rsp,%rbp
0x0000000000400478 <+4>: mov %edi,-0x14(%rbp)
=> 0x000000000040047b <+7>: mov $0xa,%eax
0x0000000000400480 <+12>: mov %eax,%edx
0x0000000000400482 <+14>: sar $0x1f,%edx
0x0000000000400485 <+17>: idivl -0x14(%rbp)
0x0000000000400488 <+20>: mov %eax,-0x4(%rbp)
0x000000000040048b <+23>: mov $0x0,%eax
0x0000000000400490 <+28>: leaveq
0x0000000000400491 <+29>: retq
End of assembler dump.
(gdb) i r
rax 0x0 0
rbx 0x0 0
rcx 0x400492 4195474
rdx 0x7fffffffe668 140737488348776
rsi 0x7fffffffe658 140737488348760
rdi 0x0 0
rbp 0x7fffffffe550 0x7fffffffe550
rsp 0x7fffffffe550 0x7fffffffe550
r8 0x7ffff7dd7ba0 140737351875488
r9 0x7ffff7deae20 140737351953952
r10 0x7fffffffe3c0 140737488348096
r11 0x7ffff7a66c20 140737348267040
r12 0x400390 4195216
r13 0x7fffffffe650 140737488348752
r14 0x0 0
r15 0x0 0
rip 0x40047b 0x40047b
eflags 0x206 [ PF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

难以复现的问题

比如程序跑了不知道多久会复位,这种问题可以让程序生成core文件,然后使用gdb解析,前提:你需要一份和问题复现一致的加了-g选项的进程文件。

首先得确认可以生成coredump文件,具体方法这里就不说了,不同环境设置方法不一,网上都有。

加-q可以让gdb不打印冗长的自我介绍。

方法:gdb  <进程文件>  <coredump文件>

[luogf@VM-0-2-centos 20210109]$ gdb -q ./a.out core.1097
Reading symbols from /home/luogf/20210109/a.out…done.
[New Thread 1097]
Reading symbols from /lib64/libc.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./a.out'.
Program terminated with signal 8, Arithmetic exception.
0 0x0000000000400485 in check (n=0) at gdb.c:4
4 int c = 10 / n;

没有加-g的文件调试

有时候出于某种原因,对于环境中没有-g编译的进程文件产生的coredump文件,结合本地使用-g编译的进程文件也可以使用上面的方法分析。

但是要是,需要在线调试呢?而且处于某种原由你不能向环境上传并替换你-g编译好的进程呢?

这时可以使用objdump反汇编本地加-g的进程,结合环境中汇编代码定位。

[luogf@VM-0-2-centos 20210109]$ gdb -q ./a.out
Reading symbols from /home/luogf/20210109/a.out…(no debugging symbols found)…done.
(gdb) r
Starting program: /home/luogf/20210109/a.out
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400485 in check ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64
(gdb) bt
0 0x0000000000400485 in check ()
1 0x00000000004004ab in main ()
(gdb) l
No symbol table is loaded. Use the "file" command.
(gdb) disassemble
Dump of assembler code for function check:
0x0000000000400474 <+0>: push %rbp
0x0000000000400475 <+1>: mov %rsp,%rbp
0x0000000000400478 <+4>: mov %edi,-0x14(%rbp)
0x000000000040047b <+7>: mov $0xa,%eax
0x0000000000400480 <+12>: mov %eax,%edx
0x0000000000400482 <+14>: sar $0x1f,%edx
=> 0x0000000000400485 <+17>: idivl -0x14(%rbp)
0x0000000000400488 <+20>: mov %eax,-0x4(%rbp)
0x000000000040048b <+23>: mov $0x0,%eax
0x0000000000400490 <+28>: leaveq
0x0000000000400491 <+29>: retq
End of assembler dump.

从上面可以看出问题出在0x0000000000400485 <+17>:    idivl  -0x14(%rbp)这一行,=>表示当前运行到的汇编位置

然后反汇编加了-g选项的进程。

objdump -d -S -l a.out

截取其中非系统接口的代码

check():
/home/luogf/20210109/gdb.c:3
include
int check(int n)
{
400474: 55 push %rbp
400475: 48 89 e5 mov %rsp,%rbp
400478: 89 7d ec mov %edi,-0x14(%rbp)
/home/luogf/20210109/gdb.c:4
int c = 10 / n;
40047b: b8 0a 00 00 00 mov $0xa,%eax
400480: 89 c2 mov %eax,%edx
400482: c1 fa 1f sar $0x1f,%edx
400485: f7 7d ec idivl -0x14(%rbp)
400488: 89 45 fc mov %eax,-0x4(%rbp)
/home/luogf/20210109/gdb.c:5
return 0;
40048b: b8 00 00 00 00 mov $0x0,%eax
/home/luogf/20210109/gdb.c:6
}
400490: c9 leaveq
400491: c3 retq
0000000000400492
:
main():
/home/luogf/20210109/gdb.c:8
int main()
{
400492: 55 push %rbp
400493: 48 89 e5 mov %rsp,%rbp
400496: 48 83 ec 10 sub $0x10,%rsp
/home/luogf/20210109/gdb.c:9
int num = 0;
40049a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
/home/luogf/20210109/gdb.c:10
check(num);
4004a1: 8b 45 fc mov -0x4(%rbp),%eax
4004a4: 89 c7 mov %eax,%edi
4004a6: e8 c9 ff ff ff callq 400474
/home/luogf/20210109/gdb.c:11
return 0;
4004ab: b8 00 00 00 00 mov $0x0,%eax
/home/luogf/20210109/gdb.c:12
}
4004b0: c9 leaveq
4004b1: c3 retq
4004b2: 90 nop
4004b3: 90 nop
4004b4: 90 nop
4004b5: 90 nop
4004b6: 90 nop
4004b7: 90 nop
4004b8: 90 nop
4004b9: 90 nop
4004ba: 90 nop
4004bb: 90 nop
4004bc: 90 nop
4004bd: 90 nop
4004be: 90 nop
4004bf: 90 nop

对比汇编发现,问题出在int c = 10 / n;这行。

/home/luogf/20210109/gdb.c:4
int c = 10 / n;
40047b: b8 0a 00 00 00 mov $0xa,%eax
400480: 89 c2 mov %eax,%edx
400482: c1 fa 1f sar $0x1f,%edx
400485: f7 7d ec idivl -0x14(%rbp)
400488: 89 45 fc mov %eax,-0x4(%rbp)

额外的事

闲着也是闲着,继续往下分析汇编。

打印出问题时寄存器的值

 (gdb) i r
rax 0xa 10
rbx 0x0 0
rcx 0x400492 4195474
rdx 0x0 0
rsi 0x7fffffffe658 140737488348760
rdi 0x0 0
rbp 0x7fffffffe550 0x7fffffffe550
rsp 0x7fffffffe550 0x7fffffffe550
r8 0x7ffff7dd7ba0 140737351875488
r9 0x7ffff7deae20 140737351953952
r10 0x7fffffffe3c0 140737488348096
r11 0x7ffff7a66c20 140737348267040
r12 0x400390 4195216
r13 0x7fffffffe650 140737488348752
r14 0x0 0
r15 0x0 0
rip 0x400485 0x400485
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

已知问题出在0x0000000000400485 <+17>:    idivl  -0x14(%rbp)这一行,idivl时除法指令,-0x14(%rbp)是被除数,由上知道寄存器rbp地址是0x7fffffffe550,我们通过gdb看看-0x14(%rbp)究竟是什么值。

(gdb) x (int*)0x7fffffffe550-0x14
0x7fffffffe500: 0x00000000

没错,是0.

关于x命令呢,它可以打印某个地址里面存储的值。