Friday, January 13, 2017

Valgrind:C/C++分析工具

Valgrind:C/C++分析工具

Valgrind是開源的測試框架,可以用來動態分析記憶體配置、快取使用、多執行序bug。

安裝

sudo pacman -Sy valgrind

基本用法

$ valgrind 程式名稱 args
預設會是用memcheck工具分析,在這個工具下她會匯出heap使用、memory leak、還有記憶體使用錯誤的部份backtrace。

進階用法

$ valgrind --tool=toolname 程式名稱 args
這是valgrind最有方便的地方,valgrind旗下有九個使用者端的工具,和幾個開發者工具。

1. Memcheck:記憶體錯誤分析工具。

2.Cachegrind:預測你的cache使用。

3.Callgrind:分析程式的function call次數,還有call graph,可以幫助快取分析。

4.Helgrind:多執行序錯誤分析工具,有race condition檢測功能。

5.DRD:另一個多執行序分析工具。

6.Massif:分析heap的使用,在一個程式執行中她會測量多次。

7.DHAT:另一種heap分析工具。

8.SGcheck:實驗性的全域變數與stack分析工具。

9.BBV:實驗性SimPoint相關工具。

這些工具比較常用的是前七個。接下來看一下個別的使用。

memcheck

$ valgrind --tool=memcheck 程式名稱 args
首先我先寫一個廢物code來做實驗。以下code new完之後沒有delete。
#include<iostream>
using namespace std;
int main()
{
    int *ptr;
    ptr = new int;
    *ptr = 2;
    cout << *ptr << endl;
}
然後我們來看一下分析結果。
==12313== Memcheck, a memory error detector
==12313== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==12313== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==12313== Command: ./garbage
==12313==
2
==12313==
==12313== HEAP SUMMARY:
==12313== in use at exit: 4 bytes in 1 blocks
==12313== total heap usage: 3 allocs, 2 frees, 73,732 bytes allocated
==12313==
==12313== LEAK SUMMARY:
==12313== definitely lost: 4 bytes in 1 blocks
==12313== indirectly lost: 0 bytes in 0 blocks
==12313== possibly lost: 0 bytes in 0 blocks
==12313== still reachable: 0 bytes in 0 blocks
==12313== suppressed: 0 bytes in 0 blocks
==12313== Rerun with --leak-check=full to see details of leaked memory
==12313==
==12313== For counts of detected and suppressed errors, rerun with: -v
==12313== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

她寫了definitely lost: 4 bytes in 1 blocks,int是4 byte,我new完之後沒有回收所以memory leak了4 byte,不過由於這程式很間單所以很好找到錯誤來源,當程式變得複雜時我們可以加上--leak-check=full來找到源頭,來看一加完的結果。
==14268== Memcheck, a memory error detector
==14268== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==14268== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==14268== Command: ./garbage
==14268==
2
==14268==
==14268== HEAP SUMMARY:
==14268== in use at exit: 4 bytes in 1 blocks
==14268== total heap usage: 3 allocs, 2 frees, 73,732 bytes allocated
==14268==
==14268== 4 bytes in 1 blocks are definitely lost in loss record 1 of 1
==14268== at 0x4C2B1EC: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14268== by 0x4007C7: main (in /home/tommycc/garbage)
==14268==
==14268== LEAK SUMMARY:
==14268== definitely lost: 4 bytes in 1 blocks
==14268== indirectly lost: 0 bytes in 0 blocks
==14268== possibly lost: 0 bytes in 0 blocks
==14268== still reachable: 0 bytes in 0 blocks
==14268== suppressed: 0 bytes in 0 blocks
==14268==
==14268== For counts of detected and suppressed errors, rerun with: -v
==14268== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
可以發現這兩行,指出了錯誤點。
==14268== at 0x4C2B1EC: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14268== by 0x4007C7: main (in /home/tommycc/garbage)

Cachegrind

$valgrind --tool=cachegrind
它可以分析你程式的快取優化程度。它的結果總共有以下類別。以下擷取至valgrind manual
  • I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr).
  • D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr).
  • D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
  • Conditional branches executed (Bc) and conditional branches mispredicted (Bcm).
  • Indirect branches executed (Bi) and indirect branches mispredicted (Bim).
一般來說I和D是首要分析的部份。以下的範例是二維陣列存取,我們都知道二維陣列在記憶體上其實是一維的,所以(row major)先row在column會比(column major)先column在row快。我們先執行row major的code如以下。
 for (int i = 0 ; i < 1000 ; i++ ){                                                                    
     for (int j = 0 ; j < 1000 ; j++ ){                                      
         a[i][j] = i+j;                                                      
     }                                                                       
 } 
結果如下。
==11719== Cachegrind, a cache and branch-prediction profiler
==11719== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11719== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==11719== Command: ./tt
==11719== 
--11719-- warning: L3 cache found, using its data for the LL simulation.
==11719== 
==11719== I   refs:      16,204,946
==11719== I1  misses:         1,401
==11719== LLi misses:         1,357
==11719== I1  miss rate:       0.01%
==11719== LLi miss rate:       0.01%
==11719== 
==11719== D   refs:       7,729,575  (6,536,262 rd   + 1,193,313 wr)
==11719== D1  misses:        78,407  (   13,698 rd   +    64,709 wr)
==11719== LLd misses:        71,617  (    7,734 rd   +    63,883 wr)
==11719== D1  miss rate:        1.0% (      0.2%     +       5.4%  )
==11719== LLd miss rate:        0.9% (      0.1%     +       5.4%  )
==11719== 
==11719== LL refs:           79,808  (   15,099 rd   +    64,709 wr)
==11719== LL misses:         72,974  (    9,091 rd   +    63,883 wr)
==11719== LL miss rate:         0.3% (      0.0%     +       5.4%  )
特別注意D1 miss rate是1.0%。接下來是column major的code,如下。
for (int i = 0 ; i < 1000 ; i++ ){                                                            
    for (int j = 0 ; j < 1000 ; j++ ){                                                              
        a[j][i] = i+j;                                                                        
    }                                                                                               
}       
結果如下。
==11522== Cachegrind, a cache and branch-prediction profiler
==11522== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11522== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==11522== Command: ./tt
==11522== 
--11522-- warning: L3 cache found, using its data for the LL simulation.
==11522== 
==11522== I   refs:      16,204,946
==11522== I1  misses:         1,401
==11522== LLi misses:         1,357
==11522== I1  miss rate:       0.01%
==11522== LLi miss rate:       0.01%
==11522== 
==11522== D   refs:       7,729,575  (6,536,262 rd   + 1,193,313 wr)
==11522== D1  misses:     1,015,906  (   13,698 rd   + 1,002,208 wr)
==11522== LLd misses:        71,617  (    7,734 rd   +    63,883 wr)
==11522== D1  miss rate:       13.1% (      0.2%     +      84.0%  )
==11522== LLd miss rate:        0.9% (      0.1%     +       5.4%  )
==11522== 
==11522== LL refs:        1,017,307  (   15,099 rd   + 1,002,208 wr)
==11522== LL misses:         72,974  (    9,091 rd   +    63,883 wr)
==11522== LL miss rate:         0.3% (      0.0%     +       5.4%  )
可以發現D1 miss rate大幅上升至13.1%。另外cachegrind一樣會產生output file,可以透過cg_annotate分析,或是用Kcahcegrind分析。

Callgrind

用來分析整個程式的function call數目。
$valgrind --tool=callgrind 程式名稱
這會開始執行你的程式,通常會執行的比較慢。在執行過程中,你可以透過$callgrind_control -e -b或是$callgrind_control -b 來看程式當下執行的function call backtrace。以下是費氏數列遞迴版結果。
    PID 6989: ./fb
sending command status internal to pid 6989

  Totals:           Ir 
   Th 1  2,560,275,485 

  Frame:             Ir Backtrace for Thread 1
   [ 0]  23,685,553,356 fb(long long) (58138276 x)
   [ 1]  41,379,358,433 fb(long long) (58138297 x)
   [ 2]  41,379,358,449 fb(long long) (58138297 x)
   [ 3]  41,379,358,465 fb(long long) (58138297 x)
   [ 4]  41,379,358,481 fb(long long) (58138297 x)
   [ 5]  41,379,358,497 fb(long long) (58138297 x)
   [ 6]  41,379,358,513 fb(long long) (58138297 x)
   [ 7]  41,379,358,529 fb(long long) (58138297 x)
   [ 8]  41,379,358,545 fb(long long) (58138297 x)
   [ 9]  23,685,553,523 fb(long long) (58138276 x)
   [10]  41,379,364,892 fb(long long) (58138297 x)
   [11]  41,379,364,908 fb(long long) (58138297 x)
   [12]  41,379,364,924 fb(long long) (58138297 x)
   [13]  23,685,559,902 fb(long long) (58138276 x)
   [14]  23,685,630,165 fb(long long) (58138276 x)
   [15]  41,379,619,162 fb(long long) (58138297 x)
   [16]  41,379,619,178 fb(long long) (58138297 x)
   [17]  41,379,619,194 fb(long long) (58138297 x)
   [18]  41,379,619,210 fb(long long) (58138297 x)
   [19]  23,685,814,188 fb(long long) (58138276 x)
   [20]  41,382,920,321 fb(long long) (58138297 x)
   [21]  41,382,920,337 fb(long long) (58138297 x)
   [22]  23,689,115,315 fb(long long) (58138276 x)
   [23]  41,405,546,424 fb(long long) (58138297 x)
   [24]  41,405,546,440 fb(long long) (58138297 x)
   [25]  23,711,741,418 fb(long long) (58138276 x)
   [26]  41,560,627,883 fb(long long) (58138297 x)
   [27]  23,866,822,861 fb(long long) (58138276 x)
   [28]  24,523,758,344 fb(long long) (58138276 x)
   [29]  43,937,442,813 fb(long long) (58138297 x)
   [30]   2,558,084,449 fb(long long) (1 x)
   [31]   2,558,084,465 fb(long long) (1 x)
   [32]   2,558,084,472 main (1 x)
   [33]   2,558,187,428 (below main) (1 x)
   [34]   2,558,187,439 _start (1 x)
   [35]               . 0x0000000000000d70 
可以看到執行當下的遞迴到哪裡,不過由於費氏數列浮動快,這用callgrind_control -e -b的深度會不同。要看到整個function的呼叫次數,可以透過callgrind執行完產生的calgrind.out.pid檔案,這時有兩種分析這個檔案的方式,一是透過callgrind_annotate callgrind.out.pid 來看結果,另一是用KCachegrind (KDE應用程式)。以下是KCachegrind結果,可以看到左下角的框框,fb被call了1+331160280次(第一層call+第二層以上call)。


helgrind

$valgrind --tool=helgrind
這次我們使用官方範例。
#include <pthread.h>

int var = 0;

void* child_fn ( void* arg ) {
   var++; /* Unprotected relative to parent */ /* this is line 6 */
   return NULL;
}

int main ( void ) {
   pthread_t child;
   pthread_create(&child, NULL, child_fn, NULL);
   var++; /* Unprotected relative to child */ /* this is line 13 */
   pthread_join(child, NULL);
   return 0;
}
很明顯的var沒有做mutex lock,會有race。結果如下。
==19156== Helgrind, a thread error detector
==19156== Copyright (C) 2007-2015, and GNU GPL'd, by OpenWorks LLP et al.
==19156== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==19156== Command: ./race
==19156== 
==19156== ---Thread-Announcement------------------------------------------
==19156== 
==19156== Thread #1 is the program's root thread
==19156== 
==19156== ---Thread-Announcement------------------------------------------
==19156== 
==19156== Thread #2 was created
==19156==    at 0x51427AE: clone (in /usr/lib/libc-2.24.so)
==19156==    by 0x4E431A9: create_thread (in /usr/lib/libpthread-2.24.so)
==19156==    by 0x4E44C12: pthread_create@@GLIBC_2.2.5 (in /usr/lib/libpthread-2.24.so)
==19156==    by 0x4C31810: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==19156==    by 0x4C328FD: pthread_create@* (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==19156==    by 0x4005C6: main (in /home/tommycc/race)
==19156== 
==19156== ----------------------------------------------------------------
==19156== 
==19156== Possible data race during read of size 4 at 0x60103C by thread #1
==19156== Locks held: none
==19156==    at 0x4005C7: main (in /home/tommycc/race)
==19156== 
==19156== This conflicts with a previous write of size 4 by thread #2
==19156== Locks held: none
==19156==    at 0x400597: child_fn (in /home/tommycc/race)
==19156==    by 0x4C31A04: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==19156==    by 0x4E44453: start_thread (in /usr/lib/libpthread-2.24.so)
==19156==  Address 0x60103c is 0 bytes inside data symbol "var"
==19156== 
==19156== ----------------------------------------------------------------
==19156== 
==19156== Possible data race during write of size 4 at 0x60103C by thread #1
==19156== Locks held: none
==19156==    at 0x4005D0: main (in /home/tommycc/race)
==19156== 
==19156== This conflicts with a previous write of size 4 by thread #2
==19156== Locks held: none
==19156==    at 0x400597: child_fn (in /home/tommycc/race)
==19156==    by 0x4C31A04: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==19156==    by 0x4E44453: start_thread (in /usr/lib/libpthread-2.24.so)
==19156==  Address 0x60103c is 0 bytes inside data symbol "var"
==19156== 
==19156== 
==19156== For counts of detected and suppressed errors, rerun with: -v
==19156== Use --history-level=approx or =none to gain increased speed, at
==19156== the cost of reduced accuracy of conflicting-access information
==19156== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
她很明確得告訴我們有可能的race問題。而且名稱叫做var。結果十分直觀。

Massif

valgrind --tool=massif 程式名稱
他是用來分析data segement 中heap使用。
heap在data sgment通常是由malloc等函式創造的,是適當的優化減少heap可以paging和盡量避開使用swap space。
在執行完massif後會有一個massif.out.pid檔,可以透過ms_print來印出。不過一般高速運行的程式,他的malloc和free執行很快,在輸出圖形的時候看起來是一條線,所以可以透過valgrind --tool=massif --time-unit=B 程式名稱 讓massif透過allocate與deallcoate記憶體大小作圖。範例程式如下。
#include<stdio.h>
#include<stdlib.h>

int main()
{
    malloc(1000);
    int* a[10];
    for (int i = 0 ; i < 10 ; i++ )
        a[i] = malloc(1000);
    for (int i = 0 ; i < 10 ; i++ )
        free(a[i]);
}
結果如下。
--------------------------------------------------------------------------------
Command:            ./heap
Massif arguments:   --time-unit=B
ms_print arguments: massif.out.2213
--------------------------------------------------------------------------------


    KB
10.91^                                     ####                               
     |                                     #                                  
     |                                  :::#   :::                            
     |                                  :  #   :                              
     |                              @@@@:  #   :  ::::                        
     |                              @   :  #   :  :                           
     |                           :::@   :  #   :  :   :::                     
     |                           :  @   :  #   :  :   :                       
     |                        ::::  @   :  #   :  :   :  :::                  
     |                        :  :  @   :  #   :  :   :  :                    
     |                    :::::  :  @   :  #   :  :   :  :  ::::              
     |                 ::::   :  :  @   :  #   :  :   :  :  :   :::           
     |                 :  :   :  :  @   :  #   :  :   :  :  :   :             
     |             :::::  :   :  :  @   :  #   :  :   :  :  :   :  ::::       
     |             :   :  :   :  :  @   :  #   :  :   :  :  :   :  :          
     |          ::::   :  :   :  :  @   :  #   :  :   :  :  :   :  :   :::    
     |          :  :   :  :   :  :  @   :  #   :  :   :  :  :   :  :   :      
     |      :::::  :   :  :   :  :  @   :  #   :  :   :  :  :   :  :   :  ::: 
     |      :   :  :   :  :   :  :  @   :  #   :  :   :  :  :   :  :   :  :   
     |   ::::   :  :   :  :   :  :  @   :  #   :  :   :  :  :   :  :   :  :  @
   0 +----------------------------------------------------------------------->KB
     0                                                                   20.84

Number of snapshots: 23
 Detailed snapshots: [9, 12 (peak), 22]

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1          1,016            1,016            1,000            16            0
  2          2,032            2,032            2,000            32            0
  3          3,048            3,048            3,000            48            0
  4          4,064            4,064            4,000            64            0
  5          5,080            5,080            5,000            80            0
  6          6,096            6,096            6,000            96            0
  7          7,112            7,112            7,000           112            0
  8          8,128            8,128            8,000           128            0
  9          9,144            9,144            9,000           144            0
98.43% (9,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->87.49% (8,000B) 0x400569: main (in /home/tommycc/heap)
| 
->10.94% (1,000B) 0x400556: main (in /home/tommycc/heap)

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 10         10,160           10,160           10,000           160            0
 11         11,176           11,176           11,000           176            0
 12         11,176           11,176           11,000           176            0
98.43% (11,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->89.48% (10,000B) 0x400569: main (in /home/tommycc/heap)
| 
->08.95% (1,000B) 0x400556: main (in /home/tommycc/heap)

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 13         12,192           10,160           10,000           160            0
 14         13,208            9,144            9,000           144            0
 15         14,224            8,128            8,000           128            0
 16         15,240            7,112            7,000           112            0
 17         16,256            6,096            6,000            96            0
 18         17,272            5,080            5,000            80            0
 19         18,288            4,064            4,000            64            0
 20         19,304            3,048            3,000            48            0
 21         20,320            2,032            2,000            32            0
 22         21,336            1,016            1,000            16            0
98.43% (1,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->98.43% (1,000B) 0x400556: main (in /home/tommycc/heap)
| 
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%
可以看到malloc和free的作用。

其他工具

有些是valgrind的開發工具所以就跳過了。
ref:http://valgrind.org/docs/manual/manual.html



Written with StackEdit.

No comments:

Post a Comment

精選文章

使用Ardunio Atmega2560 連接 nRF24L01+

使用Ardunio Atmega2560 連接 nRF24L01+ 關於library 目前主流有 https://github.com/maniacbug/RF24 與 https://github.com/TMRh20/RF24 這兩個。 其中TMRh20大大做...