Re: GCC Bug

哈，這次老師給我的作業可是難題，有好一陣子沒有碰觸 gdb ，更何況這次還要跟 FPU 打交道，下面是我的測試程式，和原文只有小部份差異，但重點部份是一樣的：

#include <cstdio>
using namespace std;

double f(double z)
{
    return z;
}

void foo(double fraction)
{
    double z = fraction * 20.0;
    int t1 = z;
    int t2 = f(z);
    printf("t1=%d, t2=%d\n", t1, t2);
}

int main()
{
    foo(6.0/20.0);
    return 0;
}

問題

輸出：

6, 6

是我們預期的結果。但在某些情況下卻會產生

5, 6

這樣的輸出。但若是在 compile 時加上 -ffloat-store 則又能得到正確的結果。根據 gcc optimization options 的說明：

Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

我們懷疑是從 register 讀出時發生意想不到的進位，導致結果有所出入。

實驗

一開始在嘗試 reproduce 時就遇上問題，在和 aaa 討論後才發現，我們使用的 gcc 版本和 compile options 都不盡相同。最後確定了發生問題的環境：

cygwin + gcc 3.4.4
gcc –O text.c

下面是我嘗試過的平台和結果：

	-O	-O3
gcc 4.4.1	No	No
gcc 3.4.4 (cygwin)	Yes	No

有趣的開始！

O3 版本

g++ -Wall -O3 -S -masm=intel main.cpp

這個版本是最好理解的， compiler 已經最佳化到：

不呼叫 foo()、f()，直接以常數取代。（沒記錯，這算是 compiler const folding 和 const propagation 的範疇）

看一下 assembly 就很清楚：

LC2:
	.ascii "t1=%d, t2=%d\12\0"
	.align 4
...
_main:
    push   ebp
    mov    eax, 16
    mov    ebp, esp
    sub    esp, 24
    and    esp, -16
    call   __alloca
    call   ___main
    mov    DWORD PTR [esp], OFFSET FLAT:LC2 ; 看一眼 LC2 ，一個取址動作 :)
    mov    edx, 6
    mov    eax, 6
    mov    DWORD PTR [esp+8], edx
    mov    DWORD PTR [esp+4], eax
    call   _printf
    leave
    xor    eax, eax
    ret

O 版本

g++ -Wall -O -S -masm=intel main.cpp

O 版本就沒那麼聰明，而是真正的去呼叫 foo() ：

_main:
    push   ebp
    mov    ebp, esp
    sub    esp, 8
    and    esp, -16
    mov    eax, 16
    call   __alloca
    call   ___main
    fld    QWORD PTR LC4
    fstp   QWORD PTR [esp]
    call   __Z3food
    mov    eax, 0
    leave
    ret

而且可以發現，LC4 是 foo() 的參數：

LC4:
	.long	858993459
	.long	1070805811
	.text
	.align 2


0.3 的 IEEE 754 Double precision (64 bits) 表示式：
0 01111111101 00110011001100110011  110011001100110011001100110011
                        1070805811                       858993459

嗯嗯，O 版本先做了簡單的運算展開。不過在進入 foo() 的 floating point 計算前，得先對 x87 FPU 有簡單的概念：

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

x87 上用來儲存資料的 registers 被組織成一塊 stack 的，並且搭配了 control, status 等 registers 可供控制、查詢之用。 Stack 以下圖的方式運做：

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

Stack 的 top 又稱為 ST(0)，是許多 FPU 指令的隱藏 operand 之一，像是待會在 foo() 看見的 fld、fstp、fmul 指令都會 implicitly 操作 ST(0) 。

__Z3food:
    push    ebp
    mov     ebp, esp
    push    ebx
    sub     esp, 20
    fld     DWORD PTR LC1
    fmul    QWORD PTR [ebp+8]
    fnstcw  WORD PTR [ebp-6]
    movzx   eax, WORD PTR [ebp-6]
    or      ax, 3072
    mov     WORD PTR [ebp-8], ax
    fldcw   WORD PTR [ebp-8]
    fist    DWORD PTR [ebp-12]      ; store integer
    fldcw   WORD PTR [ebp-6]
    mov     ebx, DWORD PTR [ebp-12]
    fstp    QWORD PTR [esp]
    call    __Z1fd
    fnstcw  WORD PTR [ebp-6]
    movzx   eax, WORD PTR [ebp-6]
    or      ax, 3072
    mov     WORD PTR [ebp-8], ax
    fldcw   WORD PTR [ebp-8]
    fistp   DWORD PTR [esp+8]
    fldcw   WORD PTR [ebp-6]
    mov     DWORD PTR [esp+4], ebx
    mov     DWORD PTR [esp], OFFSET FLAT:LC2
    call    _printf
    add     esp, 20
    pop     ebx
    pop     ebp
    ret

這邊使用倒推法，來找到 t1 及其在 assembly 的表示，尋著上頭的 highlight 部份，可以發現：

printf 從 ebx 拿到 t1 的值，也就是 ebx 、
ebx 由 ebp-12 得到、
ebp-12 則是 fist 的結果，也就是 FPU 的 ST(0)、
往上回溯，只有 L6 處有個 fld ，因此，ST(0) 指的是 fmul 的結果、

反推的過程也符合程式的邏輯，所以下一步就是用 debugger 去追蹤：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030
(gdb) info float
  R7: Empty   0x3ffd9999999999999800
  R6: Empty   0x00e07c93022200240000
  R5: Empty   0x51c27c92fe956123a74c
  R4: Empty   0x79d37c9a06007c930460
  R3: Empty   0x0a887c9479870022bf80
  R2: Empty   0xbf9800010a886123debc
  R1: Empty   0x00006123a74c7c93043e
=>R0: Empty   0xdee46123de0c6123d8cc

Status Word:         0xffff0000
                       TOP: 0
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0xffffffff
Instruction Pointer: 0x1b:0x004010c8
Operand Pointer:     0xffff0023:0x0022cd10
Opcode:              0xdd1c

我們將中斷點設在 foo() 函式，接著使用 info float 去 dump FPU 相關的 register (FPU 相關 register 似乎無法加到 gdb 的 Automatic Display！？所以每次都需手打 )，比較特別的幾個要點都被 highlight 了：

R0 ～ R7 即剛剛提到的 FPU data register stack ： stack 目前是空的，雖然指向 R0 ，但顯示為 Empty
Control Word 裡頭的 RC （rounding-control）決定了 floating-point 的進位（round）方式。

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

認識了 gdb 對於 FPU 的輸出後，可以開始一步步逼近問題核心 ── ebp-12 了。

foo() 的一開始，是讀入 0x402030 上的值到 FPU stack 上：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030

好奇這 magic number 是何來的，可以透過之前的 disassembly 結果：

LC1:
	.long	1101004800    ; = 0x41a00000
	.text
	.align 2

或是動態的用 debugger 觀看：

(gdb) x 0x402030
0x402030 <_data_start__+48>:    0x41a00000 ; = 1101004800

而 0x41a00000 便是 20.0 的 IEEE 754 0.3 的 IEEE 754 Double precision (64 bits) 表示式。來用 debugger 確認一下：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030
(gdb) info float
  R7: Empty   0x3ffd9999999999999800
  R6: Empty   0x00e07c93022200240000
  R5: Empty   0x51c27c92fe956123a74c
  R4: Empty   0x79d37c9a06007c930460
  R3: Empty   0x0a887c9479870022bf80
  R2: Empty   0xbf9800010a886123debc
  R1: Empty   0x00006123a74c7c93043e
=>R0: Empty   0xdee46123de0c6123d8cc

(gdb) si
0x00401055 in foo ()
1: x/i $pc
0x401055 <_Z3food+13>:  fmull  0x8(%ebp)

(gdb) info float
=>R7: Valid   0x4003a000000000000000 +20
  R6: Empty   0x00e07c93022200240000

執行完上面的指令後， FPU 的 data stack 現在是 20 ，接著進行 x 0.3 的乘法（ebp+0x8 是 foo() 的參數，即 0.3）：

0x401055 <_Z3food+13>:  fmull  0x8(%ebp)

(gdb) si
0x00401058 in foo ()
1: x/i $pc
0x401058 <_Z3food+16>:  fnstcw -0x6(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0x00e07c93022200240000

得到了結果：+6 ，無誤。不過接下來的 FPU control word 卻令人意外：

(gdb) info float
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

0x40105f <_Z3food+23>:  or     $0xc00,%ax
0x401063 <_Z3food+27>:  mov    %ax,-0x8(%ebp)
0x401067 <_Z3food+31>:  fldcw  -0x8(%ebp)

(gdb) info float
...
Control Word:        0xffff0f7f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round toward zero

O 版將 RC 從 Round to nearest 改變成 Round toward zero 。這會影響到 fist 的運作。在執行 fist 前，我們把預期的結果： 6.0 的 IEEE 754 Double precision 求出來：0x4018000000000000 ；而目前的 R7 值是 0x4001bffffffffffffe00 ，小於該值。表示 R7 現在是不足 6.0 ，而此時又改以 Round toward zero 來進行取整數，那結果會是 5 則是預期了。

(gdb) si
0x0040106a in foo ()
1: x/i $pc
0x40106a <_Z3food+34>:  fistl  -0xc(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0x4bc000010101e259acd0
...
Control Word:        0xffff0f7f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round toward zero

(gdb) x 0x22ccfc              ; ebp - 0xc = 0x22ccfc
0x22ccfc:       0x61100049

(gdb) si
0x0040106d in foo ()
1: x/i $pc
0x40106d <_Z3food+37>:  fldcw  -0x6(%ebp)
(gdb) x 0x22ccfc
0x22ccfc:       0x00000005

執行結果印證了我們的預期。

-ffloat-store 版

模仿取得 O 版 disassembly 的方式來取得 -ffloat-store 的結果：

g++ -Wall -O -ffloat-store -S -masm=intel main.cpp

-ffloat-store 版和 O 版相同，會去呼叫 foo() ，不過比較兩者 foo() 可以發現： -ffloat-store 版只多了兩行，在進行 x 0.3 後，會馬上將值寫回 call stack 上，並再讀入。

也就是，當 6.0 被寫入記憶體時，是以 Round to nearest 的方式寫進：

(gdb) si
0x00401058 in foo ()
1: x/i $pc
0x401058 <_Z3food+16>:  fstpl  -0x10(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0xdbc000010101e4ada9d8
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

而讀回 FPU stack 時，已經是個精確的 6.0 ：

(gdb) si
0x0040105e in foo ()
1: x/i $pc
0x40105e <_Z3food+22>:  fnstcw -0x12(%ebp)
(gdb) info float
=>R7: Valid   0x4001c000000000000000 +6
  R6: Empty   0x10000000000400010000
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

WinDbg 實驗： Command Line Arguments 的傳遞

一個 C/C++ 程式可以透過 main() 的 argument list 取得 client 端輸入的 command line arguments：

int main( int argc, char* argv[] ) { ... }

如果好奇這是如何地從無到有，可以寫一段程式碼來 trace 。

先準備好 sample code:

#include <iostream>
using namespace std;

int main( int argc, char* argv[] )
{   // 程式的內容不是重點
    cout << "hello world";
    cin.get();
    return 0;
}

透過 VC++ 或是 WinDbg 在 main() 設定 breakpoint 來追蹤：

Compiler: VC++ 2005
OS: Windows XP
在 cmd 中執行 CrtDemo05 test

0:000> bp CrtDemo05!main
*** WARNING: Unable to verify checksum for CrtDemo05.exe

0:000> bl
 0 e 004377d0     0001 (0001)  0:**** CrtDemo05!main

0:000> g
Breakpoint 0 hit
eax=00383088 ebx=7ffda000 ecx=00383028 edx=00000001 esi=010df742 edi=010df6f2
eip=004377d0 esp=0012ff58 ebp=0012ffb8 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
CrtDemo05!main:
004377d0 55              push    ebp

0:000> k
ChildEBP RetAddr  
0012ff54 0044a203 CrtDemo05!main [f:\src\_experiment\crtdemo05\crtdemo05\main.cpp @ 6]
0012ffb8 00449fbd CrtDemo05!__tmainCRTStartup+0x233 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c @ 327]
0012ffc0 7c816d4f CrtDemo05!mainCRTStartup+0xd [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c @ 196]
0012fff0 00000000 kernel32!BaseProcessStart+0x23

進入 main() 之前的兩個函式都是 CRT 的一部分，負責完成基本但必要的初始化，舉凡 Global variables、Heap、I/O 等等都屬於這個範疇。這樣一來，身處 main() 後頭的我們才能順利工作。所以照這情勢看來，想知道 argument list 怎麼來，就得去 trace 這兩個 functions ，幸運的， VC++ CRT 的 source code 是隨著安裝程式散發的，通常就在：

C:\Program Files\Microsoft Visual Studio 8\VC\crt

視安裝路徑而定。也可以 double click VC++ 的 Call Stack 進入程式碼裡頭。

如果想利用 debugger 追蹤或試驗 CRT ，不妨把 CRT link 改成 debug mode 並且是 static link 。這可以省下一些時間、增加 tracing 時的可讀性和便利性，因為像是 debug 下的 dynamic link 會使用 ILT ，它所引入的間接性，會讓定位 functions 或 symbols 徒增額外的時間，造成實驗的不便。

VC++ 有所謂的 Microsoft C++ ，其程式進入點有自己的一套方式來定義，至少在名稱上，就可以找到四種：

main
wmain
WinMain
wWinMain

不過它們使用的 CRT 是同一份程式碼，並且使用了跟 TCHAR.h 相同的手法來區別，但確實由 WPRFLAG 這個 macro 所控制凡；凡是見到 t、_t、__t 開頭的名稱，都有機會透過它來替換掉，不過 __tmainCRTStartup() 可說是例外。下表節錄與 command line 相關的 t、_t、_t 開頭 symbols:

	ansi/console	ansi/GUI	wide/console	wide/GUI
_tmainCRTStartup	mainCRTStartup	WinMainCRTStartup	wmainCRTStartup	wWinMainCRTStartup
_tcmdln	_acmdln	_acmdln	_wcmdln	_wcmdln
_targv	__argv	__argv	__wargv	__wargv
GetCommandLineT()	GetCommandLineA	GetCommandLineA	__crtGetCommandLineW	__crtGetCommandLineW
_tsetargv()	_setargv	_setargv	_wsetargv	_wsetargv

我們先從 ansi 版的 main() ，也是 C/C++ Standard 所描述的 main() 開始：

mainCRTStartup()

mainCRTStartup() 基本上是一個 forward function ：

int _tmainCRTStartup( void )
{
    __security_init_cookie();
    return __tmainCRTStartup();
}

__tmainCRTStartup()

就像前面提到的：__tmainCRTStartup() 就是正港的 __tmainCRTStartup() 沒有 ansi, wide 的替換，很快地就可以梳理出跟 command line 相關的程式：

int __tmainCRTStartup( void )
{
    // ...

    __try {
        // ...
    
        /* get wide cmd line info */
        _tcmdln = (_TSCHAR *)GetCommandLineT();

        /* get wide environ info */
        _tenvptr = (_TSCHAR *)GetEnvironmentStringsT();

        if ( _tsetargv() < 0 )
            _amsg_exit(_RT_SPACEARG);
        if ( _tsetenvp() < 0 )
            _amsg_exit(_RT_SPACEENV);

        // ...

    #ifdef _WINMAIN_
        lpszCommandLine = _twincmdln();
        mainret = _tWinMain( (HINSTANCE)&__ImageBase,
                             NULL,
                             lpszCommandLine,
                             StartupInfo.dwFlags & STARTF_USESHOWWINDOW
                                  ? StartupInfo.wShowWindow
                                  : SW_SHOWDEFAULT
                            );
    #else  /* _WINMAIN_ */
        _tinitenv = _tenviron;
        mainret = _tmain(__argc, _targv, _tenviron);
    #endif  /* _WINMAIN_ */

        // ...
    }
    __except ( _XcptFilter(GetExceptionCode(), GetExceptionInformation()) ) {
        // ...
    }

    return mainret;
}

從上面的程式碼可以發現，重點就在於兩個 functions 上：

GetCommandLineT()
_tsetargv()

但是當我們將 breakpoint 設在 GetCommandLineT() 時，卻無法 step in 。這時可以切換到 assembly 模式，透過 WinDbg diassembly window 或是 WinDbg 的 uf 幫助來持續追蹤：

0:000> uf __tmainCRTStartup
...
CrtDemo05!__tmainCRTStartup+0x1b4 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c @ 300]:
  300 0044a184 ff1564924b00    call    dword ptr [CrtDemo05!_imp__GetCommandLineA (004b9264)]
  300 0044a18a a3a4814b00      mov     dword ptr [CrtDemo05!_acmdln (004b81a4)],eax
  303 0044a18f e8a89efeff      call    CrtDemo05!ILT+55(___crtGetEnvironmentStringsA) (0043403c)
  303 0044a194 a3d45e4b00      mov     dword ptr [CrtDemo05!_aenvptr (004b5ed4)],eax
  305 0044a199 e87a9ffeff      call    CrtDemo05!ILT+275(__setargv) (00434118)
  305 0044a19e 85c0            test    eax,eax
  305 0044a1a0 7d0a            jge     CrtDemo05!__tmainCRTStartup+0x1dc (0044a1ac)
...

在 Windows 裡頭， _imp、__imp 開頭的 name mangling 代表： symbol 代表是從 DLL export 出的，可見 CrtDemo05!_imp__GetCommandLineA 存在與另外一個 DLL 之中。 dereference 該位址便可以找到真正的 symbol 。

0:000> dd 004b9264
004b9264  7c812c8d 7c93043d 7c812851 7c9305d4
004b9274  7c80aa49 7c80a0c7 7c809cad 7c832e2b

由於 CrtDemo05 裡頭使用 call 來呼叫這個位址，想必這個位址會是一段可以執行的指令，使用 uf 來查找：

0:000> uf 7c812c8d 
kernel32!GetCommandLineA:
7c812c8d a1f435887c      mov     eax,dword ptr [kernel32!BaseAnsiCommandLine+0x4 (7c8835f4)]
7c812c92 c3              ret

原來是落在 kernel32.dll 這個基礎函式庫裡頭，而且發現它的實做相當簡單：僅僅是從 kernel32!BaseAnsiCommandLine+0x4 位址讀取而已。為了驗證，我們以 dereference 該位址看是否真的儲存了要傳遞給 CrtDemo05 的 arguments 。

0:000> dd 7c8835f4
7c8835f4  00151ee0 000a0008 7c80e300 ffff02ff
7c883604  00000001 000a0008 7c831874 000a0008

接著

0:000> da 00151ee0 
00151ee0  "C:\test\CrtDemo05.exe test"

若是覺得 dd 兩層的 dereference 有點麻煩，針對字串可以使用 dp* ，

kd> dpa 7c8835f4 L1
7c8835f4  001522f8 "CrtDemo05.exe"

無論是哪種方式，都驗證了我們的想法── kernel32!GetCommandLineA() 是取得 command line 的函式，並且有著極簡單的實做──讀取一個預先填好值的 variable。

wmain() 版本也是相同的：

0:000> uf CrtDemo05!__tmainCRTStartup
...
CrtDemo05!__tmainCRTStartup+0x1b4 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c @ 300]:
  300 0044a194 e8ad9efeff      call    CrtDemo05!ILT+65(___crtGetCommandLineW) (00434046)
  300 0044a199 a3e4824b00      mov     dword ptr [CrtDemo05!_wcmdln (004b82e4)],eax
  303 0044a19e e8439ffeff      call    CrtDemo05!ILT+225(___crtGetEnvironmentStringsW) (004340e6)
  303 0044a1a3 a3d85e4b00      mov     dword ptr [CrtDemo05!_wenvptr (004b5ed8)],eax
  305 0044a1a8 e845b1feff      call    CrtDemo05!ILT+4845(__wsetargv) (004352f2)
  305 0044a1ad 85c0            test    eax,eax
  305 0044a1af 7d0a            jge     CrtDemo05!__tmainCRTStartup+0x1db (0044a1bb)
...

函式，你可能會好奇為什麼 wide 版本並不是呼叫 _imp__GetCommandLineW() ，其實它被包裝在 ___crtGetCommandLineW() 裡頭。

0:000> uf 00434046
...
CrtDemo05!__crtGetCommandLineW+0xf [f:\dd\vctools\crt_bld\self_x86\crt\src\aw_com.c @ 52]:
   52 0046336f ff1530934b00    call    dword ptr [CrtDemo05!_imp__GetCommandLineW (004b9330)]
   52 00463375 85c0            test    eax,eax
   52 00463377 740c            je      CrtDemo05!__crtGetCommandLineW+0x25 (00463385)
...

並且同樣地，可以使用如同 ansi 版本的方式，去 trace ：

0:000> dd 004b9330
004b9330  7c816cfb 7c80c6cf 7c801eee 7c873d83
004b9340  7c81abe4 7c80180e 7c810da6 7c80cd58

0:000> uf 7c816cfb 
kernel32!GetCommandLineW:
7c816cfb a10430887c      mov     eax,dword ptr [kernel32!BaseUnicodeCommandLine+0x4 (7c883004)]
7c816d00 c3              ret

0:000> dd 7c883004
7c883004  0002062c 00000000 7c809784 7c80979d
7c883014  7c8097d2 7c8097e7 7c8097b7 00000000

0:000> du 0002062c 
0002062c  "C:\test\CrtDemo05.exe test"

使用 Kernel Debugging

或許你會好奇為什麼 trace command 如何填入一個 process 需要動要到 kernel debug ？其實原因很簡單，因為當 user-mode debug 無法滿足時，以這次的參數傳遞來看，當我們利用 windbg 載入一個 program 時，program 需要執行到某個狀態後， user-mode debugger 才能介入，此時 arguments 已經設置完成。

Microsoft (R) Windows Debugger Version 6.11.0001.404 X86
Copyright (c) Microsoft Corporation. All rights reserved.

CommandLine: C:\test\CrtDemo05.exe
Symbol search path is: srv*C:\sym_cache*http://msdl.microsoft.com/download/symbols;C:\test
Executable search path is: 
ModLoad: 00400000 004bb000   CrtDemo05.exe
ModLoad: 7c920000 7c9b5000   ntdll.dll
ModLoad: 7c800000 7c91d000   C:\WINDOWS\system32\kernel32.dll
(7d8.568): Break instruction exception - code 80000003 (first chance)
eax=00251eb4 ebx=7ffd8000 ecx=00000001 edx=00000002 esi=00251f48 edi=00251eb4
eip=7c921230 esp=0012fb20 ebp=0012fc94 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
ntdll!DbgBreakPoint:
7c921230 cc              int     3

0:000> kvn
 # ChildEBP RetAddr  Args to Child              
00 0012fb1c 7c95edc0 7ffdf000 7ffd8000 00000000 ntdll!DbgBreakPoint (FPO: [0,0,0])
01 0012fc94 7c941639 0012fd30 7c920000 0012fce0 ntdll!LdrpInitializeProcess+0xffa (FPO: [5,89,4])
02 0012fd1c 7c92eac7 0012fd30 7c920000 00000000 ntdll!_LdrpInitialize+0x183 (FPO: [Non-Fpo])
03 00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7

0:000> da 00151ee0 
00151ee0  "C:\test\CrtDemo05.exe test"

注意到沒？當 CrtDemo05 整支程式還在被 OS 的 loader 所初始化時，在 __tmainCRTStartup 還未執行到之前，arguments 已經填入了。因此這個例子需要透過 kernel debugging 來進行更早期的追蹤。 kernel debugging 的方式有許多種，使用 VMWare 算是便利的方法之一，可以參考 Windows Debugging – Kernel Debugging with WinDbg and VMWare 設定環境。

Catch Me If You Can

為了捕捉 kernel32!BaseUnicodeCommanLine 以及 kernel32!BaseAnsiCommandLine 是如何被填寫的？我們可以透過 WinDbg 的 break on access ，不過很快就會遇到第一個問題：

kd> x kernel32!BaseUnicodeCommandLine
7c885730 kernel32!BaseUnicodeCommandLine = <no type information>
7c883000 kernel32!BaseUnicodeCommandLine = <no type information>

遇到兩個同名的 symbols ，理論上這是件會造成 symbol resolve 上 ambiguous 的事。雖然暫時無解，但幸運的是，根據之前追蹤結果：

7c816cfb a10430887c mov eax,dword ptr [kernel32!BaseUnicodeCommandLine+0x4 (7c883004)]

我們可以直接選擇 7c885730 ：

kd> ba r4 7c883004
kd> g
Breakpoint 0 hit
nt!MiCopyOnWrite+0x148:
0008:805222bc f3a5            rep movs dword ptr es:[edi],dword ptr [esi]
kd> k
ChildEBP RetAddr  
b22bad00 8051d019 nt!MiCopyOnWrite+0x148
b22bad4c 805406ec nt!MmAccessFault+0x9f9
b22bad4c 7c93a100 nt!KiTrap0E+0xcc
0012f938 7c93d8a0 ntdll!__security_init_cookie_ex+0x31
0012f944 7c93d83a ntdll!LdrpInitSecurityCookie+0x2f
0012fa40 7c939b78 ntdll!LdrpRunInitializeRoutines+0x124
0012faf0 7c939ba0 ntdll!LdrpGetProcedureAddress+0x1c6
0012fb0c 7c942334 ntdll!LdrGetProcedureAddress+0x18
0012fc94 7c941639 ntdll!LdrpInitializeProcess+0xd7a
0012fd1c 7c92eac7 ntdll!_LdrpInitialize+0x183
00000000 00000000 ntdll!KiUserApcDispatcher+0x7

這一次還不是我們想要的 stack：

kd> g
Breakpoint 0 hit
kernel32!_BaseDllInitialize+0x20f:
7c817ea8 ff157c10807c    call    dword ptr [kernel32!_imp__RtlUnicodeStringToAnsiString (7c80107c)]
kd> bl
 0 e 7c883004 r 4 0001 (0001) kernel32!BaseUnicodeCommandLine+0x4
kd> k
ChildEBP RetAddr  
0012f918 7c9211a7 kernel32!_BaseDllInitialize+0x20f
0012f938 7c93cbab ntdll!LdrpCallInitRoutine+0x14
0012fa40 7c939b78 ntdll!LdrpRunInitializeRoutines+0x344
0012faf0 7c939ba0 ntdll!LdrpGetProcedureAddress+0x1c6
0012fb0c 7c942334 ntdll!LdrGetProcedureAddress+0x18
0012fc94 7c941639 ntdll!LdrpInitializeProcess+0xd7a
0012fd1c 7c92eac7 ntdll!_LdrpInitialize+0x183
00000000 00000000 ntdll!KiUserApcDispatcher+0x7

這次 stack 看起來成功率比較高。 disassembly 一段 breakpoint 前的程式碼，使用 WinDbg 的 disassembly 視窗或是指令都可以做到：

kd> ub 7c817ea8 L10
kernel32!_BaseDllInitialize+0x1cc:
...
7c817e7f 64a118000000    mov     eax,dword ptr fs:[00000018h]
7c817e85 8b4030          mov     eax,dword ptr [eax+30h]
7c817e88 8b4010          mov     eax,dword ptr [eax+10h]
7c817e8b 8b4840          mov     ecx,dword ptr [eax+40h]
7c817e8e 6a01            push    1
7c817e90 890d0030887c    mov     dword ptr [kernel32!BaseUnicodeCommandLine (7c883000)],ecx
7c817e96 8b4044          mov     eax,dword ptr [eax+44h]
7c817e99 680030887c      push    offset kernel32!BaseUnicodeCommandLine (7c883000)
7c817e9e 68f035887c      push    offset kernel32!BaseAnsiCommandLine (7c8835f0)
7c817ea3 a30430887c      mov     dword ptr [kernel32!BaseUnicodeCommandLine+0x4 (7c883004)],eax

解讀 fs:[00000018h]

可以發現，kernel32!BaseUnicodeCommandLine 是由 ecx 決定，而 ecx 則是由 fs:[00000018h] 所決定，這部份的解讀，需要一些的隱藏知識：在 x86 上Windows user mode 將每個 thread 的 TEB (Thread Environment Block)，存放在 fs:[0]；而在 kernel mode 中， fs:[0] 則是存放 KPCR (Process Control Region) 。

kd> dt _TEB
ntdll!_TEB
   +0x000 NtTib            : _NT_TIB
   +0x01c EnvironmentPointer : Ptr32 Void
   +0x020 ClientId         : _CLIENT_ID
   +0x028 ActiveRpcHandle  : Ptr32 Void
   +0x02c ThreadLocalStoragePointer : Ptr32 Void
   +0x030 ProcessEnvironmentBlock : Ptr32 _PEB
   ...

整個 ntdll!_TEB 有點龐大，不過根據 ub 的結果，fs:[00000018h] 落在第一個欄位 ── _NT_TIB （Thread Information Block）裡頭，再進行一次 dt ：

kd> dt _NT_TIB
ntdll!_NT_TIB
   +0x000 ExceptionList    : Ptr32 _EXCEPTION_REGISTRATION_RECORD
   +0x004 StackBase        : Ptr32 Void
   +0x008 StackLimit       : Ptr32 Void
   +0x00c SubSystemTib     : Ptr32 Void
   +0x010 FiberData        : Ptr32 Void
   +0x010 Version          : Uint4B
   +0x014 ArbitraryUserPointer : Ptr32 Void
   +0x018 Self             : Ptr32 _NT_TIB

那為何不直接使用 fs:[0] 而要選擇 fs:[00000018h] ？因為 fs:[0] 並非 process 的 virtual address ：

kd> dd fs:0
003b:00000000  0012fa30 00130000 0012f000 00000000
003b:00000010  00001e00 00000000 7ffdd000 00000000
003b:00000020  00000778 00000128 00000000 00000000

因此使用上，往往透過 fs:[00000018h] 來指向正確 TEB 或說 TIB 的位址。

重新解讀 ub

7c817e7f 64a118000000    mov     eax,dword ptr fs:[00000018h]
7c817e85 8b4030          mov     eax,dword ptr [eax+30h]
7c817e88 8b4010          mov     eax,dword ptr [eax+10h]
7c817e8b 8b4840          mov     ecx,dword ptr [eax+40h]

上述組語在知道 fs:[0] 的意義後，便可尋著指令進行 dt 的解碼，並寫出對應的虛擬碼：

_TEB            teb = fs:[00000018h];
UNICODE_STRING  kernel32!BaseUnicodeCommandLine = 
    teb.ProcessEnvironmentBlock->ProcessParameters.CommandLine;

到了這個步驟，我們已經知道即使是 kernel32!_BaseDllInitialize() 這般低階的初始化，也僅僅只是將 PEB 中的某個變數值讀出，那麼 PEB 中的值又是給負責填寫呢？還記得 Win32 API 中的 CreateProcess() 嗎？它或許是有嫌疑的一份子。

CreateProcess()

kd> dt _RTL_USER_PROCESS_PARAMETERS
ntdll!_RTL_USER_PROCESS_PARAMETERS
   +0x000 MaximumLength    : Uint4B
   +0x004 Length           : Uint4B
   +0x008 Flags            : Uint4B
   +0x00c DebugFlags       : Uint4B
   +0x010 ConsoleHandle    : Ptr32 Void
   +0x014 ConsoleFlags     : Uint4B
   +0x018 StandardInput    : Ptr32 Void
   +0x01c StandardOutput   : Ptr32 Void
   +0x020 StandardError    : Ptr32 Void
   +0x024 CurrentDirectory : _CURDIR
   +0x030 DllPath          : _UNICODE_STRING
   +0x038 ImagePathName    : _UNICODE_STRING
   +0x040 CommandLine      : _UNICODE_STRING
   +0x048 Environment      : Ptr32 Void
   +0x04c StartingX        : Uint4B
   +0x050 StartingY        : Uint4B
   +0x054 CountX           : Uint4B
   +0x058 CountY           : Uint4B
   +0x05c CountCharsX      : Uint4B
   +0x060 CountCharsY      : Uint4B
   +0x064 FillAttribute    : Uint4B
   +0x068 WindowFlags      : Uint4B
   +0x06c ShowWindowFlags  : Uint4B
   +0x070 WindowTitle      : _UNICODE_STRING
   +0x078 DesktopInfo      : _UNICODE_STRING
   +0x080 ShellInfo        : _UNICODE_STRING
   +0x088 RuntimeData      : _UNICODE_STRING
   +0x090 CurrentDirectores : [32] _RTL_DRIVE_LETTER_CURDIR

目前我們已經知道，一個程式的 command line 儲存在 _RTL_USER_PROCESS_PARAMETERS 這個 strcut 裡頭，所以要做的是，找到何時會去填寫，方法有兩種：

Trace CreateProcess
尋找有關 ProcessParameter 的 function 。

我們使用方法 2. 來加速：

可以使用 WinDbg 的 x 指令：

kd> x *!*processparameter*
7c819f9e kernel32!BasePushProcessParameters = <no type information>
7c801488 kernel32!_imp__RtlCreateProcessParameters = <no type information>
7c801484 kernel32!_imp__RtlDestroyProcessParameters = <no type information>
7c9432ec ntdll!RtlDestroyProcessParameters = <no type information>
7c950ea3 ntdll!RtlCheckProcessParameters = <no type information>
7c9433c1 ntdll!RtlCreateProcessParameters = <no type information>

這幾個 functions 看起來都有機會，其中以 BasePushProcessParameters() 的名稱看來最有可能是 creater 建立 createe process parameter 的函式，為了驗證這想法，將 WinDbg 目前的 context 切換到 cmd.exe 下，並設定 breakpoint 來驗證：

kd> !process 0 0
**** NT ACTIVE PROCESS DUMP ****
...
PROCESS 8200d7e8  SessionId: 0  Cid: 0194    Peb: 7ffdf000  ParentCid: 0608
    DirBase: 08940240  ObjectTable: e1533828  HandleCount:  33.
    Image: cmd.exe

kd> .process /r /P /p 8200d7e8

kd> bp 7c819f9e

kd> g
Breakpoint 1 hit
kernel32!BasePushProcessParameters:
7c819f9e 68c0020000      push    2C0h

kd> k
ChildEBP RetAddr  
0013f028 7c8199bc kernel32!BasePushProcessParameters
0013fa88 7c80235e kernel32!CreateProcessInternalW+0x184e
0013fac0 4ad031dd kernel32!CreateProcessW+0x2c
0013fc20 4ad02db0 cmd!ExecPgm+0x22b
0013fc54 4ad02e0e cmd!ECWork+0x84
0013fc6c 4ad05f9f cmd!ExtCom+0x40
0013fe9c 4ad013eb cmd!FindFixAndRun+0xcf
0013fee0 4ad0bbba cmd!Dispatch+0x137
0013ff44 4ad05164 cmd!main+0x216
0013ffc0 7c816d4f cmd!mainCRTStartup+0x125
0013fff0 00000000 kernel32!BaseProcessStart+0x23

stack 目前看來是支持我們的猜測， cmd.exe 透過 CreateProcess() 來建立 CrtDemo05.exe ，並且呼叫 BasePushProcessParameters() 來初始化 process parameter。在這邊，我們也可以切換到 CrtDemo05.exe 下，確定一下它目前的建立狀態，以瞭解它仍是否仍在建立中而非建立完成：

kd> !process 0 0
**** NT ACTIVE PROCESS DUMP ****
...

PROCESS 8200d7e8  SessionId: 0  Cid: 0194    Peb: 7ffdf000  ParentCid: 0608
    DirBase: 08940240  ObjectTable: e1533828  HandleCount:  35.
    Image: cmd.exe

PROCESS 82476d00  SessionId: 0  Cid: 00c4    Peb: 7ffd4000  ParentCid: 0194
    DirBase: 089402c0  ObjectTable: e1d3bf68  HandleCount:  36.
    Image: conime.exe

PROCESS 8244e980  SessionId: 0  Cid: 05b4    Peb: 7ffdc000  ParentCid: 0400
    DirBase: 08940300  ObjectTable: e16e0700  HandleCount: 123.
    Image: wuauclt.exe

PROCESS 8244a460  SessionId: 0  Cid: 0224    Peb: 7ffd5000  ParentCid: 0194
    DirBase: 08940380  ObjectTable: e17871a0  HandleCount:   1.
    Image: CrtDemo0

看到 Image name 被截斷，有點訝異，不過可能是因為 kernel 在填寫時被我們中斷了。現在來看看 PEB 裡頭的 command line 是否已經準備好了：

kd> !peb 7ffd5000
PEB at 7ffd5000
error 1 InitTypeRead( nt!_PEB at 7ffd5000)...

kd> .process /r /P /p 8244a460 
Implicit process is now 8244a460
.cache forcedecodeuser done
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
kd> dt _PEB 7ffd5000
nt!_PEB
   +0x000 InheritedAddressSpace : 0 ''
   +0x001 ReadImageFileExecOptions : 0 ''
   +0x002 BeingDebugged    : 0 ''
   +0x003 SpareBool        : 0 ''
   +0x004 Mutant           : 0xffffffff Void
   +0x008 ImageBaseAddress : 0x00400000 Void
   +0x00c Ldr              : (null) 
   +0x010 ProcessParameters : (null) 
   +0x014 SubSystemData    : (null) 
   +0x018 ProcessHeap      : (null) 
   +0x01c FastPebLock      : (null) 
   ...

試了兩種方法，都無法正確讀取 PEB ，這正和我們的預期：表示 CrtDemo05.exe 正在被建立中，其 PEB 似乎也在混沌之中。緊接著回到 cmd.exe 的 stack 上，接下來的事就需要點耐心，由於目前只知道 BasePushProcessParameters() 跟 _RTL_USER_PROCESS_PARAMETERS 有關，但是找不到可以設立 breakpoint 的地方，根據 assembly 一步步追蹤是個方法，不過直接閱讀 assembly 來找尋跟 ProcessParameters 相關的指令或許會更快，一個很快速的想法是使用關鍵字來縮小範圍，這裡若是使用 uf 來 diassembly ，可能會遇上一些麻煩，因為 CreateProcessInternalW() 被 OMAP 最佳化過，無法依照 mapping 到記憶體上的 layout 顯示：

kd> uf CreateProcessInternalW
Flow analysis was incomplete, some code may be missing
kernel32!CreateProcessInternalW+0x308:
...
...

當遇到這種狀況，我們得放棄 diassembly 整個 function ，而是直接指明記憶體位址：

kd> x kernel32!CreateProcessInternalW
7c8191eb kernel32!CreateProcessInternalW = 

kd> u 7c8191eb L600
kernel32!CreateProcessInternalW:
...
; ProcessParameter
; ebp-234h: _RTL_USER_PROCESS_PARAMETERS*
7c81a11d 8d85ccfdffff    lea     eax,[ebp-234h]
7c81a123 50              push    eax                        ; pProcessParameters
7c81a124 ff158814807c    call    dword ptr [kernel32!_imp__RtlCreateProcessParameters (7c801488)]
7c81a12a 33d2            xor     edx,edx                    ; edx = 0
7c81a12c 3bc2            cmp     eax,edx

...
...
7c81a538 ff158414807c    call    dword ptr [kernel32!_imp__RtlDestroyProcessParameters (7c801484)]

幸運的找到：

7c9432ec ntdll!RtlDestroyProcessParameters
7c9433c1 ntdll!RtlCreateProcessParameters

利用兩個 functions ，我們可以縮小可疑範圍，7c81a11d ~ 7c81a538 給了我們一個暗示：ebp-234h 是個 local variable ，並且傳遞至 RtlCreateProcessParameters() ，當結束後進行是否為 NULL 的比較，那麼 ebp-234h 基本上就很有機會就是一個 pointer point to _RTL_USER_PROCESS_PARAMETERS。另外在向上回溯，可以找找看是否還有其他參數被傳遞至 RtlCreateProcessParameters() ：

7c81a0d7 ffd6            call    esi
7c81a0d9 8d85b4fdffff    lea     eax,[ebp-24Ch]
7c81a0df 50              push    eax                    ; parameter10
7c81a0e0 8d859cfdffff    lea     eax,[ebp-264h]
7c81a0e6 50              push    eax                    ; parameter9
7c81a0e7 8d85a4fdffff    lea     eax,[ebp-25Ch]
7c81a0ed 50              push    eax                    ; parameter8
7c81a0ee 8d8594fdffff    lea     eax,[ebp-26Ch]
7c81a0f4 50              push    eaxe                   ; parameter7
7c81a0f5 ffb570fdffff    push    dword ptr [ebp-290h]   ; parameter6
7c81a0fb 8d8584fdffff    lea     eax,[ebp-27Ch]
7c81a101 50              push    eax                    ; parameter5
7c81a102 f7db            neg     ebx
7c81a104 1bdb            sbb     ebx,ebx
7c81a106 8d8578fdffff    lea     eax,[ebp-288h]
7c81a10c 23d8            and     ebx,eax
7c81a10e 53              push    ebx                    ; parameter4
7c81a10f 8d85acfdffff    lea     eax,[ebp-254h]
7c81a115 50              push    eax                    ; parameter3
7c81a116 8d858cfdffff    lea     eax,[ebp-274h]
7c81a11c 50              push    eax                    ; parameter2
7c81a11d 8d85ccfdffff    lea     eax,[ebp-234h]
7c81a123 50              push    eax                    ; parameter1
7c81a124 ff158814807c    call    dword ptr [kernel32!_imp__RtlCreateProcessParameters (7c801488)]

看到了 10 個參數被傳遞至 RtlCreateProcessParameters() ，我們可以選擇 diassembly 它看是否有與 command line 相關的操作，或是繼續搜尋其他 7c81a11d ~ 7c81a538 內 ebp-234h 的操作，這裡有一個或許可行的快速篩選法，因為 CommandLine 位於 _RTL_USER_PROCESS_PARAMETERS offset 40 bytes 的地方：

kd> dt _RTL_USER_PROCESS_PARAMETERS
ntdll!_RTL_USER_PROCESS_PARAMETERS
   ...
   +0x040 CommandLine      : _UNICODE_STRING
   ...

我們便找尋 40h 的關鍵字，而搜尋結果的確有幾筆與 40h 有關的操作，但幸運地，都不是和 CommandLine 有關的。於是乎，把焦點放回 RtlCreateProcessParameters()：

kd> x ntdll!RtlCreateProcessParameters
7c9433c1 ntdll!RtlCreateProcessParameters

kd> u 7c9433c1 L200
ntdll!RtlCreateProcessParameters:
...
7c9435e1 6a04            push    4                          ; Protect
7c9435e3 6800100000      push    1000h                      ; AllocationType
7c9435e8 8d45cc          lea     eax,[ebp-34h]
7c9435eb 50              push    eax                        ; RegionSize
7c9435ec 53              push    ebx                        ; ZeroBits
7c9435ed 8d45e4          lea     eax,[ebp-1Ch]
7c9435f0 50              push    eax                        ; BaseAddress
7c9435f1 6aff            push    0FFFFFFFFh                 ; ProcessHandle
7c9435f3 e8e69efeff      call    ntdll!ZwAllocateVirtualMemory (7c92d4de)
...

由於 ZwAllocateVirtualMemory() 是公開的 API ，所以很快地可以對參數進行 mapping ，馬上就發現 local vaiable ebp-1ch 存放的就是 allocate 的結果，並且在稍後就有一個 +40h 的操作：

kd> u 7c9433c1 L200
ntdll!RtlCreateProcessParameters:
...
7c94369a 40              inc     eax
7c94369b 40              inc     eax
7c94369c 50              push    eax
7c94369d 8b45e4          mov     eax,dword ptr [ebp-1Ch]    ; pBase
7c9436a0 83c040          add     eax,40h                    ; pBase + 40 = CommandLine
7c9436a3 57              push    edi
7c9436a4 50              push    eax
7c9436a5 8d45d8          lea     eax,[ebp-28h]
7c9436a8 50              push    eax
7c9436a9 e8ab000000      call    ntdll!RtlpCopyProcString (7c943759)
...

這和 CommandLine 的操作看似吻合，但我們仍需要找到更直接的證據，首先可以使用動態的證據，也就是設定 breakpoint 去證明，我們將 breakpoint 先設定在 add eax,40h：

kd> bp 7c9436a0

kd> g
Breakpoint 2 hit
ntdll!RtlCreateProcessParameters+0x2f5:
7c9436a0 83c040          add     eax,40h

kd> dd ebp-1c
0013ed00  00380000 0013ecd4 0013edf8 0013f018
0013ed10  7c92ee18 7c943748 00000000 0013f028

kd> dd 00380040 
00380040  00000000 00000000 00010000 00000000
00380050  00000000 00000000 00000000 00000000

接著，執行數個指令後，等到 RtlpCopyProcString() 執行完成後：

kd> dd 00380040 
00380040  00260024 003805c0 00010000 00000000
00380050  00000000 00000000 00000000 00000000

kd> du 003805c0
003805c0  "CrtDemo05.exe test"

再者，我們可以來個靜態分析：

kd> u 7c9433c1 L200
ntdll!RtlCreateProcessParameters:
...
7c94370b 8b45e4          mov     eax,dword ptr [ebp-1Ch]
7c94370e 6689988a000000  mov     word ptr [eax+8Ah],bx
7c943715 ff75e4          push    dword ptr [ebp-1Ch]
7c943718 e8f6fbffff      call    ntdll!RtlDeNormalizeProcessParams (7c943313)
7c94371d 8b4d08          mov     ecx,dword ptr [ebp+8]
7c943720 8901            mov     dword ptr [ecx],eax
...

很快地就有一個可疑的證物： ebp+8 也就是 RtlCreateProcessParameters() 的第一個參數，它被 eax 給賦值。雖然 eax 在第一行被指定為 ebp-1Ch ，看似符合我們的假設，但因為中間有個 RtlDeNormalizeProcessParams() 的呼叫，不能掉以輕心，必須進去瞧瞧：

kd> uf RtlDeNormalizeProcessParams
Flow analysis was incomplete, some code may be missing
ntdll!RtlDeNormalizeProcessParams:
7c943313 8bff            mov     edi,edi
7c943315 55              push    ebp
7c943316 8bec            mov     ebp,esp
7c943318 8b4508          mov     eax,dword ptr [ebp+8]
7c94331b 85c0            test    eax,eax
7c94331d 7478            je      ntdll!RtlDeNormalizeProcessParams+0x84 (7c943397)
...

幸運的， RtlDeNormalizeProcessParams() 沒有可觀的實作，也僅在 function 開頭將 eax 給予 ebp+8 所代表的位址；而 ebp+8 也正就是 ebp-1Ch，其餘的對應操作都是 read 。確定了 RtlDeNormalizeProcessParams() 不會更改 ebp-1Ch 後，就可以放心回到 RtlCreateProcessParameters()：

kd> u 7c9433c1 L200
ntdll!RtlCreateProcessParameters:
...
7c94345b 8b7d18          mov     edi,dword ptr [ebp+18h]
...
7c943685 e8cf000000      call    ntdll!RtlpCopyProcString (7c943759)
7c94368a 668b07          mov     ax,word ptr [edi]
7c94368d 663b4702        cmp     ax,word ptr [edi+2]
7c943691 0f84f79a0200    je      ntdll!RtlCreateProcessParameters+0x2e9 (7c96d18e)
7c943697 0fb7c0          movzx   eax,ax
7c94369a 40              inc     eax
7c94369b 40              inc     eax
7c94369c 50              push    eax
7c94369d 8b45e4          mov     eax,dword ptr [ebp-1Ch]    ; pBase
7c9436a0 83c040          add     eax,40h                    ; pBase + 40 = CommandLine
7c9436a3 57              push    edi
7c9436a4 50              push    eax
7c9436a5 8d45d8          lea     eax,[ebp-28h]
7c9436a8 50              push    eax
7c9436a9 e8ab000000      call    ntdll!RtlpCopyProcString (7c943759)

可以發現，可疑的參數有：

7c94369a                 inc eax
7c94369b                 inc eax
7c94369c                  push eax
7c94345b mov edi,dword ptr [ebp+18h]
7c9436a3 push edi
7c9436a5 lea eax,[ebp-28h]
7c9436a8 push eax

為了找出實際的賦值給 ebp-1Ch+40h 的對象，這邊可以使用 breakpoint 去檢驗，分別是：

kd> bp 7c94369c
kd> g
Breakpoint 5 hit
ntdll!RtlCreateProcessParameters+0x2f1:
7c94369c 50              push    eax
kd> r
eax=00000026 ebx=00000000 ecx=00380034 edx=00000000 esi=00000208 edi=0013edac
eip=7c94369c esp=0013ecd4 ebp=0013ed1c iopl=0         nv up ei pl nz na po cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000203

kd> dpu 0013edac+4 L1
0013edb0  0015adc8 "CrtDemo05.exe test"

kd> bp 7c9436a8
kd> g
Breakpoint 6 hit
kd> dd ebp-28h
0013ecf4  003805c0 00000000 00000634 00380000
0013ed04  0013ecd4 0013edf8 0013f018 7c92ee18

kd> dpu 003805c0
003805c0  00000000

透過 breakpoint 的實驗，就可以知道 edi 才是賦值給 ebp-1Ch+40h 的來源，而且還是 RtlCreateProcessParameters() 的第五個參數。所以可以反推 edi 是怎麼來的，並一路反推，便可得到 command line 最初的來源。

kd> uf kernel32!CreateProcessW
kernel32!CreateProcessW:
7c802332 8bff            mov     edi,edi
7c802334 55              push    ebp
7c802335 8bec            mov     ebp,esp
7c802337 6a00            push    0                      ; parameter12
7c802339 ff752c          push    dword ptr [ebp+2Ch]    ; lpProcessInformation
7c80233c ff7528          push    dword ptr [ebp+28h]    ; lpStartupInfo
7c80233f ff7524          push    dword ptr [ebp+24h]    ; lpCurrentDirectory
7c802342 ff7520          push    dword ptr [ebp+20h]    ; lpEnvironment
7c802345 ff751c          push    dword ptr [ebp+1Ch]    ; dwCreationFlags
7c802348 ff7518          push    dword ptr [ebp+18h]    ; bInheritHandles
7c80234b ff7514          push    dword ptr [ebp+14h]    ; lpThreadAttributes
7c80234e ff7510          push    dword ptr [ebp+10h]    ; lpProcessAttributes
7c802351 ff750c          push    dword ptr [ebp+0Ch]    ; lpCommandLine
7c802354 ff7508          push    dword ptr [ebp+8]      ; lpApplicationName
7c802357 6a00            push    0                      ; parameter1
7c802359 e88d6e0100      call    kernel32!CreateProcessInternalW (7c8191eb)
7c80235e 5d              pop     ebp
7c80235f c22800          ret     28h

kd> x kernel32!CreateProcessInternalW
7c8191eb kernel32!CreateProcessInternalW = <no type information>

kd> u 7c8191eb L600
kernel32!CreateProcessInternalW:
...
7c819214 8b4510          mov     eax,dword ptr [ebp+10h]    ; lpCommandLine
7c819217 8985e0f8ffff    mov     dword ptr [ebp-720h],eax
...
7c819958 8b8de0f8ffff    mov     ecx,dword ptr [ebp-720h]   ; lpCommandLine
...
7c819964 ffb56cf9ffff    push    dword ptr [ebp-694h]    ; parameter13
7c81996a ffb500f7ffff    push    dword ptr [ebp-900h]    ; parameter12
7c819970 8a85b7f8ffff    mov     al,byte ptr [ebp-749h]
7c819976 f6d8            neg     al
7c819978 1bc0            sbb     eax,eax
7c81997a 83e002          and     eax,2
7c81997d 50              push    eax                    ; parameter11
7c81997e ff751c          push    dword ptr [ebp+1Ch]    ; parameter10
7c819981 8b85f4f6ffff    mov     eax,dword ptr [ebp-90Ch]
7c819987 0b4520          or      eax,dword ptr [ebp+20h]
7c81998a 50              push    eax                     ; parameter9
7c81998b 8d8560f7ffff    lea     eax,[ebp-8A0h]
7c819991 50              push    eax                     ; parameter8
7c819992 ffb5b0f8ffff    push    dword ptr [ebp-750h]    ; parameter7
7c819998 51              push    ecx                     ; parameter6
7c819999 ffb550f7ffff    push    dword ptr [ebp-8B0h]    ; parameter5
7c81999f ffb5e4f8ffff    push    dword ptr [ebp-71Ch]    ; parameter4
7c8199a5 ffb5b8f7ffff    push    dword ptr [ebp-848h]    ; parameter3
7c8199ab ffb594f9ffff    push    dword ptr [ebp-66Ch]    ; parameter2
7c8199b1 ffb534f7ffff    push    dword ptr [ebp-8CCh]    ; parameter1
7c8199b7 e8e2050000      call    kernel32!BasePushProcessParameters (7c819f9e)

kernel32!BasePushProcessParameters:
...
7c819fd3 8b4d1c          mov     ecx,dword ptr [ebp+1Ch]    ;
7c819fd6 898d5cfdffff    mov     dword ptr [ebp-2A4h],ecx   ; srcCmdLine
...
7c81a045 8d85d0fdffff    lea     eax,[ebp-230h]
7c81a04b 50              push    eax
7c81a04c 8d858cfdffff    lea     eax,[ebp-274h]
7c81a052 50              push    eax
7c81a053 ffd6            call    esi
7c81a055 ffb55cfdffff    push    dword ptr [ebp-2A4h]   ; srcCmdLine
7c81a05b 8d8584fdffff    lea     eax,[ebp-27Ch]
7c81a061 50              push    eax                    ; commandLine
7c81a062 ffd6            call    esi                    ; RtlInitUnicodeString
...
7c81a0cf 50              push    eax
7c81a0d0 8d8594fdffff    lea     eax,[ebp-26Ch]
7c81a0d6 50              push    eax
7c81a0d7 ffd6            call    esi
7c81a0d9 8d85b4fdffff    lea     eax,[ebp-24Ch]
7c81a0df 50              push    eax                        ; runtimeData
7c81a0e0 8d859cfdffff    lea     eax,[ebp-264h]
7c81a0e6 50              push    eax                        ; shellInfo
7c81a0e7 8d85a4fdffff    lea     eax,[ebp-25Ch]
7c81a0ed 50              push    eax                        ; desktopInfo
7c81a0ee 8d8594fdffff    lea     eax,[ebp-26Ch]
7c81a0f4 50              push    eax                        ; windowTitle
7c81a0f5 ffb570fdffff    push    dword ptr [ebp-290h]       ; environment
7c81a0fb 8d8584fdffff    lea     eax,[ebp-27Ch]
7c81a101 50              push    eax                        ; commandLine, parameter5
7c81a102 f7db            neg     ebx
7c81a104 1bdb            sbb     ebx,ebx
7c81a106 8d8578fdffff    lea     eax,[ebp-288h]
7c81a10c 23d8            and     ebx,eax
7c81a10e 53              push    ebx                        ; currentDir
7c81a10f 8d85acfdffff    lea     eax,[ebp-254h]
7c81a115 50              push    eax                        ; dllPath
7c81a116 8d858cfdffff    lea     eax,[ebp-274h]
7c81a11c 50              push    eax                        ; imagePath
; ProcessParameter
; ebp-234h: _RTL_USER_PROCESS_PARAMETERS*
7c81a11d 8d85ccfdffff    lea     eax,[ebp-234h]
7c81a123 50              push    eax                        ; pProcessParameters
7c81a124 ff158814807c    call    dword ptr [kernel32!_imp__RtlCreateProcessParameters (7c801488)]

最後，我們還沒來得及關心 creater 是如何的把 createe 的 PEB 建立完成，其實就在 RtlCreateProcessParameters() 稍後，透過 NtWriteVirtualMemory() 完成。

7c81a1eb 8b35f413807c    mov     esi,dword ptr [kernel32!_imp__NtWriteVirtualMemory (7c8013f4)]
...
7c81a38b 6a04            push    4                          ; Protect
7c81a38d bb00100000      mov     ebx,1000h
7c81a392 53              push    ebx                        ; AllocationtType
7c81a393 8d85c8fdffff    lea     eax,[ebp-238h]
7c81a399 50              push    eax                        ; RegionSize
7c81a39a 52              push    edx                        ; ZeroBits
7c81a39b 8d85c4fdffff    lea     eax,[ebp-23Ch]
7c81a3a1 50              push    eax                        ; BaseAddress
7c81a3a2 ffb580fdffff    push    dword ptr [ebp-280h]       ; hNewProcessHandle
7c81a3a8 8b3d9011807c    mov     edi,dword ptr [kernel32!_imp__NtAllocateVirtualMemory (7c801190)]
7c81a3ae ffd7            call    edi
7c81a3b0 898574fdffff    mov     dword ptr [ebp-28Ch],eax
7c81a3b6 8b85c8fdffff    mov     eax,dword ptr [ebp-238h]
7c81a3bc 898558fdffff    mov     dword ptr [ebp-2A8h],eax
7c81a3c2 83bd74fdffff00  cmp     dword ptr [ebp-28Ch],0
7c81a3c9 0f8c04950200    jl      kernel32!BasePushProcessParameters+0x504 (7c8438d3)
7c81a3cf 8b8dccfdffff    mov     ecx,dword ptr [ebp-234h]
7c81a3d5 8901            mov     dword ptr [ecx],eax
7c81a3d7 f6452b10        test    byte ptr [ebp+2Bh],10h
7c81a3db 0f85fa940200    jne     kernel32!BasePushProcessParameters+0x51d (7c8438db)
7c81a3e1 f6452b20        test    byte ptr [ebp+2Bh],20h
7c81a3e5 0f85ff940200    jne     kernel32!BasePushProcessParameters+0x52d (7c8438ea)
7c81a3eb f6452b40        test    byte ptr [ebp+2Bh],40h
7c81a3ef 0f8504950200    jne     kernel32!BasePushProcessParameters+0x53d (7c8438f9)
7c81a3f5 6a00            push    0                          ; nBytesWritten
7c81a3f7 8b85ccfdffff    mov     eax,dword ptr [ebp-234h]
7c81a3fd ff7004          push    dword ptr [eax+4]          ; nBytesToWrite    : pProcParameters->Length
7c81a400 50              push    eax                        ; Buffer           : pProcParameters
7c81a401 ffb5c4fdffff    push    dword ptr [ebp-23Ch]       ; BaseAddress      :
7c81a407 ffb580fdffff    push    dword ptr [ebp-280h]       ; hNewProcessHandle:
                                 ; NtWriteVirtualMemory( *(ebp-280h), *(ebp-23Ch), eax, ???, 0 )
7c81a40d ffd6            call    esi                    ; _imp__NtWriteVirtualMemory

網路上可以找到人家 reverse 過的 RtlCreateProcessParameters() prototype 可以加速 trace 的速度：

NTSTATUS RtlCreateProcessParameters( 
    PRTL_USER_PROCESS_PARAMETERS *ProcessParameters,
    PUNICODE_STRING     ImagePathName,
    PUNICODE_STRING     DllPath,
    PUNICODE_STRING     CurrentDirectory,
    PUNICODE_STRING     CommandLine,
    PWSTR               Environment, // Not sured
    PUNICODE_STRING     WindowTitle,
    PUNICODE_STRING     DesktopInfo,
    PUNICODE_STRING     ShellInfo,
    PUNICODE_STRING     RuntimeData );

不過如果想要硬派的自己來也是可以的，可以試試看這篇文章的作法 : )

Summary

宜蘭行

又為明日請假的衝動想到了一個理由 ── 整理照片，這檔事可是講衝動、帶感情的，時間過了就沒了！

第一站是蘭陽博物館：

蘭陽博物館的外觀造型很特別，在樸實的地方有著摩登的造型，讓人聯想起雪梨歌劇院和十三行博物館！不過歌劇

院切片柳丁的圓弧造型，稍稍溫潤了點，不像蘭陽博物博物那麼地尖銳、直衝入天；十三行博物館則像無敵艦隊。

博物館內相當不和諧，人山人海、萬頭鑽動，感覺這幾年，大家都體悟到：夏天就是

博物館的季節！爸爸媽媽要帶小朋友到博物館走走，男女朋友也可以來趟知性之旅。

滿滿的人群是可以徹底粉碎、擊潰一個人排隊的決心！走到館外的陽台，大老遠地就可以看到泡泡在空

中飄揚，外頭有個小朋友拿著電動吹泡泡機在玩。哈，和朋友不禁同聲讚道：太酷了！小時候關於泡泡

的回憶都滿在一罐十元的綠瓶子裡。要看見泡泡隨風飛舞，光靠一個人是不可能的，畢竟一隻小手握著

瓶子、一隻小手快速地在嘴前、瓶中切換，再快也快不上泡泡的消逝。使勁得吹啊！體力不好，吹著吹

著就缺氧眼冒金星、口吐白沫去了～一旁的阿公還拿著專業填充罐，小時候葛葛可是都偷拿媽媽的洗碗

精來用。猜猜有多少被這小朋友的神秘武器吸了眼球啊？

拍了弟弟一會，對他留下蠻好的印象，他就是手舞足蹈地玩，夾雜著輕輕的笑，阿公、姊姊想從他手上

拿走吹泡泡機時，也不會翻臉如翻書地哭給你看，或在地上翻滾，真是太乖巧伶俐了！對照最近在公車

上遇見的小孩，鬼吼鬼叫的行徑真是有天壤之別，加上在一旁的父母也不教育一下，實在是 ooxx ，說

來慚愧，正義感只是在心中蔓延，我也只是小峱峱 : (

拿泡泡機噴噴阿公、再噴噴我這個陌生人！

（話說這張圖是用 PowerPoint 做的，用了 photoshop 和 illustrator

之後發現，還是 PowerPoint 順手啊，誰叫我是工程師！）

第二站是金車宜蘭噶瑪蘭威士忌酒廠。

雖然這景色沒有雄偉狀闊到黃河之水天上來的境界，不過應該也有小小的相識吧。

想必這山谷就是傳說中造酒的水源？！宜蘭的好山好水被金車就用來造酒，也是種絕配。

詩情畫意的風景佐以威士忌，挺不賴的吧！

（身為好國民，就該響應一下政策宣導，請大家開車不喝酒、酒後不開車！）

整點左右到達酒廠可以跟上廠方的導覽，主要行程就是帶大家走走：蒸餾廠、酒堡。

住宿：風箏民宿，http://kite.ilanbnb.tw/

扣除給哈比人用的天花板外，風箏是間不錯的民宿，造型特別、離冬山河近、

提供腳踏車、可以烤肉露營，老闆、老闆娘都蠻親切的，又是道地宜蘭人！

ps. 這些小木屋都是蓋在水上！

那個，相當優秀的房客!!

該是騎腳踏車遊冬山河了。

天氣很熱，要找尋陰影處。瞧，這位聰明的朋友，趁著遊覽車司機

抽煙小憩，偷偷躲在車影子下，不時還探頭看看司機抽完沒～

翻過這片大山，想必就是台北了吧?!

有點不美，從此宜蘭成為台北的禁臠！來吧，一日遊的人們！（ps. 我們可是待了兩天～）

剛好碰到草坪水運！

惡搞一下

大概是整理照片最有趣的地方吧！

研究生的煩惱！

帥哥的煩惱！

空手奪白刃

演出

強者我同事、司機兼攝影師、多情研究生。對了，還有神秘同事的女友（基於保護女性不予刊登），

不知不覺竟然認識超過十年了。陳亦迅：十年之前我不認識你…十年之後我們是朋友…這首歌實在很不搭嘎，

可是每當想到認識十年，就會壞掉似的在心中哼起這首歌！

Sharing 這回事

Sharing 是先把自己掏空，再從聽講者身上充實自己。

Reduce NRV to RV

或許不用到偏執的地步，人們就會某種程度上、或多或少地對於一個 return-by-value 的 function 感到憂心！

class MyClass {
    // ...
};

MyClass foo()
{
    // ...
}

int main()
{
    MyClass obj = foo();
}

憂慮其來有自：按照字面解讀，當 foo() 結束時，會將其結果放置於一個 temporary MyClass object 裡頭，並且這個 temporary object 會被作為 MyClass 的 copy constructor 的參數來初始化 obj 。若真如此，我們就不得不小心這類 return-by-value 的 function ，因為它引入了一個 temporary object 並且牽涉到 copy，temporary object 的建立與消滅可能對 performance 帶來極大的影響，舉例來說： MyClass 若是一個 1000x1000 的 array ，它的建立與消滅牽涉到大量的 heap allocation/deallocation ，遑論它還需要透過 copy constructor 把資料傳遞到另一個 object 上。

不過事情並非想像中的那麼嚴重，現在許多的 compilers 都已經能某種程度地對 return-by-value 做出最佳化，以抑制 temporary object 和 copy 的發生，C++ Standard 中稱呼為 copy elision ；一般則叫做 return value optimization （RVO）。

Basic RVO

我們的目標是希望能夠抑制 copy 的發生，因此最好的情形就是：打從一開始，所作的運算就是在儲存結果的 destination object 上發生。要做到這點， compiler 大多使用一種簡單明瞭的方式：對 return-by-value 的 function prototype 動手腳，將它從

MyClass foo()
{
   // ...
}

轉換成：

void foo( MyClass& result )
{
    // ...
}

如此一來，便有機會在 return-by-value 的 function 中直接操作 destination object 。在某些團隊裡頭，為了避免 performance drop ，往往會制定這樣的 coding standard ：

不要讓一個 function 回傳一個 object，若真要如此，改將 object 當作 argument 傳給 function 進行處理。

將 destination object 從 return value 轉移到 argument list 上，就像是手工式的最佳化。不過即使能直接操作 destination object ，要能做出正確的最佳化判斷仍然不是件簡單的事，參考一下這個 N1377 的例子：

A f( bool b )  
{  
    A a1, a2;  
    // ...  
    return b ? a1 : a2;  
}

Destination object 的 value 可能來自不同的 control flows 、不同的 source objects，若是 compiler 無法有好的對策來在 destination object 上生成有效率的運算，便可能迫使 compiler 放棄 RVO 。為此， communities 和 compilers 往往又在討論 RVO 時，將它分為兩類：

Named return value optimization (NRVO)： Destination object 來自於一個 named （具名的）的變數，上述 N1377 的例子就屬於 NRVO 的範疇。
Unnamed return value optimization：名稱上似乎沒有 URVO 這說法。 Destination object 來自於一個 unnamed （不具名的）的變數，通常就是那些直接存在於 return statement 上的 temporary objects。

一般而言， compilers 對於 URV 的最佳化支援會比 NRV 好，因為它們往往不具備有太複雜的 control flow ，通常程式碼結構上就像：

MyClass foo()
{
    return MyClass( "foo()" );
}

這般簡單，易於最佳化，因此即使是 g++ 不開啟最佳化或是 VC 2005 的 debug mode （預設是不啟用最佳化）， URVO 都是唾手可得的。不過 RVO 到底如何施行、能帶來多少效益，還是得視 compiler 而定，最好的策略還是得多瞭解自己工作的 compiler ，寫些程式去測試測試。當我寫這篇文章的時候，我測試了一下這樣的的程式碼：

Config get( const std::string& input, const std::string& value )
{
    Config config;

    config.setName( input );
    config.setValue( value );
    config.setValue( 10 );

    return config;
}

即使這個 get() 只是：

Config get( const std::string& input, const std::string& value )
{
    return Config( input, value );
}

的惡搞版，會發現 NRV 很快地就讓 VC2005 debug mode 放棄 RVO ，但 g++4.4.1 -o0 還是能正常執行 RVO 。你可以參考這篇文章：Named Return Value Optimization in Visual C++ 2005，知道更多關於 VC 2005 NRVO 的細節。

Reduce NRV to RV

如前文所說，NRVO 屬於比較複雜的最佳化技術，並非每個 compiler 都提供。那麼在這種情況下，是不是代表著我們就只能妥協，將 return-by-value 變成 pass-by-reference 來手工模擬 compiler 的 transformation ，以減少 copy 的發生呢？不是的，有一個有趣的方式：將 NRV 轉換成 URV ，然後讓 compiler 去執行 URVO 。方法只要我們能提供特殊的 constructor ，並將運算放在裡頭。還記得我們一開始提到的精神：

最好的情形就是：打從一開始，所作的運算就是在儲存結果的 destination object 上發生。

Reduce NRV to RV 就像是這個精神的衍生， class 的設計者為 class 提供能夠直接在 class object 上運算的能力。實作上， class 的設計者會提供一個或一組 constructors 讓運算直接由 class object 完成。

舉例來說，一個 NRV function ：

MyClass foo( int input, int filter )
{
    MyClass result;
    // Operations that involves result, intput, filter
    return result;
}

轉變為：

MyClass foo( int input, int filter )
{
    return MyClass( input, filter );
}

因為 MyClass class 的設計者將 foo() 該執行的運算搬到了 constructor 裡頭：

class MyClass {
public: 
    MyClass( int input, int filter )
    {
        // Initialize by input and filter
    }
};

那麼這麼做的缺點是什麼？嗯，一個不太自然的 constructor ，有點違反 data abstraction，而這個 MyClass 的可能充斥著各種具有這特殊功能的 constructors 。

A Mat8x8 Sample

如果想找個例子，不妨看看 Faster C++ Operations [1]，它是一篇描述這種手工最佳化手法的文章，裡頭使用 Mat8x8 作為例子：

class Mat8x8 {
    double  M[ 8*8 ];
public:
    Mat8x8()                      { memset( M, 0, 64*sizeof(double) ); }
    Mat8x8( const Mat8x8& other ) { memcpy( M, other.M, 512 ); }

    double& operator() ( int row, int col ) { return M[ row*8 + col ]; }

    Mat8x8& operator = ( const Mat8x8& rhs );
    Mat8x8  operator + ( const Mat8x8& rhs );
    Mat8x8  operator - ( const Mat8x8& rhs );
    Mat8x8  operator * ( double rhs );
    Mat8x8  operator / ( double rhs );
};

Mat8x8 支援了許多 built-in operators 的能力，我們只取 + 來看：

Mat8x8::Mat8x8 operator + ( const Mat8x8& other ) {
    double Sum[64];
    for( int i=0; i<64; ++i )
        Sum[i] = M[i] + other.M[i];
    return Mat8x8( Sum );
    }

operator +() 是個典型的 NRV function ，在未能提供 NRVO 的 compiler 上，可能會產生對 performance 有所危害的程式，因此按照先前提到的 reduction ，我們需要為 Mat8x8 提供特殊的 constructor ，依照需求，這個 constructor 會是個接受兩個 arguments 的 constructors ，不過 Mat8x8 提供了不只是提供了 + 的運算，所以最好能夠連 –、*、/ 也一併處理，因此 Faster C++ Operations [1] 提供的方法是利用具有古典優雅的氣息的 function pointer ，改寫之後變成：

class Mat8x8 {
    typedef void (*PFnInitMat)( Mat8x8& Mat, void* pLHS, void* pRHS );

    // The special constructor help us to reduce NRV to RV
    Mat8x8( PFnInitMat Init, void* pLHS, void* pRHS )
    {
        Init( *this, pLHS, pRHS );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };
    Mat8x8()                      { memset( A, 0, 64*sizeof(double) ); }
    Mat8x8( const Mat8x8& other ) { memcpy( A, other.A, 512 ); }

    Mat8x8& operator = ( const Mat8x8& rhs );
    Mat8x8  operator + ( const Mat8x8& rhs );
    Mat8x8  operator - ( const Mat8x8& rhs );
    Mat8x8  operator * ( double rhs );
    Mat8x8  operator / ( double rhs );
};

void AddMat88( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void SubMat88( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void DivMat88scalar( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void MulMat88scalar( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

PFnInitMat 是個 function pointer type ，它代表了一族：可以接受兩個 Mat8x8 objects 並將結果寫於特定 object 的能力，家族成員包括： AddMat88()、SubMat88()、DivMat88()、MulMat88()。Mat8x8 class 現在有了一個特殊的 constructor ，它除了接受兩個 Mat8x8 的 arguments 外，還會接受一個 function pointer 來實際執行不同的運算。有了這樣的 constructor 幫忙，現在 operators 就會變成一個個簡單的 forward functions：

Mat8x8 Mat8x8::operator + ( const Mat8x8& rhs ) {
    return Mat8x8( AddMat88, this, (void*)&rhs );
    }

Mat8x8 Mat8x8::operator - ( const Mat8x8& rhs ) {
    return Mat8x8( SubMat88, this, (void*)&rhs );
    }

Mat8x8 Mat8x8::operator * ( double rhs ) {
    return Mat8x8( MulMat88scalar, this, (pdouble)&rhs );
    }

Mat8x8 Mat8x8::operator / ( double rhs ) {
    return Mat8x8( DivMat88scalar, this, (pdouble)&rhs );
    }

Generic Revision

看過了 Faster C++ Operations [1] 的方式，不知道您有沒有什麼好奇的地方呢？ COdE fr3@K 跟我覺得有些地方可以來作些實驗：

Type safety：Faster C++ Operations [1] 使用 void* 作為那些實際運算的 functions 的參數。因此在 AddMat88() 這類實際運算的 functions 裡頭，必須將 void* cast 成適當的 type ，少了那麼一點 type safety 的味道。或許有點吹毛求疵，不過引用作者的話：
I just used void* for this example - they make programming flexible (under great responsibility).
Generic and parameterization operation ：實際運算的 functions 是透過 function pointers 傳遞 constructor 去執行任務，如果能改用 C++ 的一些泛型機制呢？

Generic and Parameterization Operations

class Mat8x8 {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, plus_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] + rhs.A[ i ];
        }
    }

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, minus_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] - rhs.A[ i ];
        }
    }

    Mat8x8( const Mat8x8& lhs, double rhs, multiplication_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] * rhs;
        }
    }

    Mat8x8( const Mat8x8& lhs, double rhs, division_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] / rhs;
        }
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8( const Mat8x8& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8 operator + ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8 operator - ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8 operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8 operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

使用了幾個 tag classes 來幫助我們標示實際要求 constructor 執行的運算。不過這只是第一步，目前的實作還無法支援使用 Mat8x8 時，可以客製化運算，我們還需要做出這樣的修改：

template <typename OperationT>
class Mat8x8 : public OperationT {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, plus_tag )
    {
        add( A, lhs.A, rhs.A );
    }

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, minus_tag )
    {
        minus( A, lhs.A, rhs.A );
    }

    Mat8x8( const Mat8x8& lhs, double rhs, multiplication_tag )
    {
        multiply( A, lhs.A, rhs );
    }

    Mat8x8( const Mat8x8& lhs, double rhs, division_tag )
    {
        divide( A, lhs.A, rhs );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8( const Mat8x8& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8 operator + ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8 operator - ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8 operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8 operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

一個 policy-based class 來提供運算，並讓使用 Mat8x8 的人可以選擇要使用的矩陣計算方式；例如說最基本的運算可透過 MatOp class 來提供：

struct MatOp {
    template <typename ValueT,
              size_t size>
    static void add( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], const ValueT (&rhs)[ size ] )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] + rhs[ i ];
        }
    }

    template <typename ValueT,
              size_t size>
    static void minus( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], const ValueT (&rhs)[ size ] )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] - rhs[ i ];
        }
    }

    template <typename ValueT,
              size_t size>
    static void multiply( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], ValueT rhs )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] * rhs;
        }
    }

    template <typename ValueT,
              size_t size>
    static void divide( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], ValueT rhs )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] / rhs;
        }
    }
};

2010/05/03 Update

今天巧遇 COdE fr3@K ，他給了我幾個建議：

少用繼承這種高耦合性的機制，尤其當你想描述的關係並非 is-a 而是 is-implemented-in-terms-of 。
給 class template 一個名稱，並用 typedef 宣告另一個使用時的名稱。

所以今天的作業 XD

template <typename OperationT>
class Mat8x8Base {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8Base( const Mat8x8Base& lhs, const Mat8x8Base& rhs, plus_tag )
    {
        OperationT::add( A, lhs.A, rhs.A );
    }

    Mat8x8Base( const Mat8x8Base& lhs, const Mat8x8Base& rhs, minus_tag )
    {
        OperationT::minus( A, lhs.A, rhs.A );
    }

    Mat8x8Base( const Mat8x8Base& lhs, double rhs, multiplication_tag )
    {
        OperationT::multiply( A, lhs.A, rhs );
    }

    Mat8x8Base( const Mat8x8Base& lhs, double rhs, division_tag )
    {
        OperationT::divide( A, lhs.A, rhs );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8Base()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8Base( const Mat8x8Base& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8Base operator + ( const Mat8x8Base& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8Base operator - ( const Mat8x8Base& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8Base operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8Base operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

typedef Mat8x8Base<MatOp> Mat8x8;

以 Mat8x8Base 取代原本的 class template 名稱 Mat8x8 ，使用 typedef 宣告常用的名字 – Mat8x8 ；取消繼承，改用 qualified names 去呼叫 add(), minus() 之類的 functions。

C++0x 的救贖

真是的，說得口沫橫飛！其實重點在 C++0x 提供了 move semantics 幫助我們從 copy 的地獄中離開，N1377 ：

Move semantics is mostly about performance optimization: the ability to move an expensive object from one address in memory to another, while pilfering resources of the source in order to construct the target with minimum expense.

有興趣的，可以參考一下 COdE fr3@K 寫的一連串關於 move、r-value reference 的文章：

C++0x: Rvalue References：http://fsfoundry.org/codefreak/2008/11/16/cpp0x-rvalue-references/
C++0x: More on Rvalue References：http://fsfoundry.org/codefreak/2008/11/16/cpp0x-more-on-rvalue-references/
I Like to Move It：http://fsfoundry.org/codefreak/2009/05/19/i-like-to-move-it/

Furthermore

RVO vs. NRVO vs. URVO？
前面提過，比較少有 URVO 這樣的稱呼，這篇文章會做這樣的區別，只是單純覺得 RVO 應該是包涵性較 NRVO 和 URVO 廣泛的最佳化，URVO 只能算是 RVO 裡頭較簡單的一種形式，不過許多文章談及 RVO 可能只牽涉到 URVO ，或者它們說是 apply RVO to unnamed temporaries [6]。
為什麼 C++ Standard 會特別的定義 copy elision 呢？
無論是 C++03 或是 C++0x 都會特別提及 copy elision ：

When certain criteria are met, an implementation is allowed to omit the copy/move construction of a class object, even if the copy/move constructor and/or destructor for the object have side effects. … . This elision of copy/move operations, called copy elision, is permitted in the following circumstances

因為一旦 copy elision 施行，原本發生在 return value 上的 side effect 也就不可得，雖然少見，但或許真有程式碼倚賴這樣的 side effect ，object counting 可能算是一類？！但為了效率的追求， RVO 是必要的，弭平爭議的方法就是在 standard 中好好說明。