原文：GCC Bug

哈，這次老師給我的作業可是難題，有好一陣子沒有碰觸 gdb ，更何況這次還要跟 FPU 打交道，下面是我的測試程式，和原文只有小部份差異，但重點部份是一樣的：

#include <cstdio>
using namespace std;

double f(double z)
{
    return z;
}

void foo(double fraction)
{
    double z = fraction * 20.0;
    int t1 = z;
    int t2 = f(z);
    printf("t1=%d, t2=%d\n", t1, t2);
}

int main()
{
    foo(6.0/20.0);
    return 0;
}

問題

輸出：

6, 6

是我們預期的結果。但在某些情況下卻會產生

5, 6

這樣的輸出。但若是在 compile 時加上 -ffloat-store 則又能得到正確的結果。根據 gcc optimization options 的說明：

Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

我們懷疑是從 register 讀出時發生意想不到的進位，導致結果有所出入。

實驗

一開始在嘗試 reproduce 時就遇上問題，在和 aaa 討論後才發現，我們使用的 gcc 版本和 compile options 都不盡相同。最後確定了發生問題的環境：

cygwin + gcc 3.4.4
gcc –O text.c

下面是我嘗試過的平台和結果：

	-O	-O3
gcc 4.4.1	No	No
gcc 3.4.4 (cygwin)	Yes	No

有趣的開始！

O3 版本

g++ -Wall -O3 -S -masm=intel main.cpp

這個版本是最好理解的， compiler 已經最佳化到：

不呼叫 foo()、f()，直接以常數取代。（沒記錯，這算是 compiler const folding 和 const propagation 的範疇）

看一下 assembly 就很清楚：

LC2:
	.ascii "t1=%d, t2=%d\12\0"
	.align 4
...
_main:
    push   ebp
    mov    eax, 16
    mov    ebp, esp
    sub    esp, 24
    and    esp, -16
    call   __alloca
    call   ___main
    mov    DWORD PTR [esp], OFFSET FLAT:LC2 ; 看一眼 LC2 ，一個取址動作 :)
    mov    edx, 6
    mov    eax, 6
    mov    DWORD PTR [esp+8], edx
    mov    DWORD PTR [esp+4], eax
    call   _printf
    leave
    xor    eax, eax
    ret

O 版本

g++ -Wall -O -S -masm=intel main.cpp

O 版本就沒那麼聰明，而是真正的去呼叫 foo() ：

_main:
    push   ebp
    mov    ebp, esp
    sub    esp, 8
    and    esp, -16
    mov    eax, 16
    call   __alloca
    call   ___main
    fld    QWORD PTR LC4
    fstp   QWORD PTR [esp]
    call   __Z3food
    mov    eax, 0
    leave
    ret

而且可以發現，LC4 是 foo() 的參數：

LC4:
	.long	858993459
	.long	1070805811
	.text
	.align 2


0.3 的 IEEE 754 Double precision (64 bits) 表示式：
0 01111111101 00110011001100110011  110011001100110011001100110011
                        1070805811                       858993459

嗯嗯，O 版本先做了簡單的運算展開。不過在進入 foo() 的 floating point 計算前，得先對 x87 FPU 有簡單的概念：

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

x87 上用來儲存資料的 registers 被組織成一塊 stack 的，並且搭配了 control, status 等 registers 可供控制、查詢之用。 Stack 以下圖的方式運做：

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

Stack 的 top 又稱為 ST(0)，是許多 FPU 指令的隱藏 operand 之一，像是待會在 foo() 看見的 fld、fstp、fmul 指令都會 implicitly 操作 ST(0) 。

__Z3food:
    push    ebp
    mov     ebp, esp
    push    ebx
    sub     esp, 20
    fld     DWORD PTR LC1
    fmul    QWORD PTR [ebp+8]
    fnstcw  WORD PTR [ebp-6]
    movzx   eax, WORD PTR [ebp-6]
    or      ax, 3072
    mov     WORD PTR [ebp-8], ax
    fldcw   WORD PTR [ebp-8]
    fist    DWORD PTR [ebp-12]      ; store integer
    fldcw   WORD PTR [ebp-6]
    mov     ebx, DWORD PTR [ebp-12]
    fstp    QWORD PTR [esp]
    call    __Z1fd
    fnstcw  WORD PTR [ebp-6]
    movzx   eax, WORD PTR [ebp-6]
    or      ax, 3072
    mov     WORD PTR [ebp-8], ax
    fldcw   WORD PTR [ebp-8]
    fistp   DWORD PTR [esp+8]
    fldcw   WORD PTR [ebp-6]
    mov     DWORD PTR [esp+4], ebx
    mov     DWORD PTR [esp], OFFSET FLAT:LC2
    call    _printf
    add     esp, 20
    pop     ebx
    pop     ebp
    ret

這邊使用倒推法，來找到 t1 及其在 assembly 的表示，尋著上頭的 highlight 部份，可以發現：

printf 從 ebx 拿到 t1 的值，也就是 ebx 、
ebx 由 ebp-12 得到、
ebp-12 則是 fist 的結果，也就是 FPU 的 ST(0)、
往上回溯，只有 L6 處有個 fld ，因此，ST(0) 指的是 fmul 的結果、

反推的過程也符合程式的邏輯，所以下一步就是用 debugger 去追蹤：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030
(gdb) info float
  R7: Empty   0x3ffd9999999999999800
  R6: Empty   0x00e07c93022200240000
  R5: Empty   0x51c27c92fe956123a74c
  R4: Empty   0x79d37c9a06007c930460
  R3: Empty   0x0a887c9479870022bf80
  R2: Empty   0xbf9800010a886123debc
  R1: Empty   0x00006123a74c7c93043e
=>R0: Empty   0xdee46123de0c6123d8cc

Status Word:         0xffff0000
                       TOP: 0
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0xffffffff
Instruction Pointer: 0x1b:0x004010c8
Operand Pointer:     0xffff0023:0x0022cd10
Opcode:              0xdd1c

我們將中斷點設在 foo() 函式，接著使用 info float 去 dump FPU 相關的 register (FPU 相關 register 似乎無法加到 gdb 的 Automatic Display！？所以每次都需手打 )，比較特別的幾個要點都被 highlight 了：

R0 ～ R7 即剛剛提到的 FPU data register stack ： stack 目前是空的，雖然指向 R0 ，但顯示為 Empty
Control Word 裡頭的 RC （rounding-control）決定了 floating-point 的進位（round）方式。

節錄自： Intel® 64 and IA-32 Architectures Software Developer’s Manual ─ Volume 1: Basic Architecture

認識了 gdb 對於 FPU 的輸出後，可以開始一步步逼近問題核心 ── ebp-12 了。

foo() 的一開始，是讀入 0x402030 上的值到 FPU stack 上：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030

好奇這 magic number 是何來的，可以透過之前的 disassembly 結果：

LC1:
	.long	1101004800    ; = 0x41a00000
	.text
	.align 2

或是動態的用 debugger 觀看：

(gdb) x 0x402030
0x402030 <_data_start__+48>:    0x41a00000 ; = 1101004800

而 0x41a00000 便是 20.0 的 IEEE 754 0.3 的 IEEE 754 Double precision (64 bits) 表示式。來用 debugger 確認一下：

Breakpoint 1, 0x0040104f in foo ()
1: x/i $pc
0x40104f <_Z3food+7>:   flds   0x402030
(gdb) info float
  R7: Empty   0x3ffd9999999999999800
  R6: Empty   0x00e07c93022200240000
  R5: Empty   0x51c27c92fe956123a74c
  R4: Empty   0x79d37c9a06007c930460
  R3: Empty   0x0a887c9479870022bf80
  R2: Empty   0xbf9800010a886123debc
  R1: Empty   0x00006123a74c7c93043e
=>R0: Empty   0xdee46123de0c6123d8cc

(gdb) si
0x00401055 in foo ()
1: x/i $pc
0x401055 <_Z3food+13>:  fmull  0x8(%ebp)

(gdb) info float
=>R7: Valid   0x4003a000000000000000 +20
  R6: Empty   0x00e07c93022200240000

執行完上面的指令後， FPU 的 data stack 現在是 20 ，接著進行 x 0.3 的乘法（ebp+0x8 是 foo() 的參數，即 0.3）：

0x401055 <_Z3food+13>:  fmull  0x8(%ebp)

(gdb) si
0x00401058 in foo ()
1: x/i $pc
0x401058 <_Z3food+16>:  fnstcw -0x6(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0x00e07c93022200240000

得到了結果：+6 ，無誤。不過接下來的 FPU control word 卻令人意外：

(gdb) info float
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

0x40105f <_Z3food+23>:  or     $0xc00,%ax
0x401063 <_Z3food+27>:  mov    %ax,-0x8(%ebp)
0x401067 <_Z3food+31>:  fldcw  -0x8(%ebp)

(gdb) info float
...
Control Word:        0xffff0f7f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round toward zero

O 版將 RC 從 Round to nearest 改變成 Round toward zero 。這會影響到 fist 的運作。在執行 fist 前，我們把預期的結果： 6.0 的 IEEE 754 Double precision 求出來：0x4018000000000000 ；而目前的 R7 值是 0x4001bffffffffffffe00 ，小於該值。表示 R7 現在是不足 6.0 ，而此時又改以 Round toward zero 來進行取整數，那結果會是 5 則是預期了。

(gdb) si
0x0040106a in foo ()
1: x/i $pc
0x40106a <_Z3food+34>:  fistl  -0xc(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0x4bc000010101e259acd0
...
Control Word:        0xffff0f7f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round toward zero

(gdb) x 0x22ccfc              ; ebp - 0xc = 0x22ccfc
0x22ccfc:       0x61100049

(gdb) si
0x0040106d in foo ()
1: x/i $pc
0x40106d <_Z3food+37>:  fldcw  -0x6(%ebp)
(gdb) x 0x22ccfc
0x22ccfc:       0x00000005

執行結果印證了我們的預期。

-ffloat-store 版

模仿取得 O 版 disassembly 的方式來取得 -ffloat-store 的結果：

g++ -Wall -O -ffloat-store -S -masm=intel main.cpp

-ffloat-store 版和 O 版相同，會去呼叫 foo() ，不過比較兩者 foo() 可以發現： -ffloat-store 版只多了兩行，在進行 x 0.3 後，會馬上將值寫回 call stack 上，並再讀入。

也就是，當 6.0 被寫入記憶體時，是以 Round to nearest 的方式寫進：

(gdb) si
0x00401058 in foo ()
1: x/i $pc
0x401058 <_Z3food+16>:  fstpl  -0x10(%ebp)
(gdb) info float
=>R7: Valid   0x4001bffffffffffffe00 +6
  R6: Empty   0xdbc000010101e4ada9d8
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

而讀回 FPU stack 時，已經是個精確的 6.0 ：

(gdb) si
0x0040105e in foo ()
1: x/i $pc
0x40105e <_Z3food+22>:  fnstcw -0x12(%ebp)
(gdb) info float
=>R7: Valid   0x4001c000000000000000 +6
  R6: Empty   0x10000000000400010000
...
Control Word:        0xffff037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest

就愛 2B 筆芯

Re: GCC Bug

問題

實驗

O3 版本

O 版本

-ffloat-store 版

3 則留言:

Windows + Visual Studio + VSCode + CMake 的疑難雜症