或許不用到偏執的地步，人們就會某種程度上、或多或少地對於一個 return-by-value 的 function 感到憂心！

class MyClass {
    // ...
};

MyClass foo()
{
    // ...
}

int main()
{
    MyClass obj = foo();
}

憂慮其來有自：按照字面解讀，當 foo() 結束時，會將其結果放置於一個 temporary MyClass object 裡頭，並且這個 temporary object 會被作為 MyClass 的 copy constructor 的參數來初始化 obj 。若真如此，我們就不得不小心這類 return-by-value 的 function ，因為它引入了一個 temporary object 並且牽涉到 copy，temporary object 的建立與消滅可能對 performance 帶來極大的影響，舉例來說： MyClass 若是一個 1000x1000 的 array ，它的建立與消滅牽涉到大量的 heap allocation/deallocation ，遑論它還需要透過 copy constructor 把資料傳遞到另一個 object 上。

不過事情並非想像中的那麼嚴重，現在許多的 compilers 都已經能某種程度地對 return-by-value 做出最佳化，以抑制 temporary object 和 copy 的發生，C++ Standard 中稱呼為 copy elision ；一般則叫做 return value optimization （RVO）。

Basic RVO

我們的目標是希望能夠抑制 copy 的發生，因此最好的情形就是：打從一開始，所作的運算就是在儲存結果的 destination object 上發生。要做到這點， compiler 大多使用一種簡單明瞭的方式：對 return-by-value 的 function prototype 動手腳，將它從

MyClass foo()
{
   // ...
}

轉換成：

void foo( MyClass& result )
{
    // ...
}

如此一來，便有機會在 return-by-value 的 function 中直接操作 destination object 。在某些團隊裡頭，為了避免 performance drop ，往往會制定這樣的 coding standard ：

不要讓一個 function 回傳一個 object，若真要如此，改將 object 當作 argument 傳給 function 進行處理。

將 destination object 從 return value 轉移到 argument list 上，就像是手工式的最佳化。不過即使能直接操作 destination object ，要能做出正確的最佳化判斷仍然不是件簡單的事，參考一下這個 N1377 的例子：

A f( bool b )  
{  
    A a1, a2;  
    // ...  
    return b ? a1 : a2;  
}

Destination object 的 value 可能來自不同的 control flows 、不同的 source objects，若是 compiler 無法有好的對策來在 destination object 上生成有效率的運算，便可能迫使 compiler 放棄 RVO 。為此， communities 和 compilers 往往又在討論 RVO 時，將它分為兩類：

Named return value optimization (NRVO)： Destination object 來自於一個 named （具名的）的變數，上述 N1377 的例子就屬於 NRVO 的範疇。
Unnamed return value optimization：名稱上似乎沒有 URVO 這說法。 Destination object 來自於一個 unnamed （不具名的）的變數，通常就是那些直接存在於 return statement 上的 temporary objects。

一般而言， compilers 對於 URV 的最佳化支援會比 NRV 好，因為它們往往不具備有太複雜的 control flow ，通常程式碼結構上就像：

MyClass foo()
{
    return MyClass( "foo()" );
}

這般簡單，易於最佳化，因此即使是 g++ 不開啟最佳化或是 VC 2005 的 debug mode （預設是不啟用最佳化）， URVO 都是唾手可得的。不過 RVO 到底如何施行、能帶來多少效益，還是得視 compiler 而定，最好的策略還是得多瞭解自己工作的 compiler ，寫些程式去測試測試。當我寫這篇文章的時候，我測試了一下這樣的的程式碼：

Config get( const std::string& input, const std::string& value )
{
    Config config;

    config.setName( input );
    config.setValue( value );
    config.setValue( 10 );

    return config;
}

即使這個 get() 只是：

Config get( const std::string& input, const std::string& value )
{
    return Config( input, value );
}

的惡搞版，會發現 NRV 很快地就讓 VC2005 debug mode 放棄 RVO ，但 g++4.4.1 -o0 還是能正常執行 RVO 。你可以參考這篇文章：Named Return Value Optimization in Visual C++ 2005，知道更多關於 VC 2005 NRVO 的細節。

Reduce NRV to RV

如前文所說，NRVO 屬於比較複雜的最佳化技術，並非每個 compiler 都提供。那麼在這種情況下，是不是代表著我們就只能妥協，將 return-by-value 變成 pass-by-reference 來手工模擬 compiler 的 transformation ，以減少 copy 的發生呢？不是的，有一個有趣的方式：將 NRV 轉換成 URV ，然後讓 compiler 去執行 URVO 。方法只要我們能提供特殊的 constructor ，並將運算放在裡頭。還記得我們一開始提到的精神：

最好的情形就是：打從一開始，所作的運算就是在儲存結果的 destination object 上發生。

Reduce NRV to RV 就像是這個精神的衍生， class 的設計者為 class 提供能夠直接在 class object 上運算的能力。實作上， class 的設計者會提供一個或一組 constructors 讓運算直接由 class object 完成。

舉例來說，一個 NRV function ：

MyClass foo( int input, int filter )
{
    MyClass result;
    // Operations that involves result, intput, filter
    return result;
}

轉變為：

MyClass foo( int input, int filter )
{
    return MyClass( input, filter );
}

因為 MyClass class 的設計者將 foo() 該執行的運算搬到了 constructor 裡頭：

class MyClass {
public: 
    MyClass( int input, int filter )
    {
        // Initialize by input and filter
    }
};

那麼這麼做的缺點是什麼？嗯，一個不太自然的 constructor ，有點違反 data abstraction，而這個 MyClass 的可能充斥著各種具有這特殊功能的 constructors 。

A Mat8x8 Sample

如果想找個例子，不妨看看 Faster C++ Operations [1]，它是一篇描述這種手工最佳化手法的文章，裡頭使用 Mat8x8 作為例子：

class Mat8x8 {
    double  M[ 8*8 ];
public:
    Mat8x8()                      { memset( M, 0, 64*sizeof(double) ); }
    Mat8x8( const Mat8x8& other ) { memcpy( M, other.M, 512 ); }

    double& operator() ( int row, int col ) { return M[ row*8 + col ]; }

    Mat8x8& operator = ( const Mat8x8& rhs );
    Mat8x8  operator + ( const Mat8x8& rhs );
    Mat8x8  operator - ( const Mat8x8& rhs );
    Mat8x8  operator * ( double rhs );
    Mat8x8  operator / ( double rhs );
};

Mat8x8 支援了許多 built-in operators 的能力，我們只取 + 來看：

Mat8x8::Mat8x8 operator + ( const Mat8x8& other ) {
    double Sum[64];
    for( int i=0; i<64; ++i )
        Sum[i] = M[i] + other.M[i];
    return Mat8x8( Sum );
    }

operator +() 是個典型的 NRV function ，在未能提供 NRVO 的 compiler 上，可能會產生對 performance 有所危害的程式，因此按照先前提到的 reduction ，我們需要為 Mat8x8 提供特殊的 constructor ，依照需求，這個 constructor 會是個接受兩個 arguments 的 constructors ，不過 Mat8x8 提供了不只是提供了 + 的運算，所以最好能夠連 –、*、/ 也一併處理，因此 Faster C++ Operations [1] 提供的方法是利用具有古典優雅的氣息的 function pointer ，改寫之後變成：

class Mat8x8 {
    typedef void (*PFnInitMat)( Mat8x8& Mat, void* pLHS, void* pRHS );

    // The special constructor help us to reduce NRV to RV
    Mat8x8( PFnInitMat Init, void* pLHS, void* pRHS )
    {
        Init( *this, pLHS, pRHS );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };
    Mat8x8()                      { memset( A, 0, 64*sizeof(double) ); }
    Mat8x8( const Mat8x8& other ) { memcpy( A, other.A, 512 ); }

    Mat8x8& operator = ( const Mat8x8& rhs );
    Mat8x8  operator + ( const Mat8x8& rhs );
    Mat8x8  operator - ( const Mat8x8& rhs );
    Mat8x8  operator * ( double rhs );
    Mat8x8  operator / ( double rhs );
};

void AddMat88( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void SubMat88( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void DivMat88scalar( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

void MulMat88scalar( Mat8x8& Mat, void* pLHS, void* pRHS )
{
    // ...
}

PFnInitMat 是個 function pointer type ，它代表了一族：可以接受兩個 Mat8x8 objects 並將結果寫於特定 object 的能力，家族成員包括： AddMat88()、SubMat88()、DivMat88()、MulMat88()。Mat8x8 class 現在有了一個特殊的 constructor ，它除了接受兩個 Mat8x8 的 arguments 外，還會接受一個 function pointer 來實際執行不同的運算。有了這樣的 constructor 幫忙，現在 operators 就會變成一個個簡單的 forward functions：

Mat8x8 Mat8x8::operator + ( const Mat8x8& rhs ) {
    return Mat8x8( AddMat88, this, (void*)&rhs );
    }

Mat8x8 Mat8x8::operator - ( const Mat8x8& rhs ) {
    return Mat8x8( SubMat88, this, (void*)&rhs );
    }

Mat8x8 Mat8x8::operator * ( double rhs ) {
    return Mat8x8( MulMat88scalar, this, (pdouble)&rhs );
    }

Mat8x8 Mat8x8::operator / ( double rhs ) {
    return Mat8x8( DivMat88scalar, this, (pdouble)&rhs );
    }

Generic Revision

看過了 Faster C++ Operations [1] 的方式，不知道您有沒有什麼好奇的地方呢？ COdE fr3@K 跟我覺得有些地方可以來作些實驗：

Type safety：Faster C++ Operations [1] 使用 void* 作為那些實際運算的 functions 的參數。因此在 AddMat88() 這類實際運算的 functions 裡頭，必須將 void* cast 成適當的 type ，少了那麼一點 type safety 的味道。或許有點吹毛求疵，不過引用作者的話：
I just used void* for this example - they make programming flexible (under great responsibility).
Generic and parameterization operation ：實際運算的 functions 是透過 function pointers 傳遞 constructor 去執行任務，如果能改用 C++ 的一些泛型機制呢？

Generic and Parameterization Operations

class Mat8x8 {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, plus_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] + rhs.A[ i ];
        }
    }

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, minus_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] - rhs.A[ i ];
        }
    }

    Mat8x8( const Mat8x8& lhs, double rhs, multiplication_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] * rhs;
        }
    }

    Mat8x8( const Mat8x8& lhs, double rhs, division_tag )
    {
        for ( size_t i = 0; i < 64; ++i ) {
            A[ i ] = lhs.A[ i ] / rhs;
        }
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8( const Mat8x8& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8 operator + ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8 operator - ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8 operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8 operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

使用了幾個 tag classes 來幫助我們標示實際要求 constructor 執行的運算。不過這只是第一步，目前的實作還無法支援使用 Mat8x8 時，可以客製化運算，我們還需要做出這樣的修改：

template <typename OperationT>
class Mat8x8 : public OperationT {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, plus_tag )
    {
        add( A, lhs.A, rhs.A );
    }

    Mat8x8( const Mat8x8& lhs, const Mat8x8& rhs, minus_tag )
    {
        minus( A, lhs.A, rhs.A );
    }

    Mat8x8( const Mat8x8& lhs, double rhs, multiplication_tag )
    {
        multiply( A, lhs.A, rhs );
    }

    Mat8x8( const Mat8x8& lhs, double rhs, division_tag )
    {
        divide( A, lhs.A, rhs );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8( const Mat8x8& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8 operator + ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8 operator - ( const Mat8x8& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8 operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8 operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

一個 policy-based class 來提供運算，並讓使用 Mat8x8 的人可以選擇要使用的矩陣計算方式；例如說最基本的運算可透過 MatOp class 來提供：

struct MatOp {
    template <typename ValueT,
              size_t size>
    static void add( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], const ValueT (&rhs)[ size ] )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] + rhs[ i ];
        }
    }

    template <typename ValueT,
              size_t size>
    static void minus( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], const ValueT (&rhs)[ size ] )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] - rhs[ i ];
        }
    }

    template <typename ValueT,
              size_t size>
    static void multiply( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], ValueT rhs )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] * rhs;
        }
    }

    template <typename ValueT,
              size_t size>
    static void divide( ValueT (&result)[ size ], const ValueT (&lhs)[ size ], ValueT rhs )
    {
        for ( size_t i = 0; i < size; ++i ) {
            result[ i ] = lhs[ i ] / rhs;
        }
    }
};

2010/05/03 Update

今天巧遇 COdE fr3@K ，他給了我幾個建議：

少用繼承這種高耦合性的機制，尤其當你想描述的關係並非 is-a 而是 is-implemented-in-terms-of 。
給 class template 一個名稱，並用 typedef 宣告另一個使用時的名稱。

所以今天的作業 XD

template <typename OperationT>
class Mat8x8Base {
    struct plus_tag {
    };

    struct minus_tag {
    };

    struct division_tag {
    };

    struct multiplication_tag {
    };

    Mat8x8Base( const Mat8x8Base& lhs, const Mat8x8Base& rhs, plus_tag )
    {
        OperationT::add( A, lhs.A, rhs.A );
    }

    Mat8x8Base( const Mat8x8Base& lhs, const Mat8x8Base& rhs, minus_tag )
    {
        OperationT::minus( A, lhs.A, rhs.A );
    }

    Mat8x8Base( const Mat8x8Base& lhs, double rhs, multiplication_tag )
    {
        OperationT::multiply( A, lhs.A, rhs );
    }

    Mat8x8Base( const Mat8x8Base& lhs, double rhs, division_tag )
    {
        OperationT::divide( A, lhs.A, rhs );
    }

public:
    union {
        double  M[8][8];
        double  A[ 64 ];
    };

    Mat8x8Base()
    {
        memset( A, 0, 64*sizeof(double) );
    }

    Mat8x8Base( const Mat8x8Base& other )
    {
        cout << "Copy ctor\n";
        memcpy( A, other.A, 512 );
    }

    Mat8x8Base operator + ( const Mat8x8Base& rhs )
    {
        return Mat8x8( *this, rhs, plus_tag() );
    }
    Mat8x8Base operator - ( const Mat8x8Base& rhs )
    {
        return Mat8x8( *this, rhs, minus_tag() );
    }
    Mat8x8Base operator * ( double rhs )
    {
        return Mat8x8( *this, rhs, multiplication_tag() );
    }
    Mat8x8Base operator / ( double rhs )
    {
        return Mat8x8( *this, rhs, division_tag() );
    }
};

typedef Mat8x8Base<MatOp> Mat8x8;

以 Mat8x8Base 取代原本的 class template 名稱 Mat8x8 ，使用 typedef 宣告常用的名字 – Mat8x8 ；取消繼承，改用 qualified names 去呼叫 add(), minus() 之類的 functions。

C++0x 的救贖

真是的，說得口沫橫飛！其實重點在 C++0x 提供了 move semantics 幫助我們從 copy 的地獄中離開，N1377 ：

Move semantics is mostly about performance optimization: the ability to move an expensive object from one address in memory to another, while pilfering resources of the source in order to construct the target with minimum expense.

有興趣的，可以參考一下 COdE fr3@K 寫的一連串關於 move、r-value reference 的文章：

C++0x: Rvalue References：http://fsfoundry.org/codefreak/2008/11/16/cpp0x-rvalue-references/
C++0x: More on Rvalue References：http://fsfoundry.org/codefreak/2008/11/16/cpp0x-more-on-rvalue-references/
I Like to Move It：http://fsfoundry.org/codefreak/2009/05/19/i-like-to-move-it/

Furthermore

RVO vs. NRVO vs. URVO？
前面提過，比較少有 URVO 這樣的稱呼，這篇文章會做這樣的區別，只是單純覺得 RVO 應該是包涵性較 NRVO 和 URVO 廣泛的最佳化，URVO 只能算是 RVO 裡頭較簡單的一種形式，不過許多文章談及 RVO 可能只牽涉到 URVO ，或者它們說是 apply RVO to unnamed temporaries [6]。
為什麼 C++ Standard 會特別的定義 copy elision 呢？
無論是 C++03 或是 C++0x 都會特別提及 copy elision ：

When certain criteria are met, an implementation is allowed to omit the copy/move construction of a class object, even if the copy/move constructor and/or destructor for the object have side effects. … . This elision of copy/move operations, called copy elision, is permitted in the following circumstances

因為一旦 copy elision 施行，原本發生在 return value 上的 side effect 也就不可得，雖然少見，但或許真有程式碼倚賴這樣的 side effect ，object counting 可能算是一類？！但為了效率的追求， RVO 是必要的，弭平爭議的方法就是在 standard 中好好說明。

Reference

Faster C++ Operators: http://www.codeproject.com/KB/recipes/FasterCPPOperators.aspx
Named Return Value Optimization in Visual C++ 2005: http://msdn.microsoft.com/en-us/library/ms364057%28VS.80%29.aspx
N1377 A Proposal to Add Move Semantics Support to the C++ Language: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2002/n1377.htm
The Name Return Value Optimization: http://blogs.msdn.com/slippman/archive/2004/02/03/66739.aspx
C++ FAQ – 10. Constructors: http://www.parashift.com/c++-faq-lite/ctors.html
Move Constructors: http://www.drdobbs.com/cpp/184403855#3

就愛 2B 筆芯

Reduce NRV to RV

Basic RVO

Reduce NRV to RV

A Mat8x8 Sample

Generic Revision

Generic and Parameterization Operations

2010/05/03 Update

C++0x 的救贖

Furthermore

Reference

Windows + Visual Studio + VSCode + CMake 的疑難雜症