Skip to content
  • Rui Ueyama's avatar
    2555952b
    Parallelize uncompress() and splitIntoPieces(). · 2555952b
    Rui Ueyama authored
    Uncompressing section contents and spliting mergeable section contents
    into smaller chunks are heavy tasks. They scan entire section contents
    and do CPU-intensive tasks such as uncompressing zlib-compressed data
    or computing a hash value for each section piece.
    
    Luckily, these tasks are independent to each other, so we can do that
    in parallel_for_each. The number of input sections is large (as opposed
    to the number of output sections), so there's a large parallelism here.
    
    Actually the current design to call uncompress() and splitIntoPieces()
    in batch was chosen with doing this in mind. Basically what we need to
    do here is to replace `for` with `parallel_for_each`.
    
    It seems this patch improves latency significantly if linked programs
    contain debug info (which in turn contain lots of mergeable strings.)
    For example, the latency to link Clang (debug build) improved by 20% on
    my machine as shown below. Note that ld.gold took 19.2 seconds to do
    the same thing.
    
    Before:
        30801.782712 task-clock (msec)         #    3.652 CPUs utilized            ( +-  2.59% )
             104,084 context-switches          #    0.003 M/sec                    ( +-  1.02% )
               5,063 cpu-migrations            #    0.164 K/sec                    ( +- 13.66% )
           2,528,130 page-faults               #    0.082 M/sec                    ( +-  0.47% )
      85,317,809,130 cycles                    #    2.770 GHz                      ( +-  2.62% )
      67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles idle     ( +-  3.06% )
     <not supported> stalled-cycles-backend
      44,295,945,493 instructions              #    0.52  insns per cycle
                                               #    1.52  stalled cycles per insn  ( +-  0.44% )
       8,572,384,877 branches                  #  278.308 M/sec                    ( +-  0.66% )
         141,806,726 branch-misses             #    1.65% of all branches          ( +-  0.13% )
    
         8.433424003 seconds time elapsed                                          ( +-  1.20% )
    
    After:
        35523.764575 task-clock (msec)         #    5.265 CPUs utilized            ( +-  2.67% )
             159,107 context-switches          #    0.004 M/sec                    ( +-  0.48% )
               8,123 cpu-migrations            #    0.229 K/sec                    ( +- 23.34% )
           2,372,483 page-faults               #    0.067 M/sec                    ( +-  0.36% )
      98,395,342,152 cycles                    #    2.770 GHz                      ( +-  2.62% )
      79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles idle     ( +-  3.03% )
     <not supported> stalled-cycles-backend
      46,274,151,813 instructions              #    0.47  insns per cycle
                                               #    1.71  stalled cycles per insn  ( +-  0.47% )
       8,987,621,670 branches                  #  253.003 M/sec                    ( +-  0.60% )
         148,900,624 branch-misses             #    1.66% of all branches          ( +-  0.27% )
    
         6.747548004 seconds time elapsed                                          ( +-  0.40% )
    
    llvm-svn: 287946
    2555952b
    Parallelize uncompress() and splitIntoPieces().
    Rui Ueyama authored
    Uncompressing section contents and spliting mergeable section contents
    into smaller chunks are heavy tasks. They scan entire section contents
    and do CPU-intensive tasks such as uncompressing zlib-compressed data
    or computing a hash value for each section piece.
    
    Luckily, these tasks are independent to each other, so we can do that
    in parallel_for_each. The number of input sections is large (as opposed
    to the number of output sections), so there's a large parallelism here.
    
    Actually the current design to call uncompress() and splitIntoPieces()
    in batch was chosen with doing this in mind. Basically what we need to
    do here is to replace `for` with `parallel_for_each`.
    
    It seems this patch improves latency significantly if linked programs
    contain debug info (which in turn contain lots of mergeable strings.)
    For example, the latency to link Clang (debug build) improved by 20% on
    my machine as shown below. Note that ld.gold took 19.2 seconds to do
    the same thing.
    
    Before:
        30801.782712 task-clock (msec)         #    3.652 CPUs utilized            ( +-  2.59% )
             104,084 context-switches          #    0.003 M/sec                    ( +-  1.02% )
               5,063 cpu-migrations            #    0.164 K/sec                    ( +- 13.66% )
           2,528,130 page-faults               #    0.082 M/sec                    ( +-  0.47% )
      85,317,809,130 cycles                    #    2.770 GHz                      ( +-  2.62% )
      67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles idle     ( +-  3.06% )
     <not supported> stalled-cycles-backend
      44,295,945,493 instructions              #    0.52  insns per cycle
                                               #    1.52  stalled cycles per insn  ( +-  0.44% )
       8,572,384,877 branches                  #  278.308 M/sec                    ( +-  0.66% )
         141,806,726 branch-misses             #    1.65% of all branches          ( +-  0.13% )
    
         8.433424003 seconds time elapsed                                          ( +-  1.20% )
    
    After:
        35523.764575 task-clock (msec)         #    5.265 CPUs utilized            ( +-  2.67% )
             159,107 context-switches          #    0.004 M/sec                    ( +-  0.48% )
               8,123 cpu-migrations            #    0.229 K/sec                    ( +- 23.34% )
           2,372,483 page-faults               #    0.067 M/sec                    ( +-  0.36% )
      98,395,342,152 cycles                    #    2.770 GHz                      ( +-  2.62% )
      79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles idle     ( +-  3.03% )
     <not supported> stalled-cycles-backend
      46,274,151,813 instructions              #    0.47  insns per cycle
                                               #    1.71  stalled cycles per insn  ( +-  0.47% )
       8,987,621,670 branches                  #  253.003 M/sec                    ( +-  0.60% )
         148,900,624 branch-misses             #    1.66% of all branches          ( +-  0.27% )
    
         6.747548004 seconds time elapsed                                          ( +-  0.40% )
    
    llvm-svn: 287946
Loading