[ELF] Parallelize --compress-debug-sections=zlib
When linking a Debug build clang (265MiB SHF_ALLOC sections, 920MiB uncompressed debug info), in a --threads=1 link "Compress debug sections" takes 2/3 time and in a --threads=8 link "Compress debug sections" takes ~70% time. This patch splits a section into 1MiB shards and calls zlib `deflake` parallelly. DEFLATE blocks are a bit sequence. We need to ensure every shard starts at a byte boundary for concatenation. We use Z_SYNC_FLUSH for all shards but the last to flush the output to a byte boundary. (Z_FULL_FLUSH can be used as well, but Z_FULL_FLUSH clears the hash table which just wastes time.) The last block requires the BFINAL flag. We call deflate with Z_FINISH to set the flag as well as flush the output to a byte boundary. Under the hood, all of Z_SYNC_FLUSH, Z_FULL_FLUSH, and Z_FINISH emit a non-compressed block (called stored block in zlib). RFC1951 says "Any bits of input up to the next byte boundary are ignored." In a --threads=8 link, "Compress debug sections" is 5.7x as fast and the total speed is 2.54x. Because the hash table for one shard is not shared with the next shard, the output is slightly larger. Better compression ratio can be achieved by preloading the window size from the previous shard as dictionary (`deflateSetDictionary`), but that is overkill. ``` # 1MiB shards % bloaty clang.new -- clang.old FILE SIZE VM SIZE -------------- -------------- +0.3% +129Ki [ = ] 0 .debug_str +0.1% +105Ki [ = ] 0 .debug_info +0.3% +101Ki [ = ] 0 .debug_line +0.2% +2.66Ki [ = ] 0 .debug_abbrev +0.0% +1.19Ki [ = ] 0 .debug_ranges +0.1% +341Ki [ = ] 0 TOTAL # 2MiB shards % bloaty clang.new -- clang.old FILE SIZE VM SIZE -------------- -------------- +0.2% +74.2Ki [ = ] 0 .debug_line +0.1% +72.3Ki [ = ] 0 .debug_str +0.0% +69.9Ki [ = ] 0 .debug_info +0.1% +976 [ = ] 0 .debug_abbrev +0.0% +882 [ = ] 0 .debug_ranges +0.0% +218Ki [ = ] 0 TOTAL ``` Bonus in not using zlib::compress * we can compress a debug section larger than 4GiB * peak memory usage is lower because for most shards the output size is less than 50% input size (all less than 55% for a large binary I tested, but decreasing the initial output size does not decrease memory usage) Reviewed By: ikudrin Differential Revision: https://reviews.llvm.org/D117853
Loading
Please sign in to comment