[SamplePGO] Stale profile matching(part 2)
Part 2 of https://reviews.llvm.org/D147456 Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular: - Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles. - Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped. **The input of the algorithm:** - IR locations: the anchor is the callee name of direct callsite. - Profile locations: the anchor is the call target name for `BodySample`s or inlinee's profile name for `CallsiteSamples`. The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor. For example: say location `1.2(foo)` refers to a callsite at `1.2` with callee name `foo` and `1.3` refers to a non-directcall location `1.3`. ``` // The current build source code: int main() { 1. ... 2. foo(); 3. ... 4 ... 5. ... 6. bar(); 7. ... } ``` IR locations are populated and simplified as: `[1, 2(foo), 3, 5, 6(bar), 7]`. ``` ; The "stale" profile: main:350:1 1: 1 2: 3 3: 100 foo:100 4: 2 7: 2 8: 200 bar:200 9: 30 ``` Profile locations are populated and simplified as `[1, 2, 3(foo), 4, 7, 8(bar), 9]` **Matching heuristic:** - Match all the anchors in lexical order first. - Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor. So the example above is matched like: ``` [1, 2(foo), 3, 5, 6(bar), 7] | | | | | | [1, 2, 3(foo), 4, 7, 8(bar), 9] ``` 3 -> 4 matching is based on anchor `foo`, 5 -> 7 matching is based on anchor `bar`. The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9]. For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into `IRToProfileLocationMap`(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(`distributeIRToProfileLocationMap`) **Clang-self build benchmark: ** Current build version: clang-10 The profiled version: clang-9 Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list). 1) Regression to using refresh profile with this off : -3.93% 2) Regression to using refresh profile with this on : -1.1% So this algorithm can recover ~72% of the regression. **Internal(Meta) large-scale services.** we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win. **Notes or future work:** - Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure. - The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future. - Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples. - This doesn't handle function name mismatch, we plan to solve it in future. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D147545
Loading
Please sign in to comment