close
The Wayback Machine - https://web.archive.org/web/20260123174234/https://github.com/github/codeql/pull/4750
Skip to content

Conversation

@geoffw0
Copy link
Contributor

@geoffw0 geoffw0 commented Nov 30, 2020

This PR represents a prototype for what I'm thinking of doing for https://github.com/github/codeql-c-analysis-team/issues/167. The problem with it is that the changes I've made so far don't add up to a measurable difference to performance in my testing. And I'm not aware of a specific concern in the query log to focus in on there. So I'm left a little unconvinced that making these changes more widely will be worthwhile.

@jbj has previously expressed a preference where I've written this = any(StdBasicString s).getAnInstMemberNamed("push_back"), for this.getClassForMember("push_back") instanceof StdBasicString instead. I think that's for performance reasons but I don't currently understand why - and I find it significantly more difficult to read. getClassForMember really means getClassAndAlsoIAmAMemberCalled(name).

@geoffw0 geoffw0 added the C++ label Nov 30, 2020
@geoffw0 geoffw0 requested a review from a team as a code owner November 30, 2020 15:16
@MathiasVP
Copy link
Contributor

The problem with it is that the changes I've made so far don't add up to a measurable difference to performance in my testing. And I'm not aware of a specific concern in the query log to focus in on there. So I'm left a little unconvinced that making these changes more widely will be worthwhile.

Does it also not show any difference in the tuple counts? On Slack you mentioned having to do an unbind when joining with hasQualifiedName to get good performance (which leaves me with the impression that the current tuple counts when using hasQualifiedName is bad). Is it any better with this = any(StdBasicString s).getAnInstMemberNamed("push_back") or this.getClassForMember("push_back") instanceof StdBasicString?

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 8, 2021

Fixed merge (conflict with #4822).

@jbj
Copy link
Contributor

jbj commented Jan 11, 2021

@jbj has previously expressed a preference where I've written this = any(StdBasicString s).getAnInstMemberNamed("push_back"), for this.getClassForMember("push_back") instanceof StdBasicString instead. I think that's for performance reasons but I don't currently understand why - and I find it significantly more difficult to read. getClassForMember really means getClassAndAlsoIAmAMemberCalled(name).

I think the two options are equivalent in performance, and I think I prefer the getClassForMember option.

@jbj
Copy link
Contributor

jbj commented Jan 11, 2021

To move this PR forward, I'd like to see some tuple counts before/after for the predicates being changed.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 11, 2021

I'd like to see some tuple counts before/after for the predicates being changed.

Yep, I've been meaning to produce some but it's not quite trivial enough to fit in while I'm doing other work.

@rdmarsh2
Copy link
Contributor

I agree with @geoffw0 that getClassForMember is a confusing name for the behavior it actually has.

@rdmarsh2
Copy link
Contributor

rdmarsh2 commented Jan 14, 2021

It looks like the class models have the same hasQualifiedName issue that we saw with function models previously, but the function models are now using the class models correctly.

The optimizer is using magic sets and sharing with getAnInstMemberNamed to generate a common predicate for all classes, and then filtering it for each individual class:

EVALUATE NONRECURSIVE RELATION:
  SYNTHETIC StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#shared(int arg0, int arg1, cached string arg2) :- 
    {3} r1 = JOIN Memcpy::MemcpyFunction#class#b#shared AS L WITH Class::Class::getAMember_dispred#fb_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, L.<0>, L.<1>
    {3} r2 = JOIN r1 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>
    return r2

I'd advocate defining it centrally rather than adding a separate rootdef in each class model.

Uses seem to be optimized correctly:

[2021-01-13 21:42:36] (55s) Tuple counts for project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#4:
                      1 ~0%     {1} r1 = CONSTANT(unique string)["operator>>"]
                      0 ~0%     {1} r2 = JOIN r1 WITH project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb AS R ON FIRST 1 OUTPUT R.<1>
                                return r2

With this definition of getClassForMember in Declaration, it seems to have the same linear behavior we have currently, but that might be an issue with my definition:

 Class getClassForMember(string name) {
    (
      result = this.getDeclaringType()
      or
      result.(TemplateClass).getAnInstantiation() = this.getDeclaringType()
    ) and
    this.getName() = name
  }
[2021-01-13 22:30:54] (1s) Tuple counts for StdString::StdIStreamIn#class#b:
                      6735  ~6%     {2} r1 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "operator>>"
                      0     ~0%     {2} r2 = JOIN r1 WITH Declaration::Declaration::getClassForMember_dispred#bff AS R ON FIRST 2 OUTPUT R.<2>, r1.<0>
                      0     ~0%     {5} r3 = JOIN r2 WITH Class::TemplateClass#f AS R ON FIRST 1 OUTPUT "basic_string", "", "std", r2.<0>, r2.<1>
                      0     ~0%     {1} r4 = JOIN r3 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r3.<4>
                                    return r4

…es or 'any'. The classes for string objects now match instantiations directly rather than the template.
@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 14, 2021

I've just pushed a significant change, so that it's now the instantiations of library classes which are modelled, rather than the templates. This gives us much more natural code in the function models, without any weird custom predicates (getAnInstMemberNamed) or excessive use of any. I haven't looked into the impact on performance yet, though it sounds like we hadn't solved that problem yet anyway.

@rdmarsh2
Copy link
Contributor

There are a few predicates that are still doing a full scan of the function table. All of them are joining single strings against the function name, but not all such predicates do a full scan. I haven't noticed any other common factors.

[2021-01-14 14:46:39] (0s) Tuple counts for StdString::StdStreamFunction#class#b:
                      0     ~0%     {2} r1 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "ignore" OR I.<1> = "unget" OR I.<1> = "seekg"
                      0     ~0%     {1} r2 = SCAN r1 OUTPUT r1.<0>
                      0     ~0%     {2} r3 = JOIN r2 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r2.<0>
                      0     ~0%     {1} r4 = JOIN r3 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r3.<1>
                      2     ~0%     {2} r5 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "seekp" OR I.<1> = "flush"
                      2     ~0%     {1} r6 = SCAN r5 OUTPUT r5.<0>
                      1     ~0%     {2} r7 = JOIN r6 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r6.<0>
                      1     ~0%     {1} r8 = JOIN r7 WITH project#StdString::StdBasicOStream#class#ff AS R ON FIRST 1 OUTPUT r7.<1>
                      1     ~0%     {1} r9 = r4 \/ r8
                      6735  ~4%     {2} r10 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "copyfmt"
                      0     ~0%     {1} r11 = JOIN r10 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r10.<0>
                      0     ~0%     {2} r12 = JOIN r11 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r11.<0>
                      0     ~0%     {2} r13 = JOIN r12 WITH project#Class::TemplateClass::getAnInstantiation_dispred#ff AS R ON FIRST 1 OUTPUT r12.<0>, r12.<1>
                      0     ~0%     {5} r14 = JOIN r13 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT "basic_ios", "", "std", R.<1>, r13.<1>
                      0     ~0%     {1} r15 = JOIN r14 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r14.<4>
                      1     ~0%     {1} r16 = r9 \/ r15
                                    return r16

There's a few cases in hasTaintFlow and hasDataFlow that have similar issues (Including a regression of the operator>> example from the previous commit):

                      6735  ~0%       {4} r56 = JOIN r16 WITH project#Instruction::FunctionInstruction#3#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, "operator>>", r16.<0>, r16.<1>
                      2     ~0%       {3} r57 = JOIN r56 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r56.<0>, r56.<2>, r56.<3>
                      0     ~0%       {4} r58 = JOIN r57 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r57.<1>, r57.<2>, r57.<0>
                      0     ~0%       {3} r59 = JOIN r58 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r58.<3>, r58.<1>, r58.<2>

And some cases that look scary, but are just using the Cartesian product to join FunctionInput and FunctionOutput with the specific function or class:

                      77     ~0%       {3} r261 = JOIN DataFlow::DataFlowFunction::hasDataFlow_dispred#fff#shared AS L WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, L.<0>, L.<1>
                      263    ~3%       {4} r262 = JOIN r261 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, "push_back", r261.<1>, r261.<2>
                      1      ~0%       {3} r263 = JOIN r262 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r262.<0>, r262.<2>, r262.<3>

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 15, 2021

Both "copyfmt" and "operator>>" are cases where the model only models one function, so the [] syntax is not used for these strings. I remember previously finding that [] causes the optimizer to do things differently and avoids the problem. I don't remember a lot more than that and I don't think I found a good solution (short of ["copyfmt", "something-that-is-not-a-legal-function-name"]).

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 19, 2021

I've just pushed five commits containing two performance improvements, designed by myself and @jbj:

(1) pragma[nomagic] on hasTaintFlow and hasDataFlow. This is a barrier to prevent the implementation details inside taint models from being affected by the context in which hasTaintFlow / hasDataFlow is used. In particular, in context we have information about FunctionCalls as well as Functions that are being modelled, allowing evaluation to start from the Call rather than the Function - which turns out to be a poor strategy. This is now prevented.

(2) even with the first change we still get a mediocre evaluation strategy, where we narrow down to all of the functions in the modelled class before joining on name. This is the 'wrong' order, producing a fair number of rows we don't need. The getClassAndName predicate (a concept previously referred to as getClassForMember) blocks this strategy, though it does create a bit of work up front (@jbj we didn't discuss the up front cost of computing the new predicate; it seems like it will be bigger than the cost of doing the 'wrong' ordering on one model, but it isn't per-model, which means we don't need to worry about worst cases and future expansion so much).

Notes:

  • the first change (in two places) applies to everything, whereas the second change only works for models where the change has been made in their code. We may forget some models now or in the future, which is less than ideal, but with the first change in place performance should be acceptable.
  • the pragma[nomagic]s appear to be necessary for things to work as intended.

Query used for most of the testing:

import semmle.code.cpp.dataflow.TaintTracking
import DataFlow

select strictcount(Node n1, Node n2 | TaintTracking::localTaintStep(n1, n2))

Looking at StdStringCStr::StdStringCStr ("c_str") Before the latest batch of changes:

[2021-01-18 18:01:24] (8s) Tuple counts for StdString::StdStringCStr#class#b/1@52e292:
                      1953  ~1%     {2} r1 = SCAN m#FormattingFunction::FormattingFunction::getFirstFormatArgumentIndex_dispred#bf AS I OUTPUT I.<0> 'this', "c_str"
                      4     ~0%     {1} r2 = JOIN r1 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r1.<0> 'this'
                      4     ~0%     {1} r3 = STREAM DEDUP r2
                      4     ~0%     {2} r4 = JOIN r3 WITH Declaration::Declaration::getDeclaringType_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<0> 'this'
                      2     ~0%     {1} r5 = JOIN r4 WITH project#StdString::StdBasicString#class#ff AS R ON FIRST 1 OUTPUT r4.<1> 'this'
                                    return r5

[2021-01-18 18:01:24] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#bbf/3@e4424b:
...
                      1      ~0%       {2} r44 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
                      1953   ~3%       {3} r45 = JOIN r44 WITH project#Call::Call::getTarget_dispred#ff AS R CARTESIAN PRODUCT OUTPUT R.<0> 'this', r44.<0> 'input', r44.<1> 'output'
                      2      ~0%       {3} r46 = JOIN r45 WITH StdString::StdStringCStr#class#b AS R ON FIRST 1 OUTPUT r45.<0> 'this', r45.<1> 'input', r45.<2> 'output'

With the first commit only (pragma[nomagic] on hasTaintFlow / hasDataFlow):

StdString::StdStringCStr
[not explicitly evaluated]

[2021-01-18 18:11:11] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@b9de32:
...
                      1      ~0%       {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
                      51     ~0%       {3} r317 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, r72.<0> 'input', r72.<1> 'output'
                      216    ~0%       {4} r318 = JOIN r317 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1> 'this', "c_str", r317.<1> 'input', r317.<2> 'output'
                      2      ~0%       {3} r319 = JOIN r318 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r318.<0> 'this', r318.<2> 'input', r318.<3> 'output'

and after:

StdString::StdStringCStr
[not explicitly evaluated]

[2021-01-18 17:59:01] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@896546:
...
                      1      ~0%       {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
                      51     ~4%       {4} r228 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT "c_str", R.<0>, r72.<0> 'input', r72.<1> 'output'
                      2      ~0%       {3} r229 = JOIN r228 WITH Function::Function::getClassAndName#fff_120#join_rhs AS R ON FIRST 2 OUTPUT R.<2> 'this', r228.<2> 'input', r228.<3> 'output'

There are similar results in DataFlow, e.g. for StdIStreamIn::StdIStreamIn ("operator>>").

jbj
jbj previously approved these changes Jan 19, 2021
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes LGTM, and I witnessed most of the tuple-count improvements in our screen sharing session yesterday. Even if we should run into performance problems when expanding these changes to the rest of the library, I think the getClassAndName approach is flexible enough that most performance problems could be addressed centrally, without changing every single model.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well now!

@jbj jbj merged commit 24947f2 into github:main Jan 19, 2021
@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 19, 2021

Thanks for the close collaboration, reviews and merge everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants