C++: Model classes in StdString.qll. #4750

geoffw0 · 2020-11-30T15:16:07Z

This PR represents a prototype for what I'm thinking of doing for https://github.com/github/codeql-c-analysis-team/issues/167. The problem with it is that the changes I've made so far don't add up to a measurable difference to performance in my testing. And I'm not aware of a specific concern in the query log to focus in on there. So I'm left a little unconvinced that making these changes more widely will be worthwhile.

@jbj has previously expressed a preference where I've written this = any(StdBasicString s).getAnInstMemberNamed("push_back"), for this.getClassForMember("push_back") instanceof StdBasicString instead. I think that's for performance reasons but I don't currently understand why - and I find it significantly more difficult to read. getClassForMember really means getClassAndAlsoIAmAMemberCalled(name).

MathiasVP · 2020-12-01T09:36:31Z

The problem with it is that the changes I've made so far don't add up to a measurable difference to performance in my testing. And I'm not aware of a specific concern in the query log to focus in on there. So I'm left a little unconvinced that making these changes more widely will be worthwhile.

Does it also not show any difference in the tuple counts? On Slack you mentioned having to do an unbind when joining with hasQualifiedName to get good performance (which leaves me with the impression that the current tuple counts when using hasQualifiedName is bad). Is it any better with this = any(StdBasicString s).getAnInstMemberNamed("push_back") or this.getClassForMember("push_back") instanceof StdBasicString?

geoffw0 · 2021-01-08T17:16:12Z

Fixed merge (conflict with #4822).

jbj · 2021-01-11T13:59:32Z

@jbj has previously expressed a preference where I've written this = any(StdBasicString s).getAnInstMemberNamed("push_back"), for this.getClassForMember("push_back") instanceof StdBasicString instead. I think that's for performance reasons but I don't currently understand why - and I find it significantly more difficult to read. getClassForMember really means getClassAndAlsoIAmAMemberCalled(name).

I think the two options are equivalent in performance, and I think I prefer the getClassForMember option.

jbj · 2021-01-11T14:00:14Z

To move this PR forward, I'd like to see some tuple counts before/after for the predicates being changed.

geoffw0 · 2021-01-11T17:45:00Z

I'd like to see some tuple counts before/after for the predicates being changed.

Yep, I've been meaning to produce some but it's not quite trivial enough to fit in while I'm doing other work.

cpp/ql/src/semmle/code/cpp/models/implementations/StdString.qll

rdmarsh2 · 2021-01-13T01:50:41Z

I agree with @geoffw0 that getClassForMember is a confusing name for the behavior it actually has.

rdmarsh2 · 2021-01-14T06:44:30Z

It looks like the class models have the same hasQualifiedName issue that we saw with function models previously, but the function models are now using the class models correctly.

The optimizer is using magic sets and sharing with getAnInstMemberNamed to generate a common predicate for all classes, and then filtering it for each individual class:

EVALUATE NONRECURSIVE RELATION:
  SYNTHETIC StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#shared(int arg0, int arg1, cached string arg2) :- 
    {3} r1 = JOIN Memcpy::MemcpyFunction#class#b#shared AS L WITH Class::Class::getAMember_dispred#fb_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, L.<0>, L.<1>
    {3} r2 = JOIN r1 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>
    return r2

I'd advocate defining it centrally rather than adding a separate rootdef in each class model.

Uses seem to be optimized correctly:

[2021-01-13 21:42:36] (55s) Tuple counts for project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#4:
                      1 ~0%     {1} r1 = CONSTANT(unique string)["operator>>"]
                      0 ~0%     {1} r2 = JOIN r1 WITH project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb AS R ON FIRST 1 OUTPUT R.<1>
                                return r2

With this definition of getClassForMember in Declaration, it seems to have the same linear behavior we have currently, but that might be an issue with my definition:

 Class getClassForMember(string name) {
    (
      result = this.getDeclaringType()
      or
      result.(TemplateClass).getAnInstantiation() = this.getDeclaringType()
    ) and
    this.getName() = name
  }

[2021-01-13 22:30:54] (1s) Tuple counts for StdString::StdIStreamIn#class#b:
                      6735  ~6%     {2} r1 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "operator>>"
                      0     ~0%     {2} r2 = JOIN r1 WITH Declaration::Declaration::getClassForMember_dispred#bff AS R ON FIRST 2 OUTPUT R.<2>, r1.<0>
                      0     ~0%     {5} r3 = JOIN r2 WITH Class::TemplateClass#f AS R ON FIRST 1 OUTPUT "basic_string", "", "std", r2.<0>, r2.<1>
                      0     ~0%     {1} r4 = JOIN r3 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r3.<4>
                                    return r4

…es or 'any'. The classes for string objects now match instantiations directly rather than the template.

geoffw0 · 2021-01-14T15:18:27Z

I've just pushed a significant change, so that it's now the instantiations of library classes which are modelled, rather than the templates. This gives us much more natural code in the function models, without any weird custom predicates (getAnInstMemberNamed) or excessive use of any. I haven't looked into the impact on performance yet, though it sounds like we hadn't solved that problem yet anyway.

rdmarsh2 · 2021-01-15T00:46:43Z

There are a few predicates that are still doing a full scan of the function table. All of them are joining single strings against the function name, but not all such predicates do a full scan. I haven't noticed any other common factors.

[2021-01-14 14:46:39] (0s) Tuple counts for StdString::StdStreamFunction#class#b:
                      0     ~0%     {2} r1 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "ignore" OR I.<1> = "unget" OR I.<1> = "seekg"
                      0     ~0%     {1} r2 = SCAN r1 OUTPUT r1.<0>
                      0     ~0%     {2} r3 = JOIN r2 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r2.<0>
                      0     ~0%     {1} r4 = JOIN r3 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r3.<1>
                      2     ~0%     {2} r5 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "seekp" OR I.<1> = "flush"
                      2     ~0%     {1} r6 = SCAN r5 OUTPUT r5.<0>
                      1     ~0%     {2} r7 = JOIN r6 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r6.<0>
                      1     ~0%     {1} r8 = JOIN r7 WITH project#StdString::StdBasicOStream#class#ff AS R ON FIRST 1 OUTPUT r7.<1>
                      1     ~0%     {1} r9 = r4 \/ r8
                      6735  ~4%     {2} r10 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "copyfmt"
                      0     ~0%     {1} r11 = JOIN r10 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r10.<0>
                      0     ~0%     {2} r12 = JOIN r11 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r11.<0>
                      0     ~0%     {2} r13 = JOIN r12 WITH project#Class::TemplateClass::getAnInstantiation_dispred#ff AS R ON FIRST 1 OUTPUT r12.<0>, r12.<1>
                      0     ~0%     {5} r14 = JOIN r13 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT "basic_ios", "", "std", R.<1>, r13.<1>
                      0     ~0%     {1} r15 = JOIN r14 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r14.<4>
                      1     ~0%     {1} r16 = r9 \/ r15
                                    return r16

There's a few cases in hasTaintFlow and hasDataFlow that have similar issues (Including a regression of the operator>> example from the previous commit):

                      6735  ~0%       {4} r56 = JOIN r16 WITH project#Instruction::FunctionInstruction#3#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, "operator>>", r16.<0>, r16.<1>
                      2     ~0%       {3} r57 = JOIN r56 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r56.<0>, r56.<2>, r56.<3>
                      0     ~0%       {4} r58 = JOIN r57 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r57.<1>, r57.<2>, r57.<0>
                      0     ~0%       {3} r59 = JOIN r58 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r58.<3>, r58.<1>, r58.<2>

And some cases that look scary, but are just using the Cartesian product to join FunctionInput and FunctionOutput with the specific function or class:

                      77     ~0%       {3} r261 = JOIN DataFlow::DataFlowFunction::hasDataFlow_dispred#fff#shared AS L WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, L.<0>, L.<1>
                      263    ~3%       {4} r262 = JOIN r261 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, "push_back", r261.<1>, r261.<2>
                      1      ~0%       {3} r263 = JOIN r262 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r262.<0>, r262.<2>, r262.<3>

geoffw0 · 2021-01-15T19:49:19Z

Both "copyfmt" and "operator>>" are cases where the model only models one function, so the [] syntax is not used for these strings. I remember previously finding that [] causes the optimizer to do things differently and avoids the problem. I don't remember a lot more than that and I don't think I found a good solution (short of ["copyfmt", "something-that-is-not-a-legal-function-name"]).

…atters.

… well, for consistency.

geoffw0 · 2021-01-19T09:08:18Z

I've just pushed five commits containing two performance improvements, designed by myself and @jbj:

(1) pragma[nomagic] on hasTaintFlow and hasDataFlow. This is a barrier to prevent the implementation details inside taint models from being affected by the context in which hasTaintFlow / hasDataFlow is used. In particular, in context we have information about FunctionCalls as well as Functions that are being modelled, allowing evaluation to start from the Call rather than the Function - which turns out to be a poor strategy. This is now prevented.

(2) even with the first change we still get a mediocre evaluation strategy, where we narrow down to all of the functions in the modelled class before joining on name. This is the 'wrong' order, producing a fair number of rows we don't need. The getClassAndName predicate (a concept previously referred to as getClassForMember) blocks this strategy, though it does create a bit of work up front (@jbj we didn't discuss the up front cost of computing the new predicate; it seems like it will be bigger than the cost of doing the 'wrong' ordering on one model, but it isn't per-model, which means we don't need to worry about worst cases and future expansion so much).

Notes:

the first change (in two places) applies to everything, whereas the second change only works for models where the change has been made in their code. We may forget some models now or in the future, which is less than ideal, but with the first change in place performance should be acceptable.
the pragma[nomagic]s appear to be necessary for things to work as intended.

Query used for most of the testing:

import semmle.code.cpp.dataflow.TaintTracking
import DataFlow

select strictcount(Node n1, Node n2 | TaintTracking::localTaintStep(n1, n2))

Looking at StdStringCStr::StdStringCStr ("c_str") Before the latest batch of changes:

[2021-01-18 18:01:24] (8s) Tuple counts for StdString::StdStringCStr#class#b/1@52e292:
                      1953  ~1%     {2} r1 = SCAN m#FormattingFunction::FormattingFunction::getFirstFormatArgumentIndex_dispred#bf AS I OUTPUT I.<0> 'this', "c_str"
                      4     ~0%     {1} r2 = JOIN r1 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r1.<0> 'this'
                      4     ~0%     {1} r3 = STREAM DEDUP r2
                      4     ~0%     {2} r4 = JOIN r3 WITH Declaration::Declaration::getDeclaringType_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<0> 'this'
                      2     ~0%     {1} r5 = JOIN r4 WITH project#StdString::StdBasicString#class#ff AS R ON FIRST 1 OUTPUT r4.<1> 'this'
                                    return r5

[2021-01-18 18:01:24] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#bbf/3@e4424b:
...
                      1      ~0%       {2} r44 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
                      1953   ~3%       {3} r45 = JOIN r44 WITH project#Call::Call::getTarget_dispred#ff AS R CARTESIAN PRODUCT OUTPUT R.<0> 'this', r44.<0> 'input', r44.<1> 'output'
                      2      ~0%       {3} r46 = JOIN r45 WITH StdString::StdStringCStr#class#b AS R ON FIRST 1 OUTPUT r45.<0> 'this', r45.<1> 'input', r45.<2> 'output'

With the first commit only (pragma[nomagic] on hasTaintFlow / hasDataFlow):

StdString::StdStringCStr
[not explicitly evaluated]

[2021-01-18 18:11:11] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@b9de32:
...
                      1      ~0%       {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
                      51     ~0%       {3} r317 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, r72.<0> 'input', r72.<1> 'output'
                      216    ~0%       {4} r318 = JOIN r317 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1> 'this', "c_str", r317.<1> 'input', r317.<2> 'output'
                      2      ~0%       {3} r319 = JOIN r318 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r318.<0> 'this', r318.<2> 'input', r318.<3> 'output'

and after:

StdString::StdStringCStr
[not explicitly evaluated]

[2021-01-18 17:59:01] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@896546:
...
                      1      ~0%       {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
                      51     ~4%       {4} r228 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT "c_str", R.<0>, r72.<0> 'input', r72.<1> 'output'
                      2      ~0%       {3} r229 = JOIN r228 WITH Function::Function::getClassAndName#fff_120#join_rhs AS R ON FIRST 2 OUTPUT R.<2> 'this', r228.<2> 'input', r228.<3> 'output'

There are similar results in DataFlow, e.g. for StdIStreamIn::StdIStreamIn ("operator>>").

jbj

All these changes LGTM, and I witnessed most of the tuple-count improvements in our screen sharing session yesterday. Even if we should run into performance problems when expanding these changes to the rest of the library, I think the getClassAndName approach is flexible enough that most performance problems could be addressed centrally, without changing every single model.

cpp/ql/src/semmle/code/cpp/Function.qll

MathiasVP

LGTM as well now!

geoffw0 · 2021-01-19T12:48:33Z

Thanks for the close collaboration, reviews and merge everyone!

C++: Model classes in StdString.qll.

905b04a

geoffw0 added the C++ label Nov 30, 2020

geoffw0 requested a review from a team as a code owner November 30, 2020 15:16

geoffw0 added 2 commits December 1, 2020 10:51

C++: Autoformat.

79f32b0

Merge branch 'main' into modelclasses

7f0209f

rdmarsh2 reviewed Jan 13, 2021

View reviewed changes

cpp/ql/src/semmle/code/cpp/models/implementations/StdString.qll Outdated Show resolved Hide resolved

C++: Change to more natural expressions without use of weird predicat…

13d0efe

…es or 'any'. The classes for string objects now match instantiations directly rather than the template.

geoffw0 added 3 commits January 14, 2021 15:20

C++: Correct QLDoc comments.

54bd36d

C++: Simplification.

7012bc0

Merge branch 'main' into modelclasses

15089c4

geoffw0 added 5 commits January 19, 2021 08:46

C++: Fix performance issue in hasTaintFlow / hasDataFlow.

a5632b2

C++: Introduce new predicate for better performance in models.

b8e6ad8

C++: Apply the new pattern in other parts of StdString.qll where it m…

bfef1a2

…atters.

C++: Apply the new pattern where it doesn't matter for performance as…

f8a1fb1

… well, for consistency.

C++: It turns out .getTemplate() is not necessary.

b4a5346

jbj previously approved these changes Jan 19, 2021

View reviewed changes

MathiasVP reviewed Jan 19, 2021

View reviewed changes

cpp/ql/src/semmle/code/cpp/Function.qll Outdated Show resolved Hide resolved

C++: Fix getClassAndName parameter name.

cff5635

geoffw0 dismissed jbj’s stale review via cff5635 January 19, 2021 10:35

MathiasVP approved these changes Jan 19, 2021

View reviewed changes

jbj merged commit 24947f2 into github:main Jan 19, 2021

geoffw0 mentioned this pull request Feb 12, 2021

C++: StdString BSL support #5158

Merged

Dec	JAN	Feb
	23
2025	2026	2027

C++: Model classes in StdString.qll. #4750

C++: Model classes in StdString.qll. #4750

Uh oh!

Conversation

geoffw0 commented Nov 30, 2020

Uh oh!

MathiasVP commented Dec 1, 2020

Uh oh!

geoffw0 commented Jan 8, 2021

Uh oh!

jbj commented Jan 11, 2021

Uh oh!

jbj commented Jan 11, 2021

Uh oh!

geoffw0 commented Jan 11, 2021

Uh oh!

Uh oh!

rdmarsh2 commented Jan 13, 2021

Uh oh!

rdmarsh2 commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffw0 commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdmarsh2 commented Jan 15, 2021

Uh oh!

geoffw0 commented Jan 15, 2021

Uh oh!

geoffw0 commented Jan 19, 2021

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MathiasVP left a comment

Choose a reason for hiding this comment

Uh oh!

geoffw0 commented Jan 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdmarsh2 commented Jan 14, 2021 •

edited

Loading

geoffw0 commented Jan 14, 2021 •

edited

Loading