close
Skip to content

Introduce jitter metric based on RFC 1889 Appendix A#8586

Open
GGraziadei wants to merge 8 commits intoapache:masterfrom
GGraziadei:8538-rfc-1889a-jitter-metric
Open

Introduce jitter metric based on RFC 1889 Appendix A#8586
GGraziadei wants to merge 8 commits intoapache:masterfrom
GGraziadei:8538-rfc-1889a-jitter-metric

Conversation

@GGraziadei
Copy link
Copy Markdown

What is the purpose of the change

In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system.

  • Mcro-burst detection: high jitter reveals short spikes that average latency smooths out.
  • Compliance: modern SLAs rely on percentiles (e.g., P99). Jitter is a strong leading indicator of tail-latency degradation.
  • Root Cause Analysis: high component jitter means GC pressure or resource contention; instead, high global jitter with stable components suggests network congestion or shuffle bottlenecks.
  • Bottleneck identification: jitter enables precise identification of where bottlenecks occur in the topology and helps distinguish their underlying causes, making performance issues easier to diagnose and resolve.

To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8

Mathematical Model:
J_new = J_old + (|D_current - D_previous| - J_old) * smoothing_factor

Performance impact

  • Minimal computational overhead: by utilizing an EWMA.
  • Memory efficiency: only two persistent variables (8 bytes) per task.
  • System calls: no system calls required to track the latency (the latencies are already computed).

How was the change tested

  • Unit test: introduced new test cases for Config, TaskMetrics, EwmaGauge
  • Smoke test in local: registered a topology metrics reporter and persisted captured metrics in the attached file
  • The package metrics2 doesn't affect it.
    worker_log.zip

Example results in worker logs

2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__emit-count-default.m1_rate
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 30.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-count-split:default.m1_rate
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 30.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-latency-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-rfc1889a-jitter-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.2557194505051832
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__process-latency-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.3333333333333333
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__process-rfc1889a-jitter-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.145830156234796

In the context of #8583

@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented May 6, 2026

Question on the d <= 0 short-circuit in EwmaGauge.addValue

if (d <= 0) {
    return;
}

Is this skip intentional? Reading RFC 3550 §A.8 (which supersedes RFC 1889 with the same text):

d = transit - s->transit;
s->transit = transit;
if (d < 0) d = -d;
s->jitter += (1./16.) * ((double)d - s->jitter);

The update is unconditional. When d == 0 it collapses to J ← J · (1 − α)J · 15/16, i.e. the EWMA decays toward zero. The RFC explicitly chooses 1/16 for its "noise reduction ratio while maintaining a reasonable rate of convergence" — convergence back toward zero during quiet periods is part of the spec's intent.

For comparison, the major RTP stacks all apply the update unconditionally:

  • GStreamerrtpsource.c:997
    src->stats.jitter += diff - ((src->stats.jitter + 8) >> 4);
  • PJSIPrtcp.c:434
    sess->jitter += d - ((sess->jitter + 8) >> 4);
  • WebRTCreceive_statistics_impl.cc:165
    int32_t jitter_diff_q4 = (time_diff_samples << 4) - jitter_q4_;
    jitter_q4_ += ((jitter_diff_q4 + 8) >> 4);
    (the only guard here is a < 450000 anomaly cap, not a d == 0 short-circuit)

The test EwmaGaugeTest.zeroDeviationDecays appears to lock in the current behavior, but its display name ("Zero deviation decays jitter toward zero") suggests the original intent matched the spec while the assertion encodes the opposite. Worth double-checking which one was meant.

Suggested fix: drop the if (d <= 0) return; block — the CAS loop below is already correct for d == 0. (Optionally short-circuit the math with updatedJitter = currentJitter * (1.0 - alpha), but you must still write it back.)

Was the skip intentional?

@GGraziadei
Copy link
Copy Markdown
Author

Hello @rzo1 thank you for your comment.
You have reason, there is an implementation error, and I am fixing it.
If I do not update the jitter when the latency is stable d==0, the jitter doesn't decrease, and this is weird.
Regarding the test case Zero deviation decays jitter toward zero (alpha=0.5), the correct status serialization is reported here:

  • lastLat = UNSEED; lat=0; j=0
  • lastLat=0; lat=10; d=5; j=0+alpha* (d-j)=2.5
  • lastLat=10; lat=10; d=0; j=2.5 + alpha * (d-j) = 2.5 - alpha * 2.5 = 1.25

@rzo1 rzo1 added enhancement java Pull requests that update Java code labels May 6, 2026
@rzo1 rzo1 added this to the 3.0.0 milestone May 6, 2026
Copy link
Copy Markdown
Contributor

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review now that the d <= 0 skip is fixed. Most items are nits; the ones I'd want resolved before merge are the config-key naming (smoothing_factor vs the rest of Storm's dot-case keys), the RFC 1889a metric-name suffix (load-bearing for downstream dashboards once shipped), and the validator/runtime mismatch.

One general note that doesn't fit any single line: enabling jitter triples the per-(component:stream) gauge count on every task. On topologies with lots of streams that's a meaningful cardinality bump for the metrics backend (and most TSDBs charge per series). Worth a sentence in docs/Metrics.md warning operators that enabling this multiplies stream-keyed metrics 3×.

Comment thread conf/defaults.yaml
topology.max.spout.pending: null # ideally should be larger than topology.producer.batch.size. (esp. if topology.batch.flush.interval.millis=0)
topology.state.synchronization.timeout.secs: 60
topology.stats.sample.rate: 0.05
topology.stats.ewma.enable: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topology.stats.ewma.smoothing_factor isn't represented here even though it's a tunable. Either add topology.stats.ewma.smoothing.factor: 0.0625 (or null with a comment about the RFC1889_ALPHA fallback) so operators can discover the knob via defaults.yaml, or call out the default in the Config.java Javadoc — right now the 1/16 default is invisible outside the source.

* @see <a href="https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8">RFC 1889 Appendix A.8</a>
*/
@CustomValidator(validatorClass = ConfigValidation.EwmaSmoothingFactorValidator.class)
public static final String TOPOLOGY_STATS_EWMA_SMOOTHING_FACTOR = "topology.stats.ewma.smoothing_factor";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscore breaks Storm's dot.case config convention. Every neighbour key uses dots (topology.stats.sample.rate, topology.builtin.metrics.bucket.size.secs, …). Suggest topology.stats.ewma.smoothing.factor. Worth changing now — once released it's a public surface.

@IsPositiveNumber
public static final String TOPOLOGY_STATS_SAMPLE_RATE = "topology.stats.sample.rate";
/**
* Enabling jitter streaming calculation (RFC 1889a).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: "RFC 1889a" isn't a real RFC label — the algorithm lives in Appendix A.8 of RFC 1889 (and was carried unchanged into RFC 3550 §A.8). Suggest "RFC 1889 §A.8" or "RFC 3550 §A.8" everywhere it appears in this PR (Javadoc, metric names, docs). RFC 1889 has been obsolete since 2003, so a forward link to RFC 3550 is friendlier to readers.

* Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version
* 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
* <p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<p> tags in the license header are inconsistent with the rest of storm-client (the standard ASF header uses blank lines between paragraphs — see e.g. TaskMetrics.java in this same PR). Same applies to EwmaGaugeTest.java (<p/>) and TaskMetricsTest.java (<p/>). Looks like an IDE auto-formatter artifact.

import java.util.concurrent.atomic.AtomicLong;

/**
* Lock-free jitter estimator following RFC 1889 Section 6.3.1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 1889 §6.3.1 is the Receiver Reports section, not the jitter algorithm. The algorithm is in §A.8. Suggest: Lock-free jitter estimator following RFC 1889 §A.8 / RFC 3550 §A.8.

Comment on lines +32 to +36
private static final String METRIC_NAME_PROCESS_RFC_1889a_JITTER = "__process-rfc1889a-jitter";
private static final String METRIC_NAME_COMPLETE_LATENCY = "__complete-latency";
private static final String METRIC_NAME_COMPLETE_RFC_1889a_JITTER = "__complete-rfc1889a-jitter";
private static final String METRIC_NAME_EXECUTE_LATENCY = "__execute-latency";
private static final String METRIC_NAME_EXECUTE_RFC_1889a_JITTER = "__execute-rfc1889a-jitter";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues with these metric names:

  1. rfc1889a suffix isn't a real RFC label — the algorithm is in Appendix A.8 of RFC 1889. The a reads like a version suffix. Suggest __complete-rfc1889-jitter etc., or simply __complete-jitter (the RFC reference belongs in Metrics.md, not in every metric series name).
  2. Inconsistent identifier casing — the constants spell it RFC_1889a (camel-case mid-word with lowercase a). Convention in this file is all-uppercase: RFC_1889_JITTER or RFC1889_JITTER.

These metric names become public API the moment the PR ships — much easier to fix now than after dashboards are wired up.


private final ConcurrentMap<String, RateCounter> rateCounters = new ConcurrentHashMap<>();
private final ConcurrentMap<String, RollingAverageGauge> gauges = new ConcurrentHashMap<>();
private final ConcurrentMap<String, Gauge> gauges = new ConcurrentHashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to a raw ConcurrentMap<String, Gauge> loses type info that the previous ConcurrentMap<String, RollingAverageGauge> had. Cleaner would be ConcurrentMap<String, Gauge<?>> — that keeps the wildcard typing, lets getOrCreateGauge work unchanged, and the only raw cast moves to registerGauge where you already have @SuppressWarnings({"unchecked", "rawtypes"}).

return getOrCreateGauge(metricName, streamId, RollingAverageGauge.class, this.rollingAverageGaugeFactory);
}

private EwmaGauge getExponentialWeightedMobileAverageGauge(String metricName, String streamId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "Mobile" → "Moving". EWMA = Exponentially Weighted Moving Average. (The field name ewmaSmoothingFactor is fine; just the helper method.)

Comment on lines +858 to +865
if (o instanceof Number) {
double alpha = ((Number) o).doubleValue();
if (alpha > 0.0 && alpha < 1.0) {
return;
}
}
throw new IllegalArgumentException(
"Field " + name + " must be a number in the open interval (0, 1), got: " + o);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validator only accepts Number, but ConfigUtils.ewmaSmoothingFactor calls ObjectReader.getDouble(value) which also parses strings like "0.5". Other numeric configs in Storm (e.g. topology.stats.sample.rate) accept stringified numbers from YAML, so the asymmetry is surprising — a value would be rejected at validation time but accepted at runtime if validation is bypassed (programmatic conf, tests, etc.).

Suggest mirroring what the runtime does: try ObjectReader.getDouble(o) inside a try/catch, then range-check. That way the validator and the runtime parser agree.

Comment on lines +53 to +54
@Mock private RollingAverageGauge rollingAverageGauge;
@Mock private EwmaGauge ewmaGauge;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two @Mock fields are declared but never referenced anywhere in the test class — Mockito still spends init cost on them. Drop both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants