close
Skip to content

fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately#3217

Merged
CurtHagenlocher merged 23 commits into
apache:mainfrom
toddmeng-db:toddmeng-db/operation-status-poller-error-handling
Aug 8, 2025
Merged

fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately#3217
CurtHagenlocher merged 23 commits into
apache:mainfrom
toddmeng-db:toddmeng-db/operation-status-poller-error-handling

Conversation

@toddmeng-db
Copy link
Copy Markdown
Contributor

@toddmeng-db toddmeng-db commented Jul 29, 2025

Motivation

The following cases are not properly stopping or disposing the status poller:

  1. If the DatabricksCompositeReader is explicitly disposed by the user
  2. CloudFetchReader is done returning results
  3. Edge case terminal operation status (timedout_state, unknown_state)

In addition:

  • When DatabricksOperationStatusPoller.Dispose(), it may cancel the GetOperationStatusRequest in the client. If the input buffer has data and cancellation is triggered, it leaves the TCLI client with unconsumed/unsent data in the buffer, breaking subsequent requests (fixed in this PR)

Fixes

DatabricksOperationStatusPollerLogic is now more appropriately managed by DatabricksCompositeReader (moved out of BaseDatabricksReader) to handle all cases where null results (indicating completion) are returned.

Disposing DatabricksCompositeReader appropriately disposes the activeReader and statusPoller

TODO

Follow-up PR - when statement is disposed, it should also dispose the reader (the poller is currently stopped when operationhandle is set to null, but this should also happen explicitly)

Need add some unit testing (follow up pr: #3243)

@toddmeng-db toddmeng-db changed the title Error handling for operation status poller fix(csharp/src/Drivers/Databricks): Error handling for operation status poller Jul 29, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 2 times, most recently from ec41720 to 004a5a7 Compare July 29, 2025 17:50
@jadewang-db
Copy link
Copy Markdown
Contributor

can you confirm, even without this fix, the polling will stop after statement being disposed, right? if not, we need fix there also

Comment thread csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Error handling for operation status poller fix(csharp/src/Drivers/Databricks): Tighten OperationStatusPoller Disposal Jul 29, 2025
Comment thread csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs
@@ -247,6 +247,10 @@ private async Task FetchResultsAsync(CancellationToken cancellationToken)
_downloadQueue.Add(EndOfResultsGuard.Instance, cancellationToken);
_isCompleted = true;
Copy link
Copy Markdown
Contributor Author

@toddmeng-db toddmeng-db Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From testing:
Small nit but I think we need to avoid this here, since it's possible that DownloadQueue is full, then exception handling would be stuck. Should I modify the Exception handling below, or was there a reason why it it like this? (line 262) @jadewang-db

catch (Exception ex)
            {
                try
                {
                    _downloadQueue.Add(EndOfResultsGuard.Instance, CancellationToken.None);
                 }
}

Alternatively, we can create a new CancellationToken with Timeout for this attempt

                    CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, a cancellation token looks good

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just saw this comment, let me implement

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks like TryAdd is better suited here

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 9caf8db to 9fd9fea Compare July 30, 2025 04:32
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten OperationStatusPoller Disposal fix(csharp/src/Drivers/Databricks): Tighten Statement Disposal Jul 30, 2025
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten Statement Disposal fix(csharp/src/Drivers/Databricks): Tighten Statement, Reader, Poller Disposal Jul 30, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 9 times, most recently from d55808c to 74c6ee8 Compare July 30, 2025 22:49
@@ -69,13 +72,18 @@ private async Task PollOperationStatus(CancellationToken cancellationToken)
var operationHandle = _statement.OperationHandle;
if (operationHandle == null) break;

Copy link
Copy Markdown
Contributor Author

@toddmeng-db toddmeng-db Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use a timeout token here, instead of cancelling when canceltoken is triggered; if an interrupt is triggered prematurely, the TCLI client may still have unsent/unconsumed results in the buffers, affecting subsequent calls with that client (which is any future call in the same Session)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to repro this? should we do this to all the thrift rpc calls in the driver?

Copy link
Copy Markdown
Contributor Author

@toddmeng-db toddmeng-db Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because in THTTPTransport (used by SparkHttpConnection -> DatabricksHttpconnection), a new Stream is created when the request is flushed. If cancellation happens before this, that stream doesn't get discarded:
https://github.com/apache/thrift/blob/master/lib/netstd/Thrift/Transport/Client/THttpTransport.cs#L281

Yes, during testing, got some errors. In the proxy logs, I remember seeing requests sent out with both GetOperationStatus and CloseOperationStatus (in the same request) while testing another PR

I think we are safe in HiveServer2Statement, but we might need to adjust CancellationToken in DatabricksReader, CloudFetchResultFetcher, and DatabricksCompositeReader

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this depends a bit on how CancellationToken could be used by PBI, too
@CurtHagenlocher will mashup ever trigger cancellationTokens passed into IArrowStreamReader.ReadNextBatchAsync? Do we need to ensure that the connection still remains usable for subsequent statements?

Copy link
Copy Markdown
Contributor Author

@toddmeng-db toddmeng-db Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for now, I think we can operate this way:

  1. If the user cancels the token passed in to ReadNextBatchAsync, we should not to break the client
  2. Dispose() should not break the client either

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher will mashup ever trigger cancellationTokens passed into IArrowStreamReader.ReadNextBatchAsync? Do we need to ensure that the connection still remains usable for subsequent statements?

This is currently unimplemented but we'll need to implement it before GA for parity with the ODBC implementation. What is probably most important for cancellation is query execution, and unless we manage to push forward the proposed ADBC 1.1 API, currently the only way to cancel a running query is to call AdbcStatement.Cancel. There is currently no implementation of this method for any of the C#-implemented drivers :(.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Power BI perspective, the most important use of cancellation is for Direct Query because users can generate a lot of queries simply by clicking around in a visual and in-progress queries will need to be cancelled if their output is no longer needed. DQ output tends to be relatively small, so being able to cancel in the middle of reading the output is arguably less important than being able to cancel before the results start coming back.

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from ecb0771 to 3263cef Compare July 31, 2025 04:52
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten Statement, Reader, Poller Disposal fix(csharp/src/Drivers/Databricks): Correct StatusPoller to Stop/Dispose Appropriately Aug 1, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from 8b88019 to 8e54490 Compare August 1, 2025 16:44
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 579e26d to be06c48 Compare August 2, 2025 01:30
@toddmeng-db toddmeng-db requested a review from jadewang-db August 4, 2025 17:03
Comment thread csharp/src/Drivers/Databricks/DatabricksCompositeReader.cs
Comment thread csharp/src/Drivers/Databricks/DatabricksOperationStatusPoller.cs Outdated
@CurtHagenlocher CurtHagenlocher changed the title fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately Aug 5, 2025
@toddmeng-db toddmeng-db marked this pull request as ready for review August 6, 2025 21:35
@github-actions github-actions Bot added this to the ADBC Libraries 20 milestone Aug 6, 2025
request.StartRowOffset = offset;

// Cancelling mid-request breaks the client; Dispose() should not break the underlying client
CancellationToken expiringToken = ApacheUtility.GetCancellationToken(DatabricksConstants.DefaultCloudFetchRequestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you respect the connection parameter DatabricksParameters.CloudFetchTimeoutMinutes instead of the default value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I shouldn't create a new constant here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, what I meant is if you should check the value of the connection parameter CloudFetchTimeoutMinutes (adbc.databricks.cloudfetch.timeout_minutes) which can be set by the client and customer.

Copy link
Copy Markdown
Contributor Author

@toddmeng-db toddmeng-db Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it, that makes sense, it should be a configurable parameter. To be consistent with the rest of HiveServer2Statement, I'm just using the QueryTimeout parameter (which is what other FetchResultsRequest uses)

I have some changes in a follow-up PR that will make this change easier to do for DatabricksReader, will leave this as a TODO

Comment thread csharp/src/Drivers/Databricks/DatabricksCompositeReader.cs Outdated
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 2 times, most recently from 9242fd2 to efecc82 Compare August 7, 2025 19:40
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from efecc82 to 65f9d0d Compare August 7, 2025 19:41
Copy link
Copy Markdown
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The linter error needs to be fixed and I made a few small low-priority suggestions.

Comment thread csharp/test/Drivers/Databricks/Unit/DatabricksOperationStatusPollerTests.cs Outdated
Comment thread csharp/src/Drivers/Databricks/DatabricksCompositeReader.cs Outdated
Comment thread csharp/src/Drivers/Databricks/DatabricksCompositeReader.cs
Comment thread csharp/src/Drivers/Databricks/DatabricksOperationStatusPoller.cs Outdated
Comment thread csharp/test/Drivers/Databricks/Unit/DatabricksOperationStatusPollerTests.cs Outdated
Comment thread csharp/test/Drivers/Databricks/Unit/DatabricksOperationStatusPollerTests.cs Outdated
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from f559692 to 5a48ef2 Compare August 8, 2025 19:54
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 5a48ef2 to 4130c83 Compare August 8, 2025 19:55
@CurtHagenlocher CurtHagenlocher merged commit f0f36da into apache:main Aug 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants