Wednesday, June 27, 2018

Speeding up Salesforce unit testing performance

For many years I've had this thorn in my side with Apex test cases. It goes by the name of "Disable Parallel Apex Testing" and since Spring `13 (v27.0) it has needed to be enabled constantly else I'd get a UNABLE_TO_LOCK_ROW error due to the custom hierarchy settings that get updated in the test cases.

Leap forward to Summer/Winter `18 (v42.0/v43.0) and I'm still tangling with this. That's five years of waiting for all the test classes to run sequentially one after another. That's about the right amount of procrastination to finally fix this problem! So hold on, were going to raise a support case and keep at this until the bitter end1.

What follows is a true story. The namespaces and class names have been changed to protect the innocent, but the snippets of messages with support are real.

Step 1 - Turning off "Disable Parallel Apex Testing"

This was simple enough. Uncheck a checkbox, run all the tests, and... Uh oh.

Could not run tests on class 01p400000000001 because: connection was cancelled here

With that setting checked all my tests were passing. Now some of my previously passing test classes are falling over when run in parallel. Worse still, the specific test classes to fail were intermittent. It wasn't just one specific class causing a problem. Any one of a dozen or so classes could fail in this manner and they would change from run to run.

Step 2 - Winter `18 @isTest isParallel annotation to the rescue

Winter 18 added the new @isTest(isParallel=true) annotation to:

indicate test classes that can run in parallel and aren’t restricted by the default limits on the number of concurrent tests.

That's great, except I don't want to modify 90% of my test classes to deal with the 10% that are having issues. No, I'll just use @isTest(isParallel=false) and explicitly exclude the problem cases... except that doesn't work. At least not at the time of writing. Please vote for the idea Parallel Tests Option (isParallel) on the @IsTest Annotation to exclude tests to make this a viable approach.

Step 3 - Once more unto the support case, dear friends, once more

[Day 1 - 2018-05-09] Desperate times. Let's raise a support case to see if they can isolate the underlying issue.

[Day 6 - 2018-05-14] After a bit of back and forth with tier 2 to establish how to reproduce the problem (press the run all test cases in the org) the initial advice back was:

The root cause of this issue?
Answer: - Parallel test execution is the root cause of this issue.

Solution?
Answer: - Disable Parallel test execution.

Question: - Why do you need to Disable Parallel Apex Testing?
Answer: - As per the salesforce document, Tests that are started from the Salesforce user interface (including the Developer Console) run in parallel. Parallel test execution can speed up test run time. Sometimes, parallel test execution results in data contention issues, and you can turn off parallel execution in those cases.

Huh. Well, yes I knew that already (as per the case description when I raised it). That would work, but it's been five years of this situation, so lets push a bit harder. SHIBBOLEET!

Step 4 - How bad is the problem empirically?

Before responding with a kiwi yeah-nah to support I timed the total test run time for both parallel and synchronous test execution:

As it stood, the parallel test execution was marginally faster as long as I reran any initially failed test cases immediately after the first run completed. That was an interesting result, and I think I can explain the similar timing between the two later on.

The general consensus on twitter was that updates to custom hierarchy settings were probably to blame for the contention and subsequent timeouts.

Step 5 - Isolating Custom Hierarchy Settings using the Stubbing API

I use a number of hierarchy custom settings to toggle various functions in the app. All interactions with those settings from Apex are done via a single class. This allows for sensible defaults etc...

Those with particular feelings on how mock testing should be performed or a sensitive testing disposition may want to look away now...

To prevent any potentially blocking DML operations on the custom hierarchy settings I've injected a StubProvider when in a testing context. The StubProvider prevents any DML occuring when altering the settings in Apex tests. This isn't the typical usage for a test mocking framework, but it serves my needs here to help avoid database locking issues.

Here is a shortened version of how it looks:

And after all that the result on the asynchronous test execution was..... not much really. The tests were marginally faster, but the underlying problem with the test classes having the connection closed remains. At least I can rule out DML on the custom settings as being the problem.

Step 6 - Back to the drawing board

Blaming the problem on the custom hierarchy settings made sense from a historical perspective, but it doesn't appear to be the source of my challenges.

I went back and had a closer look at the debug logs from the tests in the run. That's when I saw it:

"What?", you may ask, "Am I currently looking at?"

That colorful image is a selection of debug log timelines from various parallel test runs. Ignoring the majority of the markings, the important thing is a purple line representing a DML operation that is taking more and more of each transaction. Right up until the point that Salesforce starts terminating it. If we drill into one of those logs we can see it is a DML operation to insert 6 records taking 3 minutes 52 seconds (for 6 records!):

14:14:15.0 (325578801)|DML_BEGIN|[992]|Op:Insert|Type:DFB__OpportunityPriceType__c|Rows:6
14:18:07.376 (232376043557)|DML_END|[992]

Step 7 - Returning to the support case with proof

[Day 9 - 2018-05-17] If that isn't a smoking gun then I don't know what is. Let's take this new evidence back to the support case.

Me to support:

I've just noticed something else odd in the debug logs for the running tests. Inserts to DFB__OpportunityPriceType__c for 6 rows are taking an excessive amount of time.

I'm seeing times for the CommonTestSetup.createPriceTypeMappings() method of between 20 and 45 seconds to complete.
Attached Log1.txt - Look for the lines:
16:18:58.0 (150146460)|DML_BEGIN|[941]|Op:Insert|Type:DFB__OpportunityPriceType__c|Rows:6
16:19:19.214 (21214892238)|DML_END|[941]

It isn't clear to me why the inserts for those records are taking so long. The fields on it are [reasonably] basic and it is only 6 rows.
The only thing I can think of is that all the test classes are trying to use the same records and they all have the same indexed values in the PriceTypeID__c field.

A bit more context as you dear reader can't see into the org like support can.

I have a custom object DFB__OpportunityPriceType__c with a unique external id field. That field is indexed. Just about every single test class I have requires data in this custom object as it contains configuration that gets linked to from Opportunity records. I thought I was being clever and used a @TestSetup method to insert 6 of these records once for all the other test methods to use. There are no triggers, workflow outbound messages or other automatons hanging off this custom object. None of the test classes use SeeAllData=true.

Step 8 - More steps in the Support dance

[Day 13 - 2018-05-21] Support responds:

I reviewed your case and ran the test classes to troubleshoot the issue further. I noticed the error in the server log "RunningForTooLongException: connection was canceled here".
The error is thrown because - for asynchronous Apex tests, any test method execution has a timeout limit. Once an Apex test run exceeds 6 minutes, an internal process "kills" it.

After reviewing the logs, it seems that this particular method is taking time to execute when all the tests classes are run asynchronously - CommonTestSetup.minimalSetupWithSettings();

Can you please try to refactor this method and see if it can be optimized?
I see there are many methods being called from this function.

Another frustrating response. They pointed out the method that gets called directly in the @testSetup method. I'd already identified the inner method from that one and even the line in question. That it is a RunningForTooLongException is new information, but I'm no closer to what is blocking those records from inserting.

[Day 15 - 2018-05-23] We try a GoToMeeting. I work through with Tier 2 support in ensuring the external ID's that get assigned to DFB__OpportunityPriceType__c records in the test cases are unique per @testSetup. It didn't seem to help. Beyond that, I think support has a good grasp of the problem now.

[Day 20 - 2018-05-28] The support case has been escalated to Tier 3.

[Day 30 - 2018-06-07] Tier 3 responds:

- When 'Disable Parallel Apex Testing' checkbox is checked in Developer Console.
Result: Running Perfect

- When 'Disable Parallel Apex Testing' checkbox is unchecked in Developer Console.
Result: Exception is thrown

- In Serial Mode, it's working perfectly because it's running in Sync mode and no data contention issue.
However, in Parallel mode, multiple classes and trigger have been invoked in the same context/transaction where performing DML Operation and also making callouts and waiting for a response in Async mode. This may cause some data contention. Also since there is a lot of custom code involved it becomes very difficult for us to analyze the scenario.

- This is WAD as per Salesforce documentation. If Apex Class will take more than 6 minutes of time to complete the transaction, the connection will be closed by Salesforce DB.

- Test classes are run in serial mode during deployment. This error will not occur when you are trying to deploy your package.

Possible workarounds -

- Try to modify the code and remove that many dependencies.
- Else run in Serial Mode as a Workaround.

This seems distinctly like they put it in the too hard basket and are trying to avoid investigating the actual issue. I'll push back to see if I can get a better answer.

Also, I'm not sure what they mean by "making callouts and waiting for a response in Async mode". This is all happening in a test context. There are no callouts or waiting for async responses. I'm not going to pursue that as I don't want to get distracted.

[Day 35 - 2018-06-12] Update from support.

My T3 has officially logged an investigation with our RnD team on this issue. My RnD team is currently reviewing the case.

[Day 36 - 2018-06-13] An update from R&D, as communicated via Tier 2.

I have received an update from the DB team on this case.

As per the DB team, it seems that the parallel sessions are working on common recordsets and causing the wait events.
Because of this waitime the DB team is high and eventually the classes are getting timed out.

May be this explains why the insertion to DFB__OpportunityPriceType__c is taking time. We cannot exactly comment on the common recordsets but DFB__OpportunityPriceType__c records is one of the potential candidate.

Solution -
Have the parallel sessions work on different recordset/dependencies to avoid this situation.

So it appears that all the test cases are trying to work with the same record sets and they are all blocking until they can get exclusive access to those records. They either stack up and run one after another or wait so long they time out. This explains why the parallel test performance is so similar to the serial performance.

I'm fairly certain the DFB__OpportunityPriceType__c records are unique between @testSteup executions, so it must be something else...

Step 9 - Time to regroup

[Day 42 - 2018-06-19] Much to my shame I've closed the case out. I need time to regroup and revisit this from a new approach. I'm certainly not giving up on it yet and have a few ideas to try:

  • Try adjusting the Logging levels used when capturing the logs. Particularly around the Database logging levels. Also try capturing no logs at all.
  • Push the entire packages metadata into a Scratch Org and and retry the tests. The dev org doesn't have much data loaded, but the scratch org can provide a completely empty environment to ensure it isn't a data siloing issue.
  • If it is reproducable in the scratch org, start hacking parts out until it becomes easier to reproduce.
  • Remove the creation of the DFB__OpportunityPriceType__c records from the @testSetup. Only create them when required in the individual test methods.
  • Simplify the test cases where possible. The challenge is almost all of them work with Opportunities and OpportunityLineItems at some level. Which means a large overhead to construct all the dependent records as well.

Footnotes

  1. The bitter end may or may not occur within the timeframe of this blog post.