Wednesday, June 27, 2018

Speeding up Salesforce unit testing performance

For many years I've had this thorn in my side with Apex test cases. It goes by the name of "Disable Parallel Apex Testing" and since Spring `13 (v27.0) it has needed to be enabled constantly else I'd get a UNABLE_TO_LOCK_ROW error due to the custom hierarchy settings that get updated in the test cases.

Leap forward to Summer/Winter `18 (v42.0/v43.0) and I'm still tangling with this. That's five years of waiting for all the test classes to run sequentially one after another. That's about the right amount of procrastination to finally fix this problem! So hold on, were going to raise a support case and keep at this until the bitter end1.

What follows is a true story. The namespaces and class names have been changed to protect the innocent, but the snippets of messages with support are real.

Step 1 - Turning off "Disable Parallel Apex Testing"

This was simple enough. Uncheck a checkbox, run all the tests, and... Uh oh.

Could not run tests on class 01p400000000001 because: connection was cancelled here

With that setting checked all my tests were passing. Now some of my previously passing test classes are falling over when run in parallel. Worse still, the specific test classes to fail were intermittent. It wasn't just one specific class causing a problem. Any one of a dozen or so classes could fail in this manner and they would change from run to run.

Step 2 - Winter `18 @isTest isParallel annotation to the rescue

Winter 18 added the new @isTest(isParallel=true) annotation to:

indicate test classes that can run in parallel and aren’t restricted by the default limits on the number of concurrent tests.

That's great, except I don't want to modify 90% of my test classes to deal with the 10% that are having issues. No, I'll just use @isTest(isParallel=false) and explicitly exclude the problem cases... except that doesn't work. At least not at the time of writing. Please vote for the idea Parallel Tests Option (isParallel) on the @IsTest Annotation to exclude tests to make this a viable approach.

Step 3 - Once more unto the support case, dear friends, once more

[Day 1 - 2018-05-09] Desperate times. Let's raise a support case to see if they can isolate the underlying issue.

[Day 6 - 2018-05-14] After a bit of back and forth with tier 2 to establish how to reproduce the problem (press the run all test cases in the org) the initial advice back was:

The root cause of this issue?
Answer: - Parallel test execution is the root cause of this issue.

Answer: - Disable Parallel test execution.

Question: - Why do you need to Disable Parallel Apex Testing?
Answer: - As per the salesforce document, Tests that are started from the Salesforce user interface (including the Developer Console) run in parallel. Parallel test execution can speed up test run time. Sometimes, parallel test execution results in data contention issues, and you can turn off parallel execution in those cases.

Huh. Well, yes I knew that already (as per the case description when I raised it). That would work, but it's been five years of this situation, so lets push a bit harder. SHIBBOLEET!

Step 4 - How bad is the problem empirically?

Before responding with a kiwi yeah-nah to support I timed the total test run time for both parallel and synchronous test execution:

As it stood, the parallel test execution was marginally faster as long as I reran any initially failed test cases immediately after the first run completed. That was an interesting result, and I think I can explain the similar timing between the two later on.

The general consensus on twitter was that updates to custom hierarchy settings were probably to blame for the contention and subsequent timeouts.

Step 5 - Isolating Custom Hierarchy Settings using the Stubbing API

I use a number of hierarchy custom settings to toggle various functions in the app. All interactions with those settings from Apex are done via a single class. This allows for sensible defaults etc...

Those with particular feelings on how mock testing should be performed or a sensitive testing disposition may want to look away now...

To prevent any potentially blocking DML operations on the custom hierarchy settings I've injected a StubProvider when in a testing context. The StubProvider prevents any DML occuring when altering the settings in Apex tests. This isn't the typical usage for a test mocking framework, but it serves my needs here to help avoid database locking issues.

Here is a shortened version of how it looks:

And after all that the result on the asynchronous test execution was..... not much really. The tests were marginally faster, but the underlying problem with the test classes having the connection closed remains. At least I can rule out DML on the custom settings as being the problem.

Step 6 - Back to the drawing board

Blaming the problem on the custom hierarchy settings made sense from a historical perspective, but it doesn't appear to be the source of my challenges.

I went back and had a closer look at the debug logs from the tests in the run. That's when I saw it:

"What?", you may ask, "Am I currently looking at?"

That colorful image is a selection of debug log timelines from various parallel test runs. Ignoring the majority of the markings, the important thing is a purple line representing a DML operation that is taking more and more of each transaction. Right up until the point that Salesforce starts terminating it. If we drill into one of those logs we can see it is a DML operation to insert 6 records taking 3 minutes 52 seconds (for 6 records!):

14:14:15.0 (325578801)|DML_BEGIN|[992]|Op:Insert|Type:DFB__OpportunityPriceType__c|Rows:6
14:18:07.376 (232376043557)|DML_END|[992]

Step 7 - Returning to the support case with proof

[Day 9 - 2018-05-17] If that isn't a smoking gun then I don't know what is. Let's take this new evidence back to the support case.

Me to support:

I've just noticed something else odd in the debug logs for the running tests. Inserts to DFB__OpportunityPriceType__c for 6 rows are taking an excessive amount of time.

I'm seeing times for the CommonTestSetup.createPriceTypeMappings() method of between 20 and 45 seconds to complete.
Attached Log1.txt - Look for the lines:
16:18:58.0 (150146460)|DML_BEGIN|[941]|Op:Insert|Type:DFB__OpportunityPriceType__c|Rows:6
16:19:19.214 (21214892238)|DML_END|[941]

It isn't clear to me why the inserts for those records are taking so long. The fields on it are [reasonably] basic and it is only 6 rows.
The only thing I can think of is that all the test classes are trying to use the same records and they all have the same indexed values in the PriceTypeID__c field.

A bit more context as you dear reader can't see into the org like support can.

I have a custom object DFB__OpportunityPriceType__c with a unique external id field. That field is indexed. Just about every single test class I have requires data in this custom object as it contains configuration that gets linked to from Opportunity records. I thought I was being clever and used a @TestSetup method to insert 6 of these records once for all the other test methods to use. There are no triggers, workflow outbound messages or other automatons hanging off this custom object. None of the test classes use SeeAllData=true.

Step 8 - More steps in the Support dance

[Day 13 - 2018-05-21] Support responds:

I reviewed your case and ran the test classes to troubleshoot the issue further. I noticed the error in the server log "RunningForTooLongException: connection was canceled here".
The error is thrown because - for asynchronous Apex tests, any test method execution has a timeout limit. Once an Apex test run exceeds 6 minutes, an internal process "kills" it.

After reviewing the logs, it seems that this particular method is taking time to execute when all the tests classes are run asynchronously - CommonTestSetup.minimalSetupWithSettings();

Can you please try to refactor this method and see if it can be optimized?
I see there are many methods being called from this function.

Another frustrating response. They pointed out the method that gets called directly in the @testSetup method. I'd already identified the inner method from that one and even the line in question. That it is a RunningForTooLongException is new information, but I'm no closer to what is blocking those records from inserting.

[Day 15 - 2018-05-23] We try a GoToMeeting. I work through with Tier 2 support in ensuring the external ID's that get assigned to DFB__OpportunityPriceType__c records in the test cases are unique per @testSetup. It didn't seem to help. Beyond that, I think support has a good grasp of the problem now.

[Day 20 - 2018-05-28] The support case has been escalated to Tier 3.

[Day 30 - 2018-06-07] Tier 3 responds:

- When 'Disable Parallel Apex Testing' checkbox is checked in Developer Console.
Result: Running Perfect

- When 'Disable Parallel Apex Testing' checkbox is unchecked in Developer Console.
Result: Exception is thrown

- In Serial Mode, it's working perfectly because it's running in Sync mode and no data contention issue.
However, in Parallel mode, multiple classes and trigger have been invoked in the same context/transaction where performing DML Operation and also making callouts and waiting for a response in Async mode. This may cause some data contention. Also since there is a lot of custom code involved it becomes very difficult for us to analyze the scenario.

- This is WAD as per Salesforce documentation. If Apex Class will take more than 6 minutes of time to complete the transaction, the connection will be closed by Salesforce DB.

- Test classes are run in serial mode during deployment. This error will not occur when you are trying to deploy your package.

Possible workarounds -

- Try to modify the code and remove that many dependencies.
- Else run in Serial Mode as a Workaround.

This seems distinctly like they put it in the too hard basket and are trying to avoid investigating the actual issue. I'll push back to see if I can get a better answer.

Also, I'm not sure what they mean by "making callouts and waiting for a response in Async mode". This is all happening in a test context. There are no callouts or waiting for async responses. I'm not going to pursue that as I don't want to get distracted.

[Day 35 - 2018-06-12] Update from support.

My T3 has officially logged an investigation with our RnD team on this issue. My RnD team is currently reviewing the case.

[Day 36 - 2018-06-13] An update from R&D, as communicated via Tier 2.

I have received an update from the DB team on this case.

As per the DB team, it seems that the parallel sessions are working on common recordsets and causing the wait events.
Because of this waitime the DB team is high and eventually the classes are getting timed out.

May be this explains why the insertion to DFB__OpportunityPriceType__c is taking time. We cannot exactly comment on the common recordsets but DFB__OpportunityPriceType__c records is one of the potential candidate.

Solution -
Have the parallel sessions work on different recordset/dependencies to avoid this situation.

So it appears that all the test cases are trying to work with the same record sets and they are all blocking until they can get exclusive access to those records. They either stack up and run one after another or wait so long they time out. This explains why the parallel test performance is so similar to the serial performance.

I'm fairly certain the DFB__OpportunityPriceType__c records are unique between @testSteup executions, so it must be something else...

Step 9 - Time to regroup

[Day 42 - 2018-06-19] Much to my shame I've closed the case out. I need time to regroup and revisit this from a new approach. I'm certainly not giving up on it yet and have a few ideas to try:

  • Try adjusting the Logging levels used when capturing the logs. Particularly around the Database logging levels. Also try capturing no logs at all.
  • Push the entire packages metadata into a Scratch Org and and retry the tests. The dev org doesn't have much data loaded, but the scratch org can provide a completely empty environment to ensure it isn't a data siloing issue.
  • If it is reproducable in the scratch org, start hacking parts out until it becomes easier to reproduce.
  • Remove the creation of the DFB__OpportunityPriceType__c records from the @testSetup. Only create them when required in the individual test methods.
  • Simplify the test cases where possible. The challenge is almost all of them work with Opportunities and OpportunityLineItems at some level. Which means a large overhead to construct all the dependent records as well.


  1. The bitter end may or may not occur within the timeframe of this blog post.


  1. What is the tool used to visualize debug logs?

    1. It's part of the FuseIT SFDC Explorer. In particular the Apex Logs tab.
      Most recent post I made about it was

  2. Thanks for your inputs it has been really helpful. I was able to get a clear understanding of does abd donts. However i seem to stuck at the exact same problem. I am certain records i am inserting have unique external ids. Not sure why data cotention is still happening. Just wondering if you were able to identify a solution ??

    1. Nothing definitive as yet. I've been thinking about trying Test.loadData() as another way of loading in the contentious records.

  3. Thanks for you post. We are experiencing the same issues as you described here and also after a long back and forth with SF support (tier 2 and 3) we feel like they avoiding dealing with the real issue and they rather to send us back to "rewrite the tests based on their best practice".

    I have an open question that I could not understand why does it happen so maybe you can help me.

    We have around ~1200 test methods, a total deployment takes us 12 hours!
    After every deployment Salesforce provides me a list of the top 5 tests (the ones that run the longer. For example
    Test_class1 test_method_1 - 4 minutes and 57 seconds.

    However, when I take the exact same test class and run it from the "Test Execution" page with checking the checkbox of "Disable Parallel Apex Testing", the specific method runs for 15 seconds!

    What am I missing here?


    1. Can you get the debug log from when the test method ran? Or try it with parallel testing enabled. I'm finding all my slow tests are a result of record locking between the tests when run in parallel. Although that doesn't explain why it would be slow during a deployment where all the tests run synchronously. There may be other locking issues at play.

  4. Hi Daniel, thank you for sharing with us your case and for the detailed post.

    My team and I experience the same exact issues and unfortunately, we also feel like Salesforce support does not try to solve the real issue but rather to send us back to "rewrite the tests based on their best practice". We had a long conversation with Salesforce team (tier 2 and tier 3) and we didn't get any satisfying answer.

    I have an open question that I could not understand the answer to it and would love to ask you, maybe you can help me.
    We have around ~1200 test methods in our org and a deployment takes us ~12 hours. After every deployment we get from Salesforce a list of the top 5 test methods (the ones that their run time is the longest).

    For example, after the last run we got:
    Test_class1, test_method_1 - 4 minutes and 57 seconds.

    When I try to run the same exact test method in the same environment from the "Apex Test Execution" page with the checkbox "Disable Parallel Apex Testing" checked, it takes just 15 seconds for this method and a minute for the entire class 5-6 methods.

    Why the difference is so big? what am I missing here?


    1. This sounds very much like the issue I'm having where running the tests in parallel causes contention that slows all the running tests down.

      Although you mentioned deployment to another org. In that case it may be a difference between the orgs. Especially if SeeAllData=true is involved.

  5. Great post, really interesting to see how the SF-support works (or doesn't work)

    What tool are you using for the logfile time visualization? It looks really usefull :-)

    We also have an UNABLE_TO_LOCK_ROW error in our tests, even though they run somewhat quicker (whole suite runs in 2:40). About 3 are on a hierarchical object, but 2 are also on a really simple object that is highly contended (but no test has SeeAllData=true, so they should be isolated and it shouldn't matter?). The object is an outbound queue that is used to record most object changes, usually via triggers (for then sending it to Kafka in a production setup). The triggers only append to the queue object, so no idea what the problem could be.

    If you ever fix your problem I would really be interested in it. Our experience with Salesforce so far is that we make much quicker progress with some workarounds, instead of waiting for fixes from SF.


    1. I'm using the FuseIT SFDC Explorer for the log visualizations. -

      It does sound like you are seeing contention on the common between the tests when run in parallel. In theory the data should be fully isolated. However, that doesn't appear to be the case in practice.

  6. A few years back, I tried to make things so that every test created records with different unique keys and that did reduce the number of failures in my case but unfortunately it did not fix the issue entirely. I was guessing that data was not entirely isolated and that maybe ensuring unique keys across the board would help prevent locking because there could not be any problems due to the system wrestling with duplicates (even though the error was always that UNABLE_TO_LOCK_ROW, not anything to do with dupes). Decided against dedicating any more chunks of life to solving it but it's been entertaining to read your account - thanks for sharing and well done perservering for so long!

  7. Great read. Keep up the good fight!

  8. This was a great read Daniel, April 2020 and nothing has changed. In our case we have just come to rely a lot more on apex mocking to allows us to run these test classes in an efficient manner as part of our CI process. As for the random connection errors -, we query against the apexTestResult objects to look for errors that are NOT connection issues, we have the provision to auto-run all the connection related test classes by querying ApexTestQueueItem, but so far those connection errors have been false positives.
    If you do find anything new in this space, do let me know, thanks!!

  9. I think the main cause is that @testsetup is used to create records which are somehow shared across parallel transactions. What if we create separate records (as old days) and try. Will this work... Any thoughts ?

  10. Out of curiousity, did your test class create and INSERT a User record in an @testSetup method? I'm experiencing similar issues with timeouts or UNABLE_TO_LOCK_ROW errors in some test classes. If I move the creation of my test User records out of the @testSetup to be in my test methods (with or without INSERT) then using the User instance in my runAs(), the timesouts go away and the row lock issues go away