Best practice for troubleshooting 100% CPU on EPiServer site?

Vote:
 

Hi!

We're currently experiencing intermittent CPU spikes on one of our EPiServer sites.

Can anyone share some best practices when troubleshooting these problems?

I've tried LogParserStudio for IIS logs and DebugDiag for memory dumps, but with no luck.

Is WinDbg the way to go or some 3rd party application?

#203525
Apr 26, 2019 8:54
Vote:
 

Windbg is still the golden standard to analyse such issue. Remember you can always reach out to developer support service for further assistance.

#203526
Apr 26, 2019 10:00
Vote:
 

nothing yet has beaten windbg. you can also try to open process dump in VS (2019 actually has some nice updates and you can mix match process architectures and open 64bit dumps as well). look for threads, their running time and CLR stack. that would be perfect starting position :)

#203529
Apr 26, 2019 11:06
Vote:
 

Agree with Valdis, Windbg is the greatest (if you learn how to use it..).
But I would say that VS2019 is a great alternative and more easy to use

#203575
Apr 29, 2019 9:25
Vote:
 

If you are able to reproduce the bad performance locally you can use the performance profiler tool in VS 2017. That has found plenty of performance issues for me.  

If it occurs on production I usually start by analyzing IIS logs to find see if any particular page / service has an absurd mean execution time. I use Log Parser Studio for this. There is a pretty good query call top 25 slowest urls that can get you started in the right direction. Also check scheduled jobs and what execution time those have in Episerver admin. 

If all else fails...

WinDbg 

#203582
Apr 29, 2019 11:50
Vote:
 

I am ovbiously biased but I find it more often than not it is easier to use windbg to find out a 100% CPU problem's root cause. Know a few basic command like .threadpool, .runaway or .clrstack will help you narrow down if not find out where the problem is.

#203583
Apr 29, 2019 11:58
Vote:
 

Thanks everyone for sharing thoughts and ideas!

Should I use an "ordinary" process dump or a crash dump for troubleshooting?

Right now we have an "ordinary" process dump from when the process has exceeded 80% CPU for 20 seconds (DebugDiag).

#203623
Edited, Apr 30, 2019 10:02
Vote:
 

every dump is valuable (most of the time). don't forget to bring also mscordacwks.dll from the server (it's required for windbg to match runtime info properly).

#203624
Apr 30, 2019 10:07
Vote:
 

It is valuable to setup load tests and hopefully get a controlled way to reproduce. Makes it a lot easier to get a well timed memory dump as well.

For local profiling dotTrace is still my pick. Can be easily combined with local load test if a single request doesn't reveal things.

#203649
Apr 30, 2019 21:50
Vote:
 

Like Johan, I usually do several local profiling sessions (using dotTrace and sometimes dotMemory) to measure, improve and repeat. I usually do this to the app startup, to isolated flows or pages and to complicated logic. The idea is that if the code is improved to the load of one user, it should definitely also perform better for many users.

To perform simple local "load tests" of single URLs, I use the Bombardier HTTP benchmark tool.

But I also look at Application Insights after doing load tests on load balanced servers. Then I try to replicate locally to yet another round of measure, improve and repeat.

For production issues, I would first take a look at the Application Insights (for sites that use it). Mainly to determine pages or dependencies to investigate further.

#203655
May 01, 2019 9:17
Vote:
 

you are all lucky when you can replicate issue locally :)

#203667
May 02, 2019 8:11
Vote:
 

Yeah, I always feel so lucky. And have great bug hunters on my team.

#203668
May 02, 2019 8:42
Vote:
 

I guess the conclusion is to get more information in the form of either logs or crash dumps and focus on analyzing those.
Avoid starting to look at code and guess what the problem can be / applying random fixes. 

There is another psycological advantage to this as well. If you can measure the problem first, ask for time, then fix it and measure it again, it makes it much easier to explain to end-customer why they should pay you.

If you can state that it was feature X that was the problem and that you have now improved performance by 50% which resolved the issue you are less likely to have a bouncing invoice at the end of the day. Coding is fun but getting paid for coding is even more fun!

#203670
May 02, 2019 9:41
Vote:
 

Thank you once again!

I decided to upload our memory dump to the EPiServer Developer support. They know more about this than I :-)

//Markus

#203773
May 07, 2019 10:13
Vote:
 

If you give the ticket number I will try to look into it if time permits :)

#203827
May 08, 2019 8:12
Vote:
 

I did look into it and provided my thoughts to support. Apparently you received it. So let's try that way to see how it goes.

#203843
May 08, 2019 16:01
Vote:
 

I received it, yes.

Thank you for taking your time looking into the problem! It's greatly appreciated.

#203844
May 08, 2019 16:02
Vote:
 

awesome if case permits - you could follow-up here and share what's sharable. what was wrong, any gotchas and things others should keep in mind..

thx!

#204095
May 20, 2019 0:07
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.