The mysterious case of the 1 minute pauses
I spend a lot of time debugging issues on our service - we have a big team and periodically we hit some really weird issues that get escalated to me. In this particular case, we deployed our new sprint, full of exciting features to our dogfood server, and after a day or two, people started complaining of sluggishness.
A quick check of the dashboard showed our availability was slowly getting worse:
Our DRIs had been tracking this issue but couldn't get a usable PerfView log - so I got called in to help out - and I saw the same behavior - i.e. the ETL traces that we auto-collect were not helpful - mostly because they either didn't show high CPU or because the symbols were messed up - the dreaded ?!? in PerfView (that happens sometimes with dynamically generated code, like XML Serializers or compiled regex) but this was worse than usual.
So I hopped onto the machines and captured a more detail log of requests/sec and queued requests:
And lo and behold, we have a one minute gap where nothing happens, followed by a massive increase in queued requests, then everything goes back to normal. I was pretty sure I was onto something so I figured I had to get a trace during that exact minute. So I fired up PerfView and waited - once I saw requests/sec plummet to zero, I clicked "Collect" and this is what I saw in the CPU Stacks view - just one thread doing work (and lot of it - basically burning an entire core):
That was still puzzling - we have one thread getting called by an ETW callback and doing rundown (rundown is basically the process that tells the CLR to dump type information in the ETW stream so that profilers can make sense of things like dynamic types/assemblies) and absolutely nothing else going on... I decided a dump might be helpful, so I waited again and collected a reflected process dump with ProcDump.
I opened it up and saw tons of these types of threads:
ntdll!ZwWaitForSingleObject+0x14KERNELBASE!WaitForSingleObjectEx+0x8fclr!CLREventWaitHelper2+0x3cclr!CLREventWaitHelper+0x1fclr!CLREventBase::WaitEx+0x7cclr!CLREventBase::Wait+0x15clr!SVR::GCHeap::WaitUntilGCComplete+0x2fclr!Thread::RareDisablePreemptiveGC+0x180clr!Thread::DisablePreemptiveGC+0x17bafcclr!GCHolderBase::EnterInternalCoop+0x17bb0cclr!GCCoop::GCCoop+0x3dclr!Thread::OnThreadTerminate+0x7eclr!DestroyThread+0xd2clr!ThreadpoolMgr::WorkerThreadStart+0x1ba4d0clr!Thread::intermediateThreadProc+0x86kernel32!BaseThreadInitThunk+0x14ntdll!RtlUserThreadStart+0x21
Except for one thread whose stack didn't fully resolve but did match the thread id of my PerfView above (i.e the rundown thread):
clr!ExecutionManager::GetRangeSection+0x59clr!ExecutionManager::FindCodeRangeWithLock+0xa70x00007ff9`58fe28b00x00000253`eecb0848clr!ClrFreeInProcessHeap+0x17clr!operator delete[]+0x20clr!SBuffer::DeleteBuffer+0x24clr!SBuffer::{dtor}+0xffffffff`ffeefbbbclr!SString::~SString+0x3e
So here's what we know so far:
- All threads are stuck waiting for a GC
- One thread is burning an entire core doing rundown, enumerating all of the types in the process.
The questions that remain are:
- Why is the rundown taking so long?
- Why is it holding up the GC threads?
- Why is there a rundown at all?
Since I have a process dump, I am going to look at the types in there:
0:092> !DumpDomain--------------------------------------normal output followed by thousands of these guys:Assembly: 0000025879e43270 (Dynamic) []ClassLoader: 000002587c924da0SecurityDescriptor: 0000025879cd5220Module Name00007ff939994ea8 Dynamic ModuleAssembly: 0000025879e44230 (Dynamic) []ClassLoader: 000002587c924e50SecurityDescriptor: 0000025879cd5d60Module Name00007ff939995b08 Dynamic Module
So that confirms the dynamic types going crazy theory - now we just have to figure out what they are and what is causing the problem - so I pick a couple of random modules and dump them (I also notice there's a cool -MT option to !DumpModule)
0:092> !DumpModule -MT 00007ff93bc7f7a0Name: Unknown ModuleAttributes: Reflection Supports<snip>Types defined in this moduleMT TypeDef Name------------------------------------------------------------------------------00007ff93bc7fed0 0x02000002 <Unloaded Type>00007ff93bcc0068 0x02000003 <Unloaded Type>00007ff93bcc0150 0x02000004 <Unloaded Type>00007ff93bcc0260 0x02000005 <Unloaded Type>00007ff93bcc03b0 0x02000006 <Unloaded Type>
Then I dump the types themselves and lo and behold, it's a bunch of XML Serializers:
0:092> !DumpMT -MD 00007ff93bcc03b0EEClass: 00007ff93bcbdac8Module: 00007ff93bc7f7a0Name: <Unloaded Type>mdToken: 0000000002000006File: Unknown ModuleBaseSize: 0x28ComponentSize: 0x0Slots in VTable: 12Number of IFaces in IFaceMap: 0--------------------------------------MethodDesc TableEntry MethodDesc JIT Name00007ff987692c20 00007ff9871f7de0 PreJIT System.Object.ToString()00007ff987696f40 00007ff9871f7de8 PreJIT System.Object.Equals(System.Object)00007ff9876960a0 00007ff9871f7e10 PreJIT System.Object.GetHashCode()00007ff9876bb7b0 00007ff9871f7e28 PreJIT System.Object.Finalize()00007ff93b0f1e58 00007ff93bcc0358 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_Reader()00007ff93b0f2280 00007ff93bcc0360 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_Writer()00007ff93b0f1e68 00007ff93bcc0368 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_ReadMethods()00007ff93b0f22c0 00007ff93bcc0370 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_WriteMethods()00007ff93b0f1e78 00007ff93bcc0378 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_TypedSerializers()00007ff93b0f1e80 00007ff93bcc0380 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.CanSerialize(System.Type)00007ff93b0f1e88 00007ff93bcc0388 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.GetSerializer(System.Type)00007ff93b0f2260 00007ff93bcc0390 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract..ctor()
But wait, those serializers are supposed to be cached right? Well, yes and no, in this case, we're using the constructor that does not use the built-in cache so a new serializer is created on every call to Serialize which means that we're generating a new assembly every couple of seconds...
But why does the CLR rundown block the GC? Well, that was a bug and that was fixed last fall! We just hadn't picked up the new build yet.
But wait, why are we even doing rundown? That's because we're running a super sweet background profiler which wakes up every so often and collects some data that gets aggregated later.



Commentaires
Enregistrer un commentaire