Views: 1687
Number of votes: 7
Average rating:

Two bugs in ASPNET that break Cache memory management

TL;DR: A bug in ASPNET memory monitoring prevents the cache monitoring / scavenging tread from running if the memory usage in the host at the time of application startup happens to be in a specific range in relation to the configured cache thresholds. Another bug causes the same behavior if no cache threshold is explicitly configured (falling back to default). This can further cause problems in Azure Web Apps because it triggers the Proactive Auto Heal feature, repeatedly recycling the application for no reason. The workaround is to set the privateBytesPollTime option to less than 30 seconds and also provide a value for percentagePhysicalMemoryUsedLimit (90 % or less to not conflict with Proactive Auto Heal).

Intro and previous investigations

In my previous blog post I started to investigate a problem with the ASPNET HTTP runtime cache and Azure Proactive Auto Heal and the buggy behavior of the configured memory usage threshold. I found an apparent correlation between certain settings for percentagePhysicalMemoryUsedLimit and the application failing to keep the cache memory usage under control, causing Proactive Auto Heal to recycle the application. But I also acknowledged that the behavior was intermittent, and I now know why.

A quick overview of ASPNET memory management internals

ASPNET (this refers to .net Framework 4.7.1) uses the AspNetMemoryMonitor class to monitor the application's memory usage. Internally it uses the LowPhysicalMemoryMonitor class to periodically compare the current memory pressure to the configured (or default) thresholds, and trigger cache scavenging/trimming if needed. Or, at least it is supposed to do this...

Bug 1: Periodic checks fail to start under certain conditions

For the periodic memory checks, LowPhysicalMemoryMonitor uses a System.Threading.Timer instance that is initialized like this in its constructor:

this._timer = new Timer(new TimerCallback(this.MonitorThread), (object) null, -1, this._currentPollInterval);

Note the third argument, -1, which means that the timer is initialized with a due date (next time it will trigger the callback) "infinitely" far off in the future. So essentially the timer is disabled from the start.

The framework will then call the LowPhysicalMemoryMonitor.Start method which calls AdjustTimer(false). AdjustTimer is actually called within the timer callback, to adjust the time for the next callback according to the current memory pressure. The callback will run more often if the memory pressure is high. So AdjustTimer is designed to do more than just start the timer for the Start method. It looks like this (reflected code):

internal void AdjustTimer(bool disable = false)
{
    lock (this._timerLock)
    {
        if (this._timer == null)
            return;
        if (disable)
            this._timer.Change(-1, -1);
        else if (this.PressureLast >= this.PressureHigh)
        {
            if (this._currentPollInterval <= 5000)
                return;
            this._currentPollInterval = 5000;
            this._timer.Change(this._currentPollInterval, this._currentPollInterval);
        }
        else if (this.PressureLast > this.PressureLow / 2)
        {
            int num = Math.Min(LowPhysicalMemoryMonitor.s_pollInterval, 30000);
            if (this._currentPollInterval == num)
                return;
            this._currentPollInterval = num;
            this._timer.Change(this._currentPollInterval, this._currentPollInterval);
        }
        else
        {
            if (this._currentPollInterval == CacheSizeMonitor.PollInterval)
                return;
            this._currentPollInterval = CacheSizeMonitor.PollInterval;
            this._timer.Change(this._currentPollInterval, this._currentPollInterval);
        }
    }
}

PressureLast is the last observed memory pressure (% physical memory used).
PressureHigh is the configured (or default) percentagePhysicalMemoryUsedLimit.
PressureLow is a value calculated from PressureHigh, lower than PressureHigh (generally it is PressureHigh - 9).
s_pollInterval is a default timer interval that is initialized to the value of the privateBytesPollTime setting, converted to milliseconds. The default for privateBytesPollTime is 2 minutes.
_currentPollInterval is the timer interval to use, which is also set to the Timer using the Change method. This field is initialized to 30 seconds.

Now, assume AdjustTimer is called by Start to start the timer. The fields will have the values described above, and most importantly, PressureLast is some value it has actually measured. So it can be anything. It can fulfill PressureHigh > PressureLast > PressureLow / 2, going into the second else if. That will get the minimum of 30000 and s_pollInterval, which by default is 120000, so 30000 is used. Then it compares 30000 to _currentPollInterval and does an early exit if they are the same, which they are because that is what _currentPollInterval was initialized with! This means the timer is never activated, AdjustTimer is never called again, and the LowPhysicalMemoryMonitor and the cache scavenging handled by it will remain inactive for the lifetime of the application.

How does this explain the apparent correlation with percentagePhysicalMemoryUsedLimit I described in the previous blog post? Because values in the range identified in that post are likely to arrange PressureLow and PressureHigh in such a way that a typical PressureLast value at application startup would match the scenario described above. At low values for percentagePhysicalMemoryUsedLimit it is much more likely to end up in the PressureLast >= PressureHigh scenario, and at high values it is more likely to end up in the PressureLast <= PressureLow / 2 scenario.

Bug 2: Periodic checks are stopped if default memory threshold is used

Let's take a look at LowPhysicalMemoryMonitor.ReadConfig, which is called from the constructor:

private void ReadConfig(CacheSection cacheSection)
{
  int val2 = cacheSection != null ? cacheSection.PercentagePhysicalMemoryUsedLimit : 0;
  if (val2 == 0)
    return;
  this._pressureHigh = Math.Max(3, val2);
  this._pressureLow = Math.Max(1, this._pressureHigh - 9);
  LowPhysicalMemoryMonitor.s_pollInterval = (int) Math.Min(cacheSection.PrivateBytesPollTime.TotalMilliseconds, (double) int.MaxValue);
}

Note the first line. The default for percentagePhysicalMemoryUsedLimit if it is omitted from config is actually 0. So unless you set that setting, you will end up in the early exit in line 3. The most important implication of that is that the last line never runs, so s_pollInterval retains its default value of 0.

Now say that the Bug 1 was not triggered and the timer is now running, periodically calling into AdjustTimer. Then take look again at the PressureLast > PressureLow / 2 case. Because s_pollInterval is now 0, that will be used as the new interval set to the timer. The call chain from Timer.Change ends up in System.Threading.TimerQueue.UpdateTimer where the interval (period) is used like this:

timer.m_period = (period == 0) ? Timeout.UnsignedInfinite : period;

So passing 0 disables the timer. And again we are in a situation where the memory monitoring and cache scavenging will be off for the lifetime of the application.

Workaround

Bug 1 can be avoided by specifying a privateBytesPollTime setting of less than 30 seconds, forcing the first call to AdjustTimer to adjust the timer and thereby starting it, even in the PressureLast > PressureLow / 2 scenario.
Bug 2 can be avoided simply by specifying a setting for percentagePhysicalMemoryUsedLimit other than the default of 0.

    (By Mark Hall , 08 November 2017 19:09, Permanent link)

    Nice work!! Did you file bugs to Microsoft?

    (By Magnus Rahl , 08 November 2017 20:00, Permanent link)

    Thanks! I posted the issues here initially: https://github.com/Microsoft/dotnet/issues/552

    I think we will use our direct Microsoft contacts to "promote" these issues as well.

    (By valdis iljuconoks , 08 November 2017 22:33, Permanent link)

    nice catch

    (By Magnus Rahl , 08 December 2017 11:43, Permanent link)

    Update: Microsoft have confirmed these bugs will be fixed in .NET Framework 4.7.2. They will also discuss if anything can be done to make the default settings/behavior work better with Proactive Auto Heal.

    (By Tommi Raunio , 15 January 2018 09:05, Permanent link)

    Woah! Nice catch!

    Do you by any chance happen to know if
    1) This concerns other framework versions than 4.7.1?
    2) What happens if we run Episerver Commerce on virtual machines with framework 4.7.1, with memory based application recycling off? Does this bug cause performance problems in there?

    (By Magnus Rahl , 16 January 2018 08:36, Permanent link)

    @Tommi:

    Thanks!

    1) My guess is the bugs are present in 4.7.0 as well. A number of changes related to cache and memory monitoring are listed in the release notes for that version, so I'm guessing that is when the bugs were introduced.

    2) I'm not entirely sure. I think in most cases you will be fine. The reason for that is that in addition to the LowPhysicalMemoryMonitor that is impeded by these bugs, there is also a RecycleLimitMonitor which is still working and seems to trim the cache in such a way that the machine doesn't run out of memory completely. However, if your application is running at high memory pressure and suddenly has a burst of cache inserts (or other memory allocations), it could run out of memory before a cache trim and GC kicks in. If an applicaiton exhibits this behavior you would typically use the percentagePhysicalMemoryUsedLimit to maintain a larger margin. And you can still do that despite these bugs if you follow the workaround I present.

    (By Magnus Rahl , 17 April 2018 10:17, Permanent link)

    The bugs are fixed in .net framework 4.7.2 (currently in the early access program, general avaliability unknown), see bugs 531575, 531576 in the changelog here:

    https://github.com/Microsoft/dotnet-framework-early-access/blob/master/release-notes/build-3052/dotnet-build-3052-changes.md

    (By Magnus Rahl , 30 April 2018 21:10, Permanent link)

    .NET Framework 4.7.2, which includes fixes for these bugs, was released today. It will take some time before it is deployed to Azure Web Apps though. More info here: https://world.episerver.com/blogs/Magnus-Rahl/Dates/2018/4/asp-net-cache-memory-bug-fixes-released/

  Please login to post a comment