Published: Nov 16, 2019 by Matt Wood
TL;DR: Some time trigger functions make their own schedule, regardless of your wishes due to a possible bug in scale up functionality.
Azure Functions-as-a-Service are a great way to build event-driven microservices for many business applications. We’re using them with one of our clients to implement many of the transformation processes needed for a cloud-based data warehouse. This sort of architecture allows us to properly scale compute resources to handle “bursty” data flows that are typical with consuming data from dozens of providers. For instance, we might receive data from 30 different feeds all at once or spread throughout the day. A traditional system with fixed resources would have a tough time slogging through all of those flows (and associated calculations, etc) if they arrived all at once, but in our case the system scales up to meet demand and back down again once work has completed to save cost. We’ve been happy with the performance and maintainability of the system along with the resilience/high-availability (backups, uptime SLAs, etc) you get by leveraging cloud resources.
Recently, we ran into an interesting issue with an Azure timer trigger function. We use these time-triggered functions (scheduled tasks) for doing both on-demand and scheduled data transformations like calculations, conversions, etc. The issue was that the function should have only fired once per day, yet fired multiple times, seemingly at random. If you do a quick search on google about this, you’ll find plenty of stack overflow articles about this sort of behavior resulting from leaving “RunOnStartup” set to true, but that wasn’t the case here. What we were seeing was our timer trigger, which was supposed to fire only at 1AM each day, also seemed to fire at a bunch (>10) of other times at seemingly random intervals. This sort of behavior was causing unnecessary cost and some data problems, so it needed to be fixed.
From my log analysis, it seemed like the function was firing whenever additional instances were brought online to scale up to meet the demand of inbound data flows. I dug into some of the available documentation out there to try to explain this. According to the docs, this sort of problem shouldn’t occur (sources in brackets linked at the bottom of this article):
- TimeTrigger is supposed to run only a single instance even when scaling out [1,3].
- Not supposed to retry after a failed invocation .
- Looks like the code (github) has new server instances check for anything that is overdue on startup. This would include anything that hasn’t completed a successful run [2,3].
Therefore, it would seem our problem (multiple invocations of a timer trigger) is likely caused by long-running invocations that take place while new server instances are brought online (due to scale out, which can fire up 15+ more instances). When the new instances are brought up, they see that the function is overdue (since it didn’t complete yet). The log entries seemed to support this (multiple log items with “Trigger Details: UnscheduledInvocationReason: IsPastDue” with different host IDs.
However, this doesn’t explain why the singleton behavior doesn’t work: the blobstore lock should be held by the first instance running the function. It’s possible that it couldn’t be renewed due to high CPU? or maybe need to set host ID for the lock .
While waiting to hear back from MS, we got around the issue by running the function on its own AppServicePlan with no scale out configuration, so that’s a possible short-term solution. This would prevent the system from spinning up additional instances, which would trigger the faulty “overdue” logic.