The distributed cache was a great idea introduced with SharePoint 2013. It truly allows for a load balanced, highly available SharePoint farm where a user no longer needs to retain server affinity. If I logged in on web front end 1 somehow get routed to web front end 2, no issues, lets just pull your login tokens, viewstate, and news feeds from this new cache shared across all the servers.
That said, so far maintaining it in a large enterprise environment has been a challenge. I have the opportunity(curse) of working in an environment of hundreds of SharePoint servers on premises. Our 2013 footprint is quickly growing; quickly enough that we can’t keep on top of say, manually, gracefully shutting down our distributed cache services every time an OS patch window rolls around.
Oh and what about those times that Windows just doesn’t want to play nice and a server blue screens, or worse just hangs. At this point we’re left with a server with an empty cache, which every other server in the farm thinks should have all of our logon tokens!
SP: Access Denied!!!
Browser: but I have a valid FedAuth cookie still!
SP: Yeah well I have no idea what happened to your claims so I’m not sharing
SP: I’m just trying to be nice, but I just don’t think this site has been shared with you.
So where do we go from here. The ideal situation is that before a cache server is restarted, the service is gracefully stopped, then removed.
*thank the MS Gods for autocomplete…
And after reboot, re-added.
But, can we fire that automatically for all reboot situations? Well, no, not easily.
1) Any shutdown tasks will not save you in the case of a blue screen, hardware failure, or a guy in the data center tripping over the power cord. The other servers in the farm will not know to rebuild the cache, it’ll just keep sending the requests to the bads server.
2) There is no ideal way to run the Powershell on start up, as a farm account, and please don’t tell me your farm account is Local Service or Network Service on a multi-server farm. The closes you can come is a start up script assigned to the local policy that calls a secondary script that has hardcoded credentials to impersonate a farm admin account.
What we’re doing:
1) record the last time each server had its distributed cache gracefully started. SPServer.Properties is a good place for that
2) Every 30 minutes or so, check this against the server’s last boot time (WMI)
3) If the last reboot is more recent than the last graceful dist cache start up, tell the farm to rebuild that server’s cache.
Sprinkle in some (lots of) error handling and call it a day.