Tips Tricks Distributed Cache Service SharePoint2013

If you, like me, are playing with SharePoint 2013 or if you have plans to migrate/deploy SharePoint 2013, you may have already heard about Distributed Cache (a.k.a. Velocity or AppFabric).  In this post, I’d like to make you aware of some tips from the field that may help you avoid some serious issues in your production Farm.

First things first, see the following articles to learn about planning and managing Distributed Cache on SharePoint 2013:

Plan for feeds and the Distributed Cache service in SharePoint Server 2013

Manage the Distributed Cache service in SharePoint Server 2013

As you know, real world scenarios are always different and more challenging than TechNet “ideal world” and some tips that we noted from Premier support cases are really valuable:

  • When you run Configuration Wizard on SharePoint 2013 (a.k.a. psconfig), Distributed Cache service is enabled by default on that server.  If you run the wizard on all SharePoint servers in the Farm, the service will be running on all those servers which is not the ideal configuration for your production environment.  To avoid this problem, configure your servers via PowerShell instead of the wizard.  After the first Farm server is configured, you can use connect-spconfigurationdatabase with –skipregisterasdistributedcachehost parameter.
  • Plan to have a dedicated server or servers run only the Distributed Cache service.  Avoid sharing that server(s) with any other service, even Central Administration, because Distributed Cache needs special considerations with respect to resources and maintenance activities.
  • Recommended resources for dedicated servers are:
    • 4 cores processor
    • 24 GB RAM (8-16 GB dedicated for Distributed Cache)
    • 1 Gbps network interface
    • Physical and Virtual environments are supported, however on virtual environments dynamic memory is not supported
  • Distributed Cache must be configured manually to use dedicated resources, so please run the following actions during the Farm Configuration process before starting the User Profile Service:
    1. Stop Distributed Cache service on all servers running it, wait on each one until the service stops
      Stop-SPDistributedCacheServiceInstance –Graceful
      (the graceful parameter helps to move cache on that server to another available server)
    2. Then run cmdlet:
      Update-SPDistributedCacheSize –CacheSizeinMB <size in MB>
      Remember to use between 8 GB and 16 GB (16 GB used on real world scenarios with 24 GB RAM on server).
    3. Restart the service on all dedicated servers from Central Administration –> Services on the server
  • If you need to run a maintenance window or remove a server from the Cache Cluster (name used to identify all dedicated servers to Distributed Cache service) then you need stop and remove the service as follows:
    1. Stop the service using the following cmdlet
      Stop-SPDistributedCacheServiceInstance –Graceful
      on the server to be removed or on all dedicated servers (a.k.a. Cache Hosts)
      TIP: If you need Distributed Cache to always be available then leave the service running on one server.
    2. Run the following cmdlet on all the servers except the one left running for availability:
      Remove-SPDistributedCacheServiceInstance
      If all servers have the service stopped then leave one without running this cmdlet, which will be your first server to restart.
    3. When your maintenance is over, go to Central Administration and start Distributed Cache service from Services on Server page, then wait until service is listed as “started.”
    4. Finally go to each Cache Host and run cmdlet:
      Add-SPDistributedCacheServiceInstance
    5. To verify everything is ok, run the following cmdlets from any Cache Host to see if all Cache Hosts are listed and service status is “UP”:
      Use-CacheCluster
      Get-CacheHost
  • Never stop the AppFabric service from the Services applet in Windows or restart servers running AppFabric without gracefully stopping the Distributed Cache service.
  • The Distributed Cache service is based on AppFabric, which is a prerequisite when you install SharePoint 2013.  AppFabric has its own administration via PowerShell and developers can use it to deploy new features, however direct management and development on AppFabric in a SharePoint Farm is not supported.  If you have issues with AppFabric or Distributed Cache then get support from Microsoft, do not use the AppFabric management directly.  If you want to develop new features, use a dedicated AppFabric environment outside the SharePoint Farm.
  • AppFabric has his own updates, so SharePoint Administrators must be aware of those updates and their interaction with SharePoint Farm.  Follow the AppFabric Team Blog to learn more about it.
Advertisements
distributed cache ping powershell disabled

Distributed Cache needs ping

Issue

The Distributed Cache service does not install correctly on additional farm servers.

Symptoms

  1. When you join a server to the farm the Distributed Cache service on the server does not start. When you try to manually start or provision the service, you receive an error or the exception:
    cacheHostInfo is null
  2. When you try create a new Distributed Cache instance on a server that is not part of the Distributed Cache cluster using the Add-SPDistributedCacheServiceInstance cmdlet you receive the exception:
    ErrorCode<ERRCAdmin040>:SubStatus<ES0001>:Failed to connect to hosts in the cluster

In both cases:

  • The Distributed Cache service has been created and is running on one or more other servers in the farm
  • The AppFabric ports (TCP 22233-22236) are permitted between all servers in the farm
  • SharePoint has created a new Distributed Cache SPServiceInstance on the server, but it is Disabled
  • The AppFabric Windows service (AppFabric Caching Service) is not running on the server and has a Disabled startup type

Cause

Internet Control Message Protocol v4 (ICMPv4, or “ping”) traffic between the server and the first cache host in the farm is not permitted. The source of the blocked ICMP traffic could be due to:

  • One or more firewalls between SharePoint servers are not allowing ICMP traffic. e.g. a hardware firewall, Windows Firewall, or other software-based firewall
  • For servers in different networks, ICMP packets are not routed between the networks
  • Some other network policy that blocks ICMP traffic

Resolution

Allow ICMPv4 traffic between all servers running distributed cache and attempt recreating Distributed Cache instances on the additional servers or disconnecting and re-joining the servers to the farm.

Details

You’ve been selected to set up a new SharePoint Server 2013 farm to support a new company-wide portal. The stakeholders have a vision that the SharePoint farm will “never get hacked.” In an effort to achieve this goal, you’ve spent a considerable amount of time figuring out what you’ll need to do to harden SharePoint. Thankfully, there’s the Plan security hardening for SharePoint 2013 TechNet article that details the networking and service requirements. In fact, you’ve spent so much time dissecting this guide that it’s a mainstay of your most visited sites thumbnails when you open a new browser tab.

The guide details the requirements for Distributed Cache: Open the ports for AppFabric on the servers hosting the service and allow inbound connections. These are TCP ports 22233, 22234, 22235, and 22236 (i.e. TCP ports 22233-22236).

The day has come and you’re setting up the farm. You start the process on one of your servers and by creating the configuration database and Central Administration site. Next you join some other servers to the farm without issue. You carry on setting up web applications and services.

You reach a point where you need to configure the Distributed Cache service. The first thing you want to do is change which servers are running the service. For some reason, you notice the only server running the service is the server you used to originally create the farm. This is unusual because normally Distributed Cache is created and started on a server when you join it to the farm unless you explicitly provide the -SkipRegisterAsDistributedCacheHost switch to the Connect-SPConfigurationDatabase cmdlet. Of course, in this case you did not use the switch. You expect to see Distributed Cache running on other servers.

Distributed Cache

Distributed Cache

 

So you click on the server and confirm the Distributed Cache service instance is stopped.

service on the server

service on the server

 

You click Start and after a few seconds it says there was an error.

jw-DCPING-server-services-startingerror

If you try this in PowerShell (as you should have in the first place) you see the service instance exists, but it’s disabled.

distributed cache ping powershell disabled

distributed cache ping powershell disabled

 

When you go to provision it, you get the excellent “cache host info is null” error which is the technical way to say the Distributed Cache configuration is messed up.

distributed cache ping powershell provision error

distributed cache ping powershell provision error

 

At this point the only thing you think to do is to delete the service instance and manually create it again.

Delete the service instance:

distributed cache ping powershell deleted

distributed cache ping powershell deleted

 

Add the instance by running the Add-SPDistributedCacheServiceInstance directly on the server:

dcping powershell adderror

dcping powershell adderror

 

And there we g…?

Failed to connect to hosts in the cluster? How can that be? In this case the servers are on the same network, they’re even on the same VM host. We can use PortQry to validate the server can connect to the AppFabric ports:

distributed cache port query

distributed cache port query

 

That checks out, the cache (22233), cluster (22234), and replication (22236) ports are listening so what’s the deal?

The Deal

The deal is there is a minimally documented requirement for the Distributed Cache service. Unfortunately this requirement is not mentioned in either the hardening guide or the Manage the Distributed Cache service in SharePoint Server 2013 articles. But it does appear in the final note at the very bottom of the Plan for feeds and the Distributed Cache service in SharePoint Server 2013 page:

If you are using more than one cache host in your server farm, you must configure the first cache host running the Distributed Cache service to allow Inbound ICMP (ICMPv4) traffic through the firewall … If an administrator removes the first cache host from the cluster which was configured to allow Inbound ICMP (ICMPv4) traffic through the firewall, you must configure the first server of the new cluster to allow Inbound ICMP (ICMPv4) traffic through the firewall.

To set up Distributed Cache, the cache hosts must be able to ping the initial cache host. Normally this is the first server you set up in the farm provided you haven’t removed the service instance.

Sure enough, when we ping the server, it fails:

domain controller ping failed

domain controller ping failed

 

The new server can’t ping the server that is already running Distributed Cache. In this case, Windows Firewall blocked incoming ICMPv4 ping requests. By creating a rule to allow ping to the server, it becomes possible to add a new Distributed Cache instance:

domain controller ping allow

domain controller ping allow

 

But it gets better. If you follow the documentation exactly and enable ICMP to only the first cache host and none of the others servers respond to pings, attempting to administer the AppFabric cluster won’t work and says the other hosts are unavailable. If you then allow ping on the other hosts the instances appear online.

domain controller ping other server

domain controller ping other server

 

This means the actual networking requirements for Distributed Cache are allowing inbound TCP ports 22233-22236 and inbound ICMPv4 on all cache hosts in the farm.

Adding the service to a server that didn’t have it to begin with

Let’s pretend you originally joined a server to the farm using the -SkipRegisterAsDistributedCacheHost switch and later decided you want to run Distributed Cache. If ICMP isn’t enabled on the first cache host you will encounter the issue as well. When you run Add-SPDistributedCacheServiceInstance you’ll receive the “Failed to connect to hosts in the cluster” exception. The resolution is the same. Allow ICMP and retry.

In both scenarios you may need to delete and recreate the new service instance a number of times before it works. I find after enabling ICMP the first attempt doesn’t always succeed so I need to delete the instance and add it again.

Of course, if your SharePoint servers can ping each other before you join them, you’ll never run into this issue.

Repairing distributed cache with PowerShell

  • Recently we had issues with our distributed cache system that was set up on are farm quite some time ago when I built it with SPAuto-Installer.  This could have been from rolling out cumulative updates or what have you.  There is very little documentation on the web for this.

*  In our case we had 4 servers (2 web front-ends and 2 application servers)  all with the distributed cache enabled.  Only one server was running the distributed cache.

*  The correct topology for distributed cache is for it to exist on the web front-ends.  So we made some changes to the farm.

Clean up all 4 Servers using the following commands:

#Stopping the service on local host

Stop-SPDistributedCacheServiceInstance -Graceful

#Removing the service from SharePoint on local host.

Remove-SPDistributedCacheServiceInstance

#Cleanup left over pieces from SharePoint

$instanceName =”SPDistributedCacheService Name=AppFabricCachingService”

$serviceInstance = Get-SPServiceInstance | ? {($.service.tostring()) -eq $instanceName -and ($.server.name) -eq $env:computername}

$serviceInstance.delete()

Then we added the cache host back to WEB01:

#Re-add the server back to the cluster

Add-SPDistributedCacheServiceInstance

We then checked the SPDistributedCacheClientSettings and found that “MaxConnectionsToServer” was set to 16 for all containers.

$DLTC = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache

$DLTC

We used the following script to change  “MaxConnectionsToServer” back to 1 and increase the timeout for each container.

Add-PSSnapin Microsoft.Sharepoint.Powershell

#DistributedLogonTokenCache

$DLTC = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache

$DLTC.MaxConnectionsToServer = 1

$DLTC.requestTimeout = “3000”

$DLTC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache -DistributedCacheClientSettings $DLTC

#DistributedViewStateCache

$DVSC = Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache

$DVSC.MaxConnectionsToServer = 1

$DVSC.requestTimeout = “3000”

$DLTC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache $DVSC

#DistributedAccessCache

$DAC = Get-SPDistributedCacheClientSetting -ContainerType DistributedAccessCache

$DAC.MaxConnectionsToServer = 1

$DAC.requestTimeout = “3000”

$DAC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedAccessCache $DAC

#DistributedAccessCache

$DAF = Get-SPDistributedCacheClientSetting -ContainerType DistributedAccessCache

$DAF.MaxConnectionsToServer = 1

$DAF.requestTimeout = “3000”

$DAF.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedCache $DAF

#DistributedActivityFeedLMTCache

$DAFC = Get-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedLMTCache

$DAFC.MaxConnectionsToServer = 1

$DAFC.requestTimeout = “3000”

$DAFC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedLMTCache $DAFC

#DistributedBouncerCache

$DBC = Get-SPDistributedCacheClientSetting -ContainerType DistributedBouncerCache

$DBC.MaxConnectionsToServer = 1

$DBC.requestTimeout = “3000”

$DBC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedBouncerCache $DBC

#DistributedDefaultCache

$DDC = Get-SPDistributedCacheClientSetting -ContainerType DistributedDefaultCache

$DDC.MaxConnectionsToServer = 1

$DDC.requestTimeout = “3000”

$DDC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedDefaultCache $DDC

#DistributedSearchCache

$DSC = Get-SPDistributedCacheClientSetting -ContainerType DistributedSearchCache

$DSC.MaxConnectionsToServer = 1

$DSC.requestTimeout = “3000”

$DSC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedSearchCache $DSC

#DistributedSecurityTrimmingCache

$DTC = Get-SPDistributedCacheClientSetting -ContainerType DistributedSecurityTrimmingCache

$DTC.MaxConnectionsToServer = 1

$DTC.requestTimeout = “3000”

$DTC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedSecurityTrimmingCache $DTC

#DistributedServerToAppServerAccessTokenCache

$DSTAC = Get-SPDistributedCacheClientSetting -ContainerType DistributedServerToAppServerAccessTokenCache

$DSTAC.MaxConnectionsToServer = 1

$DSTAC.requestTimeout = “3000”

$DSTAC.channelOpenTimeOut = “3000”

Set-SPDistributedCacheClientSetting -ContainerType DistributedServerToAppServerAccessTokenCache $DSTAC

  • We then stopped and restarted Distributed Cache from Central Admin on WEB01
  • We then attempted to start “Distributed Cache” on WEB02 and received error “failed to connect to hosts in the cluster”
  • Performing a TRACERT from WEB01 to WEB02, we can see a device is in the middle (10.21.1.5).
  • Installed Telnet

Import-Module servermanager

Add-WindowsFeature telnet-client

  • Telnet from WEB01 to WEB02 on port 22233 and the connection was established.
  • We then stopped, cleaned and added WEB02 back to the cache farm
  • #Stopping the service on local host

    Stop-SPDistributedCacheServiceInstance -Graceful

    #Removing the service from SharePoint on local host.

    Remove-SPDistributedCacheServiceInstance

    #Cleanup left over pieces from SharePoint

    $instanceName =”SPDistributedCacheService Name=AppFabricCachingService”

    $serviceInstance = Get-SPServiceInstance | ? {($.service.tostring()) -eq $instanceName -and ($.server.name) -eq $env:computername}

    $serviceInstance.delete()

    Then we added the cache host back to WEB02:

    #Re-add the server back to the cluster

    Add-SPDistributedCacheServiceInstance

    This time it started!

    • Now we have WEB01 and WEB02 servicing distributed Cache
  • We checked the ULS Logs with ULSViewer and found all successful events for Distributed Cache.
  • Status

    =======

    Distributed cache is now healthy and in a working state on both WFE Servers.