Recently, a few of my SQL Jobs started failing in intermediate steps. By checking the logs, I saw the error message: "Delay in opening server connection"
These jobs were running fine for a year but this sudden failure/time out issue made me curious to find out the root cause.
Background: Before proceeding, let me give some background on infrastructure where these SQL Jobs were running. We have a HADR (High Availability Disaster Recovery) solution in place. That means, we are having one primary and one secondary SQL node where jobs are running based on primary state of the nodes.
Here is representation of current setup is in US Region (local data center)
Connection string used in SQL jobs.
Data Source=DEVLISTENER;Initial Catalog=Reports;Provider=SQLNCCLI11.1;Integrated Security=SSPI;Auto Translate=True
Note
As you see here, we are pointing to the listener name of the Availability Group in Always On HA setup in SQL Server. Listener is associated with IP address and DNS resolver will return the IP whenever requested. In my case we had only one IP associated based on above setup. So all the SSIS packages executed without any issues until I got the above error recently!
Who is the culprit?
Changes to the infrastructure
We have recently added new node as secondary in EMEA Region-Disaster Recovery (DR) Site as shown below.
By doing this change, I had to add one more IP from EMEA region to the listener in the cluster configuration. This IP will be Offline as long as node in EMEA region is secondary. This IP will come online when we fail over to EMEA region in case of disaster.
So, whenever a request comes to the listener, by default DNS returns all the IPs associated with the listener one by one serially. This is very important to note!. This might return the Offline IP first and we will get timeout!!!
Solution
The parameter MultiSubnetFailover
This parameter should be used in connection string and set to “TRUE.” When set to TRUE, the connection attempt behavior changes. It will no longer attempt all of the IP addresses serially, but it will try in parallel and online IP will be returned, and no more timeout issues will come!
I have made changes to the connecting string as follow
Data Source=DEVLISTENER;Initial Catalog=Reports;Provider=SQLNCCLI11.1;Integrated Security=SSPI;Auto Translate=True; MultiSubnetFailover=True
And it worked!!!
Helpful Link
https://blogs.msdn.microsoft.com/alwaysonpro/2014/06/03/connection-timeouts-in-multi-subnet-availability-group/