RSS

Cross vCenter NSX failover and failback

21 May

So what does NSX cross vCenter failover look like in a real world scenario?

In this setup, we have 2 NSX managers across two sites, with Site A hosting the primary and DR Site hosting the secondary NSX manager. Simulating a DR scenario, the primary NSX manager, all 3 controllers and all the universal DLR control-VMs are shutdown in Site A

You will see the following after the primary NSX manager is shutdown.

nsx1-primshutdown

To failover NSX to the DR site:

 1.  Make the secondary NSX manager takeover the primary role.

From the secondary NSX manager, select action “Disconnect from the Primary NSX manager”, then select action “Assign Primary role”

nsx2-disconnectpri

2.  Deploy 3 new controllers in the DR site

I am assuming that you already have the IP pools and port groups to deploy the controllers. Remember that you must deploy them one at time, not all at once. During deployment of NSX controller, you may see momentary disconnected status as the next controller joins the cluster

nsx-controller-disconnetc

Important: Check the Cluster and ESXi channel health after you completed the deployment. During one of our test, 2 of the ESXi hosts had issues communicating with the NSX manager. This caused some of the VMs hosted by the NSX not to be pingable from North-South connections, because the new arp and mac tables did not get updated to the controllers. See KB 2146873. The solution is just to put each host into maintenance mode and reboot.

3.  Next deploy all the universal DLR control-VMs on the DR Site

You will see that all the uDLR have an “active” status instead of deployed. Just right-click on each uDLR and select deploy.

nsx3-uDLR-active

4. Lastly verify your network. Duh!

To failback to Site A from DR Site:

1.  Power up the primary NSX manager and all 3 controllers in Site A.

You should see 2 primary NSX managers and 6 controllers in the GUI.

In VMware design documentation, it is actually recommended that when performing and failover and failback of cross-vCenter NSX, you should re-create the controller cluster on the primary NSX manager. The reason is to prevent the cluster getting out of sync when you power them up; the only sure way is to re-create the controller cluster.

2.  Set the NSX manager’s role in the DR site to transit

On the NSX manager in the DR site, select action “Remove primary role”; this will put the NSX manager into transit role

nsx-mgr-failback

3.  Delete all 3 deployed controllers in the DR site

Delete in succession. For the last controller you need to select “force”

4.  Add the NSX manager back as secondary

On the Primary NSX manager in Site A, select action “Remove secondary NSX manager”. Then, select action “Add secondary NSX manager” and select the NSX manager in the DR site, which is in Transit role now. Ensure that both NSX managers shows connected to all 3 controllers.

nsx-mgr-syncissue

5.  Shutdown and delete all universal DLR control-VMs in the DR site

The current version of NSX does not allow you to delete the uDLRs directly from GUI, you need to use REST API using apps like postman to do it. If you have a HA pair you need to run the following to delete each DLR

DELETE https://{nsxmgr}/api/4.0/edges/{edgeid}/appliances/0 (or 1)

6.  Power up the original uDLRs in Site A.

7.  Lastly, verify your network. Duh!

The failover of NSX managers is pretty straight forward. Do note that there are obviously more tasks during a DR scenario, like redeploying your load-balancers, ensure your ESGs are routing correct and advertising the new routes, etc, but those are not topics that I am addressing here.

What I have described here is a planned DR scenario. In a real DR, you may have Site A totally isolated due to network connectivity for pro-long period of time. In this scenario, you must remember that all your NSX components are up and running on the isolated Site A. Before you establish connection to site A after you have done the failover, ensure that all uDLRs and controllers in the DR site are shutdown first. This is to avoid having uDLRs from both sites pushing different updates to your ESXi hosts (through each NSX manager’s controller cluster) causing route conflicts or confusion.

Advertisements
 
Leave a comment

Posted by on May 21, 2017 in Cloud, vmware

 

Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: