Service_node_take_over

Table of Contents

Overview
Implementation

Overview

In a hierarchical cluster where many service nodes are used to manage the compute nodes, sometimes a service node may be out of order or may need to be shutdown for maintenance. Sometimes, the admin may decide to move some nodes from one service nodes to another. In order not to make minimum impact to the compute nodes, it is needed to smoothly move the responsibility of the source service node, say sn1, to the destination service sn2. This design assumes that

both sn1 and sn2 are installed as the service nodes
nodes managed by sn1 are reachable by sn2 via Ethernet.
sn2 is currently running
sn1 may or may not be running.

A utility will be developed for both AIX and Linux to do the take over.

Implementation

Exploring service node pool concept

In this design, the service node pool concept will be explored. To define a service node pool for a node, noderes.servicenode will be a comma separated list of service node and noderes.xcatmaster is usually left to blank. However, this design will add the support for noderes.xcatmaster not being blank. It can be the host name of the adapter of a service node that faces the compute nodes. In this case, the first service node in the list specified in the noderes.servicenode must point to THE SAME SERVICE NODE defined in noderes.xcatmaster. Here is how it works:

If noderes.xcatmaster is blank, the MASTER environmental variable for the postscripts will be the dhcp server that responded to the node's dhcp request during node deployment.
If noderes.xcatmaster is not blank, the MASTER environmental variable for the postscripts will use this value regardless of the dhcp server.

The MASTER will be used by the postscripts to setup the server for some services, for example, syslog server. In fact the current code already works this way. We need to go through the code to make fixes in the places that assume the noderes.xcatmaster is blank for service node pool. The service node take over only works when

there is a service node pool defined for the nodes
both sn1 and sn2 are in the pool and sn1 is the first one in the servicenode list.
noderes.xcatmaster is not blank

snmove command

A command called snmove will be implemented. It'll take two service node names as input, transfer the responsibilities from the source service node to the destination service node.

   **snmove -v|-h**
   **snmove noderange -d sn2 -D sn2n [-i]**  
        Move management responsibilities for the given nodes to sn2. 
   **snmove -s sn1 [-S sn1n] -d sn2 -D sn2n [-i]**  
        Move management responsibilities for all the nodes managed by sn1 to sn2.
                sn1 is the hostname of the source service node adapter facing the mn.
                sn1n is the hostname of the source service node adapter facing the nodes.
                sn2 is the hostname of the destination service node adapter facing the mn.
                sn2n is the hostname of the destination service node adapter facing the nodes.
                -i: No action will be done on the nodes. 
                If -i is not specified, the syslog and setup ntp postscritps will be rerun on the nodes to switch the syslog and NTP server.

When the snmove is invoked, it must be invoked on the mn. It will perform the following functions:

For Linux:

Top down
- change the noderes.servicenode setting, move sn2 to the first one in the list
Bottom up
- change the noderes.xcatmaster setting for the nodes from sn1 to sn2.
- go to the nodes, make syslog and ntp MASTER to use sn2 (using updatenode)
DHCP
- add the nodes to the .leases files on sn2. (this step can be ignored because makeshcp already did it).
- change the dhcpserver on the nodes to point to sn2
TFTP and NFS
- if noderes.tftpserver and noderes.nfsserver is pointing to sn1, change it to sn2 for the nodes. rerun nodeset for the nodes so that next time when the nodes boot up, they will use the latest setting.
Name Server
- Change the /etc/resolve.conf on the node to point to sn2 (what puts the value there? dhcp?)
Conserver
- change the nodehm.conserver setting for the nodes from sn1 to sn2, rerun makeconservercf for the nodes.
Monitoring??
- change the noderes.monserser setting for the nodes from sn1 to sn2.
- How about monitoring software like RMC, SNMP, Ganglia? No, this will be done by user running updatenode command.
Hardware control point?
- What if sn1 is the service node for hardware control point for nodes need to be moved over to sn2? (Not handled yet)
Applications?
- LL config file? (No, user need to run updatenode manaually)

For AIX:

TBD.

Direct bootp or broadcast bootp

In system p, we can ask to have a direct bootp or broadcast bootp during node deployment. A flag (-d) will be added in rnetboot command to indicate a direct bootp is requested. If -d is set, the noderes.xcatmaster ,which default to site.master if blank, will be used as the bootp server for the nodes to be deployed.

News

Mar 08, 2023: xCAT 2.16.5 released.
Jun 20, 2022: xCAT 2.16.4 released.
Nov 17, 2021: xCAT 2.16.3 released.
May 25, 2021: xCAT 2.16.2 released.
Nov 06, 2020: xCAT 2.16.1 released.
Jun 17, 2020: xCAT 2.16 released.
Mar 06, 2020: xCAT 2.15.1 released.
Nov 11, 2019: xCAT 2.15 released.
Mar 29, 2019: xCAT 2.14.6 released.
Dec 07, 2018: xCAT 2.14.5 released.
Oct 19, 2018: xCAT 2.14.4 released.
Aug 24, 2018: xCAT 2.14.3 released.
Jul 13, 2018: xCAT 2.14.2 released.
Jun 01, 2018: xCAT 2.14.1 released.
Apr 20, 2018: xCAT 2.14 released.
Mar 14, 2018: xCAT 2.13.11 released.
Jan 26, 2018: xCAT 2.13.10 released.
Dec 18, 2017: xCAT 2.13.9 released.
Nov 03, 2017: xCAT 2.13.8 released.
Sep 22, 2017: xCAT 2.13.7 released.
Aug 10, 2017: xCAT 2.13.6 released.
Jun 30, 2017: xCAT 2.13.5 released.
May 19, 2017: xCAT 2.13.4 released.
Apr 14, 2017: xCAT 2.13.3 released.
Feb 24, 2017: xCAT 2.13.2 released.
Jan 13, 2017: xCAT 2.13.1 released.
Dec 09, 2016: xCAT 2.13 released.
Dec 06, 2016: xCAT 2.9.4 (AIX only) released.
Nov 11, 2016: xCAT 2.12.4 released.
Sep 30, 2016: xCAT 2.12.3 released.
Aug 19, 2016: xCAT 2.12.2 released.
Jul 08, 2016: xCAT 2.12.1 released.
May 20, 2016: xCAT 2.12 released.
Apr 22, 2016: xCAT 2.11.1 released.
Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
Dec 11, 2015: xCAT 2.11 released.
Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
Jul 30, 2015: xCAT 2.10 released.
Jul 30, 2015: xCAT migrates from sourceforge to github
Jun 26, 2015: xCAT 2.7.9 released.
Mar 20, 2015: xCAT 2.9.1 released.
Dec 12, 2014: xCAT 2.9 released.
Sep 5, 2014: xCAT 2.8.5 released.
May 23, 2014: xCAT 2.8.4 released.
Jan 24, 2014: xCAT 2.7.8 released.
Nov 15, 2013: xCAT 2.8.3 released.
Jun 26, 2013: xCAT 2.8.2 released.
May 17, 2013: xCAT 2.7.7 released.
May 10, 2013: xCAT 2.8.1 released.
Feb 28, 2013: xCAT 2.8 released.
Nov 30, 2012: xCAT 2.7.6 released.
Oct 29, 2012: xCAT 2.7.5 released.
Aug 27, 2012: xCAT 2.7.4 released.
Jun 22, 2012: xCAT 2.7.3 released.
May 25, 2012: xCAT 2.7.2 released.
Apr 20, 2012: xCAT 2.7.1 released.
Mar 19, 2012: xCAT 2.7 released.
Mar 15, 2012: xCAT 2.6.11 released.
Jan 23, 2012: xCAT 2.6.10 released.
Nov 15, 2011: xCAT 2.6.9 released.
Sep 30, 2011: xCAT 2.6.8 released.
Aug 26, 2011: xCAT 2.6.6 released.
May 20, 2011: xCAT 2.6 released.
Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
xCAT OS And Hw Support Matrix

History

Oct 22, 2010: xCAT 2.5 released.
Apr 30, 2010: xCAT 2.4 is released.
Oct 31, 2009: xCAT 2.3 released. xCAT's 10 year anniversary!
Apr 16, 2009: xCAT 2.2 released.
Oct 31, 2008: xCAT 2.1 released.
Sep 12, 2008: Support for xCAT 2 can now be purchased!
June 9, 2008: xCAT breaths life into (at the time) the fastest supercomputer on the planet
May 30, 2008: xCAT 2.0 for Linux officially released!
Oct 31, 2007: IBM open sources xCAT 2.0 to allow collaboration among all of the xCAT users.
Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in IBM developed by Egan Ford. It was quickly adopted by customers and IBM manufacturing sites to rapidly deploy clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly