Skip to content

OS_Deployment_Monitoring_And_Automatic_Retry

ligc edited this page Jul 30, 2015 · 5 revisions

{{:Design Warning}}

1. Overview
This feature will enable a monitoring and automatic retry mechanism for the node os installation. Some monitoring process can be started together with the os installation initiation, if the node status is not set to the expected value after a specific amount of time, the monitoring process will notify the users that the node failed to transit to the expected status and the monitoring process will re-initiate the os installation process if the users have told the monitoring process to do so. If the node status can not be changed to the expected value after several retries, the monitoring process will notify the users that the node failed to transit to the expected status after several retries and the monitoring process will stop the monitoring for this node. Both the stateful and stateless, both the non-hierarchy and hierarchy environment will be covered by this feature.
2. External Interface
The monitoring process is started by the rpower or rnetboot command, which is used to initiate the node os installation.
A new flag "-m" will be added to rpower and rnetboot to specify the node attributes and the expected status, the -m flag can be followed by the one table.column==expectedstatus or table.column=~expectedstatus pair, use multiple -m flags to specify multiple table.column==expectedstatus or table.column=expectedstatus pairs. table.column specifies the attribute that will be monitored, for example, nodelist.appstatus. The operator "==" indicates that the expectedstatus is a simple string, the operator "=" indicates that the expectedstatus is a regular expression. expectedstatus specifies the expected status that the user is waiting for. If the -m flag is not followed by any string, the default nodelist.status==booted will be used.
A new flag "-t" wil be added to rpower and rnetboot to specify the the timeout, in minutes, to wait for the expectedstatus. This is a required flag if the -m flag is specified.
A new flag "-r" will be added to rpower and rnetboot to specify the number of retries that the monitoring process will perform before declare the failure. The default value is 3. Setting the retrycount to 0 means only monitoring the os installation progress and will not re-initiate the installation if the node status has not been changed to the expected value after timeout.

3. Function Flow
If the -m flag is specified with the rpower/rnetboot, a monitoring process will be started to monitor the os installation progress. The attribute specified with the "table.column" or the default nodelist.status will be used to determine the node status after the os installation is initiated by rpower or rnetboot. The monitoring process will query the node status periodically, if the node status has changed to the expected value, then the monitoring for this node is done. If the node status is not changed to the expected value in minutes specified by timeout, the monitoring process will print a message to indicate that the node is not changed to the expected status, if the enableretry is not specified with the command or is set to "yes", the monitoring process needs to re-initiate the os installation and re-start the timeout minutes count down until the node status is changed to the expected result or reached the retries upper limit, if the retries upper limit, the monitoring process will print a message to indicate that the node os installation failed. Here is function flow chart of the monitoring:
[Image:Automaticretry.jpg]
4. Some considerations for the monitoring

  1. Installation enviroment changes: If the installation progress hangs at some where after the os installation has been started such as postscripts, the installation environment on the management node or service node may has been changed, for example, the NIM status may has been changed to some value other than "Ready", the pxe or tftp settings may has been changed to default to make the node boot from hard disk. It will be very difficult for the retry process to restore the initial environment settings on the management node or service node, the monitoring process will simply check the node status and re-initiate the os installation if necessary.

  2. When to start the monitoring process: The monitoring process can be started as soon as the rpower/rnetboot command is dispatched or the rpower or rnetboot retured from the hardware control plugin command such as HMC chsystate or lparnetboot. If the monitoring process is started as soon as the rpower/rnetboot command is dispatched, the monitoring process will be able to cover the chsysstate or lparnetboot hang issues, but if the chsysstate or the lparnetboot hangs, the retry probably will not help, and start the monitoring process earlier will need to fork additional process and bring in the monitoring workload earlier. So it will be better to wait for the hardware control plugin command returns to start the monitoring process.

  3. status polling interval: the monitoring process will query the node status periodically, the default polling interval can be 1 minute. We are not seeing obvious requirement for the users to change the polling interval, so the polling interval is not in the configurable parameters list. If we got customers requirement for changing the polling interval, we may add it into the configurable parameters in the future.
    5. Examples
    Here are some examples on how to use the os installation monitoring and automatic retry:

  4. Use rnetboot to start the node stateless netboot and expect the nodelist.status to be changed to "booted" in 10 minutes, if the nodelist.status is not changed to "booted" after 10 minutes then re-initiate the netboot process, the retry numbers limit is 2.
    rnetboot <noderange> -m -t 10 -r 2

  5. Use rpower to start the node stateful installation and expect the nodes' appstatus to be changed to "rmcd" in 60 minutes, if the appstatus is not changed to "rmcd" after 60 minutes, just print a message to indicate the nodes' appstatus failed to be changed to "rmcd" and do not re-initiate the os installation process.
    rpower <noderange> -m "nodelist.appstatus==rmcd" -t 60 -r 0

  6. Use rpower to start the node stateless netboot and expect the nodelist.appstatus to include "sshd" and the nodelist.status to be changed to "booting" in 10 minutes, if the appstatus is not changed to "rmcd" after 10 minutes then re-initiate the netboot process, the retry numbers limit is 2.
    rpower <noderange> -m "nodelist.status=~sshd" -m "nodelist.status==booting" -t 10 -r 2

News

History

  • Oct 22, 2010: xCAT 2.5 released.
  • Apr 30, 2010: xCAT 2.4 is released.
  • Oct 31, 2009: xCAT 2.3 released. xCAT's 10 year anniversary!
  • Apr 16, 2009: xCAT 2.2 released.
  • Oct 31, 2008: xCAT 2.1 released.
  • Sep 12, 2008: Support for xCAT 2 can now be purchased!
  • June 9, 2008: xCAT breaths life into (at the time) the fastest supercomputer on the planet
  • May 30, 2008: xCAT 2.0 for Linux officially released!
  • Oct 31, 2007: IBM open sources xCAT 2.0 to allow collaboration among all of the xCAT users.
  • Oct 31, 1999: xCAT 1.0 is born!
    xCAT started out as a project in IBM developed by Egan Ford. It was quickly adopted by customers and IBM manufacturing sites to rapidly deploy clusters.
Clone this wiki locally