16. Appendix - H Recovering from Genesis Issues

16.1. Playbook “lxc-create.yml” fails to create lxc container.

  • Verify python virtual environment is activated by running which ansible-playbook. This should return the path */cluster-genesis/deployenv/bin/ansible-playbook. If something else is returned (including nothing) cd into the cluster-genesis directory and re-run source scripts/setup-env.

Verify network bridge named “br0” is up and connected to the management network. Verify that the bridge_ports interface has a carrier:

cat /sys/class/net/p1p1/carrier


Verify bridge br0 has an ip address assigned;

ip a show br0

16.2. Switch connectivity Issues:

  • Verify connectivity from deployer container to managment interfaces of both management and data switches. Be sure to use values assigned to the [ipaddr,userid,password]-[mgmt,data]-switch keys in the config.yml. These switches can be on any subnet except the one to be used for your cluster managment network, as long as they’re accessible to the deployer system.
  • Verify SNMP is enabled on both switches. Run show snmp-server on the switch command line interface (command could vary).
  • Verify SSH is enabled on the data switch and that the user can ssh directly from deployer to the switch using the ipaddr,userid, and password keys defined in the config.yml

16.3. Missing Hardware

Hardware can fail to show up for various reasons. Most of the time these are do to miscabling or mistakes in the config.yml file. The Node discovery process starts with discovery of mac addresses and DHCP hand out of ip addresses to the BMC ports of the cluster nodes. This process can be monitored by checking the DHCP lease table after booting the BMCs of the cluster nodes. During execution of the install.yml playbook, at the prompt;

“Please reset BMC interfaces to obtain DHCP leases. Press <enter> to continue”

After rebooting the BMCs and before pressing <enter>, you can execute from a second shell;

deployer@ubuntu-14-04-deployer:~$ cat /var/lib/misc/dnsmasq.leases

1471870835 a0:42:3f:30:61:cc * 01:a0:42:3f:30:61:cc

1471870832 70:e2:84:14:0a:10 * 01:70:e2:84:14:0a:10

1471870838 a0:42:3f:32:6f:3f * 01:a0:42:3f:32:6f:3f

1471870865 a0:42:3f:30:61:fe * 01:a0:42:3f:30:61:fe

To follow the progress you can execute;

deployer@ubuntu-14-04-deployer:~$ tail -f /var/lib/misc/dnsmasq.leases

You can also check what switch ports these mac addresses are connected to by logging into the management switch and executing;

RS G8052>show mac-address-table

  • MAC address VLAN Port Trnk State Permanent Openflow*
  • —————– ——– ——- —- —– ——— ——–*
  • 00:00:5e:00:01:99 1 48 FWD N *
  • 00:16:3e:53:ae:19 1 20 FWD N *
  • 0c:c4:7a:76:c8:ec 1 37 FWD N *
  • 40:f2:e9:23:82:be 1 11 FWD N *
  • 40:f2:e9:24:96:5e 1 1 FWD N *
  • 5c:f3:fc:31:05:f0 1 15 FWD N *
  • 5c:f3:fc:31:06:2a 1 18 FWD N *
  • 5c:f3:fc:31:06:2c 1 17 FWD N *
  • 5c:f3:fc:31:06:ec 1 13 FWD N *
  • 70:e2:84:14:02:92 1 3 FWD N *

For missing mac addresses, verify that port numbers in the above printout match the ports specified in the config.yml file. Mistakes can be corrected by correcting cabling, correcting the config.yml file and rebooting the BMCs.

Mistakes in the config.yml file require a restart of the install.yml playbook. Before restarting, make a backup of any existing inventory.yml files and then create an empty inventory.yml file.

mv inventory.yml inventory.yml.bak

> inventory.yml

Once all the BMC mac addresses have been given leases, press return in the genesis execution window.

16.4. Common Supermicro PXE bootdev Failure

Supermicro servers often fail to boot PXE devices on first try. In order to get the MAC addresses of the PXE ports our code sets the bootdev on all nodes to pxe and initiates a power on. Supermicro servers do ***not* reliably boot pxe (usually will instead choose one of the disks). This *will usually show up as a python key error in the “container/inv_add_pxe_ports.yml” playbook. The only remedy is to retry the PXE boot until it’s successful (usually **within* 2-3 tries). To retry use ipmitool from the deployer. The tricky part, however, is determining 1) which systems failed to PXE boot and 2) what the current BMC IP address is. **

To determine which systems have failed to boot, go through the following bullets in this section (starting with “Verify port lists...”)

To determine what the corresponding BMC addresss is view the inventory.yml file. At this point the BMC ipv4 and mac address will already be populated in the inventory.yml within the container. To find out:

ubuntu@bloom-deployer:~/cluster-genesis/playbooks$ grep “^deployer” hosts

deployer ansible_user=deployer ansible_ssh_private_key_file=/home/ubuntu/.ssh/id_rsa_ansible-generated ansible_host=

ubuntu@bloom-deployer:~/cluster-genesis/playbooks$ ssh -i /home/ubuntu/.ssh/id_rsa_ansible-generated deployer@

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-42-generic x86_64)

Last login: Mon Aug 22 12:14:17 2016 from

deployer@ubuntu-14-04-deployer:~$ grep -e hostname -e ipmi cluster-genesis/inventory.yml

    • hostname: mgmtswitch1*
    • hostname: dataswitch1*
    • hostname: controller-1*
  • userid-ipmi: ADMIN*
  • password-ipmi: ADMIN*
  • port-ipmi: 29*
  • mac-ipmi: 0c:c4:7a:4d:88:26*
  • ipv4-ipmi:*
    • hostname: controller-2*
  • userid-ipmi: ADMIN*
  • password-ipmi: ADMIN*
  • port-ipmi: 27*
  • mac-ipmi: 0c:c4:7a:4d:87:30*
  • ipv4-ipmi:*


Verify port lists within cluster-genesis/config.yml are correct:





  • ports:*
  • ipmi:*
  • rack1:*
    • 9*
    • 11*
    • 13*
  • pxe:*
  • rack1:*
    • 10*
    • 12*
    • 14*
  • eth10:*
  • rack1:*
    • 5*
    • 7*
    • 3*
  • eth11:*
  • rack1:*
    • 6*
    • 8*
    • 4*


On the management switch;

RS G8052>show mac-address-table

in the mac address table, look for the missing pxe ports. Also note the mac address for the corresponding BMC port. Use ipmitool to reboot the nodes which have not pxe booted succesfully.

16.5. Stopping and resuming progress

In general, to resume progress after a play stops on error (presumably after the error has been understood and corrected!) the failed playbook should be re-run and subsequent plays run as normal. In the case of “cluster-genesis/playbooks/install.yml” around 20 playbooks are included. If one of these playbooks fail then edit “cluster-genesis/playbooks/install.yml” and comment plays that have passed by writing a “#” at the front of the line. Be sure not to comment out the playbook that failed so that it will re-run. Here’s an example of a modified “cluster-genesis/playbooks/install.yml” where the user wishes to resume after a data switch connectivity problem caused the “container/set_data_switch_config.yml” playbook to fail:

  • 1 —*
  • 2 # Copyright 2016, IBM US, Inc.*
  • 3 *

~ 4 #- include: lxc-update.yml

~ 5 #- include: container/cobbler/cobbler_install.yml

~ 6 #- include: pause.yml message=”Please reset BMC interfaces to obtain DHCP leases. Press <enter> to continue”

  • 7 - include: container/set_data_switch_config.yml log_level=info*
  • 8 - include: container/inv_add_switches.yml log_level=info*
  • 9 - include: container/inv_add_ipmi_ports.yml log_level=info*
  • 10 - include: container/ipmi_set_bootdev.yml log_level=info
    bootdev=network persistent=False*
  • 11 - include: container/ipmi_power_on.yml log_level=info*
  • 12 - include: pause.yml minutes=5 message=”Power-on Nodes”*
  • 13 - include: container/inv_add_ipmi_data.yml log_level=info*
  • 14 - include: container/inv_add_pxe_ports.yml log_level=info*
  • 15 - include: container/ipmi_power_off.yml log_level=info*
  • 16 - include: container/inv_modify_ipv4.yml log_level=info*
  • 17 - include: container/cobbler/cobbler_add_distros.yml*
  • 18 - include: container/cobbler/cobbler_add_profiles.yml*
  • 19 - include: container/cobbler/cobbler_add_systems.yml*
  • 20 - include: container/inv_add_config_file.yml*
  • 21 - include: container/allocate_ip_addresses.yml*
  • 22 - include: container/get_inv_file.yml dest=/var/oprc*
  • 23 - include: container/ipmi_set_bootdev.yml log_level=info
    bootdev=network persistent=False*
  • 24 - include: container/ipmi_power_on.yml log_level=info*
  • 25 - include: pause.yml minutes=5 message=”Power-on Nodes”*
  • 26 - include: container/ipmi_set_bootdev.yml log_level=info
    bootdev=default persistent=True*

16.6. Recovering from Wrong IPMI userid and /or password

If the userid or password for the ipmi ports are wrong, genesis will fail. To fix this, first correct the userid and or password in the config.yml file (~/cluster-genesis/config.yml in both the host OS and the container). Also correct the userid and or password in the container at ~/cluster-genesis/inventory.yml. Then modify the ~/cluster-genesis/playbooks/install.yml file, commenting out the playbooks shown below. Then rerstart genesis from step 15(rerun the install playbook)

# Copyright 2016 IBM Corp.


# All Rights Reserved.


# Licensed under the Apache License, Version 2.0 (the “License”);

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at


# http://www.apache.org/licenses/LICENSE-2.0


# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an “AS IS” BASIS,


# See the License for the specific language governing permissions and

# limitations under the License.

#- include: lxc-update.yml

#- include: container/cobbler/cobbler_install.yml

- include: pause.yml message=”Please reset BMC interfaces to obtain DHCP leases”

#- include: container/set_data_switch_config.yml

#- include: container/inv_add_switches.yml

#- include: container/inv_add_ipmi_ports.yml

- include: container/ipmi_set_bootdev.yml bootdev=network persistent=False

- include: container/ipmi_power_on.yml

- include: pause.yml minutes=20 message=”Power-on Nodes”

- include: container/inv_add_ipmi_data.yml

- include: container/inv_add_pxe_ports.yml

- include: container/ipmi_power_off.yml

- include: container/inv_modify_ipv4.yml

- include: container/cobbler/cobbler_add_distros.yml

- include: container/cobbler/cobbler_add_profiles.yml

- include: container/cobbler/cobbler_add_systems.yml

- include: container/inv_add_config_file.yml

- include: container/allocate_ip_addresses.yml

- include: container/get_inv_file.yml dest=/var/oprc

- include: container/ipmi_set_bootdev.yml bootdev=network persistent=False

- include: container/ipmi_power_on.yml

- include: pause.yml minutes=5 message=”Power-on Nodes”

- include: container/ipmi_set_bootdev.yml bootdev=default persistent=True

16.7. Recreating the Genesis Container

To destroy the Genesis container and restart Genesis from that point;

sudo lxc-ls –fancy

sudo lxc-stop **-n deployer-container-name

sudo lxc-destroy -n deployer-container-name

Restart genesis from step 15 of the step by step instructions. **`5.1 <#anchor-7>`__** **`Installing and Running the Genesis code. Step by Step Instructions <#anchor-7>`__

NOTE: if you have exited the shell session from which you previously created the container, be sure to execute the following setup scripts;

source ~/cluster-genesis/scripts/setup-env


After recreating the container, you will need to remove the old key from the known_hosts file in order to be able to ssh into the recreated container;

ssh-keygen -f “/home/ubuntu/.ssh/known_hosts” -R

16.8. Reinstalling Genesis

Before reinstalling genesis, stop and destroy the deployer container;

sudo lxc-ls –fancy

sudo lxc-stop **-n deployer-container-name

sudo lxc-destroy -n deployer-container-name

Then remove the cluster-genesis directory. Follow instructions of section 5 Running the OpenPOWER Cluster Configuration Software

16.9. OpenPOWER Node issues

Specifying the target drive for operating system install;

In the config.yml file, the os-disk key is the disk to which the operating system will be installed. Specifying this disk is not always obvious because Linux naming is insconsistent between boot and final OS install. For OpenPOWER S812LC, the two drives in the rear of the unit are typically used for OS install. These drives should normally be specified as /dev/sdj and /dev/sdk

PXE boot: OpenPOWER nodes need to have the Ethernet port used for PXE booting enabled for DHCP in petitboot.

Be sure to specify a disk configured for boot as the bootOS drive in the config.yml file.

When using IPMI, be sure to specify the right user id and password. IPMI will generate an “unable to initiate IPMI session errors” if the password is not correct.

ipmitool -I lanplus -H 192.168.x.y -U ADMIN -P ADMIN chassis power off
ipmitool -I lanplus -H 192.168.x.y -U ADMIN -P ADMIN chassis bootdev pxe
ipmitool -I lanplus -H 192.168.x.y -U ADMIN -P ADMIN chassis power on

ipmitool -I lanplus -H 192.168.x.y -U ADMIN -P ADMIN chassis power status

To monitor the boot window using the serial over lan capability;

ipmitool -H -I lanplus -U ADMIN -P admin sol activate

Be sure to use the correct password.

You can press Ctrl-D during petit boot to bring up a terminal.

To exit the sol window, enter “~.” enter (no quotes)