Ticket #375 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

Server-based storage intermittent failure

Reported by: ibaldin Owned by: pruth
Priority: major Milestone:
Component: External: NEuca-Py VM tools Version: baseline
Keywords: Cc: vjo, pruth, jonmills, anirban

Description

This affects TAMU and newer racks with server-based storage.

NEuca tools in the guest try to list available targets and pick the first one from the list. On IBM units the ACLs prevent wrong targets from being listed. On server storage we currently don't apply ACLs and multiple targets are listed, with the right one not necessarily being the first one. This results in intermittent failure of storage at TAMU.

There are multiple possible solutions

  1. Fix ACLs in storage server (assuming tgtd allows)
  1. Fix controller to provide more information to neuca tools to pick the right target (and fix neuca tools accordingly)
  1. Fix neuca tools to try targets in order until success

Victor is investigating 1 as the easiest to implement.

Change History

Changed 5 years ago by ibaldin

Question from the audience: What happens when multiple storage units are attached to the same node?

Changed 5 years ago by vjo

I think solution (3) also needs to also be implemented, in part.

The current neuca guest tools only pick the first target returned from discovery for mounting.

Changed 5 years ago by ibaldin

  • owner changed from jonmills to pruth
  • component changed from Infrastructure: ExoGENI Racks OS to External: NEuca-Py VM tools

Changed 5 years ago by ibaldin

Testing on RCI with Paul, two storage volumes attached to one node worked mostly - they got mounted but in the opposite order of desired mount points. There could be issues with controller setting dependencies or neuca INI file generator creating INI data (where LUNs were flipped in the storage definition).

Changed 5 years ago by vjo

OK - tested tgtd from CentOS 7 with WSU.
Bad news: initiator-name does not work for ACLs. We will need to use the VM's storage IP for the ACL.

Furthermore, controller is expecting to assign a LUN number to the target LUN; storage_service handler is not honoring this.

So - modifications to storage_service handler now required:
1) Pass in initiator IQN - intended for ACLs, but may be unused by script used to allocate target
2) Pass in initiator storage IP address - intended for ACLs, but may be unused by script used to allocate target
3) Pass in LUN number that is expected for target LUN

Changed 5 years ago by vjo

OK - question...
Does the controller *currently* pass to the storage handlers, as a property, the storage IP addresses of the VMs/bare metal nodes that will be accessing it?

I need that information, to be able to successfully do ACLs on tgtd.

Changed 5 years ago by ibaldin

The new version of the controller will provide the necessary IP address information, however we need to reconsider how storage is done in general and who does the allocation of IP addresses and LUN numbers. See ticket #378

Changed 4 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed

This appears fixed now in r7120 and beyond, except for neuca tools issue which is now #400.

Note: See TracTickets for help on using tickets.