Ticket #377 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

FIU & OSF ORCA 5 upgrade

Reported by: jonmills Owned by: jonmills
Priority: major Milestone:
Component: Don't Know Version: baseline
Keywords: Cc: vjo, jonmills, ckh, anirban, yxin, ibaldin

Description (last modified by jonmills) (diff)

FIU & OSF have been updated to ORCA 5.0.

TODO:

* update xCAT stateless image at FIU (OSF is already done)
* get correct RDF in place at both sites
* fix quantum network names on worker nodes

Change History

  Changed 5 years ago by jonmills

  • description modified (diff)

  Changed 5 years ago by jonmills

Quantum network names are now fixed on both FIU and OSF.

follow-up: ↓ 4   Changed 5 years ago by jonmills

OSF & Starlight are unique in being the only racks that have a dual-port Mellanox 40G card and a dual-port Chelsio card. (TAMU has a dual-port Mellanox and a single-port 10G card.)

What this means is that OSF & Starlight each have three active connections to the dataplane: one 40G & two 10G connections.

Coincidentally, we also have three quantum network labels to work with. I've set up OSF baremetal nodes with quantum network labels as follows:

of-data: Chelsio 10G port 2 (p3p6)
vlan-storage: Chelsio 10G port 1 (p3p5)
vlan-data: Mellanox 40G port 1 (p2p1)

This is the best way to achieve performance isolation, because the vlan-storage traffic won't eat in to the vlan-data traffic. In a theoretical, optimized world, we could hit the full 40G of a provisioned vlan-data network (for example...connecting to Starlight from OSF by vlan).

in reply to: ↑ 3   Changed 5 years ago by jonmills

Replying to jonmills:

OSF & Starlight are unique in being the only racks that have a dual-port Mellanox 40G card and a dual-port Chelsio card. (TAMU has a dual-port Mellanox and a single-port 10G card.)

What this means is that OSF & Starlight each have three active connections to the dataplane: one 40G & two 10G connections.

Coincidentally, we also have three quantum network labels to work with. I've set up OSF baremetal nodes with quantum network labels as follows:

of-data: Chelsio 10G port 2 (p3p6)
vlan-storage: Chelsio 10G port 1 (p3p5)
vlan-data: Mellanox 40G port 1 (p2p1)

This is the best way to achieve performance isolation, because the vlan-storage traffic won't eat in to the vlan-data traffic. In a theoretical, optimized world, we could hit the full 40G of a provisioned vlan-data network (for example...connecting to Starlight from OSF by vlan).

And I should clarify this statement by adding that I'm only talking about baremetal here, not openstack worker nodes. There are zero openstack workers with 40G interfaces.

  Changed 5 years ago by jonmills

OSF dataplane ports of interest:

Port 38: vlan side of the patch cable for hybrid mode
Port 37: openflow side of the patch cable for hybrid mode

Port 9: 40G port in vlan mode, interface p2p1 on baremetal worker osf-w9 (vlan-data)
Port 13: 40G port in vlan mode, interface p2p1 on baremetal worker osf-w10 (vlan-data)

Port 25: 10G port vlan-storage port p3p5 on osf-w9
Port 26: 10G port vlan-storage port p3p5 on osf-w10

Openflow ports are 37,41-60

  Changed 5 years ago by jonmills

  • cc yxin, ibaldin added

  Changed 5 years ago by jonmills

baremetal at FIU should be happy now.

follow-up: ↓ 9   Changed 5 years ago by yxin

1. I am Still a little bit unclear, esp. port 37&38), so please confirm OSF:
baremetal: vlan-data(9, 13); of-data (37, 38)
workers: vlan-data(17-24); of-data(41-50)

2. FIU?

in reply to: ↑ 8   Changed 5 years ago by jonmills

Replying to yxin:

1. I am Still a little bit unclear, esp. port 37&38), so please confirm OSF:
baremetal: vlan-data(9, 13); of-data (37, 38)
workers: vlan-data(17-24); of-data(41-50)

Ports 37,38 are just the patch cable bridging the vlan half with the openflow half of the switch. Port 37 goes into the OF side.

OSF baremetal vlan-data (9, 13)
OSF baremetal vlan-storage (25, 26)
OSF baremetal of-data (49, 50)
OSF workers vlan-data,vlan-storage (17-24)
OSF workers of-data (41-48)


2. FIU?

FIU baremetal vlan-data,vlan-storage (25, 26)
FIU baremetal of-data (49, 50)
FIU workers vlan-data,vlan-storage (17-24)
FIU workers of-data (41-48)

  Changed 5 years ago by ibaldin

Ports 37,38 for now don't show up in the RDF topology.

  Changed 5 years ago by yxin

updated, please test.

  Changed 5 years ago by ibaldin

RDF update on r6946

  Changed 5 years ago by yxin

Ilya found an error in fiu, just checked in the fix, please update w/ the new RDF.

  Changed 5 years ago by ibaldin

BBN has been upgraded to lastest tag. RDF files are up to date, has been clean-restarted.

  Changed 5 years ago by ibaldin

oops, wrong ticket

  Changed 5 years ago by ibaldin

  • cc ttoll removed

FIU and OSF have been upgraded to tag 6952. Both need testing before being added to ExoLayer?. I'm hoping Anirban can do it next week?

  Changed 5 years ago by ibaldin

Should be straightforward to test our regular scenarios (storage should work, since it is 'old style' racks) and OpenFlow? limited testing that we do and if all is well, declare them open and claim on ExoSM.

  Changed 5 years ago by anirban

Simple vlans are failing @ FIU. dumb-bells not working. The vlan reservation fails with the following error. Subsequent vlan reservations are stuck in ticketed for 5 minutes and then fail with the same error. Probably a switch configuration/state issue.

Reservation d323f3df-9e94-4c5e-b621-6b07a5c97cfb (Slice n-ng-1) is in state [Failed,None]

Last lease update: all units failed priming: Exception during join for unit: 62F21173 The following error occurred while executing this line:
/etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:342: /etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:420: An error occurred: Unable to update configuration of netconf device 192.168.105.4 due to: net.juniper.netconf.LoadException?: Load operation returned error:
<rpc-reply message-id="101">

<rpc-error>

<error-type>application</error-type>
<error-tag>partial-operation</error-tag>
<error-severity>error</error-severity>
<error-message>stop-on-error</error-message>
<error-info>

<err-element>switchport mode trunk</err-element>

</error-info>

</rpc-error>

</rpc-reply>
]]>]]>

INFO   | jvm 1    | 2014/10/17 20:41:39 |      [echo] Quantum VLAN Handler: JOIN on 10/17/2014 08:41
INFO   | jvm 1    | 2014/10/17 20:41:39 |      [echo] Performing native vlan provisioning on 192.168.105.4 of type g8264
INFO   | jvm 1    | 2014/10/17 20:41:39 |      [echo] Starting atomic sequence for 192.168.105.4
INFO   | jvm 1    | 2014/10/17 20:41:39 |      [echo] enabling vlan 2 on router 192.168.105.4 bw=0 burst=0
INFO   | jvm 1    | 2014/10/17 20:41:39 |      [echo] router.user: noradius
INFO   | jvm 1    | 2014/10/17 20:41:41 |      [echo] vlan 2 created successfully on router 192.168.105.4
INFO   | jvm 1    | 2014/10/17 20:41:41 |      [echo] adding ports 41-50 to vlan 2 on router 192.168.105.4
INFO   | jvm 1    | 2014/10/17 20:41:51 |
INFO   | jvm 1    | 2014/10/17 20:41:51 | BUILD FAILED
INFO   | jvm 1    | 2014/10/17 20:41:51 | /etc/orca/am+broker-12080/handlers/providers/quantum-vlan/handler.xml:259: The following error occurred 
while executing this line:
INFO   | jvm 1    | 2014/10/17 20:41:51 | /etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:342: /etc/orca/am+broker-12080/handlers
/providers/ben.no-na.tasks.xml:420: An error occurred: Unable to update configuration of netconf device 192.168.105.4 due to: net.juniper.netconf.
LoadException: Load operation returned error:
INFO   | jvm 1    | 2014/10/17 20:41:51 | <rpc-reply message-id="101">
INFO   | jvm 1    | 2014/10/17 20:41:51 |   <rpc-error>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     <error-type>application</error-type>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     <error-tag>partial-operation</error-tag>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     <error-severity>error</error-severity>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     <error-message>stop-on-error</error-message>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     <error-info>
INFO   | jvm 1    | 2014/10/17 20:41:51 |       <err-element>switchport mode trunk</err-element>
INFO   | jvm 1    | 2014/10/17 20:41:51 |     </error-info>
INFO   | jvm 1    | 2014/10/17 20:41:51 |   </rpc-error>
INFO   | jvm 1    | 2014/10/17 20:41:51 | </rpc-reply>
INFO   | jvm 1    | 2014/10/17 20:41:51 | ]]>]]>

  Changed 5 years ago by ibaldin

This is an issue with dataplane switch not being configured properly for hybrid mode

  Changed 5 years ago by anirban

At OSF, a dumb-bell slice comes up fine, but can't pass packets because the dataplane interfaces don't show up on the vms. It is likely an Openstack/neuca issue.. Please take a look..

Waiting on resolution of dataplane switch configuration at FIU to resume testing on that rack..

  Changed 5 years ago by ibaldin

Please bring FIU to Chris's attention. OSF is likely a neuca plugin problem at the worker nodes. See release notes https://geni-orca.renci.org/trac/wiki/releases/Eastsound-5.0%22 for where the network names should be specified.

I noticed the RDF files at both OSF and FIU are not the same as the latest in the repo. The easiest way to check is to run md5sum on the version in the repo and on the rack.

  Changed 5 years ago by anirban

I did some testing on the FIU rack. I tested a few slices in the coverage tests - node, nodegroup, storage, complex topologies, modify, extend and recovery. The issues observed were the following.

1. Confirmation of what was reported in #380 - after recovery, vm reservations expire before the new expiry time. #380 has been updated with description of the scenario.

2. For the first time, I observed that for a slice with a node connected to a nodegroup, when nodegroup size is increased after recovery, the new nodes don't show up in flukes query for manifest.The new nodes actually come up fine (as observed in handler output and by pinging the new nodes' IP addresses). So the reservations, substrates etc. work but the manifest is still the old manifest, which shows only some nodes. For another test with exactly the same scenario, the new nodes show up in the manifest, and new nodes come up. So, I don't know whether this is an one-off aberration. I don't like the non-deterministic behavior. I have saved the manifest for the problematic case and will be emailing it to yxin.

Barring the above two issues, I don't see any problems in the intra-rack testing for FIU.

Moving on to OSF rack testing later tonight.

  Changed 5 years ago by vjo

OK.

When you're done w/ OSF, please let me know if you believe that FIU and OSF are suitable for any tutorials/demos at GEC21.

If they are, I will open the whitelists on the rack local controllers, and claim them at ndl-broker.

  Changed 5 years ago by anirban

vjo, if the tutorials and demos are not going to exercise extend or modify scenarios, we should be good with FIU. The normal things are working fine. The two issues are mostly with corner cases of recovery. If we are not recovering during the tutorials, I would be quite confident about opening up FIU. My judgement call is leaning toward opening it up.

Will report on OSF soon..

  Changed 5 years ago by vjo

OK - will wait for your report on OSF.

I do not expect tutorials to make use of extend or modify.

Will Paul need extend or modify for the demos?

  Changed 5 years ago by vjo

Anirban,

Any problem w/ using extend, if recovery does not take place?

Have spoken w/ Niky; they plan on showing how extend can be used, in the tutorial.
The extend issue occurs *only* when an actor has to be restarted w/ recovery (i.e. new end time is not honored) - correct?

If so - that should not cause a problem for tutorials.

  Changed 5 years ago by anirban

vjo,

If recovery does not take place, extend seems to be working. There is one very minor kink, which I noticed now. When a slice is let to expire on its own, the slivers go away at expiry time, i.e. all the handler leave actions proceed normally. But, query manifest shows the vlan reservation in active state (vm reservations show closed) for a few minutes. After a few minutes, the slice indeed disappears from the system. I think this is all ok, because the slice is unusable at expiry and eventually everything disappears.

I also tested renew with omni right now, and it seems to do the right things with renew. The slice and slivers transition to "unknown" state on expiry. The slice is eventually garbage collected after a few minutes.

So, I feel confident about extend without recovery. Even with recovery, it breaks or not depending on the timing of the recovery.

- Anirban

  Changed 5 years ago by anirban

I ran tests at the OSF rack. Things seem to be working fine - node, nodegroup, multiple storage, complex topology, extend, renew with omni, modify, recovery. I could not reproduce the modify issue on recovery, which I encountered while testing at FIU earlier today - where modify after recovery was not showing the added nodes in the manifest even if new slivers were being brought up. This did not happen after that one time on FIU rack.

I feel confident about opening up the OSF rack, with the caveat on extend with recovery.

  Changed 5 years ago by vjo

OK - thanks!

Caveat noted; opening up FIU and OSF, and claiming at ndl-broker.

  Changed 5 years ago by vjo

FIU and OSF open; FIU released to GPO for GEC tutorials.

  Changed 5 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.