Ticket #316 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

Maintenance window 3/6/2014 - closure of several tickets

Reported by: vjo Owned by: yxin
Priority: blocker Milestone:
Component: External: Testing and Redeployment Version: baseline
Keywords: Cc: ibaldin, yxin, anirban, pruth, ckh

Description (last modified by vjo) (diff)

Maintenance is touching RCI, BBN, FIU, UFL, SL.
Emergency maintenance is occurring against ExoSM as well.

ExoSM requires controller re-deploy, due to exit of SliceDeferThread? (NPE in run()).

Tickets addressed (in whole or in part) by this maintenance:
#300
#301
#313
#315

Change History

Changed 5 years ago by vjo

Affected sites/actors have been placed into maintenance.
Proceeding with addressing racks first, while Ilya attacks controller issue.

Changed 5 years ago by vjo

All reservations closed; waiting for confirmation of closure, then will un-delegate resources.

Changed 5 years ago by vjo

All reservations confirmed closed.
One VM reservation was leaked from an old, closed slice on ExoSM; this was manually closed at the AM at BBN.

All resources undelegated for RCI, BBN, UFL, FIU, SL.

Changed 5 years ago by vjo

RDF updated at UFL, FIU.
Config modified at UFL, already done at FIU.
Addressing Tickets 300 and 301.

Changed 5 years ago by vjo

Yufeng-

Is the NDL ready for SL and RCI, to address tickets #313 and #315?
Also - how do you want me to arrange the resource delegations for BBN, to address #301?
I am not addressing #301 for RCI, since we have determined it to not be necessary.

Changed 5 years ago by vjo

Noting per Ilya's request:
A race condition was discovered between the slice being in the defer queue and having been closed. This race resulted in the possibility of demandSlices() hitting an NPE, and thereby terminating the run() method of SliceDeferThread?.

Changed 5 years ago by vjo

Noting per Ilya's request:
ibaldin -> ckh:
Please have I2/NOC look at two circuit requests in ION: ion.internet2.edu-73771 - failed, but a similar one ion.internet2.edu-73761 - succeeded (it was cancelled after slice was verified). Why? Placed within minutes of each other.

ckh -> ibaldin:
The NOC identified Vlan-203 as Active in OESS, but we're not using it. They need to correct it, which will require an interruption in service so it may not happen immediately. Additionally, I went trough the ION and AL2S cancellations and similar Vlans include 118, 119, 208, 232, 233, 234, 235, and 236. I've asked them to see if they're still active in OESS. If so, then remove.

Changed 5 years ago by vjo

Controller fix checked in as r6209. Re-building.

Changed 5 years ago by vjo

Since ExoSM is down, and all slices using it are down - updating ion.rdf and restarting geni2, per Jonathan's request.

Changed 5 years ago by vjo

UFL, RCI updated with RDF from r6010.
Does anything further need to be done to complete RCI against #313?

Changed 5 years ago by vjo

Correction: RDF from r6210.

Changed 5 years ago by vjo

  • description modified (diff)

Changed 5 years ago by vjo

RPM built against r6210.
Awaiting:
- Answers re: BBN for #301 (VLANs allocation)
- Answer re: RCI for #313 (anything further required, besides RDF change)
- RDF for SL for #315

Will install RPM, and restart ExoSM.

Changed 5 years ago by vjo

Rebuilding RPM against r6211, which names the SliceDeferThread? (for ease of future debugging).

Changed 5 years ago by vjo

Updating SL with RDF from r6212, to address #315.

Changed 5 years ago by vjo

RPM rebuilt against r6211, and installed on ExoSM.
SliceDeferThread? named thread has not yet appeared; may not yet have been started.
Will re-verify during testing.

Changed 5 years ago by ibaldin

Thread likely gets started whenever the controller XMLRPC gets poked for the first time.

Changed 5 years ago by vjo

SliceDeferThread? named thread started by UvA-NL automatic slice checker doing a check-in.
r6211 verified.

Changed 5 years ago by vjo

Current status:
- Awaiting answers re: BBN and RCI.
- Need to restart/re-claim: RCI, BBN, UFL, FIU, SL, ION/NLR.
- Need to test to validate issues in named tickets are resolved.

Changed 5 years ago by yxin

  1. #301: bbnNet: please delegate 10 to nld-broker, 106 to the rack controller to support stitching. In config.xml, for the bbnNet.vlan control, please use the NDLInterfaceVlanControl.

2. #313: No further change to rci, except for the RDF.

3. #315: for slNet.vlan, please delegate 9 to ndl-broker, 2 to the rack broker. please make sure to use the NDLInterfaceVlanControl policy.

Changed 5 years ago by vjo

UCD RDF has been updated for #313.

Changed 5 years ago by vjo

slNet.vlan currently claims 50 tags available - should this be 11 instead?

Changed 5 years ago by vjo

BBN modified according to answer.

Changed 5 years ago by vjo

slNet.vlan delegation to rack broker had previously been disabled; I presume we want to re-enable it?

Changed 5 years ago by ibaldin

sl should delegate 9 vlans (1700-1708 are the actual tags)

I don't think we should delegate any to local broker for now from slNet

Changed 5 years ago by vjo

SL *also* needs modification of quantum, ISTR.
Which VLAN tag should I set up as similar to "mesoscale"?

Changed 5 years ago by vjo

config.xml at SL modified per instructions.
Awaiting tag at SL to set as "mesoscale."

Changed 5 years ago by vjo

Per Ilya:
1655 is the tag to be marked as "static"/"mesoscale" in Quantum for SL.

Changed 5 years ago by vjo

Tag added at SL to Quantum.
Restart in progress on affected racks.
Preparing to re-claim.

Changed 5 years ago by yxin

slNet.vlan, yes, 11 total in the pool.

Changed 5 years ago by ibaldin

Suggested testing:

1. Test exo controller for basic connectivity across sites
2. Test SL to other sites
3. Test that VMs can be hung on 1655 at SL and with external controller can see each other.
4. Test FIU to UFL (may or may not actually work due to untested plumbing)

Changed 5 years ago by vjo

All racks restarted and claimed.
UCD is *not* claimed.

Awaiting results of testing for further instructions.

Changed 5 years ago by ibaldin

Addtionally test UCD reachability - if works, we can declare it conditionally open (as in open via our interfaces, GPO can test whenever they care)

Changed 5 years ago by vjo

UCD claimed, per Ilya request.

Changed 5 years ago by vjo

Unsure how to test:
Test that VMs can be hung on 1655 at SL and with external controller can see each other.

Do I use the (what I understood to be broken) OF slice functionality in Flukes for this?

Changed 5 years ago by vjo

Have thrown 3 dumbbells so far. Results not promising:
RCI VM <-> BBN VM - packets not passed, all reservations active
FIU VM <-> UFL VM - packets not passed, all reservations active
FIU VM <-> SL VM - packets not passed, all reservations active

Changed 5 years ago by vjo

RCI VM <-> FIU VM works. Retrying BBN <-> RCI.

Changed 5 years ago by vjo

Second BBN <-> RCI DB fails to pass packets.
RCI VM <-> SL VM DB - also fails to pass packets, despite all reservations coming up active.

Changed 5 years ago by vjo

In testing:
Did two RCI<->BBN dumbbells simultaneously.
First did not work, second did.
First used tag 2601 in NoX/BBN; expect stuck tag.

Not *one* of the SL tags were we able to make work (1700 to 1708).

Changed 5 years ago by vjo

  • description modified (diff)

Changed 5 years ago by vjo

Exiting maintenance.

Known issues:
- Stuck tag of 2601 (at least) at BBN/NoX
- UFL to FIU not functioning at plumbing layer
- UCD to anywhere not functioning at plumbing layer
- SL to anywhere not functioning at plumbing layer
- Discovered issue in code:
java.lang.IllegalStateException?: item is already in allocated:-1

at orca.shirako.util.FreeAllocatedSet?.allocate(FreeAllocatedSet?.java:90)
at orca.plugins.ben.control.NdlInterfaceVLANControl.assign(NdlInterfaceVLANControl.java:126)
at orca.policy.core.AuthorityCalendarPolicy?.assign(AuthorityCalendarPolicy?.java:108)
at orca.policy.core.AuthorityCalendarPolicy?.map(AuthorityCalendarPolicy?.java:586)
at orca.policy.core.AuthorityCalendarPolicy?.mapGrowing(AuthorityCalendarPolicy?.java:676)
at orca.policy.core.AuthorityCalendarPolicy?.mapForCycle(AuthorityCalendarPolicy?.java:643)
at orca.policy.core.AuthorityCalendarPolicy?.assign(AuthorityCalendarPolicy?.java:135)
at orca.shirako.core.Authority.tickHandler(Authority.java:331)
at orca.shirako.core.Actor.actorTick(Actor.java:431)
at orca.shirako.core.Actor.access$000(Actor.java:51)
at orca.shirako.core.Actor$1.process(Actor.java:341)
at orca.shirako.core.Actor.actorMain(Actor.java:384)
at orca.shirako.core.Actor$4.run(Actor.java:944)
at java.lang.Thread.run(Thread.java:662)

on am+broker at SL when attempting to use all tags.

Changed 5 years ago by vjo

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.