Ticket #296 (closed task: fixed)

Opened 5 years ago

Last modified 5 years ago

Redeployment Testing Feb 3, 2014

Reported by: yxin Owned by: yxin
Priority: blocker Milestone:
Component: External: Testing and Redeployment Version: baseline
Keywords: Cc: ibaldin, vjo, anirban, pruth, ckh

Description

Attached a few complex request RDFs for redeployment testing before releasing.

Attachments

mp-request.rdf (7.1 kB) - added by yxin 5 years ago.
A broadcast link with two leaves in two different racks, one single and one node group, and there are post boot script to see the interfaces
complex.rdf (25.5 kB) - added by yxin 5 years ago.
A interdomain mesh topology: 5 nodes
SC13-demo.rdf (16.4 kB) - added by yxin 5 years ago.

Change History

Changed 5 years ago by yxin

A broadcast link with two leaves in two different racks, one single and one node group, and there are post boot script to see the interfaces

Changed 5 years ago by yxin

A interdomain mesh topology: 5 nodes

  Changed 5 years ago by ibaldin

  • component changed from ORCA: controllers - ORCA API to External: Testing and Redeployment

  Changed 5 years ago by ibaldin

  • summary changed from Redeployment Testing to Redeployment Testing Feb 3, 2014

  Changed 5 years ago by ibaldin

Task list

1. Controller on ExoSM + test cases attached
2. RDF update for ION, NLR, BBN, FIU + AM restarts
3. Optional: update controller on BBN and FIU + test local embedding.

Make sure blowhole is running afterwards.

  Changed 5 years ago by vjo

We will be deploying code from HEAD today, which is revision 6108.
Should we need to check the code for sources of regressions, the diff for today's maintenance can be obtained by running:
svn diff -r 6094:6108

  Changed 5 years ago by vjo

As requested by Ilya, here's my "maintenance checklist" for the person performing the re-deploy:

1) Send a notice email/tweet to users at least 24 hours prior to the maintenance.
2) Send a reminder email/tweet to users 15 minutes prior to maintenance.
3) Lock users out at control.exogeni.net at start of maintenance window.
4) Use pequod to shut down slices and reservations on affected systems.
5) Use pequod to undelegate resources from ExoLayer? for affected systems.
6) Ensure code is updated via RPM on all affected systems.
7) Shut down containers on all affected systems.
8) Clean up substrate for all containers managing an aggregate.
9) Clean restart containers on all affected systems.
10) Use pequod to claim resources for ExoLayer? on affected systems.
11) Run any required tests to verify code changes are functioning "as expected" and to ensure that regressions have not crept in.
12) If necessary, re-build RPMs and repeat steps 6-12, until code is operating as desired.
13) Open up testbed at control.exogeni.net.
14) [OPTIONAL] Run puppet-agent -tv manually on affected systems to ensure whitelist push.
15) Send email/tweet to notify users that maintenance has ended, and that the testbed is open.

  Changed 5 years ago by vjo

Modifications to procedure, as suggested by Ilya:
1) Send a notice email/tweet to users at least 24 hours prior to the maintenance.
2) Send a reminder email/tweet to users 15 minutes prior to maintenance.
3) Lock users out at control.exogeni.net at start of maintenance window.
4) Run puppet-agent -tv manually on affected systems to ensure whitelist push.
5) Use pequod to shut down slices and reservations on affected systems.
6) Use pequod to undelegate resources from ExoLayer?? for affected systems.
7) Ensure that RDF is up-to-date on all affected systems.
8) Ensure code is updated via RPM on all affected systems.
9) Shut down containers on all affected systems.
10) Clean up substrate for all containers managing an aggregate.
11) Clean restart containers on all affected systems.
12) Use pequod to claim resources for ExoLayer?? on affected systems.
13) Run any required tests to verify code changes are functioning "as expected" and to ensure that regressions have not crept in.
14) If necessary, re-build RPMs and repeat steps 6-12, until code is operating as desired.
15) Ensure that blowholed is running and reporting on control.exogeni.net.
16) Open up testbed at control.exogeni.net.
17) Run puppet-agent -tv manually on affected systems to ensure whitelist push.
18) Send email/tweet to notify users that maintenance has ended, and that the testbed is open.

  Changed 5 years ago by ibaldin

If OSF comes back on line before the end of maintenance, we should test connectivity to it from a couple of places:

1. RCI
2. UFL

  Changed 5 years ago by ibaldin

  • cc ckh added

DOE is working on cert issue. Appears all OSCARS servers are affected at this time.

  Changed 5 years ago by ibaldin

Add UH to maintenance, while the certificate issue is resolved.

  Changed 5 years ago by ibaldin

We can now call ION OSCARS. Waiting on resolution from AL2S and ESnet.

  Changed 5 years ago by anirban

Tried a dumb-bell from nicta to uva. ION worked fine.. The vm at UVA is failing.. OpenStack? needs to be checked out at uva..

"Last lease update: all units failed priming: Error code 1 during join for unit: 73DF8A40 with message: unable to create instance: exit code 1, "

  Changed 5 years ago by vjo

Please tear down your slice, and I will clean and restart the substrate.

  Changed 5 years ago by vjo

Clocks out of sync between head node and workers at UvA. Working on resolving.

  Changed 5 years ago by vjo

NTP issue resolved; was common to all racks.
Have resolved in puppet.

follow-up: ↓ 16   Changed 5 years ago by anirban

For nicta-uva dumb-bell, all slivers go to active state, but VMs can't ping each other..

in reply to: ↑ 15   Changed 5 years ago by ibaldin

Replying to anirban:

For nicta-uva dumb-bell, all slivers go to active state, but VMs can't ping each other..

This could be a problem with SURFnet or under sea link. But I don't know.

  Changed 5 years ago by ibaldin

AL2S is now working. Please test other sites.

Also should be possible to do any combination of RCI, BBN, FIU, UFL.

Test NICTA-RCI and UvA RCI to see which one of them is having issues.

  Changed 5 years ago by anirban

For complex.rdf , with four nodes at RCI and one node at BBN, ION vlan failing with error:

Last lease update: all units failed priming: Error code 1 during join for unit: 8026DB1D with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to There are no VLANs available on link ion.internet2.edu:rtr.newy:ae0:bbn on reservation ion.internet2.edu-63921 in VLAN PCE ", exiting

follow-up: ↓ 20   Changed 5 years ago by anirban

I tested dumb-bells from rci to every other site. Here are the outcomes:

RCI-UFL: fine
RCI-FIU: fine
RCI-UVA: Slivers come up but VMs not pingable
RCI-NICTA: Slivers come up but VMs not pingable
RCI-BBN: ION fails with the following -

Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to There are no VLANs available on link ion.internet2.edu:rtr.newy:ae0:bbn on reservation ion.internet2.edu-63971 in VLAN PCE ", exiting

in reply to: ↑ 19   Changed 5 years ago by ibaldin

Replying to anirban:

I tested dumb-bells from rci to every other site. Here are the outcomes:

RCI-UFL: fine
RCI-FIU: fine
RCI-UVA: Slivers come up but VMs not pingable
RCI-NICTA: Slivers come up but VMs not pingable
RCI-BBN: ION fails with the following -

Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to There are no VLANs available on link ion.internet2.edu:rtr.newy:ae0:bbn on reservation ion.internet2.edu-63971 in VLAN PCE ", exiting

Chris, we need to talk to IU about BBN vlans on AL2S - I don't think they are allowing us to use those vlans

Anirban: do intra-rack slices at NICTA and UvA succeed?

  Changed 5 years ago by anirban

Ilya: Intra rack tests succeed both at NICTA and BBN. I tried the intra-rack version of complex.rdf.

  Changed 5 years ago by anirban

Threw complex.rdf with four nodes at RCI and one node at UFL. It resulted in partial success with 2/3 interdomain links working. One ION link failed:

Last lease update: all units failed priming: Error code 1 during join for unit: EA65E34B with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

  Changed 5 years ago by anirban

complex.rdf passed with four nodes at RCI and one node at FIU. All three interdomain links came up fine.. The only problem was with the manifest. There are multi-edges between some nodes on the inter-domain path.

  Changed 5 years ago by anirban

A subsequent request (while not closing the previous request) with complex.rdf with four nodes at RCI and one node at FIU fails with embedding workflow error:

java.lang.Exception: Unable to create slice: Embedding workflow ERROR: 1:Insufficient resources or Unknown domain: http://geni-orca.renci.org/owl/fiuNet.rdf#fiuNet/Domain/vlan:0!.

There were no other slices using ExoSM other than the existing slice that worked before. Don't we have enough vlan tags to cover two simultaneous complex.rdf requests with RCI and FIU ??

  Changed 5 years ago by yxin

yes, I saw fiuNet donates 5 vlans each, out of 10 total, to ndl-broker and the rack broker.

  Changed 5 years ago by anirban

Yes, I verified the limit of 5 fiuNet vlans.. Any request needing the 6th vlan throws the "embedding workflow" error, which is the correct behavior..

Now the question is why would the rack broker need any fiuNet vlans ? afaik, fiuNet vlans are used only for interdomain slices.. Shouldn't all the 10 fiuNet vlans be delegated to ndl-broker ?

  Changed 5 years ago by yxin

fiuNet,uhNet, and uflNet all delegated 5 to each brokers now. We should delegate 10 all to the ndl-broker. bbnNet should delegate 10 (AL2S vlans)to ndl-broker, 105 (5 to ION, 100 to Instangeni) to the rack broker to support geni stitching.

And, we need to remove NLR from this diagram:
https://wiki.exogeni.net/lib/exe/detail.php?id=public%3Aexperimenters%3Atopology&media=public:users:exogeni-topo.png

  Changed 5 years ago by vjo

Um. We have to leave some VLANs to the rack SM for GENI stitching.

If we don't, we can expect tickets from GPO.

  Changed 5 years ago by ibaldin

Let them get their own vlans. We're taking all of our vlans back. Please reconfigure the delegations.

follow-up: ↓ 31   Changed 5 years ago by ibaldin

From Chris:

I checked the ION database and the Vlan availability includes 2601-2610, which is correct (see below). In Yufeng's message further below, I see the urn as "ion.internet2.edu:rtr.newy:ae0:bbn". If it's literal, then it should be "urn:ogf:network:domain=ion.internet2.edu:node=rtr.newy:port=ae0:link=bbn". Are we using the right urn, syntactically?

<pref887:link id="urn:ogf:network:domain=ion.internet2.edu:node=rtr.newy:port=ae0:link=bbn">
?
<pref887:remoteLinkId>

urn:ogf:network:domain=bbn.com:node=*:port=*:link=*

</pref887:remoteLinkId>
<pref887:trafficEngineeringMetric>10</pref887:trafficEngineeringMetric>
<pref887:capacity>10000000000</pref887:capacity>
<pref887:maximumReservableCapacity>2000000000</pref887:maximumReservableCapacity>
<pref887:minimumReservableCapacity>1000000</pref887:minimumReservableCapacity>
<pref887:granularity>1000000</pref887:granularity>
?
<pref887:SwitchingCapabilityDescriptors>
<pref887:switchingcapType>l2sc</pref887:switchingcapType>
<pref887:encodingType>packet</pref887:encodingType>
?
<pref887:switchingCapabilitySpecificInfo>
<pref887:interfaceMTU>9000</pref887:interfaceMTU>
<pref887:vlanRangeAvailability>533,546,667,670,1755-1759,2601-2650,3701-3750</pref887:vlanRangeAvailability>
<pref887:vlanTranslation>true</pref887:vlanTranslation>
</pref887:switchingCapabilitySpecificInfo>
</pref887:SwitchingCapabilityDescriptors>
</pref887:link>

in reply to: ↑ 30   Changed 5 years ago by ibaldin

Replying to ibaldin:

I think Chris is right - we are not using the correct URN.

  Changed 5 years ago by anirban

I tested the multi-point request, mp-request.rdf with RCI and FIU.. All slivers go to active but VMs are not pingable.. No other slices existed simultaneously during the test.. complex.rdf and dumb-bells between RCI and FIU worked fine yesterday..

I have taken down all slices, and would be waiting on new delegations and fixing of connectivity issues..

  Changed 5 years ago by ibaldin

Please any time you have a case like this, immediately try a dumbbell to see if this is a connectivity or an embedding problem...

  Changed 5 years ago by anirban

Tried a dumb-bell between RCI and FIU. That still works..

follow-up: ↓ 36   Changed 5 years ago by anirban

I tested the multi-point request, mp-request.rdf with RCI and FIU.. All slivers including ION go to active but VMs are not pingable..

Then I tried a dumb-bell between RCI and UFL.. It resulted in an ION failure that we are seeing intermittently for RCI-UFL interdomain links (reported yesterday):

Last lease update: all units failed priming: Error code 1 during join for unit: F3DF2F73 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

in reply to: ↑ 35   Changed 5 years ago by anirban

Sorry, the request was a multipoint request between RCI and UFL, not RCI and FIU..

Replying to anirban:

I tested the multi-point request, mp-request.rdf with RCI and FIU.. All slivers including ION go to active but VMs are not pingable..

Then I tried a dumb-bell between RCI and UFL.. It resulted in an ION failure that we are seeing intermittently for RCI-UFL interdomain links (reported yesterday):

Last lease update: all units failed priming: Error code 1 during join for unit: F3DF2F73 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

  Changed 5 years ago by ibaldin

Let's redeploy BBN with a new VLAN policy orca.plugins.ben.control.NdlInterfaceVLANControl (same as one that was deployed in UvA before we switched it back). Then restart the rack, the transit net authorities and the ExoController?.

  Changed 5 years ago by anirban

dumb-bells between FIU and UFL always fail in ION with an error like:

Last lease update: all units failed priming: Error code 1 during join for unit: 3F0F7A27 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to Index: 0, Size: 0 on reservation al2s.net.internet2.edu-8461 in Dijkstra PCE ", exiting

follow-up: ↓ 40   Changed 5 years ago by anirban

A multipoint request, mp-request.rdf, works fine with UFL and FIU..

in reply to: ↑ 39   Changed 5 years ago by ibaldin

Replying to anirban:

This is apparently an AL2S limitation, as FIU and UFL share the same AL2S port. I moved this issue to ticket #300 for long-term resolution.

  Changed 5 years ago by vjo

OK - redeployed. Hit it again.

  Changed 5 years ago by anirban

Dumb-bell tests:

rci-bbn: check

rci-ufl: check

rci-fiu: check

bbn-fiu: check

bbn-ufl: check

rci-uva: slivers up; vms not pingable

rci-nicta: slivers don’t come up; nictaNet remain in Ticketed [behavior different from what I saw yesterday when all slivers went to active but vms were not pingable]

rci-osf: controller Exception when querying ORCA for slice manifest — java.lang.Exception: Unable to get sliver status: ERROR: ControllerException? encountered: null

fiu-ufl: ion reservation fails with
Last lease update: all units failed priming: Error code 1 during join for unit: C504AC87 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to Index: 0, Size: 0 on reservation al2s.net.internet2.edu-8531 in Dijkstra PCE ", exiting

Moving onto more complex inter-domain cases that include rci, fiu, bbn, and ufl

follow-up: ↓ 45   Changed 5 years ago by vjo

Could you double-check UvA? I have a suspicion as to why it was broken, and may have just fixed it.

  Changed 5 years ago by ibaldin

We now have a cert that works with AL2S, ION and ESnet. Connections to OSF *should* work, if the other problems with it are fixed.
Cert has been installed and ORCA configured to use it.

in reply to: ↑ 43   Changed 5 years ago by anirban

rci-uva dumb-bell still doesn't work

Replying to vjo:

Could you double-check UvA? I have a suspicion as to why it was broken, and may have just fixed it.

  Changed 5 years ago by anirban

Multi-point tests: mp-request.rdf

rci-fiu: check

rci-bbn: ion reservation fails with
Last lease update: all units failed priming: Error code 1 during join for unit: 2946EEA6 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to local teardown succeeded ", exiting

rci-ufl: nlr remains ticketed and ion reservation fails with
Last lease update: all units failed priming: Error code 1 during join for unit: E8AB5501 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

bbn-fiu: nor remains Ticketed and ion reservation fails with
Last lease update: all units failed priming: Error code 1 during join for unit: C40CD66D with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

bbn-ufl: check

fiu-ufl: check

  Changed 5 years ago by vjo

I have also checked RCI<->UvA, and see the same issue (all resources "up", but no ping).

RCI<->OSF is stuck, with the ION reservation not proceeding from "Ticketed."

  Changed 5 years ago by vjo

Stay off of NICTA for a moment - trying to resolve its issues.

  Changed 5 years ago by vjo

NICTA update:
Performed a RCI<->NICTA dumbbell; all resources came up active.
Logged into both VMs, but could not pass a ping.

  Changed 5 years ago by vjo

RCI<->NICTA still failing, despite all resources coming up active.
Here's the path that's not functioning for me:
1)
Node name: rciNet/Domain/vlan/7a9c471d-7e02-45d5-996c-8421de6bcc90/vlan
Label/Tag: 1016

2)
Node name: ben/Domain/vlan/a02ad832-4618-44a8-9e9f-29ae92b47947/vlan
Label/Tag: 100

3)
Node name: nlr/Domain/vlan/07fca4b1-8537-4c59-b04a-e00da62702fe/vlan
Label/Tag: 103

4)
Node name: ion/Domain/vlan/2a85e30e-0ff5-4e14-bb21-890bb521df11/vlan
Label/Tag: 274

5)
Node name: nictaNet/Domain/vlan/56426465-a472-4685-ad19-0b09420fdc3e/vlan
Label/Tag: 3199

  Changed 5 years ago by vjo

RCI<->UvA is now working!

  Changed 5 years ago by anirban

multi-point with rci and uva doesn't work. rci-uva dumb-bell still works.

Slivers go to active state but vm's not pingable. The vlan tag trail is

rciNet: 1016
ben: 100
NLR Net: vlan tag unspecified
I2 ION/AL2S: 201
uvaNet: 3200

  Changed 5 years ago by anirban

Summary of testing (02/06):

The following dumb-bells work: [rci-bbn, rci-ufl, rci-fiu, bbn-fiu, bbn-ufl, rci-uva]
rci-nicta dumb-bell doesn't work and needs debugging. See vjo's note on this.
fiu-ufl dumb-bell doesn't work and a ticket (#300) has been created for future reference.
rci-osf dumb-bell doesn't work and ESNet folks have been notified about it, since it is a probable internal OSCARS issue

The following '2-point' multipoint slices work: [rci & fiu; bbn & ufl; fiu & ufl].
[rci & ufl; bbn & fiu] didn't work because of a similar failure of the ion reservation - "FAILED PSSReplyRequest.execute". These might go away when the above OSCARS issue is fixed ? Or, might need more debugging.
[rci & bbn] didn't work because of a failure of the ion reservation - "FAILED due to local teardown succeeded" . Might go away when the above OSCARS issue is fixed ? Or, might need more debugging.
[rci & uva] didn't work because vms were not pingable. See anirban's note above. Needs debugging.

Testing plan for tomorrow (02/07):

1. Postboot scripts [** Need documentation for expression of IP address on a broadcast link in the velocity template]
2. Storage
3. Stitchport
4. complex.rdf covering as many domains as possible
5. SC13-demo.rdf covering as many domains as possible
6. Anything else ?

  Changed 5 years ago by yxin

There was a problem in nlr control in synchronization the closing. Checked in the fix. Please rebuild and redeploy nlr.

-Yufeng

  Changed 5 years ago by ibaldin

When you test please use the new version of Flukes and report on this ticket of any problems.

http://geni-images.renci.org/webstart/0.4-SNAPSHOT/flukes.jnlp

Currently one known problem is disconnected stitchports in the manifest.

  Changed 5 years ago by ckh

OESS and OSCARS inconsistencies have been reconciled and both are synchronized at the moment. It's an ongoing issue that needs persistent oversight.

follow-up: ↓ 58   Changed 5 years ago by ibaldin

Deployed updated NDL-RSpec converter v.0.7-SNAPSHOT.build-6054.

in reply to: ↑ 57 ; follow-up: ↓ 59   Changed 5 years ago by pruth

Tested 1,2,and 3 from anirban's list.

Summary:

1. Postboot scripts are missing IP addresses (and likely other info). I will look into this a bit more but I suspect the properties are not being passed to the handler correctly.
2. Storage works for the cases I tried.
3. Stitchport to OSG works.

in reply to: ↑ 58   Changed 5 years ago by ibaldin

Replying to pruth:

Tested 1,2,and 3 from anirban's list.

Under what conditions are the parameters missing? Can you say anything about the types of requests you tried?

follow-up: ↓ 62   Changed 5 years ago by ckh

Testing connectivity to NICTA from Departure Drive (DD) using stitchport. Able to implement Vlans 4001-4005 between DD(AL2S) and Los Angeles(ION) using OSCARS GUI - they're still provisioned. Able to provision stitchports for Vlans 3195-3199 at NICTA. The mappings are as follows, as well as the test result
3195 - 4001, ping successful
3196 - 4002, ping successful
3197 - 4003, unsuccessful ping
3198 - 4004, ping successful
3199 - 4005, unsuccessful ping
Need to troubleshoot unsuccessful pings.

  Changed 5 years ago by ibaldin

Removed test ION circuits to NICTA for now.

in reply to: ↑ 60   Changed 5 years ago by ibaldin

Replying to ckh:

Testing connectivity to NICTA from Departure Drive (DD) using stitchport. Able to implement Vlans 4001-4005 between DD(AL2S) and Los Angeles(ION) using OSCARS GUI - they're still provisioned. Able to provision stitchports for Vlans 3195-3199 at NICTA. The mappings are as follows, as well as the test result
3195 - 4001, ping successful
3196 - 4002, ping successful
3197 - 4003, unsuccessful ping
3198 - 4004, ping successful
3199 - 4005, unsuccessful ping
Need to troubleshoot unsuccessful pings.

Created separate ticket #302 to track this issue.

  Changed 5 years ago by vjo

To validate Chris's testing, I threw 3 RCI<->NICTA dumbbells.
The first used 4005.
The second used 4004.
The third used 4003.

Both the first and third did not ping successfully, but the second *did* ping successfully.

  Changed 5 years ago by ibaldin

ESnet has requested that ION and AL2S OSCARS instances are updated to reflect the new certificate signing structure at DoE, this is what is preventing requests from OSF to go to ION and AL2S. Hopefully once that is done OSF will become reachable.

  Changed 5 years ago by anirban

dumb-bell:

rci-ufl: check

rci-fiu: check

rci-uva: check

Multipoint:

rci & ufl: all slivers up but not pingable (rciNet:1016, ben:100, nlr:unspecified, ion:201, uflNet:1411) [dumb-bell fine]

rci & bbn: check

fiu & ufl: slice fine, but manifest consists of islands of nodes

rci & uva: all slivers are up but not pingable (rciNet:1017, ben:105, nlr:unspecified, ion:205, uvaNet:3200) [dumb-bell fine]

rci & fiu: all slivers are up but not pingable (rciNet:1016, ben:100, nlr:unspecified, ion:206, fiuNet:1762) [dumb-bell fine]

bbn & ufl: ion reservation failed and manifest consists of islands [Manifest attached]
Last lease update: all units failed priming: Error code 1 during join for unit: 7C2189B7 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

Complex:

fiu & bbn: check

rbi & bin: check

rci & ufl: One out of the 3 ion reservations fail with
Last lease update: all units failed priming: Error code 1 during join for unit: C9A081AF with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

  Changed 5 years ago by ibaldin

I extended the ExoSM maintenance until Monday evening. We are also waiting on resolution of ESnet OSCARS to ION/AL2S issue.

  Changed 5 years ago by ibaldin

For Chris:

OSF-to-RCI gets created with no errors, but pings do not pass.

  Changed 5 years ago by yxin

For Victor:

I checked in something to for label freeing more explicit. Please recompile and redeploy Exo-controller, nlr, and ben (nobody (can) uses BEN anyway now).

I'll run tests afterwards. Thanks.

-Yufeng

follow-up: ↓ 70   Changed 5 years ago by vjo

OK; first thing in AM OK, or do you want now?

in reply to: ↑ 69   Changed 5 years ago by yxin

Replying to vjo:

OK; first thing in AM OK, or do you want now?

Tomorrow morning is good. Thanks.

-Yufeng

  Changed 5 years ago by vjo

Re-deployment from trunk at HEAD (revision 6147) is complete.

  Changed 5 years ago by ibaldin

r6148, r6149 and r6150 contain needed updates to the controller and controls. Redeployment of NLR, ION, BEN and restart of exo controller is necessary.

  Changed 5 years ago by vjo

D'oh. Meant to put this here.
OK - everybody take down your slices; I’m starting the re-deploy.

  Changed 5 years ago by vjo

OK - re-deployed.
Please proceed w/ testing.

  Changed 5 years ago by ibaldin

The fix for the problem is on r6160

What the fix does: it delays creation of new slices (by returning 'busy' error message) for slices that come one after another and affect NLR, ION or BEN. So if a slice has a reservation on one of those, the controller will make you wait until this reservation (not the whole slice) is finished or failed.

How to test the fix: try throwing a mix of MP and PP slice requests. Observe the busy message. Make sure everything comes up. Try throwing single rack requests one after another and observer there is NO busy message - since these slices don't touch on NLR ION or BEN, they should not be delayed.

  Changed 5 years ago by yxin

tested in the emulator, it works.

vjo, please rebuild and redeploy the controller, redeploy nlr/ion and ben, for real test.

  Changed 5 years ago by vjo

Re-deployed; proceed.

  Changed 5 years ago by ibaldin

r6164 has additional fixes to discriminate between inter- and intra- domain requests.

After testing whether it works properly (see above), we need to test GENI API now as well as pub sub, because the fix potentially affects those two

  Changed 5 years ago by vjo

Re-deployed r6165 - I added a couple of minor safety checks.

  Changed 5 years ago by vjo

And - if that wasn't clear - we're ready to test in the morning.

  Changed 5 years ago by anirban

Tested multipoint involving rci-bbn-fiu followed by two dumb-bells (rci-fiu and then rci-bbn) in quick succession.. The mp request and the rci-fiu dumb-bell requests came back with the following exception in flukes:

ERROR: Exception encountered: com.hp.hpl.jena.shared.QueryStageException?: com.hp.hpl.jena.shared.ClosedException?: already closed

The rci-bbn dumb-bell request worked fine..

Trying the same sequence again.

  Changed 5 years ago by anirban

Tested another sequence of multipoint between rci-bbn-fiu followed by a rci-fiu dumb-bell in quick succession.

The multipoint request went through fine.. Everything became active.. VMs were pingable between bbn and fiu, but neither could ping the vm on rci.

The rci-fiu dumb-bell request failed because ben reservation failed with:

Last lease update: all units failed priming: Exception during join for unit: 35C64016 The following error occurred while executing this line:
/etc/orca/am+broker-12080/handlers/providers/ben/ben.xml:192: The following error occurred while executing this line:
/etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:547: /etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:581: An error occurred: XML Reply returned error:
configuration check-out failed

in stage (sending commit)?

  Changed 5 years ago by ibaldin

r6167 has the fix for the problem. Yufeng testing emulation, then if OK, need a rebuild.

BEN is also cleaned up.

  Changed 5 years ago by ibaldin

It is working, please rebuild, redeploy and test.

  Changed 5 years ago by vjo

On it. Everybody get out.

  Changed 5 years ago by vjo

test.renci.uh.23 is awaiting closure...

  Changed 5 years ago by ckh

test.renci.uh.23 is deleted

  Changed 5 years ago by vjo

Re-deployed; please test.

  Changed 5 years ago by anirban

First try of one multipoint followed by dumb-bell came up with this exception for the dumb-bell request:

java.lang.Exception: Unable to create slice: ERROR: createSlice(): discoverTypes() failed to populate typesMap and abstractModels

Doing more tests now..

  Changed 5 years ago by vjo

Preserving Exception backtrace from Anirban's report:
INFO | jvm 1 | 2014/02/12 12:48:36 | 2014-02-12 12:48:36,582 [qtp1228283922-33] ERROR controller.orca.controllers.xmlrpc.OrcaXmlrpcHandler? - createSlic
e(): discoverTypes() failed to populate typesMap and abstractModels: java.lang.ArrayIndexOutOfBoundsException?: 159
INFO | jvm 1 | 2014/02/12 12:48:36 | java.lang.ArrayIndexOutOfBoundsException?: 159
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.mem.HashCommon?.findSlot(HashCommon?.java:152)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.mem.HashedBunchMap?.get(HashedBunchMap?.java:42)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.mem.faster.NodeToTriplesMapFaster?.iterator(NodeToTriplesMapFaster?.java:110)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.mem.GraphTripleStoreBase?.find(GraphTripleStoreBase?.java:143)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.mem.faster.GraphMemFaster?.graphBaseFind(GraphMemFaster?.java:141)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.impl.GraphBase?.find(GraphBase?.java:240)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.compose.MultiUnion?.multiGraphFind(MultiUnion?.java:187)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.compose.MultiUnion?.graphBaseFind(MultiUnion?.java:166)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.impl.GraphBase?.find(GraphBase?.java:240)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FGraph.findWithContinuation(FGraph.java:61)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.find(FinderUtil?.java:90)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FGraph.findWithContinuation(FGraph.java:61)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.find(FinderUtil?.java:90)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.findWithContinuation(FinderUtil?.java:106)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.find(FinderUtil?.java:90)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.findWithContinuation(FinderUtil?.java:106)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.FinderUtil?$Cascade.find(FinderUtil?.java:90)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.findDataMatches(FBRuleInfGraph.java:217)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETERuleContext.find(RETERuleContext.java:121)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETERuleContext.contains(RETERuleContext.java:108)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETERuleContext.contains(RETERuleContext.java:100)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEConflictSet.execute(RETEConflictSet.java:160)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEConflictSet.add(RETEConflictSet.java:76)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.requestRuleFiring(RETEEngine.java:228)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETETerminal.fire(RETETerminal.java:73)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEClauseFilter.fire(RETEClauseFilter.java:220)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.inject(RETEEngine.java:422)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.runAll(RETEEngine.java:404)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit(RETEEngine.java:150)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare(FBRuleInfGraph.java:476)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.findWithContinuation(FBRuleInfGraph.java:572)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.graphBaseFind(FBRuleInfGraph.java:604)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.impl.GraphBase?.find(GraphBase?.java:257)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.query.SimpleQueryHandler?.subjectsFor(SimpleQueryHandler?.java:60)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.graph.query.SimpleQueryHandler?.subjectsFor(SimpleQueryHandler?.java:44)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.rdf.model.impl.ModelCom?.listSubjectsFor(ModelCom?.java:1010)
INFO | jvm 1 | 2014/02/12 12:48:36 | at com.hp.hpl.jena.rdf.model.impl.ModelCom?.listResourcesWithProperty(ModelCom?.java:1024)
INFO | jvm 1 | 2014/02/12 12:48:36 | at orca.embed.workflow.Domain.getDomainResources(Domain.java:96)
INFO | jvm 1 | 2014/02/12 12:48:36 | at orca.embed.workflow.Domain.getDomainResources(Domain.java:89)
INFO | jvm 1 | 2014/02/12 12:48:36 | at orca.controllers.xmlrpc.OrcaXmlrpcHandler?.updateModel(OrcaXmlrpcHandler?.java:958)
INFO | jvm 1 | 2014/02/12 12:48:36 | at orca.controllers.xmlrpc.OrcaXmlrpcHandler?.discoverTypes(OrcaXmlrpcHandler?.java:944)
INFO | jvm 1 | 2014/02/12 12:48:36 | at orca.controllers.xmlrpc.OrcaXmlrpcHandler?.createSlice(OrcaXmlrpcHandler?.java:296)

  Changed 5 years ago by ibaldin

If we don't lose power, it would be nice to continue testing tomorrow. Seems some of these are jena errors, that I don't quite understand - the last one comes from part of the code in createSlice that was not affected by the changes we made...

  Changed 5 years ago by anirban

Ok, I did some tests with multi point (rci-ufl-bbn) followed by point to point requests , sometimes in rapid sequence and sometimes not, I got the following jena exception, 4 out of 6 times:

ERROR: Exception encountered: com.hp.hpl.jena.shared.QueryStageException??: com.hp.hpl.jena.shared.ClosedException??: already closed

The slices, when they come up, work fine. I feel that the probability of this exception occurring is more if the requests come very close to each other..

  Changed 5 years ago by ibaldin

Regarding the out-of-bounds exception - I put a guard in Domain.java so we return an empty resource set rather than throw an exception.

Regarding the other exception - the 'already closed', I have a question - Yufeng - you do use the request parser and you do close the model using the parser's close method, right? Where is it?

In general, I think we may be running into well-known Jena concurrency issues and the problem may not be in our code. To test the scenarios we have it is sufficient to separate the requests by about 5 seconds - the requests don't have to be near-simultaneous.

  Changed 5 years ago by anirban

Ok, spacing requests doesn't result in that exception. Basically, if you wait for the controller to return before submitting the next request, the exception doesn't occur.

I threw three multipoint requests (rci-fiu-bbn) in sequence.. For the first and third slices, everything worked fine. For the second slice, everything went to active, but the vm at rci couldn't ping the vms at bbn or fiu, but the vms at bbn and fiu could ping each other.

I saw an ion reservation failure on a rci-ufl dumb-bell request, which we have seen before intermittently.

Last lease update: all units failed priming: Error code 1 during join for unit: 4C542F02 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to local teardown succeeded ", exiting

Also tested a slice using GENI API with ExoSM. Worked fine.

  Changed 5 years ago by yxin

XmlrpcControllerSlice?.close() calls RequestWorkflow?.close(), which calls NdlCommons?.closeModel() to close all the modesl within the slice.

  Changed 5 years ago by ibaldin

r6168 reduces the concurrency in Jena and catches the array out of bounds exceptions. We should try it

What is the state of substrate? Is anything torqued? (e.g. BEN commit problems?

  Changed 5 years ago by anirban

My slices are closed.. I did not see any BEN commit errors.. Waiting on redeploy..

  Changed 5 years ago by ibaldin

If I were to summarize, the last problem we are dealing with is Jena exceptions, is this correct? We are no longer seeing broken slices or lost labels?

  Changed 5 years ago by anirban

The only slice that was partly broken was a multi-point slice thrown between two multi-point slice requests. I threw three multipoint requests (rci-fiu-bbn) in sequence.. For the first and third slices, everything worked fine. For the second slice, everything went to active, but the vm at rci couldn't ping the vms at bbn or fiu, but the vms at bbn and fiu could ping each other. I don't know whether this broken slice was a result of lost labels or not.

The Jena exceptions are the only other outstanding issue. I believe that it is triggered by very rapid subsequent requests to ExoSM.

  Changed 5 years ago by ibaldin

Let's do another rebuild/redeploy of controller with concomitant restart of net authorities. If you get another slice with not pingable resources stop and let us know so Yufeng can inspect the logs to see if this is a concurrency issue or just bad luck.

In general alternating MP and PP requests is what triggers the problems. But feel free to throw stuff at it in whatever order you think tests it best.

  Changed 5 years ago by anirban

I am now seeing the BEN commit failure.. I threw 3 multi-point slices and one p-p after the third.. The first one had vms not pingable.. The second and third slice threw the ben failure.. I didn't wait for the p-p slice to come up..

/etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:547: /etc/orca/am+broker-12080/handlers/providers/ben.no-na.tasks.xml:581: An error occurred: XML Reply returned error:
configuration check-out failed

in stage (sending commit)?

I have taken down my slices.. I will test after the next redeploy..

  Changed 5 years ago by vjo

Will get to re-deploy shortly.

  Changed 5 years ago by ibaldin

This is odd. I checked the defer queue log - everything seems to have gone correctly and nothing should've been concurrent.

Yet in BEN tag 103 was reused twice between two vlans:

orca_vlan_102 {

vlan-id 102;
interface {

xe-0/0/3.0 {

mapping {

103 {

swap;

}

}

}
xe-0/0/6.0 {

mapping {

1018 {

swap;

}

}

}

}
filter {

input orca_policy_102-filter;

}

}
orca_vlan_103 {

vlan-id 103;
interface {

xe-0/0/6.0 {

mapping {

1019 {

swap;

}

}

}
xe-0/0/3.0 {

mapping {

103 {

swap;

}

}

}

}
filter {

input orca_policy_103-filter;

}

}

which points to the old problem of racing reservations...

  Changed 5 years ago by ibaldin

Cleaned out BEN switch.

  Changed 5 years ago by ibaldin

BTW, the conflict between MP slices points to a different problem I think. The pp slice was submitted to orca controller at 14:58:31, but didn't get submitted to SM demand until 15:14:09.

  Changed 5 years ago by ibaldin

Looks like vlan 102 (the top of the two above) was provisioned way before the vlan 103. Vlan 103 was done around 15:01, vlan 102 was done around noon:

NFO | jvm 1 | 2014/02/13 12:00:21 | join:
INFO | jvm 1 | 2014/02/13 12:00:21 | [echo] BEN HANDLER: SETUP on 02/13/2014 12:00
INFO | jvm 1 | 2014/02/13 12:00:21 | [echo] Starting atomic sequence for ALL-OF-BEN
INFO | jvm 1 | 2014/02/13 12:00:21 | [echo] performing setup at renci
INFO | jvm 1 | 2014/02/13 12:00:21 | [echo] enabling vlan 102 on router qfx3500.renci.ben bw=10000000 burst=1250000
INFO | jvm 1 | 2014/02/13 12:00:21 | [echo] router.user: geni-orca
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] vlan 102 created successfully on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] adding ports xe-0/0/6,xe-0/0/3 to vlan 102 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] vlan 102 added ports xe-0/0/6,xe-0/0/3 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] performing setup at unc
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] performing setup at duke
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] performing setup at ncsu
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] Finished setting up BEN vlan. code=0
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] Performing map operations
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] Mapping BEN and nlr vlans: 102 103 on qfx3500.renci.ben: xe-0/0/3
INFO | jvm 1 | 2014/02/13 12:00:32 | [echo] mapping vlan tags 103:102 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:43 | [echo] successfully mapped 103:102 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:43 | [echo] Mapping BEN and rciNet vlans: 102 1018 on qfx3500.renci.ben: xe-0/0/6
INFO | jvm 1 | 2014/02/13 12:00:43 | [echo] mapping vlan tags 1018:102 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:54 | [echo] successfully mapped 1018:102 on router qfx3500.renci.ben
INFO | jvm 1 | 2014/02/13 12:00:54 | [echo] Finished performing map operations
INFO | jvm 1 | 2014/02/13 12:00:54 | [echo] Stopping atomic sequence for ALL-OF-BEN
INFO | jvm 1 | 2014/02/13 12:00:54 | [echo] join exit code: 0

  Changed 5 years ago by ibaldin

I wonder if this is a case of a label being lost due to Jena exception? Yufeng - is that possible? Because this wasn't a race. Same tag was picked twice but with a 3 hour interval. I can't readily tell if the request at noon was PP or MP. The request at 3pm was definitely MP.

  Changed 5 years ago by ibaldin

Judging by the name it was a point to point slice (and it was the only slice at that time):

INFO | jvm 1 | 2014/02/13 11:59:41 | INFO [qtp1228283922-28] (SliceDeferThread?.java:227) - SliceDeferThread?: checkComputedReservaions test.renci.fiu.10/158a2810-5f31-47f2-9a50-5c5ea87dd7f8 has

delayed domain

INFO | jvm 1 | 2014/02/13 11:59:41 | INFO [qtp1228283922-28] (SliceDeferThread?.java:288) - SliceDeferThread?: Slice test.renci.uh.10/2d446e95-b586-498d-8329-030b35158cc1 has no non-final reserv
ations (7)
INFO | jvm 1 | 2014/02/13 11:59:41 | INFO [qtp1228283922-28] (SliceDeferThread?.java:137) - SliceDeferThread?: Processing slice test.renci.fiu.10/158a2810-5f31-47f2-9a50-5c5ea87dd7f8 immediately
INFO | jvm 1 | 2014/02/13 11:59:41 | INFO [qtp1228283922-28] (SliceDeferThread?.java:227) - SliceDeferThread?: checkComputedReservaions test.renci.fiu.10/158a2810-5f31-47f2-9a50-5c5ea87dd7f8 has

delayed domain

INFO | jvm 1 | 2014/02/13 11:59:41 | INFO [qtp1228283922-28] (SliceDeferThread?.java:114) - SliceDeferThread?: updating last slice with test.renci.fiu.10/158a2810-5f31-47f2-9a50-5c5ea87dd7f8
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 09fb4210-4057-4856-bee6-68ff574d4d9f
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: ec9237ea-c210-405b-98be-abcd7140217c
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 2150e101-1122-47f7-a7b7-a5612bcbc14c
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 56f815d5-9648-4c69-ada1-0b9948eb20c2
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 1bced593-35f0-484c-9523-db532d3aac01
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 14ecbcbb-5732-4eea-a8ef-5cf990024066
INFO | jvm 1 | 2014/02/13 11:59:41 | DEBUG [qtp1228283922-28] (SliceDeferThread?.java:199) - demandSlice(): Issuing demand for reservation: 424df18b-759e-485f-b4ae-8fa32e5b47bd

  Changed 5 years ago by anirban

By the way, this was not my slice - test.renci.fiu.10 . Who else is testing interdomain ?

  Changed 5 years ago by ibaldin

I think it was Chris maybe, but it doesn't matter, since it broke.

  Changed 5 years ago by vjo

"Everybody out of the pool. I don't care if it's frozen."

  Changed 5 years ago by vjo

Re-deployed; hit it.

  Changed 5 years ago by ibaldin

I'm pushing a new version of Flukes (limited by the speed of my home Internet connection). It *visually* addresses #297 for both pp and mp cases, HOWEVER, the manifest as displayed cannot be fully trusted - because the path as given in the manifest includes real interface names, which are reused across connections, I cannot verifiably compute the path as was provisioned by ORCA.

If there are say two racks in a slice that are connected via ION to DD/NLR, the paths from them to NLR via ION may lie through two ION vlans that share common interfaces - I have no way of telling which one is which without further verifying labels as assigned, which I also cannot do, since manifest doesn't include information about labels provisioned on specific interfaces.

It will have to stay this way until we change the manifest model to use virtual interfaces.

  Changed 5 years ago by anirban

I sent requests for 8 multi-point and 4 point to point slices involving rci, bbn, fiu and ufl, in a random sequence . 10 of them worked fine. For one of the point to point requests, which I submitted too fast, I got the following exception. I should have waited for the controller to come back before issuing that request.

java.lang.Exception: Unable to create slice: ERROR: createSlice(): discoverTypes() failed to populate typesMap and abstractModels

For another rci-ufl dumb-bell, I got the ion reservation failure, which we see once in a while.

Last lease update: all units failed priming: Error code 1 during join for unit: 86CAA881 with message: Unable to create circuit: start-oscars-v06.sh: OSCARS did not return a GRI to createReservation request due to: "Error: Generic exception: OSCARSA reservation failed with status FAILED due to PSS called Coordinator with FAILED PSSReplyRequest.execute no CreatePathRequest?,TearDownPathRequest? or CancelReservation? associated with this PSSReply ", exiting

I haven't seen any other exceptions or BEN commit failures.

  Changed 5 years ago by ibaldin

I suggest we continue testing. The first exception - I added this catch statement. I do not know the exact cause of it, but it is not anything we introduced with these changes, I don't think (Yufeng can correct me). The second one is obvious - this isn't our problem.

Paul - can you also start testing - let's make sure things work for you.

Also - can we make sure that

1. Storage works

2. Slices are properly published and caught by blowhole

  Changed 5 years ago by ibaldin

Also, after some thinking, I do not believe we have solved the problem. While we addressed it for simultaneous opens, there is still an issue when a close (done through reservation expiry, not explicit close through the controller) coincides with the open.

In this case the controller can assign some unused label A to NLR. While this redeem is traveling to NLR, a close may occur that will cause NLR to free label B, so by the time the redeem arrives at NLR, it will pick B. Since freeing up labels is tied to lazy slice garbage collection the controller won't know B is available for some time.

Feel free to contradict me.

  Changed 5 years ago by yxin

(for some reason, my comment was not posted last night.)

I was actually breaking it, as BEN could fail the same way.

1. The direct reason is because the current BEN QFX handler always does a self mapping when there is no tag from upstream domain (NLR), the MP case. It happened to use a tag that was passed in from NLR in a previous p2p slice, then the self-mapping failed.

2. In theory, this tag assignment should not happen. But somehow NLR didn't get back a label in closing and caused the mismatch. I need to test it more.

3. I've closed my slices and cleaned up BEN switch.

Paul, you can go ahead to test, esp, storage, postboot script and pubsub, but just do not use RCI/BEN site for now, as it may fail.

-Yufeng

  Changed 5 years ago by vjo

OK folks,
Haven't seen any real activity in the logs since ~10:30.
Do we need me to do any re-deployments today, or hold off for tomorrow?

  Changed 5 years ago by yxin

i will check in the new rdf for sl ion connectivity tonight. so redeploy tomorrow to test altogather.
thanks.

  Changed 5 years ago by vjo

Will do; thanks!

  Changed 5 years ago by yxin

slNet.rdf and ion.rdf, but need to confirm the port and vlan range information from Starlight. So No Redeployment yet.

  Changed 5 years ago by yxin

hi, vjo,

I checked in a fix in an attempt to fix the missed tag issue in NLR site. please rebuild and redeploy nlr/ion, ben, and controller.

I traced Anirban's tests followed my tests leading to the problem in following sequence: (1) create mp-1, p-1, mp-2, mp-3; (2) close mp-3,mp-2,p-1,mp-1;(3)create mp-4, mp-5, mp-6, mp-7, p-2; (4) close p-2, mp-6, mp-5,mp-7,mp-4; (5) create mp-8, p-9; (6) close mp-8, p-9. Then a tag got lost. The debugging difficulty was that I could not repeat it in emulator.

Hopefully it's caused by this minor bug that kept an extra tag property which might come up first in Jena query that may block a tag returning later.


  Changed 5 years ago by vjo

Will do, around 1-2 PM; am tied up w/ out-of-town folks until then.

  Changed 5 years ago by vjo

All right - everybody out.
Who owns: test.renci.houston.10
?

Will close in ~10 minutes, if I don't hear back.

  Changed 5 years ago by vjo

Also - where do:
ndl/src/main/resources/orca/ndl/substrate/instageni.rdf

and

ndl/src/main/resources/orca/ndl/substrate/slNet.rdf

get put?

  Changed 5 years ago by vjo

Code built; awaiting re-deploy on:
1) Close of single slice.
2) Answer on where to put new RDF.

  Changed 5 years ago by ckh

Is the redeploy complete?

  Changed 5 years ago by vjo

No - I still don't know where:
slNet.rdf
instageni.rdf

should go.

Are they meant for the StarLight? rack?
I *think* so - but I'd like confirmation.

  Changed 5 years ago by vjo

Proceeding on the above assumption (since none of the config.xml files reference the new RDF, in the racks in question).

Re-deploy complete. Proceed with testing.

  Changed 5 years ago by ibaldin

I'm pretty sure slNet is for SL. I don't know where instageni goes.

  Changed 5 years ago by yxin

sl?et is incomplete, waiting for port and vlan range. instageni.rdf is an aux rdf holding stitching links to instageni and does not go anywhere.

follow-up: ↓ 136   Changed 5 years ago by ibaldin

Build from r6173. This adds suupport for GENI speaks-for credentials.

r6172 causes NLR to throw exception if it is unable to match controller-suggested tag.

The following things need to be tested prior to reopening:

1. Test mp/pp sequences as before (Paul)

2. Test GENI AM API - that we haven't broken credential checking and if possible test speaks for support (Anirban)

3. Test VLAN 533 stitchport reachability (Chris, ticket #308)

Things that may wait to be resolved after reopening:

Non-working, unconfirmed VLANs on NICTA, OSF and ISI (#302, #304, #305).

VLANs

Also, need to make sure all racks delegate full VLAN ranges to ExoSM, not leaving anything to rack broker anymore (UH appears to still have a split delegation).

  Changed 5 years ago by yxin

stitching port to isi is added to ion.rdf

  Changed 5 years ago by vjo

r6174 deployed, and ION has RDF from r6175.

  Changed 5 years ago by ibaldin

OK, please commence testing as in my comment above.

in reply to: ↑ 132   Changed 5 years ago by anirban

Replying to ibaldin:


2. Test GENI AM API - that we haven't broken credential checking and if possible test speaks for support (Anirban)

Tested GENI AM API.. The changes for speaks-for haven't broken normal credential checking.. The credential parsing exception that used to occur before - "Error: URI=null Line=1: cvc-elt.1: Cannot find the declaration of element 'Signature" , isn't present any more.

Speaks-for has been tested only with unit tests. The end-to-end test for speaks-for has to wait until (a) we know the process to obtain tool certificates/credentials based using a user cert from GENI portal, CH etc. , and (b) we know how to use OMNI clients in the speaks-for mode. The test certs/credentials that were used during unit testing were verified against a test truststore. These can't be used to test against production truststore, geni-trusted.jks. So, unless we have tool certificates/credentials issued by proper GENI sources, I can't test speaks-for end-to-end in the production setting.

  Changed 5 years ago by pruth

Storage at FIU seems to have a problem and I can't login to fiu-hn to check on it.

Also, I submitted 2 separate complicated slices that disappeared in flukes as if all the slivers failed. I tried another slightly different one and it went through but it has been in nascent for a while. I think that the first two slices are trying to come up but are not visible in flukes.

  Changed 5 years ago by pruth

The slice eventually came up but several slivers are in the failed state. I'm not sure what happened.

  Changed 5 years ago by yxin

more specific abourt the failures, network or edge? insufficient resources or exceptions?

  Changed 5 years ago by yxin

There were two problems:

1. You used the same stitching tag in your two complex slices, which caused BEN QFX failure. However, the controller failed to block the second one to go through. The function was there but was broken somehow recently, I have the small fix ready for controller and will check in.

2. There seemed to be another race condition in controller in updating the global controller assigned tag, esp, when your two slices are complex and they took while in embedding computing. I'll work on this issue this morning.

Paul, could you please close your slices and try them again, but just in sequence: (1) first is the same one with the stitching port; (2) after you see the popup window in Flukes for the first one, then submit the second one but replacing the stitching port with a node in RCI. So we can confirm it is the race condition issue.

Thanks

-Yufeng

  Changed 5 years ago by pruth

The first problem was that the first two slices disappeared from flukes after I submitted them. There was/is no way for a user to delete them.

Maybe whatever state they were in that caused them to disappear from flukes also allowed them to be processed concurrently. I don't think I will be able to reproduce this error.

Paul

  Changed 5 years ago by pruth

I closed these slices in pequod but several of the ion slivers are sill in "ticketed/redeeming". I will wait until folks get in today before I try anything.

  Changed 5 years ago by ibaldin

There are two vlans in BEN QFX:
orca_vlan_100 {

vlan-id 100;
interface {

ae0.0 {

mapping {

1499 {

swap;

}

}

}
xe-0/0/3.0 {

mapping {

100 {

swap;

}

}

}

}
filter {

input orca_policy_100-filter;

}

}
orca_vlan_102 {

vlan-id 102;
filter {

input orca_policy_102-filter;

}

}

And there are two active circuits:

AL2S: starting tag 1411, ending tag 201
urn:ogf:network:domain=al2s.net.internet2.edu:node=sdn-sw.jack.net.internet2.edu:port=e1/2:link=*
urn:ogf:network:domain=al2s.net.internet2.edu:node=sdn-sw.rale.net.internet2.edu:port=xe-8/0/0.0:link=RENCI

and ION: starting tag 2601, ending tag 202
urn:ogf:network:domain=ion.internet2.edu:node=rtr.newy:port=ae0:link=bbn
urn:ogf:network:domain=al2s.net.internet2.edu:node=sdn-sw.rale.net.internet2.edu:port=xe-8/0/0.0:link=RENCI

  Changed 5 years ago by ibaldin

r6178 ready to redeploy. Requires redeploying controller, NLR, ION and BEN. Requires cleanup of BEN, DD and ION/AL2S stale entries.

Changed 5 years ago by yxin

  Changed 5 years ago by vjo

All right; everybody clear out.

  Changed 5 years ago by vjo

Should I clear:

pruth.rci-bbn-ufl-storage.1

in pequod?

  Changed 5 years ago by pruth

Yes. That is one that disappeared in flukes.

  Changed 5 years ago by vjo

Redeployed @ r6178; proceed.

  Changed 5 years ago by ckh

do you need me to clean up the QFXs at RENCI and DD?

  Changed 5 years ago by ckh

Received the following error for a RENCI-UFL dumbbell. The UFL-VM piece looks like the part that failed

Reservation e78d7785-b0d8-4a66-8eb2-f91c9f948e2a (Slice test.renci.ufl.99) is in state [Failed,None]

Last lease update: all units failed priming: Error code 1 during join for unit: 391A577D with message: unable to create instance: exit code 1,

  Changed 5 years ago by ibaldin

This is an OpenStack? issue. Looks like UFL may have OpenStack? and storage issues (that should not be related to them swapping out optics). Paul is testing.

  Changed 5 years ago by yxin

slNet is ready to redeploy with the new RDF, in config.xml:
(1) use this control policy:

<controls>

<control type="slNet.vlan" class="orca.plugins.ben.control.NdlInterfaceVLANControl" />

</controls>

(2) delegate 50 tags to nil-broker.

  Changed 5 years ago by vjo

OK - AIUI:
1) Check UFL rack for issues
2) Re-configure, re-deploy, and claim SL rack
3) Check on FIU rack?

  Changed 5 years ago by vjo

SL rack re-configured, re-deployed, claimed.
Still working on UFL.

  Changed 5 years ago by ibaldin

Summary of findings: the fix we put in to deconflict mp from pp slices via queueing offers only a partial solution. A slice may contain both types of connections (as is the case with SC13 example above), in which case the race condition still persists, only this time not between slices, but rather within a single slice.

For now we will reopen with a caveat that a slice cannot reliably combine mp and pp that go through DD (to e.g. BEN).

Some suggested solutions:

1. Having 'split' label sets between controller and controls so they never operate on overlapping label sets. Question is how to make this efficient so labels aren't wasted.

2. Changing the way DD operates to allow it to remap tags, which will remove the need to reverse dependency order between MP and PP cases, removing the possibility of collisions

  Changed 5 years ago by vjo

So - open ExoSM on control.exogeni.net?

  Changed 5 years ago by ibaldin

No, not yet. Waiting for some last tests. Also we will do a restart on controller and network controls before reopening.

  Changed 5 years ago by ibaldin

OK, please restart controller and net authorities. Claim SL and OSF (we will put a warning that these don't actually work quite yet).

I think we need to make sure there are no issues with OpenStack?/NEuca plugin/OVS on particularly NICTA, FIU and UFL as Chris has reported problems with reaching VMs there even though everything succeeded.

So let's restart and then test reachability to those sites (NICTA, UFL, FIU, but remember you can't do UFL-FIU). If things check out, we can reopen.

Paul - can you check the reachability after Victor restart? No super fancy slices - we now know the limitations.

  Changed 5 years ago by vjo

Close your slices, everybody...

  Changed 5 years ago by vjo

I see 4 test.fiu.nicta* slices...

  Changed 5 years ago by vjo

Slices still present...waiting 5 minutes, then killing...

  Changed 5 years ago by vjo

Restart complete.

  Changed 5 years ago by yxin

r6181

requires to rebuild/redeploy nlr site.

Sorry, one last fix in error handling before reopening:

When this race condition error happens, the NLR control didn't close the subrequest model properly, which may cause problems in subsequent requests.

This fix doesn't affect anything else, just removing a model in error condition.

  Changed 5 years ago by ibaldin

  1. Redelegate UH full range of VLANs
  2. Delegate SL to ExoSM, make sure ION.RDF and SLNet.RDF are correct - we should be able to test it.
  3. OSF - delegate full range of vlans
  4. Make a list of other racks that have split delegations - need to eventually move to full delegation to ExoSM
  5. Let Chris test OSF, NICTA and SL reachability end-to-end

  Changed 5 years ago by pruth

Should uva and sl have layer2 connectivity? I can't ping dumbells that include these endpoints.

  Changed 5 years ago by ibaldin

UvA should work. SL has been plumbed but not tested.

  Changed 5 years ago by pruth

I tried several dumbells and have concluded the following about layer 2 conectivity

rci - success
bbn - success
fiu - success
ufl - fail, usually active but not pingable (sometimes I2 error)
uva - success
uh - no path (rdf problem?)
nicta - vjo is testing
sl - shouldn't work yet

  Changed 5 years ago by yxin

uh has not been redeployed with new rdf and config.xml yet.

  Changed 5 years ago by pruth

uh - success for dumbells

  Changed 5 years ago by vjo

RCI<->UFL confirmed as not 'doing', with all slices active.

Path is:

rciNet/Domain/vlan/b4871038-aa3a-4ee6-93ab-173871725a7b/vlan Label/Tag: 1017
ben/Domain/vlan/be732560-45c8-4081-8ee9-6b298fc23f47/vlan Label/Tag: 101
nlr/Domain/vlan/bddb707f-6de9-4f51-85bc-7d3c4adab1e8/vlan Label/Tag: 101
ion/Domain/vlan/a5246b3e-b3fc-4c90-851f-1bc721d6179a/vlan Label/Tag: 275
uflNet/Domain/vlan/d34dc8c9-7fcf-469f-8286-c90f1ba63930/vlan Label/Tag: 1411

  Changed 5 years ago by vjo

RCI<->OSF "does".

  Changed 5 years ago by vjo

RCI<->UH "does" after re-configuring to grab all 10 tags.

  Changed 5 years ago by vjo

Summary of what doesn't "do":
UFL
2 VLANs at NICTA (4003, 4005 - either switch or upstream)
SL

  Changed 5 years ago by ibaldin

I believe UFL is starting to 'do' for some vlans. Yufeng and Chris are looking. The other two will be dealt with after reopening.

  Changed 5 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed

Opening up. Unresolved issues have been documented in #301, #302, #297, #299 and #300.

Note: See TracTickets for help on using tickets.