Ticket #380 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

Extend issues and recovery

Reported by: ibaldin Owned by: ibaldin
Priority: major Milestone:
Component: ORCA: controllers - ORCA API Version: baseline
Keywords: Cc: yxin, anirban, ibaldin

Description

This is a testing reminder. There *may* be issues with extends and recovery:

  1. Occasionally the end date of the slice gets corrupted in the model (see exception below)
  1. Need to test the extended slices actually stay extended across recovery - there was a complaint from a user whose slice expired on the original deadline after recovery, even though he thought he had extended it

Exception for issue 1:

java.lang.NullPointerException??
INFO | jvm 1 | 2014/10/14 16:22:44 | at java.util.Calendar.setTime(Calendar.java:1106)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.ndl.elements.OrcaReservationTerm??._setEnd(OrcaReservationTerm??.java:59)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.ndl.elements.OrcaReservationTerm??.setStart(OrcaReservationTerm??.java:83)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.embed.workflow.RequestWorkflow??.recover(RequestWorkflow??.java:350)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.embed.workflow.RequestWorkflow??.recover(RequestWorkflow??.java:316)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlrpcControllerSlice??.recover(XmlrpcControllerSlice??.java:464)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlrpcOrcaState??.recoverSlice(XmlrpcOrcaState??.java:452)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlrpcOrcaState??.recover(XmlrpcOrcaState??.java:413)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlRpcController??._recover(XmlRpcController??.java:196)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.OrcaController??.recover(OrcaController??.java:62)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlRpcController??.access$100(XmlRpcController??.java:24)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlRpcController??$ControllerContextListener??.start(XmlRpcController??.java:174)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlRpcController??.start(XmlRpcController??.java:156)
INFO | jvm 1 | 2014/10/14 16:22:44 | at orca.controllers.xmlrpc.XmlRpcController??.main(XmlRpcController??.java:201)
INFO | jvm 1 | 2014/10/14 16:22:44 | at sun.reflect.NativeMethodAccessorImpl??.invoke0(Native Method)
INFO | jvm 1 | 2014/10/14 16:22:44 | at sun.reflect.NativeMethodAccessorImpl??.invoke(NativeMethodAccessorImpl??.java:57)
INFO | jvm 1 | 2014/10/14 16:22:44 | at sun.reflect.DelegatingMethodAccessorImpl??.invoke(DelegatingMethodAccessorImpl??.java:43)
INFO | jvm 1 | 2014/10/14 16:22:44 | at java.lang.reflect.Method.invoke(Method.java:606)
INFO | jvm 1 | 2014/10/14 16:22:44 | at org.tanukisoftware.wrapper.WrapperSimpleApp??.run(WrapperSimpleApp??.java:240)
INFO | jvm 1 | 2014/10/14 16:22:44 | at java.lang.Thread.run(Thread.java:745)
INFO | jvm 1 | 2014/10/14 16:22:44 | 2014-10-14 16:22:44,900 [WrapperSimpleAppMain??] ERROR controller.orca.controllers.xmlrpc.XmlrpcOrcaState?? - Unable to recover slice b97052ca-b908-4955-b581-8e5110e3b5ef/adamant due to: java.lang.NullPointerException??

Change History

Changed 5 years ago by anirban

  • cc ibaldin added

Yes, #2 issue is indeed what happens. On recovery, only the older extend time is honored for vm type reservations. Expiry time gets reset to most probably the default expiry time for the vlan type reservations.

Run slice with expiry set to t1 -> extend to t2 -> recover -> (vm reservations are closed at t1, vlan reservations are not closed even at t2)

I did not see the exception in the controller as described in #1, with respect to model corruption on recovery.

Changed 5 years ago by ibaldin

In the scenario above (double recovery), did the first extend actually take place by the time recovery was triggered? In other words, did the recovery take place before t1 or between t1 and t2?

Changed 5 years ago by ibaldin

double renew is what I meant.

Changed 5 years ago by anirban

Recovery took place before t1 in the above scenario. And, there was only one renew done before recovery, not double renew. The original slice end time was t1. It was extended to t2. Then recovery was triggered before time was t1. After recovery the vm reservations expired at t1 instead of t2, while the vlan reservation stayed in Active state beyond t2.

I am going to test the scenario where recovery is triggered between t1 and t2, and report back in this thread.

By the way, the manifest always reports the correct expiry time in all cases.

Changed 5 years ago by anirban

Ok. I tested the scenario where recovery happens between t1 and t2, i.e. Run slice with original expiry t1. Before t1, extend to t2. Between t1 and t2, do recovery. After recovery, the slice is up until t2 when reservations start closing. All vm and vlan reservations show in "Closed" state. But, the slice still stays in the system, i.e can keep querying the slice status, even if everything is closed. This means the sliceID can't be reused.

Changed 5 years ago by anirban

I spoke too soon.. The slice was garbage collected after a few minutes.. So, when recovery is done between t1 and t2, things work as expected..

Changed 5 years ago by ibaldin

Ticket #358 may be related.

Changed 4 years ago by ibaldin

Main issue is resolved in r7092, however another smaller issue remains - if extend is done and the deadline for it passes while the SM is down, on recovery the replay closes the reservation before the extend gets to it.

Changed 4 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed

I think the other issue is a won't fix for now. Closing. It is possible to modify the SM policy not to close, however the AM closes the reservation anyway.

Changed 4 years ago by ibaldin

  • status changed from closed to reopened
  • resolution fixed deleted

Another issue cropped up during recent redeployment. SM slivers were marked failed, while on the AM things stayed open:

Start: Wed Feb 11 00:36:18 EST 2015 End: Wed Feb 25 00:26:17 EST 2015 Requested end: Thu Mar 05 00:26:17 EST 2015

eec070de-a0f6-4fb5-82a8-ab11151641ad exo-sm

Slice: 585c2cd6-e653-4545-a534-a8042c18a626
1 ucdvmsite.vlan [ failed, nascent]
Notices: Reservation eec070de-a0f6-4fb5-82a8-ab11151641ad (Slice urn:publicid:IDN+ch.geni.net:idms+slice+idms-g22-ucd-eg) is in state [Failed,None], err=unexpected ticket failure, message=An error occurred during extend ticket for reservation #63C170B, stack=Exception stack trace:

orca.shirako.kernel.Kernel.error(Kernel.java:279)
orca.shirako.kernel.Kernel.extendTicket(Kernel.java:427)
orca.shirako.kernel.KernelWrapper?.extendTicket(KernelWrapper?.java:436)
orca.shirako.core.ServiceManager?.bid(ServiceManager?.java:165)
orca.shirako.core.ServiceManager?.tickHandler(ServiceManager?.java:415)
orca.shirako.core.Actor.actorTick(Actor.java:428)
orca.shirako.core.Actor.access$000(Actor.java:60)
orca.shirako.core.Actor$1.process(Actor.java:333)
orca.shirako.core.Actor.actorMain(Actor.java:377)
orca.shirako.core.Actor$4.run(Actor.java:1018)
java.lang.Thread.run(Thread.java:745)

Changed 4 years ago by ibaldin

Also another:

At SL rack, during the recent certificate upgrade restart-with-recovery, the following exception was encountered.

2015-02-22 19:54:48,723 [sl-sm] ERROR orca.sl-sm - An error occurred during extend ticket for reservation #C5353F0D
java.lang.RuntimeException?: The reservation state prevents it from extending its ticket.

at orca.shirako.kernel.Kernel.extendTicket(Kernel.java:416)
at orca.shirako.kernel.KernelWrapper?.extendTicket(KernelWrapper?.java:436)
at orca.shirako.core.ServiceManager?.bid(ServiceManager?.java:165)
at orca.shirako.core.ServiceManager?.tickHandler(ServiceManager?.java:415)
at orca.shirako.core.Actor.actorTick(Actor.java:428)
at orca.shirako.core.Actor.access$000(Actor.java:60)
at orca.shirako.core.Actor$1.process(Actor.java:333)
at orca.shirako.core.Actor.actorMain(Actor.java:377)
at orca.shirako.core.Actor$4.run(Actor.java:1018)
at java.lang.Thread.run(Unknown Source)

The reservation was moved from active to failed state, as a result of the above.

Changed 4 years ago by ibaldin

The last two error messages are the same issue that happens if you recover extended reservation twice.

Changed 4 years ago by ibaldin

r7146 fixes it by disabling the check for renewable flag in canRenew method, however not clear this is the best solution.

The problem was that setting renewable to true temporarily had the effect of having renewable=true being saved in the database, so things failed on second recovery because isExtendingTicket was already the pending state. Somehow the recovery code no longer erases the extending pending state, and that causes the problem in call to extend.

Changed 4 years ago by ibaldin

  • status changed from reopened to closed
  • resolution set to fixed

Fixed in r7212 by undoing r7146 and instead making the extending phase very short, so it doesn't fall across recovery boundaries.

Note: See TracTickets for help on using tickets.