Ticket #210 (closed defect: fixed)

Opened 8 years ago

Last modified 8 years ago

Possible race condition in SM

Reported by: ibaldin Owned by: aydan
Priority: major Milestone: Camano 3.1
Component: ORCA: Shirako Core Version: Bella 2.1
Keywords: Cc:


Happens with Asic's controller and very occasionally with XMLRPC controller:

Change History

Changed 8 years ago by ibaldin

There is an apparent close bug out there that should be fixed...I'm not
aware that it has been fixed. It might not be on the bug tracker.

This is the problem of crashes in the state machine, driven by Asic's SM
controller, which does dynamic adaptation. There are close operations
that sit in a wait state for some time, and then eventually fail, and
then the AM receives a lease renew request while the close is still in
progress, and then there is an apparent race of the close completing
just as the renew executes...or something.

I think Victor has been trying to get some more log info in place that
would shed light on what is going on.

I think there will be a simple fix, as least as a band-aid, if my
reading of the problem is correct. E.g., the SM state machine should
not be renewing a lease that has a close in progress, and the AM should
not be accepting a renew that has a close in progress.


Changed 8 years ago by ibaldin

Something like this:

Target join: finishedTue Jul 26 18:36:44 EDT 2011 (139649ms)

Total time: 2 minutes 19 seconds
ticket(): ticket = Ticket [units = 1 oldUnits = 0]
Exception in thread "duke-vm-site3" java.lang.AssertionError?

at orca.shirako.kernel.AuthorityReservation?.serviceExtendLease(AuthorityReservation?.java:724)
at orca.shirako.kernel.AuthorityReservation?.serviceProbe(AuthorityReservation?.java:758)
at orca.shirako.kernel.Kernel.probePending(Kernel.java:729)
at orca.shirako.kernel.Kernel.tick(Kernel.java:1168)
at orca.shirako.kernel.KernelWrapper?.tick(KernelWrapper?.java:731)
at orca.shirako.core.Actor.externalTick(Actor.java:365)
at orca.shirako.kernel.RealtimeTick?$TickWrapper?.run(RealtimeTick?.java:139)

Changed 8 years ago by aydan

One possible source of trouble is the nothingPending predicate on the AM side. It does not consider the mustClose flag. The mustClose flag looks like a flag from a while back, but it goes deep, so I will not modify it. What I did instead, was to include the mustClose flag in the evaluation of the nothingPending flag. In this way, no operation can be issued on the reservation if it is about to be closed.

The state machines need some cleanup and that should be part of the work needed to make the actor code single threaded.

Changed 8 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.