Ticket #358 (closed defect: duplicate)

Opened 5 years ago

Last modified 4 years ago

SM recovery loop

Reported by: ibaldin Owned by: ibaldin
Priority: major Milestone:
Component: ORCA: Shirako Core Version: baseline
Keywords: Cc:

Description

I’ve encountered now a couple of times an issue with recovering the SM that throws an NPE in this function in ReservationClient?. Note the added check whether p.getReservation() returns a null that I added for now - I don’t believe this is the solution, only an NPE fix.

The scenario for recovery is as follows - submit a slice via controller, retrieve a manifest to see that things are ticketed and immediately shut down the SM and then restart (before they go active). After that the SM enters some sort of a loop - throws out a large amount of logging along with NPEs as I described.

After adding this check the code starts throwing NPE elsewhere where it tried to get unit properties of the predecessor (and they are also null).

Here is a link to the log files I collected SM-side (no errors detected elsewhere): https://www.dropbox.com/s/5261ojxn6szcpb5/sm-recovery-logs.tgz?dl=0

Here is the function that first reported the problem:

/**

  • Redeem predicate: invoked internally to determine if the reservation
  • should be redeemed. This gives subclasses an opportunity sequence install
  • configuration actions at the authority side.
  • <p>
  • If false, the reservation enters a "BlockedRedeem?" sub-state until a
  • subsequent approveRedeem returns true. When true, the reservation can
  • manipulate the current reservation's properly lists and attributes to
  • facilitate configuration. Note that approveRedeem may be polled multiple
  • times, and should be idempotent.
  • </p>
  • @return DOCUMENT ME! */

protected boolean approveRedeem() throws Exception {

boolean approved = true;

for (PredecessorState? p : redeemPredecessors.values()) {

// somehow in certain types of recovery (ticketed reservations on SM)
// we're ending up with getReservation() returning null /ib 08/27/14
if (p.getReservation() == null) {

logger.error("redeem predecessor reservation does not have a reservation object. ignoring it");
continue;

}

if (p.getReservation().isFailed() p.getReservation().isClosed()) {

logger.error("redeem predecessor reservation is in a terminal state. ignoring it: "

+ p.getReservation());

continue;

}
// FIXME: the incoming resources are not applied to the reservation
// until the
// reservation transitions into the Joining state. We must use
// isActiveJoined to make sure
// that prepareRedeem is going to see the units inside the
// predecessor reservation.
if (!p.getReservation().isActiveJoined()) {

approved = false;
break;

}

}

if (approved) {

prepareRedeem();

}

return approved;

}

Change History

Changed 4 years ago by ibaldin

  • status changed from new to closed
  • resolution set to duplicate

moved to github

Note: See TracTickets for help on using tickets.