Ticket #388 (new defect)

Opened 4 years ago

Last modified 4 years ago

DAR local cache in actor does not appear to update when status changes in DAR

Reported by: vjo Owned by: claris
Priority: major Milestone:
Component: External: Actor Registry Version: baseline
Keywords: Cc: ibaldin

Description

During the most recent maintenance, I had to verify a new set of actors in DAR (nicta-vm-am, nicta-net-am, nicta-broker, nicta-sm).

When I started these actors, ndl-broker was running.

I then verified these actors, but could not get nicta-sm and nicta-broker to recognize one another until I restarted all of the NICTA actors.

I then attempted to claim the resources delegated from nicta-vm-am and nicta-net-am to ndl-broker, but was not able to do so, despite trying for over an hour.

After looking at the code, I decided to restart ndl-broker with recovery.
As a result of the restart, ndl-broker was forced to re-build its local cache with the new information from DAR, and I was finally able to claim the delegated resources.

The bug being reported: the actor-local DAR cache needs to properly update for changes in status from DAR, without requiring a restart.

Change History

Changed 4 years ago by claris

Hey Victor,

The only change --I need to confirm this but I feel pretty confident about it--that triggers a change in the local cache is if the remote DAR reports the existence of an actor which is not in the local cache and is verified.

Is the bug being reported that this is not happening? Or is it that there is a type of change/update in the remote DAR not being detected and/or triggering the right action in the local cache? If the former, I doubt it because we would have hit this bug much earlier (it is basic functionality). If the latter, please elaborate more.

Also, the last time we have this exact incident was because the thread which fetches for changes in the remote DAR had crashed in a particular actor. We found out that the thread was dying because I was not handling all the exceptions properly (there was a connection timeout exception that was being ignored). I fixed that. Did you get to check if the thread in ndl-broker was running? If not, I can check --please remind me where ndl-broker sitting so I can login to the host.

Thanks,
Claris

Changed 4 years ago by vjo

So - to get it out of the way - I'm on PTO through the rest of the year, so my responses may be delayed.

Now - as to the issue:

I believe that this has been happening for some time, though I have not had time to properly investigate the issue, until now.

I can report on the symptoms I saw, and the actions I took to resolve. I am *speculating* as to what was causing the issue. If the local cache is only updated when the actor is verified in the DAR, then something other than what I speculated is going on.

The symptoms:
1) I was unable to claim resources delegated to ndl-broker from nicta-vm-am and nicta-net-am.
2) I checked the logs at ndl-broker. I was seeing errors containing the following text:
establishEdgePrivate(): Could not decode certificate for actor:

The actions I took:
1) I ensured that the NICTA actors were verified at DAR.
2) I attempted to claim several times from ndl-broker, over the course of an hour, while I waited for ndl-broker to decide that the NICTA actors were valid.
3) After trying and failing for an hour, I decided to see if the ORCA keystore files at ndl-broker contained entries for the NICTA actors. They did.
4) I deleted the entries for the NICTA actors in the ndl-broker keystores, and tried claiming several times, without success.
5) I finally decided to risk restarting ndl-broker with recovery. Upon restart with recovery, I was finally able to successfully claim the resources delegated to ndl-broker by the NICTA actors.

As far as I can tell, the thread fetching changes never died - it was constantly logging about not being able to decode the certificates for the NICTA actors.

There's not much to check, at this point, since the logs have rolled and the actor has been restarted.
For the future, ndl-broker is defined in /etc/orca/am+broker-12080 on geni.renci.org.

I have two theories as to what caused this issue:
1) ndl-broker somehow inserted entries into its local cache for the unverified NICTA actors.
Based on what you have said, this is unlikely.

2) ndl-broker's local cache was polluted by the presence of entries for the NICTA actors in the keystore files.
Once the local cache was polluted by the pre-existing entries, there was no way for the update from DAR to fix it.
This may be more likely.

Changed 4 years ago by claris

Ok. This fix is easy. At some point a decision to give precedence to old entries in the cache was made. I can easily change that to give precedence to the new value. Now, my concern is that I may be missing some consideration taken into account before.
Ilya, does this ring a bell to you?

Changed 4 years ago by ibaldin

I think this depends on the replacement policy. The important consideration is we have to be absolutely sure it is the same actor that is trying to change the entry and not a new actor masquerading as the old one (due to misconfiguration or malice). If this condition is satisfied, then it should be fine. Which part of the code are you referring to? Can you point to file and line numbers?

Changed 4 years ago by claris

The fix is ready to checkin.

Changed 4 years ago by ibaldin

Ping - what is the decision on when to checkin and deploy this?

Changed 4 years ago by ibaldin

If this has been checked in, please indicate the revision number?

Note: See TracTickets for help on using tickets.