Ticket #247 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Rack sm VM reservations sometimes get stuck in 'Closing'

Reported by: ibaldin Owned by: ibaldin
Priority: major Milestone: Dungeness 4.0
Component: ORCA: Shirako Core Version: baseline
Keywords: Cc: aydan.yumerefendi@…

Description

VMs properly go away, no errors in handler or AM, however SM continues to think they are closing and broker thinks they are ticketed. Requires broker restart to clear this out.

It is intermittent.

Attachments

am.log (0.6 MB) - added by anirban 7 years ago.
AM log
sm.log (234.7 kB) - added by anirban 7 years ago.
broker.log (459.5 kB) - added by anirban 7 years ago.
Broker log
am.catalina.out (11.8 kB) - added by anirban 7 years ago.
AM catalina.out
sm.catalina.out (12.8 kB) - added by anirban 7 years ago.
SM catalina.out
broker.catalina.out (3.3 kB) - added by anirban 7 years ago.
Broker catalina.out

Change History

Changed 7 years ago by ibaldin

Definitely related to GENI AM API - no suck errors with Flukes. Possibly culprit is XMLSEC validation code because catalina.out contains cryptic statements

Warning: validation was turned on but an org.xml.sax.ErrorHandler? was not
set, which is probably not what is desired. Parser will use a default
ErrorHandler? to print the first 10 errors. Please call
the 'setErrorHandler' method to fix this.
Error: URI=null Line=1: cvc-elt.1: Cannot find the declaration of element 'Signature'.
List of strings: [urn:publicid:IDN+pgeni.gpolab.bbn.com+user+lnevers, lnevers@pgeni.gpolab.bbn.com]

Changed 7 years ago by ibaldin

Also seen in ExoSM...

Changed 7 years ago by anirban

Sometimes, this is what I see in the orca log, which points to exceptions during credential verification.

2012-05-03 15:40:44,292 [http-bio-11443-exec-6] DEBUG orca.bbn-sm - GENI AM2 XmlrpcHandler? constructor called
2012-05-03 15:40:44,292 [http-bio-11443-exec-6] INFO orca.bbn-sm - GENI credential verification is turned ON
2012-05-03 15:40:44,292 [http-bio-11443-exec-6] INFO orca.bbn-sm - Using NDL-RSpec converters at http://geni.renci.org:12080/ndl-conversion/
2012-05-03 15:40:44,292 [http-bio-11443-exec-6] INFO orca.bbn-sm - GENI AM v2 ListResources?() invoked
2012-05-03 15:40:44,292 [http-bio-11443-exec-6] INFO orca.bbn-sm - validateGeniCredential(): Client presented a certificate with subject: EMAILADDRESS=lnevers@…, CN=23c6477d-d231-11df-92ee-000c29f89f7b, OU=bbn-pgeni.lnevers, O=GENI Project Office, ST=Massachusetts, C=US
2012-05-03 15:40:44,293 [http-bio-11443-exec-6] ERROR orca.controllers.xmlrpc.x509util.CredentialValidator? - Content is not allowed in prolog.
2012-05-03 15:40:44,293 [http-bio-11443-exec-6] DEBUG orca.bbn-sm - Exception parsing credential xml
2012-05-03 15:40:44,293 [http-bio-11443-exec-6] ERROR orca.bbn-sm - GENI List Resources: Credential Exception: javax.security.auth.login.CredentialException?: No credential was found with appropriate privileges.
2012-05-03 15:40:44,327 [http-bio-11443-exec-4] DEBUG orca - Client 128.89.91.19 is using secure communications

Changed 7 years ago by ibaldin

The problem has nothing to do with verification.

The problem occurs when a VM reservation that is in redeeming gets quickly closed.

Here are relevant logs for the reservation (D914B1E2) that gets stuck in closing:

2012-05-03 19:30:06,870 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1320) actor bbn-sm enter
2012-05-03 19:30:06,871 [bbn-sm] INFO orca.bbn-sm - Processing updateLease from <bbn-vm-am>: res: D914B1E2 slice: urn:publicid:IDN+emulab.net+slice+test-ilia-3 Na
scent None rset: units 1 concrete 1 orca.shirako.core.UnitSet?@4fc23996 term=[0:0:0]
2012-05-03 19:30:06,871 [bbn-sm] INFO orca.bbn-sm - inbound lease is: orca.shirako.core.UnitSet?@4fc23996
2012-05-03 19:30:06,871 [bbn-sm] DEBUG orca.bbn-sm - updateLease: Incoming termterm=[0:0:0]
2012-05-03 19:30:06,871 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 lease arrival blocked join transition for joinstate: NoJoin?->BlockedJoin?
2012-05-03 19:30:06,871 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 lease arrival blocked join transition: Ticketed->Active, Redeeming->None
2012-05-03 19:30:06,871 [bbn-sm] INFO orca.bbn-sm - Received updateLease for a reservation closed in the Redeeming state. Issuing close.
2012-05-03 19:30:06,871 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 close transition: Active->CloseWait?, None->None
2012-05-03 19:30:06,871 [bbn-sm] INFO orca.bbn-vm-am - Outbound close request from <bbn-sm>: res: D914B1E2 slice: urn:publicid:IDN+emulab.net+slice+test-ilia-3 Cl
oseWait None rset: units 1 concrete 1 Ticket [units = 1 oldUnits = 1 Slice=urn:publicid:IDN+emulab.net+slice+test-ilia-3] term=[0:0:0]
2012-05-03 19:30:06,874 [RPC] DEBUG orca - Performing RPC: type=Close to:bbn-vm-am
2012-05-03 19:30:06,901 [Finalizer] INFO orca.bbn-sm - Disconnecting from XMPP server
2012-05-03 19:30:06,902 [bbn-sm] INFO orca.bbn-sm - Updating reservation D914B1E2 in slice urn:publicid:IDN+emulab.net+slice+test-ilia-3(7E183BFA) from [Active, N
one, BlockedJoin?] to [CloseWait?, None, BlockedJoin?]
2012-05-03 19:30:06,931 [RPC] DEBUG orca - Completed RPC: type=Close to:bbn-vm-am
2012-05-03 19:30:06,976 [bbn-sm] DEBUG orca.bbn-sm - bidForSources: cycle 1320 bids 0
2012-05-03 19:30:06,976 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 unblocked join transition for joinstate: BlockedJoin?->Joining
2012-05-03 19:30:06,976 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 unblocked join transition: CloseWait?->CloseWait?, None->None
2012-05-03 19:30:06,998 [bbn-sm] INFO orca.bbn-sm - Updating reservation D914B1E2 in slice urn:publicid:IDN+emulab.net+slice+test-ilia-3(7E183BFA) from [CloseWait?
, None, Joining] to [CloseWait?, None, Joining]
2012-05-03 19:30:07,118 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1320) actor bbn-sm exit
2012-05-03 19:30:07,118 [Thread-17] DEBUG orca.bbn-sm - Executing ant driver: class orca.shirako.plugins.config.AntConfig?$RunConfig?
2012-05-03 19:30:07,248 [Thread-17] DEBUG orca.bbn-sm - join.ms=130
2012-05-03 19:30:07,870 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1321) actor bbn-sm enter
2012-05-03 19:30:07,871 [bbn-sm] DEBUG orca.bbn-sm - (join complete) incoming (2) local: (2)
2012-05-03 19:30:07,891 [bbn-sm] DEBUG orca.bbn-sm - bidForSources: cycle 1321 bids 0
2012-05-03 19:30:07,891 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 join complete transition for joinstate: Joining->NoJoin?
2012-05-03 19:30:07,891 [bbn-sm] DEBUG orca.bbn-sm - Reservation #D914B1E2 join complete transition: CloseWait?->CloseWait?, None->None
2012-05-03 19:30:07,913 [bbn-sm] INFO orca.bbn-sm - Updating reservation D914B1E2 in slice urn:publicid:IDN+emulab.net+slice+test-ilia-3(7E183BFA) from [CloseWait?
, None, NoJoin?] to [CloseWait?, None, NoJoin?]
2012-05-03 19:30:07,982 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1321) actor bbn-sm exit
2012-05-03 19:30:08,870 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1322) actor bbn-sm enter
2012-05-03 19:30:08,871 [bbn-sm] DEBUG orca.bbn-sm - bidForSources: cycle 1322 bids 0
2012-05-03 19:30:08,871 [bbn-sm] DEBUG orca.bbn-sm - No need to update reservation D914B1E2 in slice urn:publicid:IDN+emulab.net+slice+test-ilia-3(7E183BFA)
2012-05-03 19:30:08,871 [bbn-sm] DEBUG orca.bbn-sm - externalTick(1322) actor bbn-sm exit

Changed 7 years ago by ibaldin

Aydan says:

From what I can see the sequence is:

  • Someone closed the reservation before redeem completed
  • The reservation was marked for closing and when the lease came back it transitioned to CloseWait? and sent the close request to the site.

- For some reason the reservation also executed the join scripts on the sm side, which is a bug, but not likely the one causing the reservation to be stuck in CloseWait?.
- I do not see the close reply from the AM.

The AM seems to be running in a different JVM. Do you have its logs?

I think the issue here is that the AM did not reply to the close request and the reservation will remain in CloseWait? until the reply comes back. The reservation state machine does not implement timeouts in this case, and it probably should.

So to diagnose this fully, I need the logs from the site as well.

Changed 7 years ago by anirban

AM log

Changed 7 years ago by anirban

Changed 7 years ago by anirban

Broker log

Changed 7 years ago by anirban

AM catalina.out

Changed 7 years ago by anirban

SM catalina.out

Changed 7 years ago by anirban

Broker catalina.out

Changed 7 years ago by anirban

Attached logs.

Changed 7 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed

Appears fixed as of r4436

Note: See TracTickets for help on using tickets.