Ticket #349 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

controller recovery - slice deletion

Reported by: anirban Owned by: ibaldin
Priority: major Milestone:
Component: Don't Know Version: baseline
Keywords: Cc: yxin, anirban, pruth

Description

Unable to delete slices after recovery intermittently. This can happen after SM recovery and/or controller recovery. Seeing "Delete cannot transition from state NULL" even if I can query for manifest in Flukes and see the slivers in "Active" state. Saw this mainly for slices which were up for some time (~1hour ?). After that the slice can't be deleted because of the exception.

INFO | jvm 1 | 2014/08/15 21:15:48 | 2014-08-15 21:15:48,450 [qtp2062835395-26] DEBUG controller.orca.controllers.xmlrpc.XmlrpcOrcaState? - This slice ID=db72ecb7-edd3-4592-acb7-4f6e01256cbc for urn=ani-pb-ng-2
INFO | jvm 1 | 2014/08/15 21:15:48 | 2014-08-15 21:15:48,450 [qtp2062835395-26] ERROR controller.orca.controllers.xmlrpc.OrcaXmlrpcHandler? - deleteSlice(): Exception encountered: Command Delete cannot transition from state NULL
INFO | jvm 1 | 2014/08/15 21:15:48 | orca.controllers.xmlrpc.SliceStateMachine?$SliceTransitionException?: Command Delete cannot transition from state NULL
INFO | jvm 1 | 2014/08/15 21:15:48 | at orca.controllers.xmlrpc.SliceStateMachine?.transitionSlice(SliceStateMachine?.java:163)
INFO | jvm 1 | 2014/08/15 21:15:48 | at orca.controllers.xmlrpc.OrcaXmlrpcHandler?.deleteSlice(OrcaXmlrpcHandler?.java:745)
INFO | jvm 1 | 2014/08/15 21:15:48 | at sun.reflect.NativeMethodAccessorImpl?.invoke0(Native Method)
INFO | jvm 1 | 2014/08/15 21:15:48 | at sun.reflect.NativeMethodAccessorImpl?.invoke(Unknown Source)
INFO | jvm 1 | 2014/08/15 21:15:48 | at sun.reflect.DelegatingMethodAccessorImpl?.invoke(Unknown Source)
INFO | jvm 1 | 2014/08/15 21:15:48 | at java.lang.reflect.Method.invoke(Unknown Source)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.server.ReflectiveXmlRpcHandler?.invoke(ReflectiveXmlRpcHandler?.java:115)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.server.ReflectiveXmlRpcHandler?.execute(ReflectiveXmlRpcHandler?.java:106)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.server.XmlRpcServerWorker?.execute(XmlRpcServerWorker?.java:46)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.server.XmlRpcServer?.execute(XmlRpcServer?.java:86)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.server.XmlRpcStreamServer?.execute(XmlRpcStreamServer?.java:200)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.webserver.XmlRpcServletServer?.execute(XmlRpcServletServer?.java:112)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.apache.xmlrpc.webserver.XmlRpcServlet?.doPost(XmlRpcServlet?.java:196)
INFO | jvm 1 | 2014/08/15 21:15:48 | at orca.controllers.OrcaXmlrpcServlet?.doPost(OrcaXmlrpcServlet?.java:151)
INFO | jvm 1 | 2014/08/15 21:15:48 | at javax.servlet.http.HttpServlet?.service(HttpServlet?.java:727)
INFO | jvm 1 | 2014/08/15 21:15:48 | at javax.servlet.http.HttpServlet?.service(HttpServlet?.java:820)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.servlet.ServletHolder?.handle(ServletHolder?.java:527)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.servlet.ServletHandler?.doHandle(ServletHandler?.java:423)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.handler.ScopedHandler?.handle(ScopedHandler?.java:119)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.security.SecurityHandler?.handle(SecurityHandler?.java:493)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.handler.ContextHandler?.doHandle(ContextHandler?.java:926)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.servlet.ServletHandler?.doScope(ServletHandler?.java:358)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.handler.ContextHandler?.doScope(ContextHandler?.java:860)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.handler.ScopedHandler?.handle(ScopedHandler?.java:117)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.handler.HandlerWrapper?.handle(HandlerWrapper?.java:113)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.Server.handle(Server.java:335)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.HttpConnection?.handleRequest(HttpConnection?.java:588)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.HttpConnection?$RequestHandler?.content(HttpConnection?.java:1046)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.http.HttpParser?.parseNext(HttpParser?.java:764)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.http.HttpParser?.parseAvailable(HttpParser?.java:217)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.server.HttpConnection?.handle(HttpConnection?.java:418)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.io.nio.SelectChannelEndPoint?.run(SelectChannelEndPoint?.java:476)
INFO | jvm 1 | 2014/08/15 21:15:48 | at org.eclipse.jetty.util.thread.QueuedThreadPool?$2.run(QueuedThreadPool?.java:436)
INFO | jvm 1 | 2014/08/15 21:15:48 | at java.lang.Thread.run(Unknown Source)

A similar problem with the same exception resulted in the following scenario:

Restarting SM, using omni, overall slice status on a sliverstatus for an old slice becomes “unknown”, but the individual slivers show correct “ready” state. In flukes, slivers show up as Active, which is correct.

Result Summary: Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc expires on 2014-08-17 21:13:17 UTC

Slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc has overall SliverStatus?: unknown.

Resources in slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc expire at 2014-08-17T21:13:17 UTC.
Returned status of slivers on 1 of 1 possible aggregates.

On a renew, it throws

Result Summary: Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc expires on 2014-08-17 21:13:17 UTC
Failed to renew sliver urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc at unspecified_AM_URN (got result 'None'). Error from Aggregate: code 1: ERROR: unable to modify slice that is not yet stable, try again later.

This is because of the same underlying exception.

INFO | jvm 1 | 2014/08/15 21:15:48 | 2014-08-15 21:15:48,450 [qtp2062835395-26] DEBUG controller.orca.controllers.xmlrpc.XmlrpcOrcaState? - This slice ID=db72ecb7-edd3-4592-acb7-4f6e01256cbc for urn=ani-pb-ng-2
INFO | jvm 1 | 2014/08/15 21:15:48 | 2014-08-15 21:15:48,450 [qtp2062835395-26] ERROR controller.orca.controllers.xmlrpc.OrcaXmlrpcHandler? - deleteSlice(): Exception encountered: Command Delete cannot transition from state NULL
INFO | jvm 1 | 2014/08/15 21:15:48 | orca.controllers.xmlrpc.SliceStateMachine?$SliceTransitionException?: Command Delete cannot transition from state NULL

Change History

Changed 5 years ago by ibaldin

I will look at it on Monday

Changed 5 years ago by anirban

I have been unable to reproduce this after last night. I created the same slices last night and restarted SM and controller. In the morning, I could delete them without any problem. I don't know if any of the latest controller checkins fixed this. I will keep trying.

Changed 5 years ago by anirban

I saw an issue today, which is also related to slice state and garbage collection. I had a slice that should have expired on Saturday, 2014-08-16T21:04:00 UTC . I believe no actor or controller was restarted after that slice was ready. I used the same sliceid to create another slice today morning. But it complained of duplicate slice urn:

bash-3.2$ omni -a https://unc-hn.unc.ben:11443/orca/xmlrpc createsliver omniSlice-ani-unc two-node-unc.rspec
12:49:19 INFO : Loading agg_nick_cache file '/Users/anirban/.gcf/agg_nick_cache'
.......
Result Summary: Failed CreateSliver? for slice omniSlice-ani-unc at https://unc-hn.unc.ben:11443/orca/xmlrpc. Error from Aggregate: code 2: ERROR: duplicate slice urn urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc.
12:49:21 INFO : ======================================================

When I did a sliverstatus, it showed that all the slivers were in "unknown" state and overall slice state was also "unknown"

bash-3.2$ omni -a https://unc-hn.unc.ben:11443/orca/xmlrpc sliverstatus omniSlice-ani-unc
12:49:51 INFO : Loading agg_nick_cache file '/Users/anirban/.gcf/agg_nick_cache'
12:49:51 INFO : Loading config file omni_config
12:49:51 INFO : Using control framework portal
12:49:51 INFO : Member Authority is https://ch.geni.net/MA (from config)
12:49:51 INFO : Slice Authority is https://ch.geni.net/SA (from config)
12:49:51 INFO : Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc expires within 1 day on 2014-08-17 21:13:17 UTC
12:49:51 INFO : Status of Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc:
12:49:52 INFO : Slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc has overall SliverStatus?: unknown
12:49:52 INFO : Resources in slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc expire at 2014-08-16T21:04:00 UTC
12:49:52 INFO : Sliver status for Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc
12:49:52 INFO : {

"geni_status": "unknown",
"geni_urn": "urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc",
"geni_resources": [

{

"orca_expires": "Sat Aug 16 21:04:00 UTC 2014",
"geni_urn": "urn:publicid:IDN+exogeni.net:unknownvmsite+sliver+140a79c4-efa9-4b3d-8bb1-2650461a9d2f:center",
"geni_error": "",
"geni_status": "unknown"

},
{

"orca_expires": "Sat Aug 16 21:04:00 UTC 2014",
"geni_urn": "urn:publicid:IDN+exogeni.net:unknownvmsite+sliver+140a79c4-efa9-4b3d-8bb1-2650461a9d2f:geni1",
"geni_error": "",
"geni_status": "unknown"

},
{

"orca_expires": "Sat Aug 16 21:04:00 UTC 2014",
"geni_urn": "urn:publicid:IDN+exogeni.net:unknownvmsite+sliver+140a79c4-efa9-4b3d-8bb1-2650461a9d2f:geni2",
"geni_error": "",
"geni_status": "unknown"

}

]

}
12:49:52 INFO : ------------------------------------------------------
12:49:52 INFO : Completed sliverstatus:

Options as run:

aggregate: https://unc-hn.unc.ben:11443/orca/xmlrpc?
framework: portal
project: ADAMANT

Args: sliverstatus omniSlice-ani-unc

Result Summary: Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc expires within 1 day(s) on 2014-08-17 21:13:17 UTC

Slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc has overall SliverStatus?: unknown.

Resources in slice omniSlice-ani-unc at AM https://unc-hn.unc.ben:11443/orca/xmlrpc expire at 2014-08-16T21:04:00 UTC.
Returned status of slivers on 1 of 1 possible aggregates.

12:49:52 INFO : ======================================================
bash-3.2$ date
Sun Aug 17 12:56:23 EDT 2014

From the logs, I do see

INFO | jvm 1 | 2014/08/17 17:14:37 | 2014-08-17 17:14:37,367 [qtp1868373866-29 - /orca/xmlrpc] DEBUG controller.orca.controllers.xmlrpc.XmlrpcOrcaState? - This slice ID=8a313ee5-3675-4080-bcb5-b37a4820bc61 for urn=urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc
INFO | jvm 1 | 2014/08/17 17:14:37 | 2014-08-17 17:14:37,509 [qtp1868373866-29 - /orca/xmlrpc] DEBUG controller - Slice urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc transitioned to state DEAD
INFO | jvm 1 | 2014/08/17 17:14:37 | 2014-08-17 17:14:37,628 [qtp1868373866-29 - /orca/xmlrpc] DEBUG controller.orca.controllers.xmlrpc.XmlrpcOrcaState? - This slice ID=8a313ee5-3675-4080-bcb5-b37a4820bc61 for urn=urn:publicid:IDN+ch.geni.net:ADAMANT+slice+omniSlice-ani-unc

Basically, every sliverstatus call results in a similar line in the log file. So, somehow, the slice still exists in the system, and manifest too (since it still knows the slivers).

Changed 5 years ago by ibaldin

It's worth checking what ORCA API was saying about this slice (you can just use the slice URN as its name). The current GC code keeps slices that expired (but not deliberately deleted by the user) for 24 hours so we can provide information about failures. This is to cover a case when every sliver in a slice fails and the user might want to know what happened.

Perhaps we need to change this logic to narrow down the case only to when all slivers fail.

Changed 5 years ago by ibaldin

  • summary changed from controller recovery - slice state to controller recovery - slice deletion

Changed 5 years ago by ibaldin

  • status changed from new to closed
  • resolution set to fixed

Appears fixed now.

Note: See TracTickets for help on using tickets.