Ticket #232 (closed defect: wontfix)

Opened 8 years ago

Last modified 6 years ago

BenVlanControl concurrency/label agreement bug

Reported by: ibaldin Owned by: yxin
Priority: major Milestone: Dungeness 4.0
Component: ORCA: Policies and their components Version: baseline
Keywords: Cc:

Description

Overlapping join/leaves for complex slices cause silent failures at DTN layer - crossconnects aren't always properly set up.

This is either a result of a

a) race condition in BenVlanControl? model update from interleaving joins/leaves
b) bug in label agreement for DTN layer

Change History

Changed 8 years ago by yxin

Another problem:
1) 1st request: RENCI to UNC;
2) 2nd request: renci to unc and renci to duke, but the renci-unc leg seemed going through the DTNs again, instead of only adding a vlan connection on the first request.

Changed 8 years ago by yxin

In summary via recent tests, symptoms and possible reasons are:

(1) There might be a bug in releasing the complicated slice with multiple cross-layer connections, which leaves a unreleased crossconnects in a DTN that prevents the success configuration of next requests.
(2) There might be a race condition in BEN control/handler, that hangs the proceeding of the configuration of BEN configurations.

The next step is to create more complicated multi-connection requests in hope to reveal the problem more often.

Changed 7 years ago by ibaldin

  • milestone changed from Camano 3.1 to Dungeness 4.0

Changed 7 years ago by ibaldin

Aydan,

This is primarily directed at you as a question. We are seeing non-deterministic behavior in BEN AM when it is bombarded with fast sequences of requests to create/take connections down. There clearly is a race condition of some sort where the network model in the AM thinks one thing, while the substrate has a different state.

The main issue it appears is that the BEN handler (and to a lesser extent the network handler for euca sites) are not idempotent. The join part is, but the leave is not, because if leave handler execution is delayed (or out of order), you may end up taking down a VLAN or a circuit that was setup by someone else in the meantime (since tags and circuits are reused).

The question is how to deal with it. One possible solution is to serialize everything in BEN AM and Euca Net AM. I believe a thread pool is used to fire off handlers in the AMs (correct me if I'm wrong). We may need to either set the pool size to 1, assuming the FIFO discipline is guaranteed for handler invocations. Alternatively we may need to implement a separate queue with one server thread (if thread pool relies on e.g. condition variables, which I don't think have a well-define queuing discipline).

What do you think?

-ilia

Changed 7 years ago by ibaldin

The Ben NDL control already processes one request at a time, i.e., it queues new requests until the setup handler for the current one completes its execution. This means that all BEN setup actions are strictly ordered.

The teardown actions can happen, however, in any order depending on lease length and scheduling order. To deal with this the BEN control receives a call for each reservation that is about to be closed. These calls happen on the actor thread and are strictly ordered. See: BenControl?.close(IReservation). When I wrote this method, the intent was to consult the NDL model and determine what is safe to teardown and what is not. This method sets properties that are passed down to the handler and they can control its behavior. The goal here is to keep track of resources and to prevent the teardown of a resource that it shared among multiple reservations.

Note that close(IReservation) is an indication that a close is about to happen. The NDL model should mark these resources as "about to be closed, but not really closed. Once the teardown handler completes, free(Units) will be invoked, which ends up calling releaseResources(Unit).

As written, I can see a race between a close() and a new allocation: if a resource is about to be destroyed the model should not assume the resource exists when it processes new requests. But to satisfy the request, we might need to create a new link that already exists and is being destroyed.

It seems that the safest thing to do is to defer allocating new requests while there is an outstanding close. One way to do this is to keep a counter: increment it in close and decrement it in release: since we have only one Unit in a Ben reservation that would work fine. The more elaborate approach would be for the NDL model to keep track of whether there is a close in progress and to defer allocations until there are no more closes in progress.

Let me know if you have more questions.

--aydan

Changed 6 years ago by ibaldin

  • status changed from new to closed
  • resolution set to wontfix
Note: See TracTickets for help on using tickets.