Version 29 (modified by ibaldin, 8 years ago)

--

Camano 3.0

Issues

* Re-deploying site requires redeploying broker - VERY BAD!

  • When you redeploy - don't wipe the state recovery file: this will preserve existing reservations. HOWEVER: there is a bug now that prevents expanding the packages if the state recovery lock file is present. Aydan will fix #145. Every actor is designed to work like that. If a broker is blown away, service managers will continue working until they need extend. Basically we need to fix recovery bugs if there are any but by design, redeploying should be fine. Recovery is most problematic on SM and Authorities. Eucalyptus is made to recover but not well tested.

See #187 first, then look at #182, #183, #184, #185, #186

* How can we pass port configuration from the switches into the log from failure of operations?

  • if you specify a task property as 'shirako.save.XXXX' (Config class) will be passed back (with stripped 'shirako.save') and the property will be attached to the unit for which the handler was executed (shirako.save.blah -> blah) however failed units aren't sent back to e.g. SM. Logging it in the handler is possible. Logging is probably easiest. Also possible to attach it to the reservation and pass it as a property. There is also a notice mechanism that can be used to pass it back to the reservation.
  • Problem is related to using Ant as the basis for handler scripts. To look at changing to using Jython as the engine, look at Config class. One issue may be checking progress of the operation. Jeff has details on this (added for one of the demos). Some code in AntConfig? class may be generically useful and would have to be pushed up. Implement a generic config, look at javax.scripting and BSF (bean-scripting framework).

* Do we still an interactor deadlock problem?

  • Yes. Needs fixing. Every inter-actor call holds big lock. Across multiple containers it is possible for call and response to be in different threads, which may cause deadlock. The solution is when calls are made across actors, they are not done while holding the big lock. The issue is handling exceptions.

* If a parent reservation fails, should children be allowed to go on? #189

  • There is a concept of 'deferral'. If you don't have enough resources, you can send back partial resources, which is why this makes sense. We may need to add a property on the control for the resource pool, a new property 'non-deferrable' may need to be added. There is a binary predicate that determines whether a redeem is possible (yes redeem/no redeem/no redeem and release). The children reservation get tickets and they need to be cleaned up if a parent fails on redeem. This needs to be a transaction. Predecessor reservations need to be closed, however we may need to wait until they are done redeeming, as closing in redeeming state is difficult.

* Need to re-enable ticket validation #181

* Site consolidation (is it needed?):

  • Can/Should we get rid of RENCI-net UNC-net authorities? So we can have a single authority controlling EX3200? With NDL exporting multiple resource types we can now do the consolidation.
  • We could modify configuration processor to create resource pools based on NDL input to avoid repeating delegation in config.xml (and NDL). To be discussed.

* NDL code:

* GENI AM API #190, #191

* Reuse controller code between XMLRPC and ID controllers (Avoid copying).

* Old bugs:

  • Adding resource pools from GUI/extending management API #120
  • Referencing actors by GUIDs instead of names #121
  • Emulation mode build #192
  • Node Agent install fail #193
  • Close reservation from reservation screen (for single VMs w/o controller use) fails #196, #116 (related)

ToDos

  • Aydan to look at (1) Interactor deadlock, (2) broker/state recovery and (3) failing reservations/parent reservations
  • Victor/Ilia to look at handlers/jython/javax/BFS and logging + old bugs
  • Prateek/Muzhi - come up with draft API for image management supporting multiple cloud providers (Euca, XCat, Nimbus etc). Continue developing image management function as independent of ORCA, running as a webapp
  • Prateek/Victor - put shibboleth support into trunk, help investigage shib at UNC, test against two IDPs (UNC/Duke)

Wishlist

  • Bugs
    • updateTicket throws null pointer exceptions after initial reservation request fails because broker cannot fulfill it (per Prateek)
    • In at least one instance a site container was restarted between and a join and a leave and a leave did not work after that (Euca at RENCI).
    • The container with the interdomain controller stopped ticking abruptly. Two requests were issued and closed successfully two days before this happened. When a new request was issued, after the interdomain path was computed, everything halted. orca.log wasn't growing. 'View Reservations' on the portal was showing only one reservation in 'Nascent' state.
  • Miscellaneous/core
    • Clean up SM policy and controller APIs to avoid problems like with close()
    • Add 'exportAll' to the config file
    • Add instantiating a controller from config file vs. GUI
    • Can CXF replace Axis2 or can we upgrade Axis2?
  • Network drivers
    • Improve 6509 driver performance by caching login sessions
    • Consider separating adding a QoS profile to vlan from vlan creation. This may be needed to deal with vlan delays and in general give more flexibility.
  • NDL policies
    • Should improve the performance of the label assignment policy and label range update utility in the model
    • Review the code for static members and general structure
    • Multipoint BEN and Sherpa
    • Investigate persistent triple store from BBN http://parliament.semwebcentral.org/
    • Can we have controller (ID controller) query NDL's on demand instead of only in the beginning
    • May need to get rid of the domain name entry in the config.xml, and get it from the NDL when delegating resource pools.
    • May need to get rid of the build.property in the handler to get the device information (name/address) out of the NDL file.
  • Utilities
    • Sanity checking script for container actor configuration files (check guids, check locations, check edges)
  • Can we use cytoscape to visualize our RDFs in a useful way (example: in the registry add an option to show a visualization of the delegated resources)?
  • Look at switching away from mySQL.
  • Ability to change passwords for simple container authentication from web gui