Ticket #272 (closed defect: fixed)

Opened 6 years ago

Last modified 4 years ago

BBN-W6 offline, due to hardware error

Reported by: vjo Owned by: ttoll
Priority: major Milestone:
Component: Infrastructure: Rack hardware Version: baseline
Keywords: Cc:

Description

bbn-w6 spontaneously rebooted, and is having hardware errors; log says:

Jun 30 05:05:58 bbn-w6 kernel: [Hardware Error]: Machine check events logged
Jun 30 05:07:27 bbn-w6 kernel: [Hardware Error]: Machine check events logged
Jun 30 05:11:07 bbn-w6 kernel: [Hardware Error]: Machine check events logged
Jun 30 05:12:27 bbn-w6 kernel: [Hardware Error]: Machine check events logged
Jun 30 05:12:35 bbn-w6 mcelog: Corrected memory errors on page fae23a000 exceed threshold 10 in 24h: 10 in 24h
Jun 30 05:12:35 bbn-w6 mcelog: Location SOCKET:1 CHANNEL:0 DIMM:? []
Jun 30 05:12:35 bbn-w6 mcelog: Offlining page fae23a000

It has been taken offline in OpenStack? and xCAT, but has been left powered on, in order to aid in investigation. The openstack-nova-compute service has also been disabled.

Change History

Changed 6 years ago by ibaldin

Still reporting same problems

Changed 5 years ago by ibaldin

Please open a ticket with IBM and note it on the ticket.

Changed 5 years ago by ibaldin

ping

Changed 5 years ago by ibaldin

  • owner changed from jonmills to ttoll
  • status changed from new to assigned

Changed 5 years ago by vjo

Ping.
I recently had to clean up MCE log in /var/log, since it had grown to 15G(!).

Changed 4 years ago by wardag31

I've contacted IBM support and opened a new case. The case ID is A08RQH7. IBM sent a replacement for DIMM 13 to BBN. Tim Upthegrove installed the new DIMM and BBN-W6 failed the new DIMM. Upgraded UEFI firmware to lastest version (1.80). DIMM error cleared and BBN-W6 can create OpenStack? instances successfully.

Changed 4 years ago by wardag31

  • status changed from assigned to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.