MCE Recovery During Acquisition

From MCEWiki

This page describes how to recover the MCE from common failure scenarios at ACT. An MCE can be fixed while the others carry out their regularly scheduled programming.

Recover MCE after acquisition system lock-up

Scenarios

  1. An MCE command is accidentally issued during an acquisition, confusing the MCE or the scripting system.
  2. For some reason an auto-tuning or IV has failed and the MCE needs to be properly configured for subsequent acquisitions.
  3. There are a lot of ERRORs reported in the logs and we don't know why and we want to reset the MCE.

Solution: Disconnect PC from the scheduler; reset the MCE; reset the script system; reconfigure the MCE; reconnect to scheduler.

Steps

All the commands below should be run at the shell on the MCE PC, as user "mce". It is not necessary to stop the schedule from running on the other MCE machines.

1. Disable MCE_control, so that no scheduler commands interfere with the recovery commands:

sudo /etc/init.d/MCE_control stop

2. Reset the MCE and tell any acquisition sessions to exit:

mce_reset_clean
mce_cmd -x fakestop
mce_cmd -x empty
mce_reset_clean

On both of those mce_reset_cleans, the output should be this:

ok : 41 41 41 41 41 41 41 41 411
ok : 41 41 41 41 41 41 41 41 412
ok : 41 41 41 41 41 41 41 41 413
MCE reset Successful.

If mce_reset_clean fails, you have some kind of hardware problem. Sure-fire solution is to power cycle the MCE and the PC. But try this sequence first:

mce_cmd -x mce_reset
mce_cmd -x dsp_reset
killall mce_cmd
killall mce_run

If either of the mce_cmds hang for longer than 2 seconds, just Ctrl-C them and carry on. If mce_reset_clean still does not work at this point, determine whether the MCE or the PC (probably the PCI card) is at fault. Run

mce_status -s

This should give you "ERROR" replies to every command. This will either happen

  • a) slowly, each command timing out (~1s per command; you can Ctrl-C this if it's taking too long)
    • In this case, the MCE is probably dead. Power cycle the MCE using the sync box. If this doesn't help, do the steps in (b) as well.
  • b) very quickly, each command returning error immediately
    • In this case, the PCI card is probably in trouble. Power-cycle the PC by running
sudo poweroff

and then powering the PC off (for ~5 seconds) and then on again using the ibootbar power switch .


3. Reconfigure the MCE.

Either:

  • a) If the last auto-tuning and IV were probably ok, simply run
auto_reconfig

This will set the MCE to ACT configuration and rebias the MCE to the last known bias points.

  • or b) If the last auto-tuning/IV failed (in the sense that the auto_* scripts returned "ERROR" in auto_log.txt, or maybe you just don't like or trust them), then you should do an auto-tuning/IV by hand:
auto_setup_squids 0
auto_iv_and_bias_squids_sh 1 man

4. Test; make sure acqs run smoothly:

auto_acquire test 1000

(This should print the name of the data and run files for a 1000 frame acquisition. The acq should take about 5 seconds and the size of the resulting data file should be 4400000 bytes. If this fails start over at step 2.)

If the acquisition hangs, and if you successfully reset the MCE in step 2 (i.e. you didn't have to power cycle anything), the driver might be in some weird state. Kill the acq (Ctrl-C), then reload the driver:

sudo rmmod mce_dsp
sudo modprobe mce_dsp

Then try the acq again.

5. Reconnect the MCE PC to the scheduler:

sudo /etc/init.d/MCE_control start

The PC should pick up the next scheduler command; this will be apparent in auto_log.txt and in the MCE_control log (use [[1]] ).


Matthew Hasselfield, 3 Sept 2008