STOP Command

From MCEWiki
Revision as of 14:36, 16 April 2012 by Mandana (talk | contribs) (Quick Reference)

Background

The stop command was invented to allow users to stop data acquisitions in mid-stream. There are a variety of reasons for wanting to do so:

  • Malfunction of other subsystems at the telescope
  • Not receiving any DV pulses from the Sync Box or other triggering software
  • Closing off a long data acquisition
  • A hang of the Clock Card firmware

How to Issue STOP Command

From a MAS shell:

> set_directory 
> mce_run mce_run_1042 10000 s &
> sleep 2
> mce_cmd -x stop rcs ret_dat
> sleep 1
> ps | grep mce_run

In MAS' interactive mode, the stop command can be replaced by the following:

> stop <card_addr> ret_dat

In order to stop the MAS data process only (from a shell):

> mce_cmd -x fakestop
> mce_cmd -x mce_reset
> mce_cmd -x dsp_reset
> mce_cmd -x acq_flush

If that doesn't work, try unloading and reloading the PCI driver:

> modprobe...

If that doesn't work, try killing processes:

> ps
> kill <#>

STOP Command Replies:

The STOP command is supported as a special command in the Clock Card firmware. Unlike for WB and RB commands, the MCE replies to the STOP command at it's leisure, and not necessarily in order with data packets being returned.

STOP Command Data Packets:

Data packets continue to be returned following the reply to the STOP command until all of the remaining ret_dat commands are flushed from the MCE. This means that either one or two data packets are returned following the receipt of a STOP command by the Clock Card. The last data packet has the 'stop' and 'last_frame' bits set in the status frame header. With MAS, a certain amount of dead-time is required between the reply to the STOP command and the next frame of data. This dead-time is hard-coded as 10ms in the Clock Card firmware. With a delay of 1ms, the PCI card was not be ready to receive the final data packet in 50% of STOP trials. The delay can be adjusted by using the 'stop_dly' command. The units for this command are in us.

STOP Commands Outside of Data Runs:

When a STOP command is issued outside of a data run, no data packets are returned. When a STOP command is issued during a data run, the timing of the last data frame does not generally follow the timing that is specified by the '> rb cc data_rate' parameter. In general, the last ret_dat is queued up as quickly as possible, irrespective of the status of '> rb cc use_dv'. For example, when the Clock Card is sourcing its DV pulses from the Sync Box, and a STOP command arrives, it does not wait for the next DV pulse -- instead it issues the last ret_dat immediately. If the Clock Card waited, it would hang if the reason for the STOP was because the source of the DV pulses was not functioning correctly to begin with.

Testing

There is a test script for STOP commands that instantiates a 'runner' process and a 'stopper' process. The runner issues mce_run commands, and the stopper issues mce_stop commands 'n' seconds after the mce_run. To run the test, do the following:

> cd stop_test
> ./test.bash <# trials> <# seconds of data>


If an error occurs, the drivers will need to be flushed, etc. In another window, enter the following:

> ps aux | grep runner
> kill <"/bin/bash" process>
> killall mce_cmd (do this 6 times)
> mce_cmd -x dsp_reset
> sudo /etc/init.d/mas restart
> mce_reset_clean
> mce_reconfig
> mce_cmd -x wb cc stop_dly 10000
> mce_cmd -x wb cc data_rate 6


The runner script outputs the following information with each START/STOP:

  • Healthy PCI Card:
mce@mce-ubc-2:~/stop_test$ ./test.bash 10
Running 0
Time is 4
Stopping 0
GROUP basic
STATUS               X   0x0000 = 0x0000
MODE                 X   0x0001 = 0x001c
FRAME_COUNT          X   0x0002 = 0x0eaf
REV_NUMBER           X   0x0003 = 0x550105
NUM_DUMPED           X   0x0006 = 0x9e1ffb
  • Unhealthy PCI Card:
mce@mce-ubc-2:~/stop_test$ ./test.bash 10
Running 0
Time is 4
Stopping 0
GROUP basic
STATUS               X   0x0000 = 0x0040
MODE                 X   0x0001 = 0x001c
FRAME_COUNT          X   0x0002 = 0x1395
REV_NUMBER           X   0x0003 = 0x550105
NUM_DUMPED           X   0x0006 = 0x9e1ffb


There are a couple other tools that one can use to check the integrity of packets being returned:

  • You can see frame size information by going to
> /home/mce/dsp_dump
> python dsp_ram.py header
  • It shows the preamble, packet type, and size:
HEAD_W1_1            X   0x000f = 0xa5a5
HEAD_W1_0            X   0x0010 = 0xa5a5
HEAD_W2_1            X   0x0011 = 0x5a5a
HEAD_W2_0            X   0x0012 = 0x5a5a
HEAD_W3_1            X   0x0013 = 0x2020
HEAD_W3_0            X   0x0014 = 0x5250
HEAD_W4_1            X   0x0015 = 0x0000
HEAD_W4_0            X   0x0016 = 0x0004 
  • The command/ reply log for MAS is here stored in
> /data/cryo/current_data
> tail -n 50 log

Useful Signal Tap Signals

Signal tap is useful for capturing the behavior of the MCE during STOP commands. I found that the following settings are most useful for STOP testing:

  • Nodes:
    • issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|ret_dat_stop_req
    • issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_data_o
    • issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_nena_o
    • issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_clk_i
    • issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|current_state.REQ_LAST_DATA_PACKET
  • Signal Cofiguration:
    • Clock: _clk0
    • Sample Depth: 64K
    • RAM Type: Auto
    • Trigger: Sequential, Center Trigger Position, Trigger Conditions = 2

Test Results and Timing Diagrams

During the testing of STOP commands in the sys_v05000000 tag of firmware, it was found that whenever a malfunction with stopping occurred, the Clock Card had been in the process of sending a data packet to the PCI card when a STOP command was issued by the PCI Card. Further investigation revealed that the PCI Card required an inordinate amount of time to process the reply to the STOP command, which caused an overflow in the PCI Card buffer space. By making changes to both the PCI Firmware and Linux Driver, we were able to increase the STOP Reply processing bandwidth to a level where STOP and On-The-Fly errors no longer occurred.

Test Cases

The cmd_translator block on the clock card is the block that nominally runs data acquisitions. It is a complicated piece of code, and requires simulation of at least the following cases:

  • Acquisition of one frame of data
  • Acquisition of multiple frames of data
  • Acquisition while sourcing the DV from the Sync Box (use_sync=2, use_dv=2, select_clk=1)
  • Acquisition while sourcing the DV from the Sync Box's input (use_sync=2, use_dv=2, select_clk=1)
  • Acquisition while sourcing the DV from the Sync Box and disconnecting the Sync Box fibre.
  • Acquisition while sourcing the DV from the Sync Box with the fibre initially disconnected
  • Acquisition while turing the Sync Box output off and then on.


All the cases above should be repeated in the following scenarios:

  • a STOP command should be issued before the first frame is returned
  • a STOP command should be issued during the acquisition
  • a STOP command should be issued after the acquisition