Difference between revisions of "STOP Command"

From MCEWiki
(Testing)
m (Dvw moved page The STOP Command to STOP Command)
 
(12 intermediate revisions by 3 users not shown)
Line 6: Line 6:
 
* A hang of the Clock Card firmware
 
* A hang of the Clock Card firmware
  
= How to Issue STOP Command =
+
= How to issue a STOP command =
From a shell:
+
From a MAS shell:
> set_directory
 
 
  > mce_run mce_run_1042 10000 s &
 
  > mce_run mce_run_1042 10000 s &
 
  > sleep 2
 
  > sleep 2
 
  > mce_cmd -x stop rcs ret_dat
 
  > mce_cmd -x stop rcs ret_dat
 
  > sleep 1
 
  > sleep 1
> ps | grep mce_run
+
 
In MAS' interactive mode, the stop command can be replaced by the following:
+
In mce_cmd interactive mode, the stop command can be issued as:
 
  > stop <card_addr> ret_dat
 
  > stop <card_addr> ret_dat
 
In order to stop the MAS data process only (from a shell):
 
In order to stop the MAS data process only (from a shell):
Line 21: Line 20:
 
  > mce_cmd -x dsp_reset
 
  > mce_cmd -x dsp_reset
 
  > mce_cmd -x acq_flush
 
  > mce_cmd -x acq_flush
If that doesn't work, try unloading and reloading the PCI driver:
 
> modprobe...
 
If that doesn't work, try killing processes:
 
> ps
 
> kill <#>
 
  
= STOP Command Replies: =
+
If that doesn't work, try unloading and reloading the PCI driver.
 +
 
 +
= How does the MCE handle a STOP command =
 
The STOP command is supported as a special command in the Clock Card firmware.  Unlike for WB and RB commands, the MCE replies to the STOP command at it's leisure, and not necessarily in order with data packets being returned.  
 
The STOP command is supported as a special command in the Clock Card firmware.  Unlike for WB and RB commands, the MCE replies to the STOP command at it's leisure, and not necessarily in order with data packets being returned.  
  
= STOP Command Data Packets: =
 
 
Data packets continue to be returned following the reply to the STOP command until all of the remaining ret_dat commands are flushed from the MCE.  This means that either one or two data packets are returned following the receipt of a STOP command by the Clock Card.  The last data packet has the 'stop' and 'last_frame' bits set in the status frame header.  With MAS, a certain amount of dead-time is required between the reply to the STOP command and the next frame of data.  This dead-time is hard-coded as 10ms in the Clock Card firmware.  With a delay of 1ms, the PCI card was not be ready to receive the final data packet in 50% of STOP trials.  The delay can be adjusted by using the 'stop_dly' command.  The units for this command are in us.
 
Data packets continue to be returned following the reply to the STOP command until all of the remaining ret_dat commands are flushed from the MCE.  This means that either one or two data packets are returned following the receipt of a STOP command by the Clock Card.  The last data packet has the 'stop' and 'last_frame' bits set in the status frame header.  With MAS, a certain amount of dead-time is required between the reply to the STOP command and the next frame of data.  This dead-time is hard-coded as 10ms in the Clock Card firmware.  With a delay of 1ms, the PCI card was not be ready to receive the final data packet in 50% of STOP trials.  The delay can be adjusted by using the 'stop_dly' command.  The units for this command are in us.
  
= STOP Commands Outside of Data Runs: =
+
When a STOP command is issued outside of a data run, no data packets are returned.  When a STOP command is issued during a data run, the timing of the last data frame does not generally follow the timing that is specified by the '> rb cc data_rate' parameter.  In general, the last ret_dat is queued up as quickly as possible, irrespective of the status of '> rb cc use_dv'.  For example, when the Clock Card is sourcing its DV pulses from the Sync Box, and a STOP command arrives, it does not wait for the next DV pulse -- instead it issues the last ret_dat immediately.  If the Clock Card waited, it would hang if the reason for the STOP was because the source of the DV pulses was not functioning correctly to begin with.
When a STOP command is issued outside of a data run, no data packets are returned.  When a STOP command is issued during a data run, the timing of the last data frame does not generally follow the timing that is specified by the '> rb cc data_rate' parameter.  In general, the last ret_dat is queued up as quickly as possible, irrespective of the status of '> rb cc use_dv'.  For example, when the Clock Card is sourcing its DV pulses from the Sync Box, and a STOP command arrives, it does not wait for the next DV pulse -- instead it issues the last ret_dat immediately.  If the Clock Card waited, it might hang if the source of the DV pulses was cut off.
 
 
 
= Testing =
 
There is a test script for STOP commands that instantiates a runner and a stopper process.  The runner issues an mce_run command, and the stopper issues an mce_stop command 'n' seconds after the mce_run.  To run the test, do the following:
 
> cd stop_test
 
> ./test.bash <# trials> <# seconds of data>
 
 
 
 
 
If an error occurs, the drivers will need to be flushed, etc.  In another window, enter the following:
 
> ps aux | grep runner
 
> kill <"/bin/bash" process>
 
> killall mce_cmd (do this 6 times)
 
> mce_cmd -x dsp_reset
 
> sudo /etc/init.d/mas restart
 
> mce_reset_clean
 
> mce_reconfig
 
> mce_cmd -x wb cc stop_dly 10000
 
> mce_cmd -x wb cc data_rate 6
 
 
 
 
 
The runner script outputs the following information with each START/STOP:
 
* Legend:  http://e-mode.phas.ubc.ca/mcewiki/index.php/PCI_card_hacking
 
 
 
* Healthy PCI Card:
 
mce@mce-ubc-2:~/stop_test$ ./test.bash 10
 
Running 0
 
Time is 4
 
Stopping 0
 
GROUP basic
 
STATUS              X  0x0000 = 0x0000
 
MODE                X  0x0001 = 0x001c
 
FRAME_COUNT          X  0x0002 = 0x0eaf
 
REV_NUMBER          X  0x0003 = 0x550105
 
NUM_DUMPED          X  0x0006 = 0x9e1ffb
 
 
 
* Unhealthy PCI Card:
 
mce@mce-ubc-2:~/stop_test$ ./test.bash 10
 
Running 0
 
Time is 4
 
Stopping 0
 
GROUP basic
 
STATUS              X  0x0000 = 0x0040
 
MODE                X  0x0001 = 0x001c
 
FRAME_COUNT          X  0x0002 = 0x1395
 
REV_NUMBER          X  0x0003 = 0x550105
 
NUM_DUMPED          X  0x0006 = 0x9e1ffb
 
 
 
 
 
There are a couple other tools that one can use to check the integrity of packets being returned:
 
* You can see frame size information by going to
 
> /home/mce/dsp_dump
 
> python dsp_ram.py header
 
 
 
* It shows the preamble, packet type, and size:
 
HEAD_W1_1            X  0x000f = 0xa5a5
 
HEAD_W1_0            X  0x0010 = 0xa5a5
 
HEAD_W2_1            X  0x0011 = 0x5a5a
 
HEAD_W2_0            X  0x0012 = 0x5a5a
 
HEAD_W3_1            X  0x0013 = 0x2020
 
HEAD_W3_0            X  0x0014 = 0x5250
 
HEAD_W4_1            X  0x0015 = 0x0000
 
HEAD_W4_0            X  0x0016 = 0x0004
 
 
 
* The command/ reply log for MAS is here stored in
 
> /data/cryo/current_data
 
> tail -n 50 log
 
 
 
= Signal Tap =
 
Signal tap is useful for capturing the behavior of the MCE during STOP commands.  I found that the following settings are most useful for STOP testing:
 
* Nodes:
 
** issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|ret_dat_stop_req
 
** issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_data_o
 
** issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_nena_o
 
** issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_clk_i
 
** issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|current_state.REQ_LAST_DATA_PACKET
 
* Signal Cofiguration:
 
** Clock: _clk0
 
** Sample Depth: 64K
 
** RAM Type: Auto
 
** Trigger: Sequential, Center Trigger Position, Trigger Conditions = 2
 
  
 
= Test Cases =
 
= Test Cases =
Line 131: Line 46:
 
* a STOP command should be issued after the acquisition
 
* a STOP command should be issued after the acquisition
  
= Quick Reference =
+
'''Note:''' During the testing of STOP commands in the sys_v05000000 tag of firmware, it was found that whenever a malfunction with stopping occurred, the Clock Card had been in the process of sending a data packet to the PCI card when a STOP command was issued by the PCI Card.  Further investigation revealed that the PCI Card required an inordinate amount of time to process the reply to the STOP command, which caused an overflow in the PCI Card buffer space.  By making changes to both the PCI Firmware and Linux Driver, we were able to increase the STOP  Reply processing bandwidth to a level where STOP and On-The-Fly errors no longer occurred.
* [[ MAS Cheat Sheet ]]
+
 
 +
[[Category:Commanding]]
 +
[[Category:Readout Card Firmware]]

Latest revision as of 18:59, 31 August 2016

Background

The stop command was invented to allow users to stop data acquisitions in mid-stream. There are a variety of reasons for wanting to do so:

  • Malfunction of other subsystems at the telescope
  • Not receiving any DV pulses from the Sync Box or other triggering software
  • Closing off a long data acquisition
  • A hang of the Clock Card firmware

How to issue a STOP command

From a MAS shell:

> mce_run mce_run_1042 10000 s &
> sleep 2
> mce_cmd -x stop rcs ret_dat
> sleep 1

In mce_cmd interactive mode, the stop command can be issued as:

> stop <card_addr> ret_dat

In order to stop the MAS data process only (from a shell):

> mce_cmd -x fakestop
> mce_cmd -x mce_reset
> mce_cmd -x dsp_reset
> mce_cmd -x acq_flush

If that doesn't work, try unloading and reloading the PCI driver.

How does the MCE handle a STOP command

The STOP command is supported as a special command in the Clock Card firmware. Unlike for WB and RB commands, the MCE replies to the STOP command at it's leisure, and not necessarily in order with data packets being returned.

Data packets continue to be returned following the reply to the STOP command until all of the remaining ret_dat commands are flushed from the MCE. This means that either one or two data packets are returned following the receipt of a STOP command by the Clock Card. The last data packet has the 'stop' and 'last_frame' bits set in the status frame header. With MAS, a certain amount of dead-time is required between the reply to the STOP command and the next frame of data. This dead-time is hard-coded as 10ms in the Clock Card firmware. With a delay of 1ms, the PCI card was not be ready to receive the final data packet in 50% of STOP trials. The delay can be adjusted by using the 'stop_dly' command. The units for this command are in us.

When a STOP command is issued outside of a data run, no data packets are returned. When a STOP command is issued during a data run, the timing of the last data frame does not generally follow the timing that is specified by the '> rb cc data_rate' parameter. In general, the last ret_dat is queued up as quickly as possible, irrespective of the status of '> rb cc use_dv'. For example, when the Clock Card is sourcing its DV pulses from the Sync Box, and a STOP command arrives, it does not wait for the next DV pulse -- instead it issues the last ret_dat immediately. If the Clock Card waited, it would hang if the reason for the STOP was because the source of the DV pulses was not functioning correctly to begin with.

Test Cases

The cmd_translator block on the clock card is the block that nominally runs data acquisitions. It is a complicated piece of code, and requires simulation of at least the following cases:

  • Acquisition of one frame of data
  • Acquisition of multiple frames of data
  • Acquisition while sourcing the DV from the Sync Box (use_sync=2, use_dv=2, select_clk=1)
  • Acquisition while sourcing the DV from the Sync Box's input (use_sync=2, use_dv=2, select_clk=1)
  • Acquisition while sourcing the DV from the Sync Box and disconnecting the Sync Box fibre.
  • Acquisition while sourcing the DV from the Sync Box with the fibre initially disconnected
  • Acquisition while turing the Sync Box output off and then on.


All the cases above should be repeated in the following scenarios:

  • a STOP command should be issued before the first frame is returned
  • a STOP command should be issued during the acquisition
  • a STOP command should be issued after the acquisition

Note: During the testing of STOP commands in the sys_v05000000 tag of firmware, it was found that whenever a malfunction with stopping occurred, the Clock Card had been in the process of sending a data packet to the PCI card when a STOP command was issued by the PCI Card. Further investigation revealed that the PCI Card required an inordinate amount of time to process the reply to the STOP command, which caused an overflow in the PCI Card buffer space. By making changes to both the PCI Firmware and Linux Driver, we were able to increase the STOP Reply processing bandwidth to a level where STOP and On-The-Fly errors no longer occurred.