Difference between revisions of "STOP Command"
Line 55: | Line 55: | ||
The runner script outputs the following information with each START/STOP: | The runner script outputs the following information with each START/STOP: | ||
− | * Legend: | + | * Legend: [[ PCI card hacking]] |
* Healthy PCI Card: | * Healthy PCI Card: |
Revision as of 17:26, 15 November 2011
Contents
[hide]Background
The stop command was invented to allow users to stop data acquisitions in mid-stream. There are a variety of reasons for wanting to do so:
- Malfunction of other subsystems at the telescope
- Not receiving any DV pulses from the Sync Box or other triggering software
- Closing off a long data acquisition
- A hang of the Clock Card firmware
How to Issue STOP Command
From a MAS shell:
> set_directory > mce_run mce_run_1042 10000 s & > sleep 2 > mce_cmd -x stop rcs ret_dat > sleep 1 > ps | grep mce_run
In MAS' interactive mode, the stop command can be replaced by the following:
> stop <card_addr> ret_dat
In order to stop the MAS data process only (from a shell):
> mce_cmd -x fakestop > mce_cmd -x mce_reset > mce_cmd -x dsp_reset > mce_cmd -x acq_flush
If that doesn't work, try unloading and reloading the PCI driver:
> modprobe...
If that doesn't work, try killing processes:
> ps > kill <#>
STOP Command Replies:
The STOP command is supported as a special command in the Clock Card firmware. Unlike for WB and RB commands, the MCE replies to the STOP command at it's leisure, and not necessarily in order with data packets being returned.
STOP Command Data Packets:
Data packets continue to be returned following the reply to the STOP command until all of the remaining ret_dat commands are flushed from the MCE. This means that either one or two data packets are returned following the receipt of a STOP command by the Clock Card. The last data packet has the 'stop' and 'last_frame' bits set in the status frame header. With MAS, a certain amount of dead-time is required between the reply to the STOP command and the next frame of data. This dead-time is hard-coded as 10ms in the Clock Card firmware. With a delay of 1ms, the PCI card was not be ready to receive the final data packet in 50% of STOP trials. The delay can be adjusted by using the 'stop_dly' command. The units for this command are in us.
STOP Commands Outside of Data Runs:
When a STOP command is issued outside of a data run, no data packets are returned. When a STOP command is issued during a data run, the timing of the last data frame does not generally follow the timing that is specified by the '> rb cc data_rate' parameter. In general, the last ret_dat is queued up as quickly as possible, irrespective of the status of '> rb cc use_dv'. For example, when the Clock Card is sourcing its DV pulses from the Sync Box, and a STOP command arrives, it does not wait for the next DV pulse -- instead it issues the last ret_dat immediately. If the Clock Card waited, it would hang if the reason for the STOP was because the source of the DV pulses was not functioning correctly to begin with.
Testing
There is a test script for STOP commands that instantiates a 'runner' process and a 'stopper' process. The runner issues mce_run commands, and the stopper issues mce_stop commands 'n' seconds after the mce_run. To run the test, do the following:
> cd stop_test > ./test.bash <# trials> <# seconds of data>
If an error occurs, the drivers will need to be flushed, etc. In another window, enter the following:
> ps aux | grep runner > kill <"/bin/bash" process> > killall mce_cmd (do this 6 times) > mce_cmd -x dsp_reset > sudo /etc/init.d/mas restart > mce_reset_clean > mce_reconfig > mce_cmd -x wb cc stop_dly 10000 > mce_cmd -x wb cc data_rate 6
The runner script outputs the following information with each START/STOP:
- Legend: PCI card hacking
- Healthy PCI Card:
mce@mce-ubc-2:~/stop_test$ ./test.bash 10 Running 0 Time is 4 Stopping 0 GROUP basic STATUS X 0x0000 = 0x0000 MODE X 0x0001 = 0x001c FRAME_COUNT X 0x0002 = 0x0eaf REV_NUMBER X 0x0003 = 0x550105 NUM_DUMPED X 0x0006 = 0x9e1ffb
- Unhealthy PCI Card:
mce@mce-ubc-2:~/stop_test$ ./test.bash 10 Running 0 Time is 4 Stopping 0 GROUP basic STATUS X 0x0000 = 0x0040 MODE X 0x0001 = 0x001c FRAME_COUNT X 0x0002 = 0x1395 REV_NUMBER X 0x0003 = 0x550105 NUM_DUMPED X 0x0006 = 0x9e1ffb
There are a couple other tools that one can use to check the integrity of packets being returned:
- You can see frame size information by going to
> /home/mce/dsp_dump > python dsp_ram.py header
- It shows the preamble, packet type, and size:
HEAD_W1_1 X 0x000f = 0xa5a5 HEAD_W1_0 X 0x0010 = 0xa5a5 HEAD_W2_1 X 0x0011 = 0x5a5a HEAD_W2_0 X 0x0012 = 0x5a5a HEAD_W3_1 X 0x0013 = 0x2020 HEAD_W3_0 X 0x0014 = 0x5250 HEAD_W4_1 X 0x0015 = 0x0000 HEAD_W4_0 X 0x0016 = 0x0004
- The command/ reply log for MAS is here stored in
> /data/cryo/current_data > tail -n 50 log
Useful Signal Tap Signals
Signal tap is useful for capturing the behavior of the MCE during STOP commands. I found that the following settings are most useful for STOP testing:
- Nodes:
- issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|ret_dat_stop_req
- issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_data_o
- issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_nena_o
- issue_reply:issue_reply_block|fibre_tx:i_fibre_tx|fibre_clk_i
- issue_reply:issue_reply_block|cmd_translator:i_cmd_translator|current_state.REQ_LAST_DATA_PACKET
- Signal Cofiguration:
- Clock: _clk0
- Sample Depth: 64K
- RAM Type: Auto
- Trigger: Sequential, Center Trigger Position, Trigger Conditions = 2
Test Results and Timing Diagrams
During the testing of STOP commands in the sys_v05000000 tag of firmware, it was found that whenever a malfunction with stopping occurred, the Clock Card had been in the process of sending a data packet to the PCI card when a STOP command was issued by the PCI Card. Further investigation revealed that the PCI Card required an inordinate amount of time to process the reply to the STOP command, which caused an overflow in the PCI Card buffer space. By making changes to both the PCI Firmware and Linux Driver, we were able to increase the STOP Reply processing bandwidth to a level where STOP and On-The-Fly errors no longer occurred.
Test Cases
The cmd_translator block on the clock card is the block that nominally runs data acquisitions. It is a complicated piece of code, and requires simulation of at least the following cases:
- Acquisition of one frame of data
- Acquisition of multiple frames of data
- Acquisition while sourcing the DV from the Sync Box (use_sync=2, use_dv=2, select_clk=1)
- Acquisition while sourcing the DV from the Sync Box's input (use_sync=2, use_dv=2, select_clk=1)
- Acquisition while sourcing the DV from the Sync Box and disconnecting the Sync Box fibre.
- Acquisition while sourcing the DV from the Sync Box with the fibre initially disconnected
- Acquisition while turing the Sync Box output off and then on.
All the cases above should be repeated in the following scenarios:
- a STOP command should be issued before the first frame is returned
- a STOP command should be issued during the acquisition
- a STOP command should be issued after the acquisition