RSP Vector Instructions

by rasky on 28 Mar 2020

Within the Nintendo 64, the RSP (Reality Signal Processor) is the name of the computation units used for mathematical calculations. It is made by a stripped-down R4300 core (without a few more advanced opcodes) referred to as the Scalar Unit (SU), composed with a coprocessor (configured as COP2) that can perform SIMD operations on a separate set of vector registers, referred to as the Vector Unit (VU).

RSP has two different banks of onboard 0-stalls dedicated memories: IMEM (4KB) for instructions, and DMEM (4KB) for data. It has no external memory buses but has a DMA engine capable to copy code/data from/into DMEM/IMEM and the main RDRAM. The DMA engine can be driven by either the main CPU or the RSP itself.

The code running on the RSP is usually called “microcode”, but it’s a standard MIPS program, obviously containing the dedicated COP2 instructions to drive the VU.

Excluding stalls in the pipeline, the RSP is able to perform in parallel one SU and one VU opcode in a single clock cycle. For best performance, the microcode should thus interleave SU and VU opcodes.

This article is originally from github.com/rasky/r64emu

Table of Contents

Vector registers

VU contains 32 128-bit SIMD registers, each organized in 8 lanes of 16-bit each one. Most VU opcodes perform the same operation in parallel on each of the 8 lanes. The arrangement is thus similar to x86 SSE2 registers in EPI16 format.

The vector registers array is called VPR in this document, so VPR[4] refers to the fifth register (usually called v4 in assembly). When referring to specific portions of the register, we use the following convention:

Ranges are specified using the beg..end inclusive notation (that is, both beg and end are part of the range).

The concatenation of disjoint ranges is written with a ,, for instance: [0..3,8..11] means 8 bytes formed by concatenating 4 bytes starting at 0 with 4 bytes starting at 8.

Accumulator

The RSP contains a 8-lane SIMD accumulator, that is used implicitly by multiplication opcodes. Each of the 8 lanes is 48-bits wide, that allows to accumulate intermediate results of calculations without the loss of precision that would incur when storing them into a 16-bit lane in a vector register.

It is possible to extract the contents of the accumulator through the VSAR opcode; one call to this opcode can extract a 16-bit portion of each lane and store it into the specified vector register. The three portions are conventionally called ACCUM_LO (bits 15..0 of each lane), ACCUM_MD (bits 31..16 of each lane), and ACCUM_HI (bits 47..32 of each lane).

If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.

Clamping

Multiplication opcodes perform a clamping step when extracting the accumulator into a vector register. Notice that each lane of the accumulator is always treated as a signed 48-bit number.

This is the pseudo-code for signed clamping (no surprises):

function clamp_signed(accum)
    if accum < -32768  => return -32768
    if accum > 32767   => return 32767
    return accum

The returned value is thus always within the signed 16-bit range.

This is the pseudo-code for unsigned clamping:

function clamp_unsigned(accum)
    if accum < 0       => return 0
    if accum > 32767   => return 65535
    return accum

Notice that in unsigned clamping, the saturating threshold is 15-bit, but the saturated value is 16-bit.

Loads and stores

31..26 25..21 20..16 15..11 10..7 6..0
LWC2 or SWC2 base vt opcode element offset

The instructions perform a load/store from DMEM into/from a vector register.

8/16/32/64-bit vector loads/stores

These instructions can be used to load/store up to 64 bits of data to/from a register vector:

Insn opcode Desc
LBV 0x00 load 1 byte into vector
SBV 0x00 store 1 byte from vector
LSV 0x01 load (up to) 2 bytes into vector
SSV 0x01 store 2 bytes from vector
LLV 0x02 load (up to) 4 bytes into vector
SLV 0x02 store 4 bytes from vector
LDV 0x03 load (up to) 8 bytes into vector
SDV 0x03 store 8 bytes from vector

The address in DMEM is computed as GPR[base] + (offset * access_size), where access_size is the number of bytes being accessed (eg: 4 for SLV). The address can be unaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.

The part of the vector register being accessed is VPR[vt][element..element+access_size], that is element selects the first accessed byte within the vector register. When element+access_size is bigger than 15, the behavior is as follows:

Loads affect only a portion of the vector register (which is 128-bit); other bytes in the register are not modified.

128-bit vector loads

These instructions can be used to load up to 128 bits of data to a register vector:

Insn opcode Desc
LQV 0x04 load (up to) 16 bytes into vector, left-aligned
LRV 0x05 load (up to) 16 bytes into vector, right-aligned

Roughly, these functions behave like LWL and LWR: combined, they allow to read 128 bits of data into a vector register, irrespective of the alignment. For instance, this code will fill v0 with 128 bits of data starting at the possibly-unaligned $08(a0).

// a0 is 128-bit aligned in this example
LQV v0[e0],$08(a0)     // read bytes $08(a0)-$0F(a0) into left part of the vector (VPR[0][0..7])
LRV v0[e0],$18(a0)     // read bytes $10(a0)-$17(a0) into right part of the vector (VPR[0][8..15])

Notice that if the data is 128-bit aligned, LQV is sufficient to read the whole vector (LRV in this case is redundant because it becomes a no-op).

The actual bytes accessed in DMEM depend on the instruction: for LQV, the bytes are those starting at GPR[base] + (offset * 16), up to and excluding the next 128-bit aligned byte ($10(a0) in the above example); for LRV, the bytes are those starting at the previous 128-bit aligned byte ($10(a0) in the above example) up to and excluding GPR[base] + (offset * 16). Again, this is exactly the same behavior of LWL and LWR, but for 128-bit aligned loads.

element is used as a byte offset within the vector register to specify the first byte affected by the operation; that is, the part of the vector being loaded with the instruction pair is VPR[vt][element..15]. Thus a non-zero element means that fewer bytes are loaded; for instance, this code loads 12 unaligned bytes into the lower part of the vector starting at byte 4:

LQV v1[e4],$08(a0)     // read bytes $08(a0)-$0F(a0) into VPR[1][4..11]
LRV v1[e4],$18(a0)     // read bytes $10(a0)-$13(a0) into VPR[1][12..15]

128-bit vector stores

These instructions can be used to load up to 128 bits of data to a register vector:

Insn opcode Desc
SQV 0x04 store (up to) 16 bytes into vector, left-aligned
SRV 0x05 store (up to) 16 bytes into vector, right-aligned

These instructions behave like SWL and SWR and are thus the counterpart to LQV and LRV. For instance:

// a0 is 128-bit aligned in this example
SQV v0[e0],$08(a0)     // store left (higher) part of the vector into bytes $08(a0)-$0F(a0)
SRV v0[e0],$18(a0)     // store right (lower) part of the vector into bytes $10(a0)-$17(a0)

The main difference from load instructions is how element is used: it still refers to the first byte being accessed in the vector register, but SQV/SRV always perform a full-width write (128-bit in total when used together), and the data is fetched from VPR[vt][element..element+16] wrapping around the vector. For instance:

SQV v1[e4],$08(a0)     // write bytes $08(a0)-$0F(a0) from VPR[1][4..11]
SRV v1[e4],$18(a0)     // write bytes $10(a0)-$17(a0) from VPR[1][12..15,0..3]

128-bit vector transpose

These instructions are used to read/write lanes across a group of registers, to help implementing the transposition of a matrix:

Insn opcode Desc
LTV 0x0B load 8 lanes to 8 different registers
STV 0x0B store 8 lanes from 8 different registers
SWV 0x0A store 16 bytes into vector, wrapped

The 8-registers group is identified by vt, ignoring the last 3 bits. This means that the 32 registers are logically divided into 4 groups (0-7, 8-15, 16-23, 24-31).

The lanes affected within the register group are laid out in diagonal layout; for instance, if vt is zero, the lanes will be: VREG[0]<0>, VREG[1]<1>, …, VREG[7]<7>. element(3..1) specifies the first register affected within the register group, and thus identifies the diagonal. For instance, if vt is 0 and element(3..1) is 5, the lanes will be: VREG[5]<0>, VREG[6]<1>, VREG[7]<2>, VREG[0]<3>, etc. Notice that element(0) is ignored.

The following table shows the numbering of the 8 diagonals present in a 8-registers group; each cell of the table contains the diagonal that lane belongs to (and thus shows which element(3..1) value will trigger an access to it):

Reg Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7
v0 0 7 6 5 4 3 2 1
v1 1 0 7 6 5 4 3 2
v2 2 1 0 7 6 5 4 3
v3 3 2 1 0 7 6 5 4
v4 4 3 2 1 0 7 6 5
v5 5 4 3 2 1 0 7 6
v6 6 5 4 3 2 1 0 7
v7 7 6 5 4 3 2 1 0

STV writes lane 0 of the specified diagonal to the address GPR[base] + (offset * 16); following lanes are written to subsequent memory addresses, wrapping around at the second 64-bit boundary. For instance, STV v0[e2],$1E(r0) writes diagonal 1, starting with VPR[1]<0>, to the following addresses: $1E, $20, $22, $24, $26, $18, $1A, $1C.

LTV fetches two subsequent 64-bit aligned words, starting from `(GPR[base] + (offset

By combining STV and LTV, it is possible to transpose a matrix because diagonals are symmetric; for instance, assuming a 8x8 matrix is stored in VPR[0..7]<0..7>, the following sequence transposes it:

// a0 is 128-bit aligned
STV v0[e2],$10(a0)  // store diagonal 1
STV v0[e4],$20(a0)  // store diagonal 2
STV v0[e6],$30(a0)  // store diagonal 3
STV v0[e8],$40(a0)  // store diagonal 4
STV v0[e10],$50(a0) // store diagonal 5
STV v0[e12],$60(a0) // store diagonal 6
STV v0[e14],$70(a0) // store diagonal 7

LTV v0[e14],$10(a0) // load back diagonal 1 into diagonal 7
LTV v0[e12],$20(a0) // load back diagonal 2 into diagonal 6
LTV v0[e10],$30(a0) // load back diagonal 3 into diagonal 5
LTV v0[e8],$40(a0)  // load back diagonal 4 into diagonal 4
LTV v0[e6],$50(a0)  // load back diagonal 5 into diagonal 3
LTV v0[e4],$60(a0)  // load back diagonal 6 into diagonal 2
LTV v0[e2],$70(a0)  // load back diagonal 7 into diagonal 1

It is also possible to transpose a matrix stored in memory by combining LTV and SWV. SWV is much simpler than the other transpose instructions. It writes byte element(3..0) of vt to the address GPR[base] + (offset * 16), and writes subsequent bytes of vt to subsequent addresses, wrapping around within vt. Addresses also wrap at the second 64-bit boundary. The following sequence transposes a matrix stored at $00(a0)..$7F(a0):

// a0 is 128-bit aligned
LTV v0[e0], $00(a0)   // load diagonal 0
LTV v0[e14], $10(a0)  // load diagonal 7
LTV v0[e12], $20(a0)  // load diagonal 6
LTV v0[e10], $30(a0)  // load diagonal 5
LTV v0[e8], $40(a0)   // load diagonal 4
LTV v0[e6], $50(a0)   // load diagonal 3
LTV v0[e4], $60(a0)   // load diagonal 2
LTV v0[e2], $70(a0)   // load diagonal 1

SWV v0[e0], $00(a0)   // store column 0 to row 0
SWV v1[e2], $10(a0)   // store column 1 to row 1
SWV v2[e4], $20(a0)   // store column 2 to row 2
SWV v3[e6], $30(a0)   // store column 3 to row 3
SWV v4[e8], $40(a0)   // store column 4 to row 4
SWV v5[e10], $50(a0)  // store column 5 to row 5
SWV v5[e12], $60(a0)  // store column 6 to row 6
SWV v5[e14], $70(a0)  // store column 7 to row 7

8-bit packed loads and stores

These instructions can be used to load or store distinct 8-bit signed/unsigned values into a vector register, moving each 8-bit value into/from each own lane.

Insn opcode Desc
LPV 0x06 load 8 signed 8-bit values into 8 lanes
LUV 0x07 load 8 unsigned 8-bit values into 8 lanes
SPV 0x06 store 8 signed 8-bit values from 8 lanes
SUV 0x07 store 8 unsigned 8-bit values from 8 lanes

The only difference between the signed and unsigned versions is how the 8-bit values are mapped into the 16-bit lanes. Signed opcodes (LPV, SPV) map the value to bits (15..8) (effectively producing a signed number), while unsigned opcodes (LUV, SUV) map the value to bits (14..7). Load instructions zero the bits outside the mapped range, while store instructions effectively ignore the other bits.

The packed loads first create a 128-bit intermediate value W by reading 16 bytes in DMEM, starting from GPR[base] + (offset * 8) and wrapping at the at the second 64-bit (8-byte) boundary. The first byte read is loaded into byte offset element of W, with subsequent byte offsets wrapping around within W. Bytes (0..7) of W are then mapped to the appropriate bits of each 16-bit lane within the target register.

The packed stores generally behave as you would expect, mapping the appropriate bits of each lane to consecutive bytes in memory, starting with the lane specified by element. However, instead of wrapping at 8 lanes, packed stores wrap at 16, and change the mapping bits for “lanes” 8-15. SPV when a lane index is in the range [8..15] behaves like SUV when its lane index is in the range [0..7], and vice versa.

For instance:

// a0 is 64-bit aligned
LUV v1[e5],$02(a0)     // load bytes $00(a0)-$04(a0) into VPR[1]<3..7>,
                          and $0d(a0)-$0f(a0) into VPR[1]<0..2>
SUV v1[e5],$02(a0)     // write bytes $02(a0)-$04(a0) from VPR[1]<5..7> (14..7),
                          and $05(a0)-$09(a0) from VPR[1]<0..5> (15..8)

8-bit strided loads and stores

Like the packed load/store instructions, these instructions load or store 8-bit unsigned values into a vector register, moving each 8-bit value from/into each lane. Unlike the packed instructions, however, the addreses of these 8-bit values are not consecutive, and instead include every other address or every fourth address.

Insn opcode Desc
LHV 0x08 load 8 unsigned 8-bit values into 8 lanes
LFV 0x09 load 8 unsigned 8-bit values into 8 lanes
SHV 0x08 store 8 unsigned 8-bit values from 8 lanes
SFV 0x09 store 8 unsigned 8-bit values from 8 lanes

Similar to (LUV, SUV), these handle unsigned values, and map each value to bits (14..7) of the 16-bit lanes. Load instructions zero the bits of each lane outside the mapped range, while store instructions effectively ignore the other bits.

Like the packed loads, the strided loads create an intermediate value W by reading 16 bytes in DMEM, starting from GPR[base] + (offset * 16) this time. After loading W as above, however, instead of mapping the leftmost 8 bytes into the lanes, every other byte, in the case of LHV, and every fourth byte (repeated in a pattern), in the case of LFV, are used.

LFV, since it doesn’t write an entire register, behaves differently from other loads at this last step. It creates a second 128-bit temporary, and loads (14..7) of each lane in this temporary from a different byte in W, in the pattern 0,4,8,12,8,12,0,4. 64 bits of the original register starting at byte index element (NOT wrapping around) are then replaced with the corresponding bits of the second temporary.

SHV stores the appropriate bits of each lane into every other byte in memory, like SUV. However, instead of storing one lane at a time starting from lane element, it stores every other byte in the register, beginning at byte index element (after rotating the entire register left by one bit to align the mapping). Mapping bits are not affected by element, and addresses wrap at the second 64-bit boundary.

SFV is more complex. It first creates a new 128-bit temporary, and loads each byte of the temporary from (14..7) of a different lane in the source register, using the pattern 0,6,X,X,1,7,X,X,2,4,X,X,3,5,X,X (X repesents a zeroed byte). It then increments element values in the range [8..15] by 1, so they become [9..15,0]. Finally, every fourth byte of the temporary, beginning at byte index element, is written to every fourth byte in memory, with addresses wrapping at the second 64-bit boundary.

For instance:

// a0 is 64-bit aligned
LHV v1[e7],$06(a0)     // load bytes $01(a0)..$0d(a0) (odd) into VPR[1]<1..7>,
                          then byte $0f(a0) into VPR[1]<0>
SHV v1[e3],$06(a0)     // write bytes $06(a0)..$0e(a0) (even) from VPR[1][3..11] (odd)
                          then bytes $00(a0)..$04(a0) (even) from VPR[1][13,15,1]
LFV v1[e3],$06(a0)     // load byte $07(a0) into VPR[1]<1>
                          then bytes $0b(a0),$0f(a0) into VPR[1]<2..3>
                          then bytes $0b(a0),$0f(a0) into VPR[1]<4..5>
SFV v1[e5],$06(a0)     // write bytes $06(a0),$0a(a0),$0e(a0) from VPR[1]<7,4,5>
                          then byte $02(a0) from VPR[1]<6>

Vector move instructions

31..21 20..16 15..11 10..8 7..0
COP2 Move rt vs vs_elem 0

Vector moves follow the same format as other coprocessor moves, but use part of the lower 11 bits to specify which lane of the vector register is used. mtc2 moves the lower 16 bits of the general purpose register rt to the vector register VS<vs_elem>, while mtc2 moves VS<vs_elem> to GPR rt, sign extending to 64 bits.

ctc2 moves the lower 16 bits of GPR rt into the control register specified by vs, while cfc2 does the reverse, moving the control register specified by vs into GPR rt, sign extending to 64 bits. Note that both ctc2 and cfc2 ignore the vs_elem field. For these instructions, the control register is specified as follows:

vs Register
0 VCO
1 VCC
2 VCE

Single-lane instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 vt_elem vt vd_elem vd opcode

Single-lane instructions are an instruction group that perform operations on a single lange of a single input register (VT<vt_elem>), and store the result into a single lane of a single output register (VD<vd_elem>). Only the lowest 3 bits of vt_elem and vd_elem are used to compute the source lane se and destination lane de, respectively.

VMOV

Copy a lane from vt to vd, after broadcast:

VMOV vd[de],vt[de]

Pseudo-code:

VD<de> = VT<de>

As a side-effect, ACCUM_LO is loaded with VT. Note that the source and destination lanes are both de, and vt_elem is only being used as a broadcast modifier. See the section on computational instructions for more details about how vt_elem modifies how vt is accessed.

VRCP

Computes a 32-bit reciprocal of the 16-bit input lane, and store it into the output lane:

VRCP vd[de],vt[se]

The recriprocal is computed using a lookup table of 512 elements of 16-bits each one. The table is burnt within an internal ROM of the RSP and cannot be directly accessed nor modified.

The function computes a 32-bit recriprocal; the lower 16-bits of the result are stored into the destination lane, while the higher 16-bits are stored into the DIV_OUT special register, and can be subsequently read using VRCPH.

Pseudo-code:

function rcp(input(31..0))
    result = 0
    if input == 0
        return NOT result
    endif
    x = abs(input)
    scale_out = highest_set_bit(x)
    scale_in = 32 - scale_out
    result(scale_out..scale_out-16) = 1 || RCP_ROM[x(scale_in-1..scale_in-9)]
    if input < 0
        result = NOT result
    endif
    return result

result = rcp(sign_extend(VT<se>))
VD<de> = result(15..0)
DIV_OUT = result(31..16)
for i in 0..7
    ACCUM<i>(15..0) = VT<i>(15..0)
endfor

As a side-effect, ACCUM_LO is loaded with VT (all lanes).

This is the RCP_ROM table:

ffff  ff00  fe01  fd04  fc07  fb0c  fa11  f918  f81f  f727  f631  f53b  f446  f352  f25f  f16d
f07c  ef8b  ee9c  edae  ecc0  ebd3  eae8  e9fd  e913  e829  e741  e65a  e573  e48d  e3a9  e2c5
e1e1  e0ff  e01e  df3d  de5d  dd7e  dca0  dbc2  dae6  da0a  d92f  d854  d77b  d6a2  d5ca  d4f3
d41d  d347  d272  d19e  d0cb  cff8  cf26  ce55  cd85  ccb5  cbe6  cb18  ca4b  c97e  c8b2  c7e7
c71c  c652  c589  c4c0  c3f8  c331  c26b  c1a5  c0e0  c01c  bf58  be95  bdd2  bd10  bc4f  bb8f
bacf  ba10  b951  b894  b7d6  b71a  b65e  b5a2  b4e8  b42e  b374  b2bb  b203  b14b  b094  afde
af28  ae73  adbe  ad0a  ac57  aba4  aaf1  aa40  a98e  a8de  a82e  a77e  a6d0  a621  a574  a4c6
a41a  a36e  a2c2  a217  a16d  a0c3  a01a  9f71  9ec8  9e21  9d79  9cd3  9c2d  9b87  9ae2  9a3d
9999  98f6  9852  97b0  970e  966c  95cb  952b  948b  93eb  934c  92ad  920f  9172  90d4  9038
8f9c  8f00  8e65  8dca  8d30  8c96  8bfc  8b64  8acb  8a33  899c  8904  886e  87d8  8742  86ad
8618  8583  84f0  845c  83c9  8336  82a4  8212  8181  80f0  8060  7fd0  7f40  7eb1  7e22  7d93
7d05  7c78  7beb  7b5e  7ad2  7a46  79ba  792f  78a4  781a  7790  7706  767d  75f5  756c  74e4
745d  73d5  734f  72c8  7242  71bc  7137  70b2  702e  6fa9  6f26  6ea2  6e1f  6d9c  6d1a  6c98
6c16  6b95  6b14  6a94  6a13  6993  6914  6895  6816  6798  6719  669c  661e  65a1  6524  64a8
642c  63b0  6335  62ba  623f  61c5  614b  60d1  6058  5fdf  5f66  5eed  5e75  5dfd  5d86  5d0f
5c98  5c22  5bab  5b35  5ac0  5a4b  59d6  5961  58ed  5879  5805  5791  571e  56ac  5639  55c7
5555  54e3  5472  5401  5390  5320  52af  5240  51d0  5161  50f2  5083  5015  4fa6  4f38  4ecb
4e5e  4df1  4d84  4d17  4cab  4c3f  4bd3  4b68  4afd  4a92  4a27  49bd  4953  48e9  4880  4817
47ae  4745  46dc  4674  460c  45a5  453d  44d6  446f  4408  43a2  433c  42d6  4270  420b  41a6
4141  40dc  4078  4014  3fb0  3f4c  3ee8  3e85  3e22  3dc0  3d5d  3cfb  3c99  3c37  3bd6  3b74
3b13  3ab2  3a52  39f1  3991  3931  38d2  3872  3813  37b4  3755  36f7  3698  363a  35dc  357f
3521  34c4  3467  340a  33ae  3351  32f5  3299  323e  31e2  3187  312c  30d1  3076  301c  2fc2
2f68  2f0e  2eb4  2e5b  2e02  2da9  2d50  2cf8  2c9f  2c47  2bef  2b97  2b40  2ae8  2a91  2a3a
29e4  298d  2937  28e0  288b  2835  27df  278a  2735  26e0  268b  2636  25e2  258d  2539  24e5
2492  243e  23eb  2398  2345  22f2  22a0  224d  21fb  21a9  2157  2105  20b4  2063  2012  1fc1
1f70  1f1f  1ecf  1e7f  1e2e  1ddf  1d8f  1d3f  1cf0  1ca1  1c52  1c03  1bb4  1b66  1b17  1ac9
1a7b  1a2d  19e0  1992  1945  18f8  18ab  185e  1811  17c4  1778  172c  16e0  1694  1648  15fd
15b1  1566  151b  14d0  1485  143b  13f0  13a6  135c  1312  12c8  127f  1235  11ec  11a3  1159
1111  10c8  107f  1037  0fef  0fa6  0f5e  0f17  0ecf  0e87  0e40  0df9  0db2  0d6b  0d24  0cdd
0c97  0c50  0c0a  0bc4  0b7e  0b38  0af2  0aad  0a68  0a22  09dd  0998  0953  090f  08ca  0886
0842  07fd  07b9  0776  0732  06ee  06ab  0668  0624  05e1  059e  055c  0519  04d6  0494  0452
0410  03ce  038c  034a  0309  02c7  0286  0245  0204  01c3  0182  0141  0101  00c0  0080  0040

VRSQ

Computes a 32-bit reciprocal of the square root of the input lane, and store it into the output lane:

VRSQ vd[de],vt[se]

The recriprocal of the square root is computed using a lookup table similar to that used by VRCP (512 elements of 16-bits each one), stored within the same ROM. The higher part of the result is stored into the same DIV_OUT special register used by VRCP.

Pseudo-code:

function rsq(input(31..0))
    result = 0
    if input == 0
        return NOT result
    endif
    x = abs(input)
    scale_out = highest_set_bit(x)
    scale_in = 32 - scale_out
    scale_out = scale_out / 2
    result(scale_out..scale_out-16) = 1 || RSQ_ROM[scale_in(0) || x(scale_in-1..scale_in-8)]
    if input < 0
        result = NOT result
    endif
    return result

result = rcp(sign_extend(VT<se>))
VD<de> = result(15..0)
DIV_OUT = result(31..16)

This is the RSQ_ROM table:

ffff  ff00  fe02  fd06  fc0b  fb12  fa1a  f923  f82e  f73b  f648  f557  f467  f379  f28c  f1a0
f0b6  efcd  eee5  edff  ed19  ec35  eb52  ea71  e990  e8b1  e7d3  e6f6  e61b  e540  e467  e38e
e2b7  e1e1  e10d  e039  df66  de94  ddc4  dcf4  dc26  db59  da8c  d9c1  d8f7  d82d  d765  d69e
d5d7  d512  d44e  d38a  d2c8  d206  d146  d086  cfc7  cf0a  ce4d  cd91  ccd6  cc1b  cb62  caa9
c9f2  c93b  c885  c7d0  c71c  c669  c5b6  c504  c453  c3a3  c2f4  c245  c198  c0eb  c03f  bf93
bee9  be3f  bd96  bced  bc46  bb9f  baf8  ba53  b9ae  b90a  b867  b7c5  b723  b681  b5e1  b541
b4a2  b404  b366  b2c9  b22c  b191  b0f5  b05b  afc1  af28  ae8f  adf7  ad60  acc9  ac33  ab9e
ab09  aa75  a9e1  a94e  a8bc  a82a  a799  a708  a678  a5e8  a559  a4cb  a43d  a3b0  a323  a297
a20b  a180  a0f6  a06c  9fe2  9f59  9ed1  9e49  9dc2  9d3b  9cb4  9c2f  9ba9  9b25  9aa0  9a1c
9999  9916  9894  9812  9791  9710  968f  960f  9590  9511  9492  9414  9397  931a  929d  9221
91a5  9129  90af  9034  8fba  8f40  8ec7  8e4f  8dd6  8d5e  8ce7  8c70  8bf9  8b83  8b0d  8a98
8a23  89ae  893a  88c6  8853  87e0  876d  86fb  8689  8618  85a7  8536  84c6  8456  83e7  8377
8309  829a  822c  81bf  8151  80e4  8078  800c  7fa0  7f34  7ec9  7e5e  7df4  7d8a  7d20  7cb6
7c4d  7be5  7b7c  7b14  7aac  7a45  79de  7977  7911  78ab  7845  77df  777a  7715  76b1  764d
75e9  7585  7522  74bf  745d  73fa  7398  7337  72d5  7274  7213  71b3  7152  70f2  7093  7033
6fd4  6f76  6f17  6eb9  6e5b  6dfd  6da0  6d43  6ce6  6c8a  6c2d  6bd1  6b76  6b1a  6abf  6a64
6a09  6955  68a1  67ef  673e  668d  65de  6530  6482  63d6  632b  6280  61d7  612e  6087  5fe0
5f3a  5e95  5df1  5d4e  5cac  5c0b  5b6b  5acb  5a2c  598f  58f2  5855  57ba  5720  5686  55ed
5555  54be  5427  5391  52fc  5268  51d5  5142  50b0  501f  4f8e  4efe  4e6f  4de1  4d53  4cc6
4c3a  4baf  4b24  4a9a  4a10  4987  48ff  4878  47f1  476b  46e5  4660  45dc  4558  44d5  4453
43d1  434f  42cf  424f  41cf  4151  40d2  4055  3fd8  3f5b  3edf  3e64  3de9  3d6e  3cf5  3c7c
3c03  3b8b  3b13  3a9c  3a26  39b0  393a  38c5  3851  37dd  3769  36f6  3684  3612  35a0  352f
34bf  344f  33df  3370  3302  3293  3226  31b9  314c  30df  3074  3008  2f9d  2f33  2ec8  2e5f
2df6  2d8d  2d24  2cbc  2c55  2bee  2b87  2b21  2abb  2a55  29f0  298b  2927  28c3  2860  27fd
279a  2738  26d6  2674  2613  25b2  2552  24f2  2492  2432  23d3  2375  2317  22b9  225b  21fe
21a1  2145  20e8  208d  2031  1fd6  1f7b  1f21  1ec7  1e6d  1e13  1dba  1d61  1d09  1cb1  1c59
1c01  1baa  1b53  1afc  1aa6  1a50  19fa  19a5  1950  18fb  18a7  1853  17ff  17ab  1758  1705
16b2  1660  160d  15bc  156a  1519  14c8  1477  1426  13d6  1386  1337  12e7  1298  1249  11fb
11ac  115e  1111  10c3  1076  1029  0fdc  0f8f  0f43  0ef7  0eab  0e60  0e15  0dca  0d7f  0d34
0cea  0ca0  0c56  0c0c  0bc3  0b7a  0b31  0ae8  0aa0  0a58  0a10  09c8  0981  0939  08f2  08ab
0865  081e  07d8  0792  074d  0707  06c2  067d  0638  05f3  05af  056a  0526  04e2  049f  045b
0418  03d5  0392  0350  030d  02cb  0289  0247  0206  01c4  0183  0142  0101  00c0  0080  0040

VRCPH/VRSQH

Reads the higher part of the result of a previous 32-bit reciprocal instruction, and stores the higher part of the input for a following 32-bit reciprocal.

VRCPH vd[de],vt[se]

VRSPH is meant to be used for the recriprocal of square root, but its beahvior is identical to VRCPH, as neither perform an actual calculation, and there is a single couple of DIV_IN and DIV_OUT registers that are used for both kind of reciprocals.

This opcode performs two separate steps: first, the output of a previous reciprocal is read from DIV_OUT and stored into the output lane VD<de>; second, the input lane VS<se> is loaded into the special register DIV_IN, ready for a following full-width 32-bit reciprocal that can be invoked with VRCPL.

Pseudo-code:

VD<de>(15..0) = DIV_OUT(15..0)
DIV_IN(15..0) = VT<se>(15..0)
for i in 0..7
    ACCUM<i>(15..0) = VT<i>(15..0)
endfor

As a side-effect, ACCUM_LO is loaded with VT (all lanes).

VRCPL/VRSQL

Performs a full 32-bit reciprocal combining the input lane with the special register DIV_IN that must have been loaded with a previous VRCPH/VRSPH instruction.

VRCPL vd[de],vt[se]
VRSQL vd[de],vt[se]

The RSP remembers whether DIV_IN was loaded or not, by a previous VRCPH or VRSQH instruction. If VRCPL/VRSQL is executed without DIV_IN being loaded, they perform exactly like their 16-bit counterparts VRCP/VRSQ instructions (that is, the input lane is sign extended). After VRCPL/VRSQL, DIV_IN is unloaded.

Pseudo-code:

result = rcp(DIV_IN(15..0) || VT<se>(15..0))  // or rsq()
VD<de> = result(15..0)
DIV_OUT = result(31..16)
DIV_IN = <null>
for i in 0..7
    ACCUM<i>(15..0) = VT<i>(15..0)
endfor

As a side-effect, ACCUM_LO is loaded with VT (all lanes).

Computational instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 element vt vs vd opcode

Instructions have this general format:

VINSN vd, vs, vt[element]

where element is a “broadcast modifier” (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others.

element Lanes being accessed Description
0 0,1,2,3,4,5,6,7 Normal register access (no broadcast)
1 0,1,2,3,4,5,6,7 Normal register access (no broadcast)
2 0,0,2,2,4,4,6,6 Broadcast 4 of 8 lanes
3 1,1,3,3,5,5,7,7 Broadcast 4 of 8 lanes
4 0,0,0,0,4,4,4,4 Broadcast 2 of 8 lanes
5 1,1,1,1,5,5,5,5 Broadcast 2 of 8 lanes
6 2,2,2,2,6,6,6,6 Broadcast 2 of 8 lanes
7 3,3,3,3,7,7,7,7 Broadcast 2 of 8 lanes
8 0,0,0,0,0,0,0,0 Broadcast single lane
9 1,1,1,1,1,1,1,1 Broadcast single lane
10 2,2,2,2,2,2,2,2 Broadcast single lane
11 3,3,3,3,3,3,3,3 Broadcast single lane
12 4,4,4,4,4,4,4,4 Broadcast single lane
13 5,5,5,5,5,5,5,5 Broadcast single lane
14 6,6,6,6,6,6,6,6 Broadcast single lane
15 7,7,7,7,7,7,7,7 Broadcast single lane

This is the list of opcodes in this group:

Opcode Instruction
0x00 VMULF
0x01 VMULU
0x02 VRNDP
0x03 VMULQ
0x04 VMUDL
0x05 VMUDM
0x06 VMUDN
0x07 VMUDH
0x08 VMACF
0x09 VMACU
0x0A VRNDN
0x0B VMACQ
0x0C VMADL
0x0D VMADM
0x0E VMADN
0x0F VMADH
0x10 VADD
0x14 VADDC
0x1D VSAR
0x28 VAND
0x29 VNAND
0x2A VOR
0x2B VNOR
0x2C VXOR
0x2D VNXOR

VADD/VSUB

Vector addition or subtraction, with signed saturation:

vadd vd, vs, vt[e]
vsub vd, vs, vt[e]

Pseudo-code for vadd:

for i in 0..7
    result(16..0) = VS<i>(15..0) + VT<i>(15..0) + VCO(i)
    ACC<i>(15..0) = result(15..0)
    VD<i>(15..0) = clamp_signed(result(16..0))
    VCO(i) = 0
    VCO(i + 8) = 0
endfor

Pseudo-code for vsub:

for i in 0..7
    result(16..0) = VS<i>(15..0) - VT<i>(15..0) - VCO(i)
    ACC<i>(15..0) = result(15..0)
    VD<i>(15..0) = clamp_signed(result(16..0))
    VCO(i) = 0
    VCO(i + 8) = 0
endfor

Both instructions use the carry bits in VCO_LO, and clear them after usage. VCO_HI is also cleared (though it is not used).

VADDC/VSUBC

Vector addition or subtraction, with unsigned carry computation:

vaddc vd, vs, vt[e]
vsubc vd, vs, vt[e]

Pseudo-code for vaddc:

for i in 0..7
    result(16..0) = VS<i>(15..0) + VT<i>(15..0)
    ACC<i>(15..0) = result(15..0)
    VD<i>(15..0) = result(15..0)
    VCO(i) = result(16)
    VCO(i + 8) = 0
endfor

Pseudo-code for vsubc:

for i in 0..7
    result(16..0) = VS<i>(15..0) - VT<i>(15..0)
    ACC<i>(15..0) = result(15..0)
    VD<i>(15..0) = result(15..0)
    VCO(i) = result(16)
    VCO(i + 8) = result(16..0) != 0
endfor

Both instructions stores the carry produced by the unsigned overflow (16th bit) to VCO_LO, but vaddc clears VCO_HI while vsubc uses it to store whether the subtracted elements are equal. Note that VCO_LO is not used as input.

VAND/VNAND/VOR/VNOR/VXOR/VNXOR

Logical bitwise operations:

vand vd,vs,vt[e]     // VS AND VT
vnand vd,vs,vt[e]    // NOT (VS AND VT)
vor vd,vs,vt[e]      // VS OR VT
vnor vd,vs,vt[e]     // NOT (VS OR VT)
vxor vd,vs,vt[e]     // VS XOR VT
vnxor vd,vs,vt[e]    // NOT (VS XOR VT)

Pseudo-code for all instructions:

for i in 0..7
    ACC<i>(15..0) = VS<i>(15..0) <LOGICAL_OP> VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
endfor

VMULF

Vector multiply of signed fractions:

vmulf vd, vs, vt[e]

For each lane, this instructions multiplies 2 fixed-point 1.15 operands (in the range [-1, 1]) and produces a 1.15 result (rounding to nearest). Overflow can happen when doing 0x8000*0x8000, but it’s correctly handled by saturating to the positive max (0x7FFF).

Pseudo-code:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0) * 2   // signed multiplication
    ACC<i>(47..0) = sign_extend(prod(31..0) + 0x8000)
    VD<i>(15..0) = clamp_signed(ACC<i>(47..16))
endfor

VMULU

Vector multiply of signed fractions with unsigned result:

vmulu vd, vs, vt[e]

For each lane, this instructions multiplies 2 fixed-point 1.15 operands (in the range [-1, 1]) and produces a 0.15 result (rounding to nearest). Negative results are clipped to zero. Overflow can only happen when doing 0x8000*0x8000, and it produces 0xFFFF.

Pseudo-code:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0) * 2   // signed multiplication
    ACC<i>(47..0) = sign_extend(prod(31..0) + 0x8000)
    VD<i>(15..0) = clamp_unsigned(ACC<i>(47..16))
endfor

NOTE: name notwithstanding, this opcode performs a signed multiplication of the incoming vectors. The only difference with VMULF is the clamping step.

VMACF

Vector multiply of signed fractions with accumulation:

vmacf vd, vs, vt[e]

For each lane, this instructions multiplies 2 fixed-point 1.15 operands (in the range [-1, 1]) and adds the 1.31 result into the accumulator (treated as 17.31). The current value of the accumulator is then returned as a 1.15 result (with saturation), but the full-width value is not discarded in case subsequent VMACF are issued.

Notice that, contrary to VMULF, there is no rounding-to-nearest performed while saturating the intermediate high-precision value into the result.

Pseudo-code:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0) * 2   // signed multiplication
    ACC<i>(47..0) += sign_extend(prod(31..0))
    VD<i>(15..0) = clamp_signed(ACC<i>(47..16))
endfor

VMACU

Vector multiply of signed fractions with accumulation and unsigned result:

vmacu vd, vs, vt[e]

For each lane, this instructions multiplies 2 fixed-point 1.15 operands (in the range [-1, 1]) and adds the 1.31 result into the accumulator (treated as 17.31). The current value of the accumulator is then returned as a 0.15 result, clipping negative results to zero, and reporting positive overflow with the out-of-band value 0xFFFF.

Notice that, contrary to VMULU, there is no rounding-to-nearest performed while saturating the intermediate high-precision value into the result.

Pseudo-code:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0) * 2   // signed multiplication
    ACC<i>(47..0) += sign_extend(prod(31..0))
    VD<i>(15..0) = clamp_unsigned(ACC<i>(47..16))
endfor

NOTE: name notwithstanding, this opcode performs a signed multiplication of the incoming vectors. The only difference with VMACU is the clamping step.

VMUDN/VMADN

Vector multiply of mid partial products with unsigned result:

vmudn vd, vs, vt[e]
vmadn vd, vs, vt[e]

For each lane, this instruction multiplies an unsigned fixed-point number with a signed fixed-point number, returning the lower 16 bits of the result. The full result is stored in the lower 32 bits of the accumulator.

Pseudo-code for vmadn:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0)   // unsigned by signed
    ACC<i>(47..0) += sign_extend(prod(31..0))
    VD<i>(15..0) = clamp_unsigned(ACC<i>(31..0))
endfor

In this case, the unsigned clamp will return ACC_LO if ACC_HI is the sign extension of ACC_MD - otherwise, it will return 0 for negative ACC_HI, and 65535 for positive ACC_HI. vmudn operates similarly, but clears the accumulator beforehand.

VMUDL/VMADL

Vector multiply of low partial products:

vmudl vd, vs, vt[e]
vmadl vd, vs, vt[e]

For each lane, this instruction multiplies 2 unsigned fixed-point operands and produces an unsigned fixed-point result after removing 16 bits of precision. The result is stored in the lower 16 bits of the accumulator.

Pseudo-code for vmadl:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0)   // unsigned multiplication
    ACC<i>(47..0) += prod(31..16)
    VD<i>(15..0) = clamp_unsigned(ACC<i>(31..0))
endfor

The unsigned clamp works the same way as for vmudn. vmudl operates similarly, but clears the accumulator beforehand. Note that the lower bits of the product are discarded, and no sign extension is performed.

VMUDM/VMADM

Vector multiply of mid partial products with signed result:

vmudl vd, vs, vt[e]
vmadl vd, vs, vt[e]

For each lane, this instruction multiplies an unsigned fixed-point number with a signed fixed-point number, returning the upper 16 bits of the result. The full result is stored in the lower 32 bits of the accumulator.

Pseudo-code for vmadm:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0)   // unsigned by signed
    ACC<i>(47..0) += sign_extend(prod(31..0))
    VD<i>(15..0) = clamp_signed(ACC<i>(47..16))
endfor

vmudm operates similarly, but clears the accumulator beforehand. These instructions differ from vmudn and vmadn in the bits of the result returned as well as the type of clamping.

VMUDH/VMADH

Vector multiply of high partial products:

vmudh vd, vs, vt[e]
vmadh vd, vs, vt[e]

For each lane, this instructions multiplies 2 signed fixed-point operands and returns the lower 16 bits of the signed fixed-point result. The full result is stored in the higher 32 bits of the accumulator.

Pseudo-code for vmadh:

for i in 0..7
    prod(32..0) = VS<i>(15..0) * VT<i>(15..0)   // signed multiplication
    ACC<i>(47..16) += prod(31..0)
    VD<i>(15..0) = clamp_signed(ACC<i>(47..16))
endfor

vmudh operates similarly, but clears the accumulator beforehand.

VRNDP/VRNDN

Vector accumulator MPEG DCT round:

vrndp vd, vs, vt[e]
vrndn vd, vs, vt[e]

For each lane, this instruction computes VT shifted left by 16 bits if the VS field (not the register, but the instruction bits) equals 1. This value is then added to the accumulator if and only if the accumulator is positive. The upper 32 bits of the accumulator are then clamped and returned.

Pseudo-code for vrndp:

for i in 0..7
    prod(47..0) = sign_extend(VT<i>(15..0))
    if VS<i>(0)     => prod(47..0) <<= 16
    if !ACC<i>(47)  => ACC<i>(47..0) += prod(47..0)
    VD<i>(15..0) = clamp_signed(ACC<i>(47..16))
endfor

vrndn behaves similarly, but the value is added to the accumulator if and only if the accumulator is negative.

VMULQ

Vector multiply with MPEG inverse quantization:

vmulq vd, vs, vt[e]

For each lane, this instruction multiples two signed operands to produce a signed result, with negative values rounded up by 31. This result is shifted up by 16 bits and loaded in the accumulator. The returned value is the result shifted right by 1 bit, clamped, and AND’d with 0xFFF0.

Pseudo-code:

for i in 0..7
    prod(31..0) = VS<i>(15..0) * VT<i>(15..0)  // signed multiplication
    if prod(31)  => prod(31..0) += 0x1F
    ACC<i>(47..0) = prod(31..0) << 16
    VD<i>(15..0) = clamp_signed(prod(31..1)) & 0xFFF0
endfor

VMACQ

Vector accumulator oddification:

vmacq vd, vs, vt[e]

This instruction ignores its two input operands and performs MPEG-1 oddification of bits 16-46 of the accumulator. If the higher 32 bits of the accumulator are negative and bit 5 is zero, it rounds up by 32; if the higher 32 bits are positive and bit 5 is zero, it rounds down by 32. The returned value are these 32 bits shifted right by 1 bit, clamped, and AND’d with 0xFFF0.

Pseudo-code:

for i in 0..7
    prod(31..0) = ACC<i>(47..16)
    if  prod(31) & !prod(5)  => prod(31..0) += 0x1F
    if !prod(31) & !prod(5)  => prod(31..0) -= 0x1F
    ACC<i>(47..0) = prod(31..0) << 16
    VD<i>(15..0) = clamp_signed(prod(31..1)) & 0xFFF0
endfor

VSAR

Vector accumulator read:

vsar vd, vs, vt[e]

This instruction loads each lane of vd with the 16-bit portion of the accumulator specified by e - ACC_LO if e is 0, ACC_MD if e is 1, and ACC_HI is e is 2. The values of vs and vt are not used.

Pseudo-code:

for i in 0..7
    a = 16 * e + 15
    b = 16 * e
    VD<i>(15..0) = ACC<i>(a..b)
endfor

Select instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 element vt vs vd opcode

Instructions have this general format:

VINSN vd, vs, vt[element]

where element is a “broadcast modifier” (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others. See the Computational instructions section for details.

This is the list of opcodes in this group:

Opcode Instruction
0x00 VLT
0x01 VEQ
0x02 VNE
0x03 VGE
0x04 VCL
0x05 VCH
0x06 VCR
0x07 VMRG

VLT/VEQ/VNE/VGE

Vector select comparison operations:

vlt vd, vs, vt[e]     // VS < VT
veq vd, vs, vt[e]     // VS == VT
vne vd, vs, vt[e]     // VS != VT
vge vd, vs, vt[e]     // VS >= VT

These instructions compare the elements of two vector registers in parallel. Compare operations not supplied can be done by reversing the order of operands, or decrementing vt if it is a scalar operand. The result of the comparison is stored in VCC_LO, while VCC_HI, VCO_HI, and VCO_LO are cleared. Depending on the instruction, VCO_HI and VCO_LO may be used.

Pseudo-code for vge:

for i in 0..7
    eql = VS<i>(15..0) == VT<i>(15..0)
    neg = !(VCO(i + 8) & VCO(i)) & eql
    VCC(i) = neg | (VS<i>(15..0) > VT<i>(15..0))
    ACC<i>(15..0) = VCC(i) ? VS<i>(15..0) : VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
    VCC(i + 8) = VCO(i + 8) = VCO(i) = 0
endfor

Pseudo-code for vne:

for i in 0..7
    VCC(i) = VCO(i + 8) | (VS<i>(15..0) != VT<i>(15..0))
    ACC<i>(15..0) = VCC(i) ? VS<i>(15..0) : VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
    VCC(i + 8) = VCO(i + 8) = VCO(i) = 0
endfor

Pseudo-code for veq:

for i in 0..7
    VCC(i) = !VCO(i + 8) & (VS<i>(15..0) == VT<i>(15..0))
    ACC<i>(15..0) = VCC(i) ? VS<i>(15..0) : VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
    VCC(i + 8) = VCO(i + 8) = VCO(i) = 0
endfor

Pseudo-code for vlt:

for i in 0..7
    eql = VS<i>(15..0) == VT<i>(15..0)
    neg = VCO(i + 8) & VCO(i) & eql
    VCC(i) = neg | (VS<i>(15..0) < VT<i>(15..0))
    ACC<i>(15..0) = VCC(i) ? VS<i>(15..0) : VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
    VCC(i + 8) = VCO(i + 8) = VCO(i) = 0
endfor

VCH/VCR

Vector select clip test, single precision or high half of double precision:

vch vd, vs, vt[e]

For each lane, this instruction sets the corresponding bit of VCO_LO if the operands have opposite signs. VCO_HI indicates if the two operands are definitely not equal, and is only set when vs is not equal vt, -vt, or -vt - 1. This last condition is also specifically shown by the VCE bit, so that it can be read by a later vcl instruction to determine if the lower bits of the double-precision operands need to be compared for an accurate result.

The actual results of the clip test are stored in VCC. VCC_HI is set if vs >= vt, and VCC_LO is set if vs <= -vt. When vs and vt have opposite signs, ACC_LO is loaded with vs if vs > -vt and -vt otherwise. When vs and vt have the same sign, ACC_LO is loaded with vs if vs < vt and vt otherwise. This results in ACC_LO containing vs clamped within the range (-vt, vt) when vt is positive, and excluded from the range (vt, -vt) when vt is negative. The value of ACC_LO is then returned in vd.

Pseudo-code for vch:

for i in 0..7
    VCO(i) = VS<i>(15) != VT<i>(15)
    vt_abs(15..0) = VCO(i) ? -VT<i>(15..0) : VT<i>(15..0)
    VCE(i) = VCO(i) & (VS<i>(15..0) == -VT<i>(15..0) - 1)
    VCO(i + 8) = !VCE(i) & (VS<i>(15..0) != vt_abs(15..0))
    VCC(i) = VS<i>(15..0) <= -VT<i>(15..0)
    VCC(i + 8) = VS<i>(15..0) >= VT<i>(15..0)
    clip = VCO(i) ? VCC(i) : VCC(i + 8)
    ACC<i>(15..0) = clip ? vt_abs(15..0) : VS<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
endfor

vcr operates in exactly the same manner as vch, but assumes the inputs are in 1s complement, rather than 2s complement, and clears VCO and VCE at the end. This changes the representation of -vt and how comparison between operands of different sign are carried out, but the general algorithm remains the same.

VCL

Vector select clip test, low half of double precision:

vcl vd, vs, vt[e]

This is used in conjunction with vch for double precision clip compares. The numbers to be compared much each have their upper 16 bits in one vector register and their lower 16 bits in a different register. Then, the registers containing the upper bits can be compared using vch, and the registers with the lower bits compared using vcl.

For each lane, this instruction sets VCC_HI to vs >= vt only if the previous vch operands had same sign and were approximately equal, as shown by the stored VCO_LO and VCO_HI bits, respectively. VCC_LO is only recomputed if the previous operands had opposite sign and were approximately equal. The computed value depends on VCE - if it is set, VCC_LO is set when vs <= -vt, otherwise VCC_LO is only set when vs == -vt. If the previous operands had opposite sign ACC_LO is loaded with -vt if VCC_LO is set and vs otherwise, while if they were the same sign ACC_LO is loaded with vt if VCC_HI is set and vs otherwise. The value of ACC_LO is then returned in vd.

Pseudo-code:

for i in 0..7
    if !VCO(i) & !VCO(i + 8)
        VCC(i + 8) = VS<i>(15..0) >= VT<i>(15..0)
    endif
    if VCO(i) & !VCO(i + 8)
        lte = VS<i>(15..0) <= -VT<i>(15..0)
        eql = VS<i>(15..0) == -VT<i>(15..0)
        VCC(i) = VCE(i) ? lte : eql
    endif
    clip = VCO(i) ? VCC(i) : VCC(i + 8)
    vt_abs(15..0) = VCO(i) ? -VT<i>(15..0) : VT<i>(15..0)
    ACC<i>(15..0) = clip ? vt_abs(15..0) : VS<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
endfor

Note that all comparisons are unsigned. This instruction is intended for the low bits of double-prevision operands, so the sign bits are handled by vch. Negatives are represented in 2s complement form.

VMRG

Vector select merge:

vmrg vd, vs, vt[e]

For each lane, this instruction selects one of its operands based on the value of VCC for that lane. The values of VCC, VCO, and VCE remain unchanged. Note that only the lower 8 bits of VCC are considered.

Pseudo-code:

for i in 0..7
    ACC<i>(15..0) = VCC(i) ? VS<i>(15..0) : VT<i>(15..0)
    VD<i>(15..0) = ACC<i>(15..0)
endfor