Angrylion RDP Comments

by cqcumbers on 21 Sep 2021

The RDP is the Nintendo 64’s equivalent to the back half of a typical GPU pipeline, taking screen-space triangle and rectangle span coordinates and turning them into appropriately textured, depth-tested, and anti-aliased pixels in memory. The official RDP command reference lists all the RDP instructions, and the N64 Programming manual describes the architecture of the RDP and how many effects work from the point of view of a programmer, but neither of these resources document many details relevant to emulator authors, such as bit widths and nonstandard formats, and can make it difficult to understand the full capabilities due to their organization. This document attempts to address these shortcomings in documentation by commenting on the internal workings of Angrylion’s RDP plugin, generally acknowledged as the standard for accuracy.

The original code is from cen64’s copy of n64video.c. There were no comments except for license information in the original, so everything in English is based only on close reading of the (often rather cryptic) code and my own understanding of the RDP. You can download the Markdown for this article from github to better explore the code in a text editor or IDE - see the link in the footer.

Table of Contents

/*

MAME Legal Information
License

Redistribution and use of the MAME code or any derivative works are permitted provided that the following conditions are met:

    * Redistributions may not be sold, nor may they be used in a commercial product or activity.
    * Redistributions that are modified from the original source must include the complete source code, including the source code for all components used by a binary built from the modified sources. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
    * Redistributions must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Trademark

MAME® is a registered trademark of Nicola Salmoria. The "MAME" name and MAME logo may not be used without first obtaining permission of the trademark holder.
Copyright

The code in MAME is the work of hundreds of developers, each of whom owns the copyright to the code they wrote. There is no central copyright authority you can license the code from. The proper way to use the MAME source code is to examine it, using it to understand how the games worked, and then write your own emulation code. Sorry, there is no free lunch here.
Common Questions

Q. Can I include MAME with my product?
A. No. MAME is not licensed for commercial use. Using MAME as a "freebie" or including it at "no cost" with your product still constitutes commerical usage and is forbidden by the license.

Q. Can I sell my product with the MAME logo on it?
A. No. Putting the logo on your product makes it appear that the product is something officially endorsed by Nicola Salmoria, and constitutes trademark infringement.

Q. Can I use the MAME logo to advertise my product?
A. No. Using the logo in your advertising makes it appear that the product is something officially endorsed by Nicola Salmoria, and constitutes trademark infringement.

Q. Can I use the term "MAME" in the name of my software?
A. Generally, no, especially if it is something that is sold. However, if you are producing a free MAME-related piece of software, it is common that permission is granted. Send a query to double- check first, please.

Q. Can I put an arcade cabinet running MAME in a public location?
A. No. This this a commercial use of MAME and is prohibited by the license. Even if you don't charge money, putting a machine in a public location is "operating" an arcade machine and falls under commercial rules in most locations.

Q. Can my non-profit use MAME or an arcade cabinet running MAME to help raise money?
A. No, sorry. Even for the most worthwhile cause, this still is a commercial use of MAME and is prohibited by the license.

Q. How do I obtain a license to the MAME source code?
A. You can't. See the Copyright section above.

Q. Is it legal to download ROMs for a game when I own the PCB?
A. This is unclear and depends on where you live. In most cases you would need to obtain permission from the original manufacturer to do so.

Q. What about the free ROMs on the MAME site? Can I use those with my product?
A. Almost all of the free ROMs on the MAME site are licensed only for non- commercial use, and only for distribution from the MAME site. Just because they are available for "free" here does not grant further redistribution rights, nor does it allow you to treat them as "freebies" for commercial use.

Q. If I obtain a license from an original manufacturer to distribute the ROMs can I use MAME to run them?
A. Generally, no, because it constitutes a commercial use of MAME. However, we have in the past made a couple of exceptions for this particular case. We will not consider making any further exceptions without proof that such a license has already been obtained.

Q. Can I use a PC running MAME to replace a real arcade PCB?
A. In order to do this you would have to use a copy of the original ROMs, which would require obtaining permission from the original manufacturer. Once you had permission from them, if it was used for non-commercial purposes, then you would not technically be violating the MAME license. However we still do not explicitly give permission to use MAME in this way because of the possibility of the game being sold sometime later, which would constitute commercial use of MAME. If you sell your game later you must sell it without MAME included.

Q. Can I ask for donations for the work I did on my port of MAME to platform X?
A. No. You would be earning money from the MAME trademark and copyrights, and that would be a commercial use, which is prohibited by the license. It is our wish that MAME remain free.
*/

/*
This little RDP plugin was initially based on MESS 0.128 source code.
Many thanks to Ville Linde, MooglyGuy and other people who wrote the RDP implementation in MESS 0.128. The rest of the code is by me, angrylion.
Many thanks to people who helped me in various ways: olivieryuyu, marshallh, LaC, oman, pinchy, ziggy, FatCat and other folks I forgot.
The code comes under MAME license.
Sorry for my terrible English.

angrylion
*/

/*
I tried to keep angrylion's plugin as unmodified as possible while making it compatible with CEN64.
This version of n64video was forked from angrylion's googlecode repository (r83) and aligned to r107.

MarathonMan
*/

Interface Definitions

#include "common.h"
#include "bus/controller.h"
#include "device/device.h"
#include "ri/controller.h"
#include "tctables.h"
#include "vr4300/interface.h"
#include <stdint.h>
#include <string.h>

#define byteswap_16(x) ((uint16_t) (((uint8_t) (x >> 8)) | ((uint16_t) (x << 8))))

#define SP_INTERRUPT    0x1
#define SI_INTERRUPT    0x2
#define AI_INTERRUPT    0x4
#define VI_INTERRUPT    0x8
#define PI_INTERRUPT    0x10
#define DP_INTERRUPT    0x20

#define SP_STATUS_HALT          0x0001
#define SP_STATUS_BROKE         0x0002
#define SP_STATUS_DMABUSY       0x0004
#define SP_STATUS_DMAFULL       0x0008
#define SP_STATUS_IOFULL        0x0010
#define SP_STATUS_SSTEP         0x0020
#define SP_STATUS_INTR_BREAK    0x0040
#define SP_STATUS_SIGNAL0       0x0080
#define SP_STATUS_SIGNAL1       0x0100
#define SP_STATUS_SIGNAL2       0x0200
#define SP_STATUS_SIGNAL3       0x0400
#define SP_STATUS_SIGNAL4       0x0800
#define SP_STATUS_SIGNAL5       0x1000
#define SP_STATUS_SIGNAL6       0x2000
#define SP_STATUS_SIGNAL7       0x4000

#define DP_STATUS_XBUS_DMA      0x01
#define DP_STATUS_FREEZE        0x02
#define DP_STATUS_FLUSH         0x04
#define DP_STATUS_START_GCLK    0x008
#define DP_STATUS_TMEM_BUSY     0x010
#define DP_STATUS_PIPE_BUSY     0x020
#define DP_STATUS_CMD_BUSY      0x040
#define DP_STATUS_CBUF_READY    0x080
#define DP_STATUS_DMA_BUSY      0x100
#define DP_STATUS_END_VALID     0x200
#define DP_STATUS_START_VALID   0x400

#define R4300i_SP_Intr 1

Angrylion expects 32-bit byteswapped memory by default. The N64 is a big endian system but most emulator hosts are little endian, so many older emulators reverse the order of bytes in every 32-bit word in memory to save an instruction on the most common access width.

#undef WORD_ADDR_XOR
#define LSB_FIRST 1
#ifdef LSB_FIRST
    #define BYTE_ADDR_XOR       3
    #define WORD_ADDR_XOR       1
    #define BYTE4_XOR_BE(a)     ((a) ^ 3)
#else
    #define BYTE_ADDR_XOR       0
    #define WORD_ADDR_XOR       0
    #define BYTE4_XOR_BE(a)     (a)
#endif

#ifdef LSB_FIRST
#define BYTE_XOR_DWORD_SWAP 7
#define WORD_XOR_DWORD_SWAP 3
#else
#define BYTE_XOR_DWORD_SWAP 4
#define WORD_XOR_DWORD_SWAP 2
#endif
#define DWORD_XOR_DWORD_SWAP 1

#define PRESCALE_WIDTH 640
#define PRESCALE_HEIGHT 625
extern const int screen_width, screen_height;

typedef unsigned int offs_t;

static struct cen64_device *cen64;

#define rsp_imem ((uint32_t*)(cen64->rsp.mem+0x1000))
#define rsp_dmem ((uint32_t*)cen64->rsp.mem)

#define rdram ((uint32_t*)cen64->ri.ram)
#define rdram16 ((uint16_t*)cen64->ri.ram)
#define rdram8 (cen64->ri.ram)

#define vi_width (cen64->vi.regs[VI_WIDTH_REG])

#define dp_start (cen64->rdp.regs[DPC_START_REG])
#define dp_end (cen64->rdp.regs[DPC_END_REG])
#define dp_current (cen64->rdp.regs[DPC_CURRENT_REG])
#define dp_status (cen64->rdp.regs[DPC_STATUS_REG])

#define SIGN16(x)   ((int16_t)(x))
#define SIGN8(x)    ((int8_t)(x))

#define SIGN(x, numb)   (((x) & ((1 << numb) - 1)) | -((x) & (1 << (numb - 1))))
#define SIGNF(x, numb)  ((x) | -((x) & (1 << (numb - 1))))

#define GET_LOW(x)  (((x) & 0x3e) << 2)
#define GET_MED(x)  (((x) & 0x7c0) >> 3)
#define GET_HI(x)   (((x) >> 8) & 0xf8)

#define GET_LOW_RGBA16_TMEM(x)  (replicated_rgba[((x) >> 1) & 0x1f])
#define GET_MED_RGBA16_TMEM(x)  (replicated_rgba[((x) >> 6) & 0x1f])
#define GET_HI_RGBA16_TMEM(x)   (replicated_rgba[(x) >> 11])

#define LOG_RDP_EXECUTION 0
#define DETAILED_LOGGING 0

Internal Structs

FILE *rdp_exec;

uint32_t rdp_cmd_data[0x10000];
uint32_t rdp_cmd_ptr = 0;
uint32_t rdp_cmd_cur = 0;
uint32_t ptr_onstart = 0;

extern FILE* zeldainfo;

int32_t oldvstart = 1337;
uint32_t oldhstart = 0;
uint32_t oldsomething = 0;
uint32_t prevwasblank = 0;
uint32_t double_stretch = 0;
int blshifta = 0, blshiftb = 0, pastblshifta = 0, pastblshiftb = 0;
int32_t pastrawdzmem = 0;
uint32_t plim = 0x7fffff;
uint32_t idxlim16 = 0x3fffff;
uint32_t idxlim32 = 0x1fffff;
uint8_t* rdram_8;
uint16_t* rdram_16;
uint32_t brightness = 0;
int32_t iseed = 1;

typedef struct
{
    int lx, rx;
    int unscrx;
    int validline;
    int32_t r, g, b, a, s, t, w, z;
    int32_t majorx[4];
    int32_t minorx[4];
    int32_t invalyscan[4];
} SPAN;

cen64_align(static SPAN span[1024], 16);
uint8_t cvgbuf[1024];

static int spans_ds;
static int spans_dt;
static int spans_dw;
static int spans_dr;
static int spans_dg;
static int spans_db;
static int spans_da;
static int spans_dz;
static int spans_dzpix;

int spans_drdy, spans_dgdy, spans_dbdy, spans_dady, spans_dzdy;
int spans_cdr, spans_cdg, spans_cdb, spans_cda, spans_cdz;

static int spans_dsdy, spans_dtdy, spans_dwdy;

typedef struct
{
    int32_t r, g, b, a;
} COLOR;

typedef struct
{
    uint8_t r, g, b;
} FBCOLOR;

typedef struct
{
    uint8_t r, g, b, cvg;
} CCVG;

typedef struct
{
    uint16_t xl, yl, xh, yh;
} RECTANGLE;

typedef struct
{
    int tilenum;
    uint16_t xl, yl, xh, yh;
    int16_t s, t;
    int16_t dsdx, dtdy;
    uint32_t flip;
} TEX_RECTANGLE;

typedef struct
{
    int clampdiffs, clampdifft;
    int clampens, clampent;
    int masksclamped, masktclamped;
    int notlutswitch, tlutswitch;
} FAKETILE;

typedef struct
{
    int format;
    int size;
    int line;
    int tmem;
    int palette;
    int ct, mt, cs, ms;
    int mask_t, shift_t, mask_s, shift_s;

    uint16_t sl, tl, sh, th;

    FAKETILE f;
} TILE;

typedef struct
{
    int sub_a_rgb0;
    int sub_b_rgb0;
    int mul_rgb0;
    int add_rgb0;
    int sub_a_a0;
    int sub_b_a0;
    int mul_a0;
    int add_a0;

    int sub_a_rgb1;
    int sub_b_rgb1;
    int mul_rgb1;
    int add_rgb1;
    int sub_a_a1;
    int sub_b_a1;
    int mul_a1;
    int add_a1;
} COMBINE_MODES;

typedef struct
{
    int stalederivs;
    int dolod;
    int partialreject_1cycle;
    int partialreject_2cycle;
    int special_bsel0;
    int special_bsel1;
    int rgb_alpha_dither;
    int realblendershiftersneeded;
    int interpixelblendershiftersneeded;
} MODEDERIVS;

typedef struct
{
    int cycle_type;
    int persp_tex_en;
    int detail_tex_en;
    int sharpen_tex_en;
    int tex_lod_en;
    int en_tlut;
    int tlut_type;
    int sample_type;
    int mid_texel;
    int bi_lerp0;
    int bi_lerp1;
    int convert_one;
    int key_en;
    int rgb_dither_sel;
    int alpha_dither_sel;
    int blend_m1a_0;
    int blend_m1a_1;
    int blend_m1b_0;
    int blend_m1b_1;
    int blend_m2a_0;
    int blend_m2a_1;
    int blend_m2b_0;
    int blend_m2b_1;
    int force_blend;
    int alpha_cvg_select;
    int cvg_times_alpha;
    int z_mode;
    int cvg_dest;
    int color_on_cvg;
    int image_read_en;
    int z_update_en;
    int z_compare_en;
    int antialias_en;
    int z_source_sel;
    int dither_alpha_en;
    int alpha_compare_en;
    MODEDERIVS f;
} OTHER_MODES;

#define PIXEL_SIZE_4BIT         0
#define PIXEL_SIZE_8BIT         1
#define PIXEL_SIZE_16BIT        2
#define PIXEL_SIZE_32BIT        3

#define CYCLE_TYPE_1            0
#define CYCLE_TYPE_2            1
#define CYCLE_TYPE_COPY         2
#define CYCLE_TYPE_FILL         3

#define FORMAT_RGBA             0
#define FORMAT_YUV              1
#define FORMAT_CI               2
#define FORMAT_IA               3
#define FORMAT_I                4

#define TEXEL_RGBA4             0
#define TEXEL_RGBA8             1
#define TEXEL_RGBA16            2
#define TEXEL_RGBA32            3
#define TEXEL_YUV4              4
#define TEXEL_YUV8              5
#define TEXEL_YUV16             6
#define TEXEL_YUV32             7
#define TEXEL_CI4               8
#define TEXEL_CI8               9
#define TEXEL_CI16              0xa
#define TEXEL_CI32              0xb
#define TEXEL_IA4               0xc
#define TEXEL_IA8               0xd
#define TEXEL_IA16              0xe
#define TEXEL_IA32              0xf
#define TEXEL_I4                0x10
#define TEXEL_I8                0x11
#define TEXEL_I16               0x12
#define TEXEL_I32               0x13

#define CVG_CLAMP               0
#define CVG_WRAP                1
#define CVG_ZAP                 2
#define CVG_SAVE                3

#define ZMODE_OPAQUE            0
#define ZMODE_INTERPENETRATING  1
#define ZMODE_TRANSPARENT       2
#define ZMODE_DECAL             3

COMBINE_MODES combine;
OTHER_MODES other_modes;

COLOR blend_color;
COLOR prim_color;
COLOR env_color;
COLOR fog_color;
COLOR combined_color;
COLOR texel0_color;
COLOR texel1_color;
COLOR nexttexel_color;
COLOR shade_color;
COLOR key_scale;
COLOR key_center;
COLOR key_width;
static int32_t noise = 0;
static int32_t primitive_lod_frac = 0;
static int32_t one_color = 0x100;
static int32_t zero_color = 0x00;

int32_t keyalpha;

static int32_t blenderone   = 0xff;

static int32_t *combiner_rgbsub_a_r[2];
static int32_t *combiner_rgbsub_a_g[2];
static int32_t *combiner_rgbsub_a_b[2];
static int32_t *combiner_rgbsub_b_r[2];
static int32_t *combiner_rgbsub_b_g[2];
static int32_t *combiner_rgbsub_b_b[2];
static int32_t *combiner_rgbmul_r[2];
static int32_t *combiner_rgbmul_g[2];
static int32_t *combiner_rgbmul_b[2];
static int32_t *combiner_rgbadd_r[2];
static int32_t *combiner_rgbadd_g[2];
static int32_t *combiner_rgbadd_b[2];

static int32_t *combiner_alphasub_a[2];
static int32_t *combiner_alphasub_b[2];
static int32_t *combiner_alphamul[2];
static int32_t *combiner_alphaadd[2];

static int32_t *blender1a_r[2];
static int32_t *blender1a_g[2];
static int32_t *blender1a_b[2];
static int32_t *blender1b_a[2];
static int32_t *blender2a_r[2];
static int32_t *blender2a_g[2];
static int32_t *blender2a_b[2];
static int32_t *blender2b_a[2];

COLOR pixel_color;
COLOR inv_pixel_color;
COLOR blended_pixel_color;
COLOR memory_color;
COLOR pre_memory_color;

uint32_t fill_color;

uint32_t primitive_z;
uint16_t primitive_delta_z;

static int fb_format = FORMAT_RGBA;
static int fb_size = PIXEL_SIZE_4BIT;
static int fb_width = 0;
static uint32_t fb_address = 0;

static int ti_format = FORMAT_RGBA;
static int ti_size = PIXEL_SIZE_4BIT;
static int ti_width = 0;
static uint32_t ti_address = 0;

static uint32_t zb_address = 0;

static TILE tile[8];

static RECTANGLE clip = {0,0,0x2000,0x2000};
static int scfield = 0;
static int sckeepodd = 0;
int oldscyl = 0;

uint8_t TMEM[0x1000];

#define tlut ((uint16_t*)(&TMEM[0x800]))

#define PIXELS_TO_BYTES(pix, siz) (((pix) << (siz)) >> 1)

typedef struct{
    int startspan;
    int endspan;
    int preendspan;
    int nextspan;
    int midspan;
    int longspan;
    int onelessthanmid;
}SPANSIGS;

Function Declarations

The functions in this file are not necessarily organized in a logical order for following a primitive through the pipeline. I would recommend starting with rdp_process_list, and tracing through all the calls in edgewalker_for_prims and edgewalker_for_loads to understand the rendering and texture loading portions of the RDP respectively. The VI functions included with Angrylion are commented, but are not part of the RDP and thus called seperately through vi_fetch_filter.

static void rdp_set_other_modes(uint32_t w1, uint32_t w2);
static void fetch_texel(COLOR *color, int s, int t, uint32_t tilenum);
static void fetch_texel_entlut(COLOR *color, int s, int t, uint32_t tilenum);
static void fetch_texel_quadro(COLOR *color0, COLOR *color1, COLOR *color2, COLOR *color3, int s0, int s1, int t0, int t1, uint32_t tilenum);
static void fetch_texel_entlut_quadro(COLOR *color0, COLOR *color1, COLOR *color2, COLOR *color3, int s0, int s1, int t0, int t1, uint32_t tilenum);
static void tile_tlut_common_cs_decoder(uint32_t w1, uint32_t w2);
static void loading_pipeline(int start, int end, int tilenum, int coord_quad, int ltlut);
static void get_tmem_idx(int s, int t, uint32_t tilenum, uint32_t* idx0, uint32_t* idx1, uint32_t* idx2, uint32_t* idx3, uint32_t* bit3flipped, uint32_t* hibit);
static void sort_tmem_idx(uint32_t *idx, uint32_t idxa, uint32_t idxb, uint32_t idxc, uint32_t idxd, uint32_t bankno);
static void sort_tmem_shorts_lowhalf(uint32_t* bindshort, uint32_t short0, uint32_t short1, uint32_t short2, uint32_t short3, uint32_t bankno);
static void compute_color_index(uint32_t* cidx, uint32_t readshort, uint32_t nybbleoffset, uint32_t tilenum);
static void read_tmem_copy(int s, int s1, int s2, int s3, int t, uint32_t tilenum, uint32_t* sortshort, int* hibits, int* lowbits);
static void replicate_for_copy(uint32_t* outbyte, uint32_t inshort, uint32_t nybbleoffset, uint32_t tilenum, uint32_t tformat, uint32_t tsize);
static void fetch_qword_copy(uint32_t* hidword, uint32_t* lowdword, int32_t ssss, int32_t ssst, uint32_t tilenum);
static void render_spans_1cycle_complete(int start, int end, int tilenum, int flip);
static void render_spans_1cycle_notexel1(int start, int end, int tilenum, int flip);
static void render_spans_1cycle_notex(int start, int end, int tilenum, int flip);
static void render_spans_2cycle_complete(int start, int end, int tilenum, int flip);
static void render_spans_2cycle_notexelnext(int start, int end, int tilenum, int flip);
static void render_spans_2cycle_notexel1(int start, int end, int tilenum, int flip);
static void render_spans_2cycle_notex(int start, int end, int tilenum, int flip);
static void render_spans_fill(int start, int end, int flip);
static void render_spans_copy(int start, int end, int tilenum, int flip);
static inline void combiner_1cycle(int adseed, uint32_t* curpixel_cvg);
static inline void combiner_2cycle(int adseed, uint32_t* curpixel_cvg, int32_t* acalpha);
static inline int blender_1cycle(uint32_t* fr, uint32_t* fg, uint32_t* fb, int dith, uint32_t blend_en, uint32_t prewrap, uint32_t curpixel_cvg, uint32_t curpixel_cvbit);
static inline int blender_2cycle(uint32_t* fr, uint32_t* fg, uint32_t* fb, int dith, uint32_t blend_en, uint32_t prewrap, uint32_t curpixel_cvg, uint32_t curpixel_cvbit, int32_t acalpha);
static inline void texture_pipeline_cycle(COLOR* TEX, COLOR* prev, int32_t SSS, int32_t SST, uint32_t tilenum, uint32_t cycle);
static inline void tc_pipeline_copy(int32_t* sss0, int32_t* sss1, int32_t* sss2, int32_t* sss3, int32_t* sst, int tilenum);
static inline void tc_pipeline_load(int32_t* sss, int32_t* sst, int tilenum, int coord_quad);
static inline void tcclamp_generic(int32_t* S, int32_t* T, int32_t* SFRAC, int32_t* TFRAC, int32_t maxs, int32_t maxt, int32_t num);
static inline void tcclamp_cycle(int32_t* S, int32_t* T, int32_t* SFRAC, int32_t* TFRAC, int32_t maxs, int32_t maxt, int32_t num);
static inline void tcclamp_cycle_light(int32_t* S, int32_t* T, int32_t maxs, int32_t maxt, int32_t num);
static inline void tcshift_cycle(int32_t* S, int32_t* T, int32_t* maxs, int32_t* maxt, uint32_t num);
static inline void tcshift_copy(int32_t* S, int32_t* T, uint32_t num);
cen64_cold static void precalculate_everything(void);
static inline int alpha_compare(int32_t comb_alpha);
static inline int32_t color_combiner_equation(int32_t a, int32_t b, int32_t c, int32_t d);
static inline int32_t alpha_combiner_equation(int32_t a, int32_t b, int32_t c, int32_t d);
static inline void blender_equation_cycle0(int* r, int* g, int* b);
static inline void blender_equation_cycle0_2(int* r, int* g, int* b);
static inline void blender_equation_cycle1(int* r, int* g, int* b);
static inline uint32_t rightcvghex(uint32_t x, uint32_t fmask);
static inline uint32_t leftcvghex(uint32_t x, uint32_t fmask);
static inline void compute_cvg_noflip(int32_t scanline);
static inline void compute_cvg_flip(int32_t scanline);
static void fbwrite_4(uint32_t curpixel, uint32_t r, uint32_t g, uint32_t b, uint32_t blend_en, uint32_t curpixel_cvg, uint32_t curpixel_memcvg);
static void fbwrite_8(uint32_t curpixel, uint32_t r, uint32_t g, uint32_t b, uint32_t blend_en, uint32_t curpixel_cvg, uint32_t curpixel_memcvg);
static void fbwrite_16(uint32_t curpixel, uint32_t r, uint32_t g, uint32_t b, uint32_t blend_en, uint32_t curpixel_cvg, uint32_t curpixel_memcvg);
static void fbwrite_32(uint32_t curpixel, uint32_t r, uint32_t g, uint32_t b, uint32_t blend_en, uint32_t curpixel_cvg, uint32_t curpixel_memcvg);
static void fbfill_4(uint32_t curpixel);
static void fbfill_8(uint32_t curpixel);
static void fbfill_16(uint32_t curpixel);
static void fbfill_32(uint32_t curpixel);
static void fbread_4(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread_8(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread_16(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread_32(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread2_4(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread2_8(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread2_16(uint32_t num, uint32_t* curpixel_memcvg);
static void fbread2_32(uint32_t num, uint32_t* curpixel_memcvg);
static inline uint32_t z_decompress(uint32_t rawz);
static inline uint32_t dz_decompress(uint32_t compresseddz);
static inline uint32_t dz_compress(uint32_t value);
cen64_cold static void z_build_com_table(void);
cen64_cold static void precalc_cvmask_derivatives(void);
static inline uint16_t decompress_cvmask_frombyte(uint8_t byte);
static inline void lookup_cvmask_derivatives(uint32_t mask, uint8_t* offx, uint8_t* offy, uint32_t* curpixel_cvg, uint32_t* curpixel_cvbit);
static inline void z_store(uint32_t zcurpixel, uint32_t z, int dzpixenc);
static inline uint32_t z_compare(uint32_t zcurpixel, uint32_t sz, uint16_t dzpix, int dzpixenc, uint32_t* blend_en, uint32_t* prewrap, uint32_t* curpixel_cvg, uint32_t curpixel_memcvg);
static inline int finalize_spanalpha(uint32_t blend_en, uint32_t curpixel_cvg, uint32_t curpixel_memcvg);
static inline int32_t normalize_dzpix(int32_t sum);
static inline int32_t CLIP(int32_t value,int32_t min,int32_t max);
static inline void video_filter16(int* r, int* g, int* b, uint32_t fboffset, uint32_t num, uint32_t hres, uint32_t centercvg, uint32_t fetchstate);
static inline void video_filter32(int* endr, int* endg, int* endb, uint32_t fboffset, uint32_t num, uint32_t hres, uint32_t centercvg, uint32_t fetchstate);
static inline void divot_filter(CCVG* final, CCVG centercolor, CCVG leftcolor, CCVG rightcolor);
static inline void restore_filter16(int* r, int* g, int* b, uint32_t fboffset, uint32_t num, uint32_t hres, uint32_t fetchstate);
static inline void restore_filter32(int* r, int* g, int* b, uint32_t fboffset, uint32_t num, uint32_t hres, uint32_t fetchstate);
static inline void gamma_filters(int* r, int* g, int* b, int gamma_and_dither);
static inline void adjust_brightness(int* r, int* g, int* b, int brightcoeff);
static void clearfb16(uint16_t* fb, uint32_t width,uint32_t height);
static void tcdiv_persp(int32_t ss, int32_t st, int32_t sw, int32_t* sss, int32_t* sst);
static void tcdiv_nopersp(int32_t ss, int32_t st, int32_t sw, int32_t* sss, int32_t* sst);
static inline void tclod_4x17_to_15(int32_t scurr, int32_t snext, int32_t tcurr, int32_t tnext, int32_t previous, int32_t* lod);
static inline void tclod_tcclamp(int32_t* sss, int32_t* sst);
static inline void lodfrac_lodtile_signals(int lodclamp, int32_t lod, uint32_t* l_tile, uint32_t* magnify, uint32_t* distant, int32_t* lfdst);
static inline void tclod_1cycle_current(int32_t* sss, int32_t* sst, int32_t nexts, int32_t nextt, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t scanline, int32_t prim_tile, int32_t* t1, SPANSIGS* sigs);
static inline void tclod_1cycle_current_simple(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t scanline, int32_t prim_tile, int32_t* t1, SPANSIGS* sigs);
static inline void tclod_1cycle_next(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t scanline, int32_t prim_tile, int32_t* t1, SPANSIGS* sigs, int32_t* prelodfrac);
static inline void tclod_2cycle_current(int32_t* sss, int32_t* sst, int32_t nexts, int32_t nextt, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t prim_tile, int32_t* t1, int32_t* t2);
static inline void tclod_2cycle_current_simple(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t prim_tile, int32_t* t1, int32_t* t2);
static inline void tclod_2cycle_current_notexel1(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t prim_tile, int32_t* t1);
static inline void tclod_2cycle_next(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t prim_tile, int32_t* t1, int32_t* t2, int32_t* prelodfrac);
static inline void tclod_copy(int32_t* sss, int32_t* sst, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t prim_tile, int32_t* t1);
static inline void get_texel1_1cycle(int32_t* s1, int32_t* t1, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc, int32_t scanline, SPANSIGS* sigs);
static inline void get_nexttexel0_2cycle(int32_t* s1, int32_t* t1, int32_t s, int32_t t, int32_t w, int32_t dsinc, int32_t dtinc, int32_t dwinc);
static inline void video_max_optimized(uint32_t* Pixels, uint32_t* penumin, uint32_t* penumax, int numofels);
static void calculate_clamp_diffs(uint32_t tile);
static void calculate_tile_derivs(uint32_t tile);
static void rgb_dither_complete(int* r, int* g, int* b, int dith);
static void rgb_dither_nothing(int* r, int* g, int* b, int dith);
static void get_dither_noise_complete(int x, int y, int* cdith, int* adith);
static void get_dither_only(int x, int y, int* cdith, int* adith);
static void get_dither_nothing(int x, int y, int* cdith, int* adith);
static inline void vi_vl_lerp(CCVG* up, CCVG down, uint32_t frac);
static inline void rgbaz_correct_clip(int offx, int offy, int r, int g, int b, int a, int* z, uint32_t curpixel_cvg);
static inline void vi_fetch_filter16(CCVG* res, uint32_t fboffset, uint32_t cur_x, uint32_t fsaa, uint32_t dither_filter, uint32_t vres, uint32_t fetchstate);
static inline void vi_fetch_filter32(CCVG* res, uint32_t fboffset, uint32_t cur_x, uint32_t fsaa, uint32_t dither_filter, uint32_t vres, uint32_t fetchstate);
cen64_cold static uint32_t vi_integer_sqrt(uint32_t a);
cen64_cold static void deduce_derivatives(void);
static inline int32_t irand();

static int32_t k0_tf = 0, k1_tf = 0, k2_tf = 0, k3_tf = 0;
static int32_t k4 = 0, k5 = 0;
static int32_t lod_frac = 0;
uint32_t DebugMode = 0, DebugMode2 = 0; int32_t DebugMode3 = 0;
int debugcolor = 0;
uint8_t hidden_bits[0x400000];
struct {uint32_t shift; uint32_t add;} z_dec_table[8] = {
    {6, 0x00000},
    {5, 0x20000},
    {4, 0x30000},
    {3, 0x38000},
    {2, 0x3c000},
    {1, 0x3e000},
    {0, 0x3f000},
    {0, 0x3f800},
};

Function Pointers

To handle all the possible configurations of the RDP, Angrylion defines variants of each part of the pipeline as seperate functions. Most functions then call that pipeline component through a function pointer, which can be swapped to point to a different variant when the RDP configuration changes.

static void (*vi_fetch_filter_func[2])(CCVG*, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t) =
{
    vi_fetch_filter16, vi_fetch_filter32
};

static void (*fbread_func[4])(uint32_t, uint32_t*) =
{
    fbread_4, fbread_8, fbread_16, fbread_32
};

static void (*fbread2_func[4])(uint32_t, uint32_t*) =
{
    fbread2_4, fbread2_8, fbread2_16, fbread2_32
};

static void (*fbwrite_func[4])(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t) =
{
    fbwrite_4, fbwrite_8, fbwrite_16, fbwrite_32
};

static void (*fbfill_func[4])(uint32_t) =
{
    fbfill_4, fbfill_8, fbfill_16, fbfill_32
};

static void (*get_dither_noise_func[3])(int, int, int*, int*) =
{
    get_dither_noise_complete, get_dither_only, get_dither_nothing
};

static void (*rgb_dither_func[2])(int*, int*, int*, int) =
{
    rgb_dither_complete, rgb_dither_nothing
};

static void (*tcdiv_func[2])(int32_t, int32_t, int32_t, int32_t*, int32_t*) =
{
    tcdiv_nopersp, tcdiv_persp
};

static void (*render_spans_1cycle_func[3])(int, int, int, int) =
{
    render_spans_1cycle_notex, render_spans_1cycle_notexel1, render_spans_1cycle_complete
};

static void (*render_spans_2cycle_func[4])(int, int, int, int) =
{
    render_spans_2cycle_notex, render_spans_2cycle_notexel1, render_spans_2cycle_notexelnext, render_spans_2cycle_complete
};

void (*fbread1_ptr)(uint32_t, uint32_t*) = fbread_4;
void (*fbread2_ptr)(uint32_t, uint32_t*) = fbread2_4;
void (*fbwrite_ptr)(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t) = fbwrite_4;
void (*fbfill_ptr)(uint32_t) = fbfill_4;
void (*get_dither_noise_ptr)(int, int, int*, int*) = get_dither_noise_complete;
void (*rgb_dither_ptr)(int*, int*, int*, int) = rgb_dither_complete;
void (*tcdiv_ptr)(int32_t, int32_t, int32_t, int32_t*, int32_t*) = tcdiv_nopersp;
void (*render_spans_1cycle_ptr)(int, int, int, int) = render_spans_1cycle_complete;
void (*render_spans_2cycle_ptr)(int, int, int, int) = render_spans_2cycle_notexel1;

typedef struct{
    uint8_t cvg;
    uint8_t cvbit;
    uint8_t xoff;
    uint8_t yoff;
} CVtcmaskDERIVATIVE;

uint32_t gamma_table[0x100];
uint32_t gamma_dither_table[0x4000];
uint16_t z_com_table[0x40000];
uint32_t z_complete_dec_table[0x4000];
uint8_t replicated_rgba[32];
int vi_restore_table[0x400];
int32_t maskbits_table[16];
uint32_t special_9bit_clamptable[512];
int32_t special_9bit_exttable[512];
int32_t ge_two_table[128];
int32_t log2table[256];
int32_t tcdiv_table[0x8000];
uint8_t bldiv_hwaccurate_table[0x8000];
uint16_t deltaz_comparator_lut[0x10000];
int32_t clamp_t_diff[8];
int32_t clamp_s_diff[8];
CVtcmaskDERIVATIVE cvarray[0x100];

#define RDRAM_MASK 0x007fffff

#define RREADADDR8(rdst, in) {(in) &= RDRAM_MASK; (rdst) = ((in) <= plim) ? (rdram_8[(in)]) : 0;}
#define RREADIDX16(rdst, in) {(in) &= (RDRAM_MASK >> 1); (rdst) = ((in) <= idxlim16) ? (byteswap_16(rdram_16[(in)])) : 0;}
#define RREADIDX32(rdst, in) {(in) &= (RDRAM_MASK >> 2); (rdst) = ((in) <= idxlim32) ? (byteswap_32(rdram[(in)])) : 0;}

#define RWRITEADDR8(in, val) {(in) &= RDRAM_MASK; if ((in) <= plim) rdram_8[(in)] = (val);}
#define RWRITEIDX16(in, val) {(in) &= (RDRAM_MASK >> 1); if ((in) <= idxlim16) rdram_16[(in)] = byteswap_16(val);}
#define RWRITEIDX32(in, val) {(in) &= (RDRAM_MASK >> 2); if ((in) <= idxlim32) rdram[(in)] = byteswap_32(val);}

#define PAIRREAD16(rdst, hdst, in)      \
{                                       \
    (in) &= (RDRAM_MASK >> 1);          \
    if ((in) <= idxlim16) {(rdst) = byteswap_16(rdram_16[(in)]); (hdst) = hidden_bits[(in)];}   \
    else {(rdst) = (hdst) = 0;}         \
}

#define PAIRWRITE16(in, rval, hval)     \
{                                       \
    (in) &= (RDRAM_MASK >> 1);          \
    if ((in) <= idxlim16) {rdram_16[(in)] = byteswap_16(rval); hidden_bits[(in)] = (hval);} \
}

#define PAIRWRITE32(in, rval, hval0, hval1) \
{                                           \
    (in) &= (RDRAM_MASK >> 2);              \
    if ((in) <= idxlim32) {rdram[(in)] = byteswap_32(rval); hidden_bits[(in) << 1] = (hval0); hidden_bits[((in) << 1) + 1] = (hval1);}  \
}

#define PAIRWRITE8(in, rval, hval)  \
{                                   \
    (in) &= RDRAM_MASK;             \
    if ((in) <= plim) {rdram_8[(in)] = (rval); if ((in) & 1) hidden_bits[(in) >> 1] = (hval);}  \
}

struct onetime
{
    int nolerp, copymstrangecrashes, fillmcrashes, fillmbitcrashes, syncfullcrash, vbusclock;
} onetimewarnings;

uint32_t z64gl_command = 0;
uint32_t command_counter = 0;
int SaveLoaded = 0;
uint32_t max_level = 0;
int32_t min_level = 0;
int32_t* PreScale;
uint32_t tvfadeoutstate[625];
int rdp_pipeline_crashed = 0;

Texture Shift/Clamp/Mask

After shifting and clamping, the top bits of texture coordinates can also be masked off to create repeating patterns. The mask S,T parameters of each tile descriptor determine how many trailing bits of the integer portion are kept (with 0 meaning the maximum of 10 bits are kept). Mirroring, if enabled, checks the least significant bit to be masked off (checking bit 0 if the mask parameter is 0), and bit inverts the coordinates when this bit is 1, allowing symmetric patterns to save on memory.

static inline void tcmask(int32_t* S, int32_t* T, int32_t num);
static inline void tcmask(int32_t* S, int32_t* T, int32_t num)
{
    int32_t wrap;

    if (tile[num].mask_s)
    {
        if (tile[num].ms)
        {
            wrap = *S >> tile[num].f.masksclamped;
            wrap &= 1;
            *S ^= (-wrap);
        }
        *S &= maskbits_table[tile[num].mask_s];
    }

    if (tile[num].mask_t)
    {
        if (tile[num].mt)
        {
            wrap = *T >> tile[num].f.masktclamped;
            wrap &= 1;
            *T ^= (-wrap);
        }

        *T &= maskbits_table[tile[num].mask_t];
    }
}

static inline void tcmask_coupled(int32_t* S, int32_t* S1, int32_t* T, int32_t* T1, int32_t num);
static inline void tcmask_coupled(int32_t* S, int32_t* S1, int32_t* T, int32_t* T1, int32_t num)
{
    int32_t wrap;
    int32_t maskbits;
    int32_t wrapthreshold;

    if (tile[num].mask_s)
    {
        if (tile[num].ms)
        {
            wrapthreshold = tile[num].f.masksclamped;

            wrap = (*S >> wrapthreshold) & 1;
            *S ^= (-wrap);

            wrap = (*S1 >> wrapthreshold) & 1;
            *S1 ^= (-wrap);
        }

        maskbits = maskbits_table[tile[num].mask_s];
        *S &= maskbits;
        *S1 &= maskbits;
    }

    if (tile[num].mask_t)
    {
        if (tile[num].mt)
        {
            wrapthreshold = tile[num].f.masktclamped;

            wrap = (*T >> wrapthreshold) & 1;
            *T ^= (-wrap);

            wrap = (*T1 >> wrapthreshold) & 1;
            *T1 ^= (-wrap);
        }
        maskbits = maskbits_table[tile[num].mask_t];
        *T &= maskbits;
        *T1 &= maskbits;
    }
}

static inline void tcmask_copy(int32_t* S, int32_t* S1, int32_t* S2, int32_t* S3, int32_t* T, int32_t num);
static inline void tcmask_copy(int32_t* S, int32_t* S1, int32_t* S2, int32_t* S3, int32_t* T, int32_t num)
{
    int32_t wrap;
    int32_t maskbits_s;
    int32_t swrapthreshold;

    if (tile[num].mask_s)
    {
        if (tile[num].ms)
        {
            swrapthreshold = tile[num].f.masksclamped;

            wrap = (*S >> swrapthreshold) & 1;
            *S ^= (-wrap);

            wrap = (*S1 >> swrapthreshold) & 1;
            *S1 ^= (-wrap);

            wrap = (*S2 >> swrapthreshold) & 1;
            *S2 ^= (-wrap);

            wrap = (*S3 >> swrapthreshold) & 1;
            *S3 ^= (-wrap);
        }

        maskbits_s = maskbits_table[tile[num].mask_s];
        *S &= maskbits_s;
        *S1 &= maskbits_s;
        *S2 &= maskbits_s;
        *S3 &= maskbits_s;
    }

    if (tile[num].mask_t)
    {
        if (tile[num].mt)
        {
            wrap = *T >> tile[num].f.masktclamped;
            wrap &= 1;
            *T ^= (-wrap);
        }

        *T &= maskbits_table[tile[num].mask_t];
    }
}

An additional shift can be applied to texture coordinates after perspective correction in order to ensure mip maps or other textures of different sizes are combined at the right scale. The shift amounts in the horizontal (S) and vertical (T) directions are set seperately as part of each tile descriptor. Angrylion also computes whether the shifted texture coordinates exceed the coordinates of the lower right corner of the tile (SH/TH) here, though the info is only used for clamping later.

static inline void tcshift_cycle(int32_t* S, int32_t* T, int32_t* maxs, int32_t* maxt, uint32_t num)
{

    int32_t coord = *S;
    int32_t shifter = tile[num].shift_s;

    if (shifter < 11)
    {
        coord = SIGN16(coord);
        coord >>= shifter;
    }
    else
    {
        coord <<= (16 - shifter);
        coord = SIGN16(coord);
    }
    *S = coord;

    *maxs = ((coord >> 3) >= tile[num].sh);

    coord = *T;
    shifter = tile[num].shift_t;

    if (shifter < 11)
    {
        coord = SIGN16(coord);
        coord >>= shifter;
    }
    else
    {
        coord <<= (16 - shifter);
        coord = SIGN16(coord);
    }
    *T = coord;
    *maxt = ((coord >> 3) >= tile[num].th);
}

static inline void tcshift_copy(int32_t* S, int32_t* T, uint32_t num)
{
    int32_t coord = *S;
    int32_t shifter = tile[num].shift_s;

    if (shifter < 11)
    {
        coord = SIGN16(coord);
        coord >>= shifter;
    }
    else
    {
        coord <<= (16 - shifter);
        coord = SIGN16(coord);
    }
    *S = coord;

    coord = *T;
    shifter = tile[num].shift_t;

    if (shifter < 11)
    {
        coord = SIGN16(coord);
        coord >>= shifter;
    }
    else
    {
        coord <<= (16 - shifter);
        coord = SIGN16(coord);
    }
    *T = coord;

}

Clamping is automatically applied to texture coordinates after shifting, except when coordinate masking is enabled, in which case it can be enabled manually as part of the tile descriptor. The 17-bit (s.11.5) relative texture coordinates are also reduced to seperate 12-bit integer and 5-bit fractional parts at this point. If clamping is applied, the fractional part is always 0, even if SH and SL have different fractional components.

static inline void tcclamp_cycle(int32_t* S, int32_t* T, int32_t* SFRAC, int32_t* TFRAC, int32_t maxs, int32_t maxt, int32_t num)
{

    int32_t locs = *S, loct = *T;
    if (tile[num].f.clampens)
    {

        if (maxs)
        {
            *S = tile[num].f.clampdiffs;
            *SFRAC = 0;
        }
        else if (!(locs & 0x10000))
            *S = locs >> 5;
        else
        {
            *S = 0;
            *SFRAC = 0;
        }
    }
    else
        *S = (locs >> 5);

    if (tile[num].f.clampent)
    {
        if (maxt)
        {
            *T = tile[num].f.clampdifft;
            *TFRAC = 0;
        }
        else if (!(loct & 0x10000))
            *T = loct >> 5;
        else
        {
            *T = 0;
            *TFRAC = 0;
        }
    }
    else
        *T = (loct >> 5);
}

static inline void tcclamp_cycle_light(int32_t* S, int32_t* T, int32_t maxs, int32_t maxt, int32_t num)
{
    int32_t locs = *S, loct = *T;
    if (tile[num].f.clampens)
    {
        if (maxs)
            *S = tile[num].f.clampdiffs;
        else if (!(locs & 0x10000))
            *S = locs >> 5;
        else
            *S = 0;
    }
    else
        *S = (locs >> 5);

    if (tile[num].f.clampent)
    {
        if (maxt)
            *T = tile[num].f.clampdifft;
        else if (!(loct & 0x10000))
            *T = loct >> 5;
        else
            *T = 0;
    }
    else
        *T = (loct >> 5);
}

RDP Initialization

This sets the initial configuration of the RDP, before any commands are executed. It also initializes the various temporary buffers used, and precalculates all the LUTs needed for division, clamping, and other operations throughout the RDP.

cen64_cold int angrylion_rdp_init(struct cen64_device *device)
{
    cen64 = device;

    if (LOG_RDP_EXECUTION)
        rdp_exec = fopen("rdp_execute.txt", "wt");

    combiner_rgbsub_a_r[0] = combiner_rgbsub_a_r[1] = &one_color;
    combiner_rgbsub_a_g[0] = combiner_rgbsub_a_g[1] = &one_color;
    combiner_rgbsub_a_b[0] = combiner_rgbsub_a_b[1] = &one_color;
    combiner_rgbsub_b_r[0] = combiner_rgbsub_b_r[1] = &one_color;
    combiner_rgbsub_b_g[0] = combiner_rgbsub_b_g[1] = &one_color;
    combiner_rgbsub_b_b[0] = combiner_rgbsub_b_b[1] = &one_color;
    combiner_rgbmul_r[0] = combiner_rgbmul_r[1] = &one_color;
    combiner_rgbmul_g[0] = combiner_rgbmul_g[1] = &one_color;
    combiner_rgbmul_b[0] = combiner_rgbmul_b[1] = &one_color;
    combiner_rgbadd_r[0] = combiner_rgbadd_r[1] = &one_color;
    combiner_rgbadd_g[0] = combiner_rgbadd_g[1] = &one_color;
    combiner_rgbadd_b[0] = combiner_rgbadd_b[1] = &one_color;

    combiner_alphasub_a[0] = combiner_alphasub_a[1] = &one_color;
    combiner_alphasub_b[0] = combiner_alphasub_b[1] = &one_color;
    combiner_alphamul[0] = combiner_alphamul[1] = &one_color;
    combiner_alphaadd[0] = combiner_alphaadd[1] = &one_color;

    rdp_set_other_modes(0, 0);
    other_modes.f.stalederivs = 1;

    memset(TMEM, 0, 0x1000);

    memset(hidden_bits, 3, sizeof(hidden_bits));

    memset(tile, 0, sizeof(tile));

    for (int i = 0; i < 8; i++)
    {
        calculate_tile_derivs(i);
        calculate_clamp_diffs(i);
    }

    memset(&combined_color, 0, sizeof(COLOR));
    memset(&prim_color, 0, sizeof(COLOR));
    memset(&env_color, 0, sizeof(COLOR));
    memset(&key_scale, 0, sizeof(COLOR));
    memset(&key_center, 0, sizeof(COLOR));

    rdp_pipeline_crashed = 0;
    memset(&onetimewarnings, 0, sizeof(onetimewarnings));

    precalculate_everything();

    // TODO: Set limits based on RDRAM size.
    plim = 0x7fffff;
    idxlim16 = 0x3fffff;
    idxlim32 = 0x1fffff;

    rdram_8 = (uint8_t*)rdram;
    rdram_16 = (uint16_t*)rdram;
    return 0;
}

static inline void vi_fetch_filter16(CCVG* res, uint32_t fboffset, uint32_t cur_x, uint32_t fsaa, uint32_t dither_filter, uint32_t vres, uint32_t fetchstate)
{
    int r, g, b;
    uint32_t idx = (fboffset >> 1) + cur_x;
    uint32_t pix, hval;
    uint32_t cur_cvg;
    if (fsaa)
    {
        PAIRREAD16(pix, hval, idx);
        cur_cvg = ((pix & 1) << 2) | hval;
    }
    else
    {
        RREADIDX16(pix, idx);
        cur_cvg = 7;
    }
    r = GET_HI(pix);
    g = GET_MED(pix);
    b = GET_LOW(pix);

    uint32_t fbw = vi_width & 0xfff;

    if (cur_cvg == 7)
    {
        if (dither_filter)
            restore_filter16(&r, &g, &b, fboffset, cur_x, fbw, fetchstate);
    }
    else
    {
        video_filter16(&r, &g, &b, fboffset, cur_x, fbw, cur_cvg, fetchstate);
    }

    res->r = r;
    res->g = g;
    res->b = b;
    res->cvg = cur_cvg;
}

static inline void vi_fetch_filter32(CCVG* res, uint32_t fboffset, uint32_t cur_x, uint32_t fsaa, uint32_t dither_filter, uint32_t vres, uint32_t fetchstate)
{
    int r, g, b;
    uint32_t pix, addr = (fboffset >> 2) + cur_x;
    RREADIDX32(pix, addr);
    uint32_t cur_cvg;
    if (fsaa)
        cur_cvg = (pix >> 5) & 7;
    else
        cur_cvg = 7;
    r = (pix >> 24) & 0xff;
    g = (pix >> 16) & 0xff;
    b = (pix >> 8) & 0xff;

    uint32_t fbw = vi_width & 0xfff;

    if (cur_cvg == 7)
    {
        if (dither_filter)
            restore_filter32(&r, &g, &b, fboffset, cur_x, fbw, fetchstate);
    }
    else
    {
        video_filter32(&r, &g, &b, fboffset, cur_x, fbw, cur_cvg, fetchstate);
    }

    res->r = r;
    res->g = g;
    res->b = b;
    res->cvg = cur_cvg;
}

Color Combiner

static void SET_SUBA_RGB_INPUT(int32_t **input_r, int32_t **input_g, int32_t **input_b, int code)
{
    switch (code & 0xf)
    {
        case 0:     *input_r = &combined_color.r;   *input_g = &combined_color.g;   *input_b = &combined_color.b;   break;
        case 1:     *input_r = &texel0_color.r;     *input_g = &texel0_color.g;     *input_b = &texel0_color.b;     break;
        case 2:     *input_r = &texel1_color.r;     *input_g = &texel1_color.g;     *input_b = &texel1_color.b;     break;
        case 3:     *input_r = &prim_color.r;       *input_g = &prim_color.g;       *input_b = &prim_color.b;       break;
        case 4:     *input_r = &shade_color.r;      *input_g = &shade_color.g;      *input_b = &shade_color.b;      break;
        case 5:     *input_r = &env_color.r;        *input_g = &env_color.g;        *input_b = &env_color.b;        break;
        case 6:     *input_r = &one_color;          *input_g = &one_color;          *input_b = &one_color;          break;
        case 7:     *input_r = &noise;              *input_g = &noise;              *input_b = &noise;              break;
        case 8: case 9: case 10: case 11: case 12: case 13: case 14: case 15:
        {
            *input_r = &zero_color;     *input_g = &zero_color;     *input_b = &zero_color;     break;
        }
    }
}

static void SET_SUBB_RGB_INPUT(int32_t **input_r, int32_t **input_g, int32_t **input_b, int code)
{
    switch (code & 0xf)
    {
        case 0:     *input_r = &combined_color.r;   *input_g = &combined_color.g;   *input_b = &combined_color.b;   break;
        case 1:     *input_r = &texel0_color.r;     *input_g = &texel0_color.g;     *input_b = &texel0_color.b;     break;
        case 2:     *input_r = &texel1_color.r;     *input_g = &texel1_color.g;     *input_b = &texel1_color.b;     break;
        case 3:     *input_r = &prim_color.r;       *input_g = &prim_color.g;       *input_b = &prim_color.b;       break;
        case 4:     *input_r = &shade_color.r;      *input_g = &shade_color.g;      *input_b = &shade_color.b;      break;
        case 5:     *input_r = &env_color.r;        *input_g = &env_color.g;        *input_b = &env_color.b;        break;
        case 6:     *input_r = &key_center.r;       *input_g = &key_center.g;       *input_b = &key_center.b;       break;
        case 7:     *input_r = &k4;                 *input_g = &k4;                 *input_b = &k4;                 break;
        case 8: case 9: case 10: case 11: case 12: case 13: case 14: case 15:
        {
            *input_r = &zero_color;     *input_g = &zero_color;     *input_b = &zero_color;     break;
        }
    }
}

static void SET_MUL_RGB_INPUT(int32_t **input_r, int32_t **input_g, int32_t **input_b, int code)
{
    switch (code & 0x1f)
    {
        case 0:     *input_r = &combined_color.r;   *input_g = &combined_color.g;   *input_b = &combined_color.b;   break;
        case 1:     *input_r = &texel0_color.r;     *input_g = &texel0_color.g;     *input_b = &texel0_color.b;     break;
        case 2:     *input_r = &texel1_color.r;     *input_g = &texel1_color.g;     *input_b = &texel1_color.b;     break;
        case 3:     *input_r = &prim_color.r;       *input_g = &prim_color.g;       *input_b = &prim_color.b;       break;
        case 4:     *input_r = &shade_color.r;      *input_g = &shade_color.g;      *input_b = &shade_color.b;      break;
        case 5:     *input_r = &env_color.r;        *input_g = &env_color.g;        *input_b = &env_color.b;        break;
        case 6:     *input_r = &key_scale.r;        *input_g = &key_scale.g;        *input_b = &key_scale.b;        break;
        case 7:     *input_r = &combined_color.a;   *input_g = &combined_color.a;   *input_b = &combined_color.a;   break;
        case 8:     *input_r = &texel0_color.a;     *input_g = &texel0_color.a;     *input_b = &texel0_color.a;     break;
        case 9:     *input_r = &texel1_color.a;     *input_g = &texel1_color.a;     *input_b = &texel1_color.a;     break;
        case 10:    *input_r = &prim_color.a;       *input_g = &prim_color.a;       *input_b = &prim_color.a;       break;
        case 11:    *input_r = &shade_color.a;      *input_g = &shade_color.a;      *input_b = &shade_color.a;      break;
        case 12:    *input_r = &env_color.a;        *input_g = &env_color.a;        *input_b = &env_color.a;        break;
        case 13:    *input_r = &lod_frac;           *input_g = &lod_frac;           *input_b = &lod_frac;           break;
        case 14:    *input_r = &primitive_lod_frac; *input_g = &primitive_lod_frac; *input_b = &primitive_lod_frac; break;
        case 15:    *input_r = &k5;                 *input_g = &k5;                 *input_b = &k5;                 break;
        case 16: case 17: case 18: case 19: case 20: case 21: case 22: case 23:
        case 24: case 25: case 26: case 27: case 28: case 29: case 30: case 31:
        {
            *input_r = &zero_color;     *input_g = &zero_color;     *input_b = &zero_color;     break;
        }
    }
}

static void SET_ADD_RGB_INPUT(int32_t **input_r, int32_t **input_g, int32_t **input_b, int code)
{
    switch (code & 0x7)
    {
        case 0:     *input_r = &combined_color.r;   *input_g = &combined_color.g;   *input_b = &combined_color.b;   break;
        case 1:     *input_r = &texel0_color.r;     *input_g = &texel0_color.g;     *input_b = &texel0_color.b;     break;
        case 2:     *input_r = &texel1_color.r;     *input_g = &texel1_color.g;     *input_b = &texel1_color.b;     break;
        case 3:     *input_r = &prim_color.r;       *input_g = &prim_color.g;       *input_b = &prim_color.b;       break;
        case 4:     *input_r = &shade_color.r;      *input_g = &shade_color.g;      *input_b = &shade_color.b;      break;
        case 5:     *input_r = &env_color.r;        *input_g = &env_color.g;        *input_b = &env_color.b;        break;
        case 6:     *input_r = &one_color;          *input_g = &one_color;          *input_b = &one_color;          break;
        case 7:     *input_r = &zero_color;         *input_g = &zero_color;         *input_b = &zero_color;         break;
    }
}

static void SET_SUB_ALPHA_INPUT(int32_t **input, int code)
{
    switch (code & 0x7)
    {
        case 0:     *input = &combined_color.a; break;
        case 1:     *input = &texel0_color.a; break;
        case 2:     *input = &texel1_color.a; break;
        case 3:     *input = &prim_color.a; break;
        case 4:     *input = &shade_color.a; break;
        case 5:     *input = &env_color.a; break;
        case 6:     *input = &one_color; break;
        case 7:     *input = &zero_color; break;
    }
}

static void SET_MUL_ALPHA_INPUT(int32_t **input, int code)
{
    switch (code & 0x7)
    {
        case 0:     *input = &lod_frac; break;
        case 1:     *input = &texel0_color.a; break;
        case 2:     *input = &texel1_color.a; break;
        case 3:     *input = &prim_color.a; break;
        case 4:     *input = &shade_color.a; break;
        case 5:     *input = &env_color.a; break;
        case 6:     *input = &primitive_lod_frac; break;
        case 7:     *input = &zero_color; break;
    }
}

In addition to computing the color combiner equation, Angrylion performs several other computation as part of the combiner phase of the RDP pipeline, including chroma keying, cvg_times_alpha and alpha_cvg_select alterations, and alpha dithering.

static inline void combiner_1cycle(int adseed, uint32_t* curpixel_cvg)
{

    int32_t redkey, greenkey, bluekey, temp;
    COLOR chromabypass;

    if (other_modes.key_en)
    {
        chromabypass.r = *combiner_rgbsub_a_r[1];
        chromabypass.g = *combiner_rgbsub_a_g[1];
        chromabypass.b = *combiner_rgbsub_a_b[1];
    }

    // apply color combiner equation
    if (combiner_rgbmul_r[1] != &zero_color)
    {

        combined_color.r = color_combiner_equation(*combiner_rgbsub_a_r[1],*combiner_rgbsub_b_r[1],*combiner_rgbmul_r[1],*combiner_rgbadd_r[1]);
        combined_color.g = color_combiner_equation(*combiner_rgbsub_a_g[1],*combiner_rgbsub_b_g[1],*combiner_rgbmul_g[1],*combiner_rgbadd_g[1]);
        combined_color.b = color_combiner_equation(*combiner_rgbsub_a_b[1],*combiner_rgbsub_b_b[1],*combiner_rgbmul_b[1],*combiner_rgbadd_b[1]);
    }
    else
    {
        combined_color.r = ((special_9bit_exttable[*combiner_rgbadd_r[1]] << 8) + 0x80) & 0x1ffff;
        combined_color.g = ((special_9bit_exttable[*combiner_rgbadd_g[1]] << 8) + 0x80) & 0x1ffff;
        combined_color.b = ((special_9bit_exttable[*combiner_rgbadd_b[1]] << 8) + 0x80) & 0x1ffff;
    }

    // apply alpha combiner equation
    if (combiner_alphamul[1] != &zero_color)
        combined_color.a = alpha_combiner_equation(*combiner_alphasub_a[1],*combiner_alphasub_b[1],*combiner_alphamul[1],*combiner_alphaadd[1]);
    else
        combined_color.a = special_9bit_exttable[*combiner_alphaadd[1]] & 0x1ff;

    pixel_color.a = special_9bit_clamptable[combined_color.a];
    if (pixel_color.a == 0xff)
        pixel_color.a = 0x100;

    if (!other_modes.key_en)
    {
        // clamp asymmetric 9-bit signed
        // RGB values to unsigned 8 bits
        combined_color.r >>= 8;
        combined_color.g >>= 8;
        combined_color.b >>= 8;
        pixel_color.r = special_9bit_clamptable[combined_color.r];
        pixel_color.g = special_9bit_clamptable[combined_color.g];
        pixel_color.b = special_9bit_clamptable[combined_color.b];
    }
    else
    {
        // set per-channel key alpha to
        // key width - abs(current color - key color)
        redkey = SIGN(combined_color.r, 17);
        if (redkey >= 0)
            redkey = (key_width.r << 4) - redkey;
        else
            redkey = (key_width.r << 4) + redkey;
        greenkey = SIGN(combined_color.g, 17);
        if (greenkey >= 0)
            greenkey = (key_width.g << 4) - greenkey;
        else
            greenkey = (key_width.g << 4) + greenkey;
        bluekey = SIGN(combined_color.b, 17);
        if (bluekey >= 0)
            bluekey = (key_width.b << 4) - bluekey;
        else
            bluekey = (key_width.b << 4) + bluekey;
        // use minimum of per-channel keys
        keyalpha = (redkey < greenkey) ? redkey : greenkey;
        keyalpha = (bluekey < keyalpha) ? bluekey : keyalpha;
        keyalpha = CLIP(keyalpha, 0, 0xff);

        // use equation input A as
        // combiner color output
        pixel_color.r = special_9bit_clamptable[chromabypass.r];
        pixel_color.g = special_9bit_clamptable[chromabypass.g];
        pixel_color.b = special_9bit_clamptable[chromabypass.b];

        combined_color.r >>= 8;
        combined_color.g >>= 8;
        combined_color.b >>= 8;
    }

    // set coverage to product of
    // coverage and combined alpha
    if (other_modes.cvg_times_alpha)
    {
        temp = (pixel_color.a * (*curpixel_cvg) + 4) >> 3;
        *curpixel_cvg = (temp >> 5) & 0xf;
    }

    // apply alpha options, always
    // clamp to unsigned 8-bit value
    if (!other_modes.alpha_cvg_select)
    {
        if (!other_modes.key_en)
        {
            // apply alpha dithering to
            // combiner output alpha
            pixel_color.a += adseed;
            if (pixel_color.a & 0x100)
                pixel_color.a = 0xff;
        }
        else
            // use chroma key as alpha
            pixel_color.a = keyalpha;
    }
    else
    {
        // set alpha to (scaled) coverage
        if (other_modes.cvg_times_alpha)
            pixel_color.a = temp;
        else
            pixel_color.a = (*curpixel_cvg) << 5;
        if (pixel_color.a > 0xff)
            pixel_color.a = 0xff;
    }

    // apply alpha dithering to
    // interpolated shade alpha
    shade_color.a += adseed;
    if (shade_color.a & 0x100)
        shade_color.a = 0xff;
}

static inline void combiner_2cycle(int adseed, uint32_t* curpixel_cvg, int32_t* acalpha)
{
    int32_t redkey, greenkey, bluekey, temp;
    COLOR chromabypass;

    if (combiner_rgbmul_r[0] != &zero_color)
    {
        combined_color.r = color_combiner_equation(*combiner_rgbsub_a_r[0],*combiner_rgbsub_b_r[0],*combiner_rgbmul_r[0],*combiner_rgbadd_r[0]);
        combined_color.g = color_combiner_equation(*combiner_rgbsub_a_g[0],*combiner_rgbsub_b_g[0],*combiner_rgbmul_g[0],*combiner_rgbadd_g[0]);
        combined_color.b = color_combiner_equation(*combiner_rgbsub_a_b[0],*combiner_rgbsub_b_b[0],*combiner_rgbmul_b[0],*combiner_rgbadd_b[0]);
    }
    else
    {
        combined_color.r = ((special_9bit_exttable[*combiner_rgbadd_r[0]] << 8) + 0x80) & 0x1ffff;
        combined_color.g = ((special_9bit_exttable[*combiner_rgbadd_g[0]] << 8) + 0x80) & 0x1ffff;
        combined_color.b = ((special_9bit_exttable[*combiner_rgbadd_b[0]] << 8) + 0x80) & 0x1ffff;
    }

    if (combiner_alphamul[0] != &zero_color)
        combined_color.a = alpha_combiner_equation(*combiner_alphasub_a[0],*combiner_alphasub_b[0],*combiner_alphamul[0],*combiner_alphaadd[0]);
    else
        combined_color.a = special_9bit_exttable[*combiner_alphaadd[0]] & 0x1ff;

    if (other_modes.alpha_compare_en)
    {
        if (other_modes.key_en)
        {
            redkey = SIGN(combined_color.r, 17);
            if (redkey >= 0)
                redkey = (key_width.r << 4) - redkey;
            else
                redkey = (key_width.r << 4) + redkey;
            greenkey = SIGN(combined_color.g, 17);
            if (greenkey >= 0)
                greenkey = (key_width.g << 4) - greenkey;
            else
                greenkey = (key_width.g << 4) + greenkey;
            bluekey = SIGN(combined_color.b, 17);
            if (bluekey >= 0)
                bluekey = (key_width.b << 4) - bluekey;
            else
                bluekey = (key_width.b << 4) + bluekey;
            keyalpha = (redkey < greenkey) ? redkey : greenkey;
            keyalpha = (bluekey < keyalpha) ? bluekey : keyalpha;
            keyalpha = CLIP(keyalpha, 0, 0xff);
        }

        int32_t preacalpha = special_9bit_clamptable[combined_color.a];
        if (preacalpha == 0xff)
            preacalpha = 0x100;

        if (other_modes.cvg_times_alpha)
            temp = (preacalpha * (*curpixel_cvg) + 4) >> 3;

        if (!other_modes.alpha_cvg_select)
        {
            if (!other_modes.key_en)
            {
                preacalpha += adseed;
                if (preacalpha & 0x100)
                    preacalpha = 0xff;
            }
            else
                preacalpha = keyalpha;
        }
        else
        {
            if (other_modes.cvg_times_alpha)
                preacalpha = temp;
            else
                preacalpha = (*curpixel_cvg) << 5;
            if (preacalpha > 0xff)
                preacalpha = 0xff;
        }

        *acalpha = preacalpha;
    }

    combined_color.r >>= 8;
    combined_color.g >>= 8;
    combined_color.b >>= 8;

    texel0_color = texel1_color;
    texel1_color = nexttexel_color;

    if (other_modes.key_en)
    {
        chromabypass.r = *combiner_rgbsub_a_r[1];
        chromabypass.g = *combiner_rgbsub_a_g[1];
        chromabypass.b = *combiner_rgbsub_a_b[1];
    }

    if (combiner_rgbmul_r[1] != &zero_color)
    {
        combined_color.r = color_combiner_equation(*combiner_rgbsub_a_r[1],*combiner_rgbsub_b_r[1],*combiner_rgbmul_r[1],*combiner_rgbadd_r[1]);
        combined_color.g = color_combiner_equation(*combiner_rgbsub_a_g[1],*combiner_rgbsub_b_g[1],*combiner_rgbmul_g[1],*combiner_rgbadd_g[1]);
        combined_color.b = color_combiner_equation(*combiner_rgbsub_a_b[1],*combiner_rgbsub_b_b[1],*combiner_rgbmul_b[1],*combiner_rgbadd_b[1]);
    }
    else
    {
        combined_color.r = ((special_9bit_exttable[*combiner_rgbadd_r[1]] << 8) + 0x80) & 0x1ffff;
        combined_color.g = ((special_9bit_exttable[*combiner_rgbadd_g[1]] << 8) + 0x80) & 0x1ffff;
        combined_color.b = ((special_9bit_exttable[*combiner_rgbadd_b[1]] << 8) + 0x80) & 0x1ffff;
    }

    if (combiner_alphamul[1] != &zero_color)
        combined_color.a = alpha_combiner_equation(*combiner_alphasub_a[1],*combiner_alphasub_b[1],*combiner_alphamul[1],*combiner_alphaadd[1]);
    else
        combined_color.a = special_9bit_exttable[*combiner_alphaadd[1]] & 0x1ff;

    if (!other_modes.key_en)
    {

        combined_color.r >>= 8;
        combined_color.g >>= 8;
        combined_color.b >>= 8;

        pixel_color.r = special_9bit_clamptable[combined_color.r];
        pixel_color.g = special_9bit_clamptable[combined_color.g];
        pixel_color.b = special_9bit_clamptable[combined_color.b];
    }
    else
    {
        redkey = SIGN(combined_color.r, 17);
        if (redkey >= 0)
            redkey = (key_width.r << 4) - redkey;
        else
            redkey = (key_width.r << 4) + redkey;
        greenkey = SIGN(combined_color.g, 17);
        if (greenkey >= 0)
            greenkey = (key_width.g << 4) - greenkey;
        else
            greenkey = (key_width.g << 4) + greenkey;
        bluekey = SIGN(combined_color.b, 17);
        if (bluekey >= 0)
            bluekey = (key_width.b << 4) - bluekey;
        else
            bluekey = (key_width.b << 4) + bluekey;
        keyalpha = (redkey < greenkey) ? redkey : greenkey;
        keyalpha = (bluekey < keyalpha) ? bluekey : keyalpha;
        keyalpha = CLIP(keyalpha, 0, 0xff);

        pixel_color.r = special_9bit_clamptable[chromabypass.r];
        pixel_color.g = special_9bit_clamptable[chromabypass.g];
        pixel_color.b = special_9bit_clamptable[chromabypass.b];

        combined_color.r >>= 8;
        combined_color.g >>= 8;
        combined_color.b >>= 8;
    }

    pixel_color.a = special_9bit_clamptable[combined_color.a];
    if (pixel_color.a == 0xff)
        pixel_color.a = 0x100;

    if (other_modes.cvg_times_alpha)
    {
        temp = (pixel_color.a * (*curpixel_cvg) + 4) >> 3;
        *curpixel_cvg = (temp >> 5) & 0xf;
    }

    if (!other_modes.alpha_cvg_select)
    {
        if (!other_modes.key_en)
        {
            pixel_color.a += adseed;
            if (pixel_color.a & 0x100)
                pixel_color.a = 0xff;
        }
        else
            pixel_color.a = keyalpha;
    }
    else
    {
        if (other_modes.cvg_times_alpha)
            pixel_color.a = temp;
        else
            pixel_color.a = (*curpixel_cvg) << 5;
        if (pixel_color.a > 0xff)
            pixel_color.a = 0xff;
    }

    shade_color.a += adseed;
    if (shade_color.a & 0x100)
        shade_color.a = 0xff;
}

static void precalculate_everything(void)
{
    int i = 0, k = 0, j = 0;

    for (i = 0; i < 256; i++)
    {
        gamma_table[i] = vi_integer_sqrt(i << 6);
        gamma_table[i] <<= 1;
    }
    for (i = 0; i < 0x4000; i++)
    {
        gamma_dither_table[i] = vi_integer_sqrt(i);
        gamma_dither_table[i] <<= 1;
    }

    z_build_com_table();

    uint32_t exponent;
    uint32_t mantissa;
    for (i = 0; i < 0x4000; i++)
    {
        exponent = (i >> 11) & 7;
        mantissa = i & 0x7ff;
        z_complete_dec_table[i] = ((mantissa << z_dec_table[exponent].shift) + z_dec_table[exponent].add) & 0x3ffff;
    }

    precalc_cvmask_derivatives();

    i = 0;
    log2table[0] = log2table[1] = 0;
    for (i = 2; i < 256; i++)
    {
        for (k = 7; k > 0; k--)
        {
            if((i >> k) & 1)
            {
                log2table[i] = k;
                break;
            }
        }
    }

    for (i = 0; i < 0x400; i++)
    {
        if (((i >> 5) & 0x1f) < (i & 0x1f))
            vi_restore_table[i] = 1;
        else if (((i >> 5) & 0x1f) > (i & 0x1f))
            vi_restore_table[i] = -1;
        else
            vi_restore_table[i] = 0;
    }

    for (i = 0; i < 32; i++)
        replicated_rgba[i] = (i << 3) | ((i >> 2) & 7);

    maskbits_table[0] = 0x3ff;
    for (i = 1; i < 16; i++)
        maskbits_table[i] = ((uint16_t)(0xffff) >> (16 - i)) & 0x3ff;

    for(i = 0; i < 0x200; i++)
    {
        switch((i >> 7) & 3)
        {
        case 0:
        case 1:
            special_9bit_clamptable[i] = i & 0xff;
            break;
        case 2:
            special_9bit_clamptable[i] = 0xff;
            break;
        case 3:
            special_9bit_clamptable[i] = 0;
            break;
        }
    }

    for(i = 0; i < 0x200; i++)
    {
        special_9bit_exttable[i] = ((i & 0x180) == 0x180) ? (i | ~0x1ff) : (i & 0x1ff);
    }

    int temppoint, tempslope;
    int normout;
    int wnorm;
    int shift, tlu_rcp;

    for (i = 0; i < 0x8000; i++)
    {
        for (k = 1; k <= 14 && !((i << k) & 0x8000); k++)
            ;
        shift = k - 1;
        normout = (i << shift) & 0x3fff;
        wnorm = (normout & 0xff) << 2;
        normout >>= 8;

        temppoint = norm_point_table[normout];
        tempslope = norm_slope_table[normout];

        tempslope = (tempslope | ~0x3ff) + 1;

        tlu_rcp = (((tempslope * wnorm) >> 10) + temppoint) & 0x7fff;

        tcdiv_table[i] = shift | (tlu_rcp << 4);
    }

    int d = 0, n = 0, temp = 0, res = 0, invd = 0, nbit = 0;
    int ps[9];

    for (i = 0; i < 0x8000; i++)
    {
        res = 0;
        d = (i >> 11) & 0xf;
        n = i & 0x7ff;
        invd = (~d) & 0xf;

        temp = invd + (n >> 8) + 1;
        ps[0] = temp & 7;
        for (k = 0; k < 8; k++)
        {
            nbit = (n >> (7 - k)) & 1;
            if (res & (0x100 >> k))
                temp = invd + (ps[k] << 1) + nbit + 1;
            else
                temp = d + (ps[k] << 1) + nbit;
            ps[k + 1] = temp & 7;
            if (temp & 0x10)
                res |= (1 << (7 - k));
        }
        bldiv_hwaccurate_table[i] = res;
    }

    deltaz_comparator_lut[0] = 0;
    for (i = 1; i < 0x10000; i++)
    {
        for (k = 15; k >= 0; k--)
        {
            if (i & (1 << k))
            {
                deltaz_comparator_lut[i] = 1 << k;
                break;
            }
        }
    }

}

Blender

static void SET_BLENDER_INPUT(int cycle, int which, int32_t **input_r, int32_t **input_g, int32_t **input_b, int32_t **input_a, int a, int b)
{

    switch (a & 0x3)
    {
        case 0:
        {
            if (cycle == 0)
            {
                *input_r = &pixel_color.r;
                *input_g = &pixel_color.g;
                *input_b = &pixel_color.b;
            }
            else
            {
                *input_r = &blended_pixel_color.r;
                *input_g = &blended_pixel_color.g;
                *input_b = &blended_pixel_color.b;
            }
            break;
        }

        case 1:
        {
            *input_r = &memory_color.r;
            *input_g = &memory_color.g;
            *input_b = &memory_color.b;
            break;
        }

        case 2:
        {
            *input_r = &blend_color.r;      *input_g = &blend_color.g;      *input_b = &blend_color.b;
            break;
        }

        case 3:
        {
            *input_r = &fog_color.r;        *input_g = &fog_color.g;        *input_b = &fog_color.b;
            break;
        }
    }

    if (which == 0)
    {
        switch (b & 0x3)
        {
            case 0:     *input_a = &pixel_color.a; break;
            case 1:     *input_a = &fog_color.a; break;
            case 2:     *input_a = &shade_color.a; break;
            case 3:     *input_a = &zero_color; break;
        }
    }
    else
    {
        switch (b & 0x3)
        {
            case 0:     *input_a = &inv_pixel_color.a; break;
            case 1:     *input_a = &memory_color.a; break;
            case 2:     *input_a = &blenderone; break;
            case 3:     *input_a = &zero_color; break;
        }
    }
}

static const uint8_t bayer_matrix[16] =
{
     0,  4,  1, 5,
     4,  0,  5, 1,
     3,  7,  2, 6,
     7,  3,  6, 2
};

static const uint8_t magic_matrix[16] =
{
     0,  6,  1, 7,
     4,  2,  5, 3,
     3,  5,  2, 4,
     7,  1,  6, 0
};

In addition to computing the blender equation, Angrylion performs several other computations as part of the blender part of the pipeline. Before anything is written the alpha comparison and coverage checks must pass. If color on coverage is selected and coverage did not overflow at the Z comparison phase, the blender M input (assumed to be the already stored color) is written out directly. This allows coverage to be updated without updating color, and helps ensure transparent surfaces are only blended once (see Section 15.7). Otherwise, if the Z comparator determined blending to be unnecessary, the blender P input is written out directly, ensuring only pixels on internal edges are actually blended. RGB dithering is applied as the final step of this phase.

static inline int blender_1cycle(uint32_t* fr, uint32_t* fg, uint32_t* fb, int dith, uint32_t blend_en, uint32_t prewrap, uint32_t curpixel_cvg, uint32_t curpixel_cvbit)
{
    int r, g, b, dontblend;

    if (alpha_compare(pixel_color.a))
    {

        if (other_modes.antialias_en ? (curpixel_cvg) : (curpixel_cvbit))
        {

            if (!other_modes.color_on_cvg || prewrap)
            {
                dontblend = (other_modes.f.partialreject_1cycle && pixel_color.a >= 0xff);
                if (!blend_en || dontblend)
                {
                    r = *blender1a_r[0];
                    g = *blender1a_g[0];
                    b = *blender1a_b[0];
                }
                else
                {
                    inv_pixel_color.a =  (~(*blender1b_a[0])) & 0xff;

                    blender_equation_cycle0(&r, &g, &b);
                }
            }
            else
            {
                r = *blender2a_r[0];
                g = *blender2a_g[0];
                b = *blender2a_b[0];
            }

            rgb_dither_ptr(&r, &g, &b, dith);
            *fr = r;
            *fg = g;
            *fb = b;
            return 1;
        }
        else
            return 0;
        }
    else
        return 0;
}

static inline int blender_2cycle(uint32_t* fr, uint32_t* fg, uint32_t* fb, int dith, uint32_t blend_en, uint32_t prewrap, uint32_t curpixel_cvg, uint32_t curpixel_cvbit, int32_t acalpha)
{
    int r, g, b, dontblend;

    if (alpha_compare(acalpha))
    {
        if (other_modes.antialias_en ? (curpixel_cvg) : (curpixel_cvbit))
        {

            inv_pixel_color.a =  (~(*blender1b_a[0])) & 0xff;

            blender_equation_cycle0_2(&r, &g, &b);

            memory_color = pre_memory_color;

            blended_pixel_color.r = r;
            blended_pixel_color.g = g;
            blended_pixel_color.b = b;
            blended_pixel_color.a = pixel_color.a;

            if (!other_modes.color_on_cvg || prewrap)
            {
                dontblend = (other_modes.f.partialreject_2cycle && pixel_color.a >= 0xff);
                if (!blend_en || dontblend)
                {
                    r = *blender1a_r[1];
                    g = *blender1a_g[1];
                    b = *blender1a_b[1];
                }
                else
                {
                    inv_pixel_color.a =  (~(*blender1b_a[1])) & 0xff;
                    blender_equation_cycle1(&r, &g, &b);
                }
            }
            else
            {
                r = *blender2a_r[1];
                g = *blender2a_g[1];
                b = *blender2a_b[1];
            }

            rgb_dither_ptr(&r, &g, &b, dith);
            *fr = r;
            *fg = g;
            *fb = b;
            return 1;
        }
        else
        {
            memory_color = pre_memory_color;
            return 0;
                }
    }
    else
    {
        memory_color = pre_memory_color;
        return 0;
    }
}

Texture Fetching

The N64 RDP has several options that decide the format of a texture, including pixel size, format type, and TLUT enable, but only a few combinations actually produce sensible results, in particular 32-bit RGBA, 16-bit YUV/RGBA/IA, 8-bit CI/I/IA, and 4-bit CI/I/IA. For diagrams of each format’s layout in TMEM see Section 13.8.

static void fetch_texel(COLOR *color, int s, int t, uint32_t tilenum)
{
    uint32_t tbase = tile[tilenum].line * (t & 0xff) + tile[tilenum].tmem;

    uint32_t tpal   = tile[tilenum].palette;

    uint16_t *tc16 = (uint16_t*)TMEM;
    uint32_t taddr = 0;

    switch (tile[tilenum].f.notlutswitch)
    {
    case TEXEL_RGBA4:
        {
            taddr = ((tbase << 4) + s) >> 1;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            uint8_t byteval, c;

            byteval = TMEM[taddr & 0xfff];
            c = ((s & 1)) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color->r = c;
            color->g = c;
            color->b = c;
            color->a = c;
        }
        break;
    case TEXEL_RGBA8:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t p;

            p = TMEM[taddr & 0xfff];
            color->r = p;
            color->g = p;
            color->b = p;
            color->a = p;
        }
        break;
    case TEXEL_RGBA16:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = GET_HI_RGBA16_TMEM(c);
            color->g = GET_MED_RGBA16_TMEM(c);
            color->b = GET_LOW_RGBA16_TMEM(c);
            color->a = (c & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_RGBA32:
        {

            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            taddr &= 0x3ff;
            c = tc16[taddr];
            color->r = c >> 8;
            color->g = c & 0xff;
            c = tc16[taddr | 0x400];
            color->b = c >> 8;
            color->a = c & 0xff;
        }
        break;
    case TEXEL_YUV4:
    case TEXEL_YUV8:
        {
            taddr = (tbase << 3) + s;

            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            int32_t u, save;

            save = u = TMEM[taddr & 0x7ff];

            u = u - 0x80;

            color->r = u;
            color->g = u;
            color->b = save;
            color->a = save;
        }
        break;
    case TEXEL_YUV16:
    case TEXEL_YUV32:
        {
            taddr = (tbase << 3) + s;
            int taddrlow = taddr >> 1;

            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            taddrlow ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            taddr &= 0x7ff;
            taddrlow &= 0x3ff;

            uint16_t c = tc16[taddrlow];

            int32_t y, u, v;
            y = TMEM[taddr | 0x800];
            u = c >> 8;
            v = c & 0xff;

            u = u - 0x80;
            v = v - 0x80;

            color->r = u;
            color->g = v;
            color->b = y;
            color->a = y;
        }
        break;
    case TEXEL_CI4:
        {
            taddr = ((tbase << 4) + s) >> 1;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t p;

            p = TMEM[taddr & 0xfff];
            p = (s & 1) ? (p & 0xf) : (p >> 4);
            p = (tpal << 4) | p;
            color->r = color->g = color->b = color->a = p;
        }
        break;
    case TEXEL_CI8:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t p;

            p = TMEM[taddr & 0xfff];
            color->r = p;
            color->g = p;
            color->b = p;
            color->a = p;
        }
        break;
    case TEXEL_CI16:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = c >> 8;
            color->g = c & 0xff;
            color->b = color->r;
            color->a = (c & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_CI32:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = c >> 8;
            color->g = c & 0xff;
            color->b = color->r;
            color->a = (c & 1) ? 0xff : 0;

        }
        break;
    case TEXEL_IA4:
        {
            taddr = ((tbase << 4) + s) >> 1;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t p, i;

            p = TMEM[taddr & 0xfff];
            p = (s & 1) ? (p & 0xf) : (p >> 4);
            i = p & 0xe;
            i = (i << 4) | (i << 1) | (i >> 2);
            color->r = i;
            color->g = i;
            color->b = i;
            color->a = (p & 0x1) ? 0xff : 0;
        }
        break;
    case TEXEL_IA8:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t p, i;

            p = TMEM[taddr & 0xfff];
            i = p & 0xf0;
            i |= (i >> 4);
            color->r = i;
            color->g = i;
            color->b = i;
            color->a = ((p & 0xf) << 4) | (p & 0xf);
        }
        break;
    case TEXEL_IA16:
        {

            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = color->g = color->b = (c >> 8);
            color->a = c & 0xff;
        }
        break;
    case TEXEL_IA32:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = c >> 8;
            color->g = c & 0xff;
            color->b = color->r;
            color->a = (c & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_I4:
        {
            taddr = ((tbase << 4) + s) >> 1;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t byteval, c;

            byteval = TMEM[taddr & 0xfff];
            c = (s & 1) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color->r = c;
            color->g = c;
            color->b = c;
            color->a = c;
        }
        break;
    case TEXEL_I8:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

            uint8_t c;

            c = TMEM[taddr & 0xfff];
            color->r = c;
            color->g = c;
            color->b = c;
            color->a = c;
        }
        break;
    case TEXEL_I16:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = c >> 8;
            color->g = c & 0xff;
            color->b = color->r;
            color->a = (c & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_I32:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);

            uint16_t c;

            c = tc16[taddr & 0x7ff];
            color->r = c >> 8;
            color->g = c & 0xff;
            color->b = color->r;
            color->a = (c & 1) ? 0xff : 0;
        }
        break;
    default:
        debug("fetch_texel: unknown texture format %d, size %d, tilenum %d\n", tile[tilenum].format, tile[tilenum].size, tilenum);
        break;
    }
}

static void fetch_texel_entlut(COLOR *color, int s, int t, uint32_t tilenum)
{
    uint32_t tbase = tile[tilenum].line * (t & 0xff) + tile[tilenum].tmem;
    uint32_t tpal   = tile[tilenum].palette << 4;
    uint16_t *tc16 = (uint16_t*)TMEM;
    uint32_t taddr = 0;
    uint32_t c;

    switch(tile[tilenum].f.tlutswitch)
    {
    case 0:
    case 1:
    case 2:
        {
            taddr = ((tbase << 4) + s) >> 1;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            c = TMEM[taddr & 0x7ff];
            c = (s & 1) ? (c & 0xf) : (c >> 4);
            c = tlut[((tpal | c) << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 3:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            c = TMEM[taddr & 0x7ff];
            c = (s & 1) ? (c & 0xf) : (c >> 4);
            c = tlut[((tpal | c) << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 4:
    case 5:
    case 6:
    case 7:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            c = TMEM[taddr & 0x7ff];
            c = tlut[(c << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 8:
    case 9:
    case 10:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);
            c = tc16[taddr & 0x3ff];
            c = tlut[((c >> 6) & ~3) ^ WORD_ADDR_XOR];

        }
        break;
    case 11:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            c = TMEM[taddr & 0x7ff];
            c = tlut[(c << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 12:
    case 13:
    case 14:
        {
            taddr = (tbase << 2) + s;
            taddr ^= ((t & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR);
            c = tc16[taddr & 0x3ff];
            c = tlut[((c >> 6) & ~3) ^ WORD_ADDR_XOR];
        }
        break;
    case 15:
        {
            taddr = (tbase << 3) + s;
            taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
            c = TMEM[taddr & 0x7ff];
            c = tlut[(c << 2) ^ WORD_ADDR_XOR];
        }
        break;
    default:
        debug("fetch_texel_entlut: unknown texture format %d, size %d, tilenum %d\n", tile[tilenum].format, tile[tilenum].size, tilenum);
        break;
    }

    if (!other_modes.tlut_type)
    {
        color->r = GET_HI_RGBA16_TMEM(c);
        color->g = GET_MED_RGBA16_TMEM(c);
        color->b = GET_LOW_RGBA16_TMEM(c);
        color->a = (c & 1) ? 0xff : 0;
    }
    else
    {
        color->r = color->g = color->b = c >> 8;
        color->a = c & 0xff;
    }

}

static void fetch_texel_quadro(COLOR *color0, COLOR *color1, COLOR *color2, COLOR *color3, int s0, int s1, int t0, int t1, uint32_t tilenum)
{

    uint32_t tbase0 = tile[tilenum].line * (t0 & 0xff) + tile[tilenum].tmem;
    uint32_t tbase2 = tile[tilenum].line * (t1 & 0xff) + tile[tilenum].tmem;
    uint32_t tpal   = tile[tilenum].palette;
    uint32_t xort = 0, ands = 0;

    uint16_t *tc16 = (uint16_t*)TMEM;
    uint32_t taddr0 = 0, taddr1 = 0, taddr2 = 0, taddr3 = 0;
    uint32_t taddrlow0 = 0, taddrlow1 = 0, taddrlow2 = 0, taddrlow3 = 0;

    switch (tile[tilenum].f.notlutswitch)
    {
    case TEXEL_RGBA4:
        {
            taddr0 = ((tbase0 << 4) + s0) >> 1;
            taddr1 = ((tbase0 << 4) + s1) >> 1;
            taddr2 = ((tbase2 << 4) + s0) >> 1;
            taddr3 = ((tbase2 << 4) + s1) >> 1;
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t byteval, c;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            ands = s0 & 1;
            byteval = TMEM[taddr0];
            c = (ands) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color0->r = c;
            color0->g = c;
            color0->b = c;
            color0->a = c;
            byteval = TMEM[taddr2];
            c = (ands) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color2->r = c;
            color2->g = c;
            color2->b = c;
            color2->a = c;

            ands = s1 & 1;
            byteval = TMEM[taddr1];
            c = (ands) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color1->r = c;
            color1->g = c;
            color1->b = c;
            color1->a = c;
            byteval = TMEM[taddr3];
            c = (ands) ? (byteval & 0xf) : (byteval >> 4);
            c |= (c << 4);
            color3->r = c;
            color3->g = c;
            color3->b = c;
            color3->a = c;
        }
        break;
    case TEXEL_RGBA8:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            p = TMEM[taddr0];
            color0->r = p;
            color0->g = p;
            color0->b = p;
            color0->a = p;
            p = TMEM[taddr2];
            color2->r = p;
            color2->g = p;
            color2->b = p;
            color2->a = p;
            p = TMEM[taddr1];
            color1->r = p;
            color1->g = p;
            color1->b = p;
            color1->a = p;
            p = TMEM[taddr3];
            color3->r = p;
            color3->g = p;
            color3->b = p;
            color3->a = p;
        }
        break;
    case TEXEL_RGBA16:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            c1 = tc16[taddr1];
            c2 = tc16[taddr2];
            c3 = tc16[taddr3];
            color0->r = GET_HI_RGBA16_TMEM(c0);
            color0->g = GET_MED_RGBA16_TMEM(c0);
            color0->b = GET_LOW_RGBA16_TMEM(c0);
            color0->a = (c0 & 1) ? 0xff : 0;
            color1->r = GET_HI_RGBA16_TMEM(c1);
            color1->g = GET_MED_RGBA16_TMEM(c1);
            color1->b = GET_LOW_RGBA16_TMEM(c1);
            color1->a = (c1 & 1) ? 0xff : 0;
            color2->r = GET_HI_RGBA16_TMEM(c2);
            color2->g = GET_MED_RGBA16_TMEM(c2);
            color2->b = GET_LOW_RGBA16_TMEM(c2);
            color2->a = (c2 & 1) ? 0xff : 0;
            color3->r = GET_HI_RGBA16_TMEM(c3);
            color3->g = GET_MED_RGBA16_TMEM(c3);
            color3->b = GET_LOW_RGBA16_TMEM(c3);
            color3->a = (c3 & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_RGBA32:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x3ff;
            taddr1 &= 0x3ff;
            taddr2 &= 0x3ff;
            taddr3 &= 0x3ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            c0 = tc16[taddr0 | 0x400];
            color0->b = c0 >>  8;
            color0->a = c0 & 0xff;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            c1 = tc16[taddr1 | 0x400];
            color1->b = c1 >>  8;
            color1->a = c1 & 0xff;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            c2 = tc16[taddr2 | 0x400];
            color2->b = c2 >>  8;
            color2->a = c2 & 0xff;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            c3 = tc16[taddr3 | 0x400];
            color3->b = c3 >>  8;
            color3->a = c3 & 0xff;
        }
        break;
    case TEXEL_YUV4:
    case TEXEL_YUV8:
        {
            taddr0 = (tbase0 << 3) + s0;
            taddr1 = (tbase0 << 3) + s1;
            taddr2 = (tbase2 << 3) + s0;
            taddr3 = (tbase2 << 3) + s1;

            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            int32_t u0, u1, u2, u3, save0, save1, save2, save3;

            save0 = u0 = TMEM[taddr0 & 0x7ff];
            u0 = u0 - 0x80;
            save1 = u1 = TMEM[taddr1 & 0x7ff];
            u1 = u1 - 0x80;
            save2 = u2 = TMEM[taddr2 & 0x7ff];
            u2 = u2 - 0x80;
            save3 = u3 = TMEM[taddr3 & 0x7ff];
            u3 = u3 - 0x80;

            color0->r = u0;
            color0->g = u0;
            color0->b = save0;
            color0->a = save0;
            color1->r = u1;
            color1->g = u1;
            color1->b = save1;
            color1->a = save1;
            color2->r = u2;
            color2->g = u2;
            color2->b = save2;
            color2->a = save2;
            color3->r = u3;
            color3->g = u3;
            color3->b = save3;
            color3->a = save3;
        }
        break;
    case TEXEL_YUV16:
    case TEXEL_YUV32:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            taddrlow0 = taddr0 >> 1;
            taddrlow1 = taddr1 >> 1;
            taddrlow2 = taddr2 >> 1;
            taddrlow3 = taddr3 >> 1;

            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddrlow0 ^= xort;
            taddrlow1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddrlow2 ^= xort;
            taddrlow3 ^= xort;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            taddrlow0 &= 0x3ff;
            taddrlow1 &= 0x3ff;
            taddrlow2 &= 0x3ff;
            taddrlow3 &= 0x3ff;

            uint16_t c0, c1, c2, c3;
            int32_t y0, y1, y2, y3, u0, u1, u2, u3, v0, v1, v2, v3;

            c0 = tc16[taddrlow0];
            c1 = tc16[taddrlow1];
            c2 = tc16[taddrlow2];
            c3 = tc16[taddrlow3];

            y0 = TMEM[taddr0 | 0x800];
            u0 = c0 >> 8;
            v0 = c0 & 0xff;
            y1 = TMEM[taddr1 | 0x800];
            u1 = c1 >> 8;
            v1 = c1 & 0xff;
            y2 = TMEM[taddr2 | 0x800];
            u2 = c2 >> 8;
            v2 = c2 & 0xff;
            y3 = TMEM[taddr3 | 0x800];
            u3 = c3 >> 8;
            v3 = c3 & 0xff;

            u0 = u0 - 0x80;
            v0 = v0 - 0x80;
            u1 = u1 - 0x80;
            v1 = v1 - 0x80;
            u2 = u2 - 0x80;
            v2 = v2 - 0x80;
            u3 = u3 - 0x80;
            v3 = v3 - 0x80;

            color0->r = u0;
            color0->g = v0;
            color0->b = y0;
            color0->a = y0;
            color1->r = u1;
            color1->g = v1;
            color1->b = y1;
            color1->a = y1;
            color2->r = u2;
            color2->g = v2;
            color2->b = y2;
            color2->a = y2;
            color3->r = u3;
            color3->g = v3;
            color3->b = y3;
            color3->a = y3;
        }
        break;
    case TEXEL_CI4:
        {
            taddr0 = ((tbase0 << 4) + s0) >> 1;
            taddr1 = ((tbase0 << 4) + s1) >> 1;
            taddr2 = ((tbase2 << 4) + s0) >> 1;
            taddr3 = ((tbase2 << 4) + s1) >> 1;
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            ands = s0 & 1;
            p = TMEM[taddr0];
            p = (ands) ? (p & 0xf) : (p >> 4);
            p = (tpal << 4) | p;
            color0->r = color0->g = color0->b = color0->a = p;
            p = TMEM[taddr2];
            p = (ands) ? (p & 0xf) : (p >> 4);
            p = (tpal << 4) | p;
            color2->r = color2->g = color2->b = color2->a = p;

            ands = s1 & 1;
            p = TMEM[taddr1];
            p = (ands) ? (p & 0xf) : (p >> 4);
            p = (tpal << 4) | p;
            color1->r = color1->g = color1->b = color1->a = p;
            p = TMEM[taddr3];
            p = (ands) ? (p & 0xf) : (p >> 4);
            p = (tpal << 4) | p;
            color3->r = color3->g = color3->b = color3->a = p;
        }
        break;
    case TEXEL_CI8:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            p = TMEM[taddr0];
            color0->r = p;
            color0->g = p;
            color0->b = p;
            color0->a = p;
            p = TMEM[taddr2];
            color2->r = p;
            color2->g = p;
            color2->b = p;
            color2->a = p;
            p = TMEM[taddr1];
            color1->r = p;
            color1->g = p;
            color1->b = p;
            color1->a = p;
            p = TMEM[taddr3];
            color3->r = p;
            color3->g = p;
            color3->b = p;
            color3->a = p;
        }
        break;
    case TEXEL_CI16:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            color0->b = c0 >> 8;
            color0->a = (c0 & 1) ? 0xff : 0;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            color1->b = c1 >> 8;
            color1->a = (c1 & 1) ? 0xff : 0;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            color2->b = c2 >> 8;
            color2->a = (c2 & 1) ? 0xff : 0;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            color3->b = c3 >> 8;
            color3->a = (c3 & 1) ? 0xff : 0;

        }
        break;
    case TEXEL_CI32:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            color0->b = c0 >> 8;
            color0->a = (c0 & 1) ? 0xff : 0;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            color1->b = c1 >> 8;
            color1->a = (c1 & 1) ? 0xff : 0;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            color2->b = c2 >> 8;
            color2->a = (c2 & 1) ? 0xff : 0;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            color3->b = c3 >> 8;
            color3->a = (c3 & 1) ? 0xff : 0;

        }
        break;
    case TEXEL_IA4:
        {
            taddr0 = ((tbase0 << 4) + s0) >> 1;
            taddr1 = ((tbase0 << 4) + s1) >> 1;
            taddr2 = ((tbase2 << 4) + s0) >> 1;
            taddr3 = ((tbase2 << 4) + s1) >> 1;
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p, i;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            ands = s0 & 1;
            p = TMEM[taddr0];
            p = ands ? (p & 0xf) : (p >> 4);
            i = p & 0xe;
            i = (i << 4) | (i << 1) | (i >> 2);
            color0->r = i;
            color0->g = i;
            color0->b = i;
            color0->a = (p & 0x1) ? 0xff : 0;
            p = TMEM[taddr2];
            p = ands ? (p & 0xf) : (p >> 4);
            i = p & 0xe;
            i = (i << 4) | (i << 1) | (i >> 2);
            color2->r = i;
            color2->g = i;
            color2->b = i;
            color2->a = (p & 0x1) ? 0xff : 0;

            ands = s1 & 1;
            p = TMEM[taddr1];
            p = ands ? (p & 0xf) : (p >> 4);
            i = p & 0xe;
            i = (i << 4) | (i << 1) | (i >> 2);
            color1->r = i;
            color1->g = i;
            color1->b = i;
            color1->a = (p & 0x1) ? 0xff : 0;
            p = TMEM[taddr3];
            p = ands ? (p & 0xf) : (p >> 4);
            i = p & 0xe;
            i = (i << 4) | (i << 1) | (i >> 2);
            color3->r = i;
            color3->g = i;
            color3->b = i;
            color3->a = (p & 0x1) ? 0xff : 0;

        }
        break;
    case TEXEL_IA8:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p, i;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            p = TMEM[taddr0];
            i = p & 0xf0;
            i |= (i >> 4);
            color0->r = i;
            color0->g = i;
            color0->b = i;
            color0->a = ((p & 0xf) << 4) | (p & 0xf);
            p = TMEM[taddr1];
            i = p & 0xf0;
            i |= (i >> 4);
            color1->r = i;
            color1->g = i;
            color1->b = i;
            color1->a = ((p & 0xf) << 4) | (p & 0xf);
            p = TMEM[taddr2];
            i = p & 0xf0;
            i |= (i >> 4);
            color2->r = i;
            color2->g = i;
            color2->b = i;
            color2->a = ((p & 0xf) << 4) | (p & 0xf);
            p = TMEM[taddr3];
            i = p & 0xf0;
            i |= (i >> 4);
            color3->r = i;
            color3->g = i;
            color3->b = i;
            color3->a = ((p & 0xf) << 4) | (p & 0xf);

        }
        break;
    case TEXEL_IA16:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = color0->g = color0->b = c0 >> 8;
            color0->a = c0 & 0xff;
            c1 = tc16[taddr1];
            color1->r = color1->g = color1->b = c1 >> 8;
            color1->a = c1 & 0xff;
            c2 = tc16[taddr2];
            color2->r = color2->g = color2->b = c2 >> 8;
            color2->a = c2 & 0xff;
            c3 = tc16[taddr3];
            color3->r = color3->g = color3->b = c3 >> 8;
            color3->a = c3 & 0xff;

        }
        break;
    case TEXEL_IA32:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            color0->b = c0 >> 8;
            color0->a = (c0 & 1) ? 0xff : 0;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            color1->b = c1 >> 8;
            color1->a = (c1 & 1) ? 0xff : 0;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            color2->b = c2 >> 8;
            color2->a = (c2 & 1) ? 0xff : 0;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            color3->b = c3 >> 8;
            color3->a = (c3 & 1) ? 0xff : 0;

        }
        break;
    case TEXEL_I4:
        {
            taddr0 = ((tbase0 << 4) + s0) >> 1;
            taddr1 = ((tbase0 << 4) + s1) >> 1;
            taddr2 = ((tbase2 << 4) + s0) >> 1;
            taddr3 = ((tbase2 << 4) + s1) >> 1;
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p, c0, c1, c2, c3;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            ands = s0 & 1;
            p = TMEM[taddr0];
            c0 = ands ? (p & 0xf) : (p >> 4);
            c0 |= (c0 << 4);
            color0->r = color0->g = color0->b = color0->a = c0;
            p = TMEM[taddr2];
            c2 = ands ? (p & 0xf) : (p >> 4);
            c2 |= (c2 << 4);
            color2->r = color2->g = color2->b = color2->a = c2;

            ands = s1 & 1;
            p = TMEM[taddr1];
            c1 = ands ? (p & 0xf) : (p >> 4);
            c1 |= (c1 << 4);
            color1->r = color1->g = color1->b = color1->a = c1;
            p = TMEM[taddr3];
            c3 = ands ? (p & 0xf) : (p >> 4);
            c3 |= (c3 << 4);
            color3->r = color3->g = color3->b = color3->a = c3;

        }
        break;
    case TEXEL_I8:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint32_t p;

            taddr0 &= 0xfff;
            taddr1 &= 0xfff;
            taddr2 &= 0xfff;
            taddr3 &= 0xfff;
            p = TMEM[taddr0];
            color0->r = p;
            color0->g = p;
            color0->b = p;
            color0->a = p;
            p = TMEM[taddr1];
            color1->r = p;
            color1->g = p;
            color1->b = p;
            color1->a = p;
            p = TMEM[taddr2];
            color2->r = p;
            color2->g = p;
            color2->b = p;
            color2->a = p;
            p = TMEM[taddr3];
            color3->r = p;
            color3->g = p;
            color3->b = p;
            color3->a = p;
        }
        break;
    case TEXEL_I16:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            color0->b = c0 >> 8;
            color0->a = (c0 & 1) ? 0xff : 0;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            color1->b = c1 >> 8;
            color1->a = (c1 & 1) ? 0xff : 0;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            color2->b = c2 >> 8;
            color2->a = (c2 & 1) ? 0xff : 0;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            color3->b = c3 >> 8;
            color3->a = (c3 & 1) ? 0xff : 0;
        }
        break;
    case TEXEL_I32:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            uint16_t c0, c1, c2, c3;

            taddr0 &= 0x7ff;
            taddr1 &= 0x7ff;
            taddr2 &= 0x7ff;
            taddr3 &= 0x7ff;
            c0 = tc16[taddr0];
            color0->r = c0 >> 8;
            color0->g = c0 & 0xff;
            color0->b = c0 >> 8;
            color0->a = (c0 & 1) ? 0xff : 0;
            c1 = tc16[taddr1];
            color1->r = c1 >> 8;
            color1->g = c1 & 0xff;
            color1->b = c1 >> 8;
            color1->a = (c1 & 1) ? 0xff : 0;
            c2 = tc16[taddr2];
            color2->r = c2 >> 8;
            color2->g = c2 & 0xff;
            color2->b = c2 >> 8;
            color2->a = (c2 & 1) ? 0xff : 0;
            c3 = tc16[taddr3];
            color3->r = c3 >> 8;
            color3->g = c3 & 0xff;
            color3->b = c3 >> 8;
            color3->a = (c3 & 1) ? 0xff : 0;
        }
        break;
    default:
        debug("fetch_texel_quadro: unknown texture format %d, size %d, tilenum %d\n", tile[tilenum].format, tile[tilenum].size, tilenum);
        break;
    }
}

static void fetch_texel_entlut_quadro(COLOR *color0, COLOR *color1, COLOR *color2, COLOR *color3, int s0, int s1, int t0, int t1, uint32_t tilenum)
{
    uint32_t tbase0 = tile[tilenum].line * (t0 & 0xff) + tile[tilenum].tmem;
    uint32_t tbase2 = tile[tilenum].line * (t1 & 0xff) + tile[tilenum].tmem;
    uint32_t tpal   = tile[tilenum].palette << 4;
    uint32_t xort = 0, ands = 0;

    uint16_t *tc16 = (uint16_t*)TMEM;
    uint32_t taddr0 = 0, taddr1 = 0, taddr2 = 0, taddr3 = 0;
    uint16_t c0, c1, c2, c3;

    switch(tile[tilenum].f.tlutswitch)
    {
    case 0:
    case 1:
    case 2:
        {
            taddr0 = ((tbase0 << 4) + s0) >> 1;
            taddr1 = ((tbase0 << 4) + s1) >> 1;
            taddr2 = ((tbase2 << 4) + s0) >> 1;
            taddr3 = ((tbase2 << 4) + s1) >> 1;
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            ands = s0 & 1;
            c0 = TMEM[taddr0 & 0x7ff];
            c0 = (ands) ? (c0 & 0xf) : (c0 >> 4);
            c0 = tlut[((tpal | c0) << 2) ^ WORD_ADDR_XOR];
            c2 = TMEM[taddr2 & 0x7ff];
            c2 = (ands) ? (c2 & 0xf) : (c2 >> 4);
            c2 = tlut[((tpal | c2) << 2) ^ WORD_ADDR_XOR];

            ands = s1 & 1;
            c1 = TMEM[taddr1 & 0x7ff];
            c1 = (ands) ? (c1 & 0xf) : (c1 >> 4);
            c1 = tlut[((tpal | c1) << 2) ^ WORD_ADDR_XOR];
            c3 = TMEM[taddr3 & 0x7ff];
            c3 = (ands) ? (c3 & 0xf) : (c3 >> 4);
            c3 = tlut[((tpal | c3) << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 3:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            ands = s0 & 1;
            c0 = TMEM[taddr0 & 0x7ff];
            c0 = (ands) ? (c0 & 0xf) : (c0 >> 4);
            c0 = tlut[((tpal | c0) << 2) ^ WORD_ADDR_XOR];
            c2 = TMEM[taddr2 & 0x7ff];
            c2 = (ands) ? (c2 & 0xf) : (c2 >> 4);
            c2 = tlut[((tpal | c2) << 2) ^ WORD_ADDR_XOR];

            ands = s1 & 1;
            c1 = TMEM[taddr1 & 0x7ff];
            c1 = (ands) ? (c1 & 0xf) : (c1 >> 4);
            c1 = tlut[((tpal | c1) << 2) ^ WORD_ADDR_XOR];
            c3 = TMEM[taddr3 & 0x7ff];
            c3 = (ands) ? (c3 & 0xf) : (c3 >> 4);
            c3 = tlut[((tpal | c3) << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 4:
    case 5:
    case 6:
    case 7:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            c0 = TMEM[taddr0 & 0x7ff];
            c0 = tlut[(c0 << 2) ^ WORD_ADDR_XOR];
            c2 = TMEM[taddr2 & 0x7ff];
            c2 = tlut[(c2 << 2) ^ WORD_ADDR_XOR];
            c1 = TMEM[taddr1 & 0x7ff];
            c1 = tlut[(c1 << 2) ^ WORD_ADDR_XOR];
            c3 = TMEM[taddr3 & 0x7ff];
            c3 = tlut[(c3 << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 8:
    case 9:
    case 10:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            c0 = tc16[taddr0 & 0x3ff];
            c0 = tlut[((c0 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c1 = tc16[taddr1 & 0x3ff];
            c1 = tlut[((c1 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c2 = tc16[taddr2 & 0x3ff];
            c2 = tlut[((c2 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c3 = tc16[taddr3 & 0x3ff];
            c3 = tlut[((c3 >> 6) & ~3) ^ WORD_ADDR_XOR];
        }
        break;
    case 11:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            c0 = TMEM[taddr0 & 0x7ff];
            c0 = tlut[(c0 << 2) ^ WORD_ADDR_XOR];
            c2 = TMEM[taddr2 & 0x7ff];
            c2 = tlut[(c2 << 2) ^ WORD_ADDR_XOR];
            c1 = TMEM[taddr1 & 0x7ff];
            c1 = tlut[(c1 << 2) ^ WORD_ADDR_XOR];
            c3 = TMEM[taddr3 & 0x7ff];
            c3 = tlut[(c3 << 2) ^ WORD_ADDR_XOR];
        }
        break;
    case 12:
    case 13:
    case 14:
        {
            taddr0 = ((tbase0 << 2) + s0);
            taddr1 = ((tbase0 << 2) + s1);
            taddr2 = ((tbase2 << 2) + s0);
            taddr3 = ((tbase2 << 2) + s1);
            xort = (t0 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? WORD_XOR_DWORD_SWAP : WORD_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            c0 = tc16[taddr0 & 0x3ff];
            c0 = tlut[((c0 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c1 = tc16[taddr1 & 0x3ff];
            c1 = tlut[((c1 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c2 = tc16[taddr2 & 0x3ff];
            c2 = tlut[((c2 >> 6) & ~3) ^ WORD_ADDR_XOR];
            c3 = tc16[taddr3 & 0x3ff];
            c3 = tlut[((c3 >> 6) & ~3) ^ WORD_ADDR_XOR];
        }
        break;
    case 15:
        {
            taddr0 = ((tbase0 << 3) + s0);
            taddr1 = ((tbase0 << 3) + s1);
            taddr2 = ((tbase2 << 3) + s0);
            taddr3 = ((tbase2 << 3) + s1);
            xort = (t0 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr0 ^= xort;
            taddr1 ^= xort;
            xort = (t1 & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR;
            taddr2 ^= xort;
            taddr3 ^= xort;

            c0 = TMEM[taddr0 & 0x7ff];
            c0 = tlut[(c0 << 2) ^ WORD_ADDR_XOR];
            c2 = TMEM[taddr2 & 0x7ff];
            c2 = tlut[(c2 << 2) ^ WORD_ADDR_XOR];
            c1 = TMEM[taddr1 & 0x7ff];
            c1 = tlut[(c1 << 2) ^ WORD_ADDR_XOR];
            c3 = TMEM[taddr3 & 0x7ff];
            c3 = tlut[(c3 << 2) ^ WORD_ADDR_XOR];
        }
        break;
    default:
        debug("fetch_texel_entlut_quadro: unknown texture format %d, size %d, tilenum %d\n", tile[tilenum].format, tile[tilenum].size, tilenum);
        break;
    }

    if (!other_modes.tlut_type)
    {
        color0->r = GET_HI_RGBA16_TMEM(c0);
        color0->g = GET_MED_RGBA16_TMEM(c0);
        color0->b = GET_LOW_RGBA16_TMEM(c0);
        color0->a = (c0 & 1) ? 0xff : 0;
        color1->r = GET_HI_RGBA16_TMEM(c1);
        color1->g = GET_MED_RGBA16_TMEM(c1);
        color1->b = GET_LOW_RGBA16_TMEM(c1);
        color1->a = (c1 & 1) ? 0xff : 0;
        color2->r = GET_HI_RGBA16_TMEM(c2);
        color2->g = GET_MED_RGBA16_TMEM(c2);
        color2->b = GET_LOW_RGBA16_TMEM(c2);
        color2->a = (c2 & 1) ? 0xff : 0;
        color3->r = GET_HI_RGBA16_TMEM(c3);
        color3->g = GET_MED_RGBA16_TMEM(c3);
        color3->b = GET_LOW_RGBA16_TMEM(c3);
        color3->a = (c3 & 1) ? 0xff : 0;
    }
    else
    {
        color0->r = color0->g = color0->b = c0 >> 8;
        color0->a = c0 & 0xff;
        color1->r = color1->g = color1->b = c1 >> 8;
        color1->a = c1 & 0xff;
        color2->r = color2->g = color2->b = c2 >> 8;
        color2->a = c2 & 0xff;
        color3->r = color3->g = color3->b = c3 >> 8;
        color3->a = c3 & 0xff;
    }
}

The 4KB of TMEM in the N64 is split into 8 banks, each with 256 16-bit words. The first 4 banks are considered low TMEM, and the second 4 banks high TMEM. Angrylion uses a 16-bit array to represent this, with most significant bit (bit 10) of the index distinguishing high and low TMEM, bits 2-9 determining the address within the bank, and last 2 bits determining the bank within high or low TMEM.

void get_tmem_idx(int s, int t, uint32_t tilenum, uint32_t* idx0, uint32_t* idx1, uint32_t* idx2, uint32_t* idx3, uint32_t* bit3flipped, uint32_t* hibit)
{
    uint32_t tbase = (tile[tilenum].line * t) & 0x1ff;
    tbase += tile[tilenum].tmem;
    uint32_t tsize = tile[tilenum].size;
    uint32_t tformat = tile[tilenum].format;
    uint32_t sshorts = 0;

    // convert horizontal position
    // from texels to 16-bit words
    if (tsize == PIXEL_SIZE_8BIT || tformat == FORMAT_YUV)
        sshorts = s >> 1;
    else if (tsize >= PIXEL_SIZE_16BIT)
        sshorts = s;
    else
        sshorts = s >> 2;
    sshorts &= 0x7ff;

    // check if bit 1 of starting address set
    *bit3flipped = ((sshorts & 2) ? 1 : 0) ^ (t & 1);

    int tidx_a = ((tbase << 2) + sshorts) & 0x7fd;
    int tidx_b = (tidx_a + 1) & 0x7ff;
    int tidx_c = (tidx_a + 2) & 0x7ff;
    int tidx_d = (tidx_a + 3) & 0x7ff;

    // check if starting address in high TMEM
    *hibit = (tidx_a & 0x400) ? 1 : 0;

    // swap each pair of 4-byte blocks on odd rows
    // ensures 2x2 blocks of 16bpp texels use 4 banks
    if (t & 1)
    {
        tidx_a ^= 2;
        tidx_b ^= 2;
        tidx_c ^= 2;
        tidx_d ^= 2;
    }

    // order TMEM indexes by bank
    sort_tmem_idx(idx0, tidx_a, tidx_b, tidx_c, tidx_d, 0);
    sort_tmem_idx(idx1, tidx_a, tidx_b, tidx_c, tidx_d, 1);
    sort_tmem_idx(idx2, tidx_a, tidx_b, tidx_c, tidx_d, 2);
    sort_tmem_idx(idx3, tidx_a, tidx_b, tidx_c, tidx_d, 3);
}

void read_tmem_copy(int s, int s1, int s2, int s3, int t, uint32_t tilenum, uint32_t* sortshort, int* hibits, int* lowbits)
{
    uint32_t tbase = (tile[tilenum].line * t) & 0x1ff;
    tbase += tile[tilenum].tmem;
    uint32_t tsize = tile[tilenum].size;
    uint32_t tformat = tile[tilenum].format;
    uint32_t shbytes = 0, shbytes1 = 0, shbytes2 = 0, shbytes3 = 0;
    int32_t delta = 0;
    uint32_t sortidx[8];

    if (tsize == PIXEL_SIZE_8BIT || tformat == FORMAT_YUV)
    {
        shbytes = s << 1;
        shbytes1 = s1 << 1;
        shbytes2 = s2 << 1;
        shbytes3 = s3 << 1;
    }
    else if (tsize >= PIXEL_SIZE_16BIT)
    {
        shbytes = s << 2;
        shbytes1 = s1 << 2;
        shbytes2 = s2 << 2;
        shbytes3 = s3 << 2;
    }
    else
    {
        shbytes = s;
        shbytes1 = s1;
        shbytes2 = s2;
        shbytes3 = s3;
    }

    shbytes &= 0x1fff;
    shbytes1 &= 0x1fff;
    shbytes2 &= 0x1fff;
    shbytes3 &= 0x1fff;

    int tidx_a, tidx_blow, tidx_bhi, tidx_c, tidx_dlow, tidx_dhi;

    tbase <<= 4;
    tidx_a = (tbase + shbytes) & 0x1fff;
    tidx_bhi = (tbase + shbytes1) & 0x1fff;
    tidx_c = (tbase + shbytes2) & 0x1fff;
    tidx_dhi = (tbase + shbytes3) & 0x1fff;

    if (tformat == FORMAT_YUV)
    {
        delta = shbytes1 - shbytes;
        tidx_blow = (tidx_a + (delta << 1)) & 0x1fff;
        tidx_dlow = (tidx_blow + shbytes3 - shbytes) & 0x1fff;
    }
    else
    {
        tidx_blow = tidx_bhi;
        tidx_dlow = tidx_dhi;
    }

    if (t & 1)
    {
        tidx_a ^= 8;
        tidx_blow ^= 8;
        tidx_bhi ^= 8;
        tidx_c ^= 8;
        tidx_dlow ^= 8;
        tidx_dhi ^= 8;
    }

    hibits[0] = (tidx_a & 0x1000) ? 1 : 0;
    hibits[1] = (tidx_blow & 0x1000) ? 1 : 0;
    hibits[2] = (tidx_bhi & 0x1000) ? 1 : 0;
    hibits[3] = (tidx_c & 0x1000) ? 1 : 0;
    hibits[4] = (tidx_dlow & 0x1000) ? 1 : 0;
    hibits[5] = (tidx_dhi & 0x1000) ? 1 : 0;
    lowbits[0] = tidx_a & 0xf;
    lowbits[1] = tidx_blow & 0xf;
    lowbits[2] = tidx_bhi & 0xf;
    lowbits[3] = tidx_c & 0xf;
    lowbits[4] = tidx_dlow & 0xf;
    lowbits[5] = tidx_dhi & 0xf;

    uint16_t* tmem16 = (uint16_t*)TMEM;
    uint32_t short0, short1, short2, short3;

    tidx_a >>= 2;
    tidx_blow >>= 2;
    tidx_bhi >>= 2;
    tidx_c >>= 2;
    tidx_dlow >>= 2;
    tidx_dhi >>= 2;

    sort_tmem_idx(&sortidx[0], tidx_a, tidx_blow, tidx_c, tidx_dlow, 0);
    sort_tmem_idx(&sortidx[1], tidx_a, tidx_blow, tidx_c, tidx_dlow, 1);
    sort_tmem_idx(&sortidx[2], tidx_a, tidx_blow, tidx_c, tidx_dlow, 2);
    sort_tmem_idx(&sortidx[3], tidx_a, tidx_blow, tidx_c, tidx_dlow, 3);

    short0 = tmem16[sortidx[0] ^ WORD_ADDR_XOR];
    short1 = tmem16[sortidx[1] ^ WORD_ADDR_XOR];
    short2 = tmem16[sortidx[2] ^ WORD_ADDR_XOR];
    short3 = tmem16[sortidx[3] ^ WORD_ADDR_XOR];

    sort_tmem_shorts_lowhalf(&sortshort[0], short0, short1, short2, short3, lowbits[0] >> 2);
    sort_tmem_shorts_lowhalf(&sortshort[1], short0, short1, short2, short3, lowbits[1] >> 2);
    sort_tmem_shorts_lowhalf(&sortshort[2], short0, short1, short2, short3, lowbits[3] >> 2);
    sort_tmem_shorts_lowhalf(&sortshort[3], short0, short1, short2, short3, lowbits[4] >> 2);

    if (other_modes.en_tlut)
    {

        compute_color_index(&short0, sortshort[0], lowbits[0] & 3, tilenum);
        compute_color_index(&short1, sortshort[1], lowbits[1] & 3, tilenum);
        compute_color_index(&short2, sortshort[2], lowbits[3] & 3, tilenum);
        compute_color_index(&short3, sortshort[3], lowbits[4] & 3, tilenum);

        sortidx[4] = (short0 << 2);
        sortidx[5] = (short1 << 2) | 1;
        sortidx[6] = (short2 << 2) | 2;
        sortidx[7] = (short3 << 2) | 3;
    }
    else
    {
        sort_tmem_idx(&sortidx[4], tidx_a, tidx_bhi, tidx_c, tidx_dhi, 0);
        sort_tmem_idx(&sortidx[5], tidx_a, tidx_bhi, tidx_c, tidx_dhi, 1);
        sort_tmem_idx(&sortidx[6], tidx_a, tidx_bhi, tidx_c, tidx_dhi, 2);
        sort_tmem_idx(&sortidx[7], tidx_a, tidx_bhi, tidx_c, tidx_dhi, 3);
    }

    short0 = tmem16[(sortidx[4] | 0x400) ^ WORD_ADDR_XOR];
    short1 = tmem16[(sortidx[5] | 0x400) ^ WORD_ADDR_XOR];
    short2 = tmem16[(sortidx[6] | 0x400) ^ WORD_ADDR_XOR];
    short3 = tmem16[(sortidx[7] | 0x400) ^ WORD_ADDR_XOR];

    if (other_modes.en_tlut)
    {
        sort_tmem_shorts_lowhalf(&sortshort[4], short0, short1, short2, short3, 0);
        sort_tmem_shorts_lowhalf(&sortshort[5], short0, short1, short2, short3, 1);
        sort_tmem_shorts_lowhalf(&sortshort[6], short0, short1, short2, short3, 2);
        sort_tmem_shorts_lowhalf(&sortshort[7], short0, short1, short2, short3, 3);
    }
    else
    {
        sort_tmem_shorts_lowhalf(&sortshort[4], short0, short1, short2, short3, lowbits[0] >> 2);
        sort_tmem_shorts_lowhalf(&sortshort[5], short0, short1, short2, short3, lowbits[2] >> 2);
        sort_tmem_shorts_lowhalf(&sortshort[6], short0, short1, short2, short3, lowbits[3] >> 2);
        sort_tmem_shorts_lowhalf(&sortshort[7], short0, short1, short2, short3, lowbits[5] >> 2);
    }
}

void sort_tmem_idx(uint32_t *idx, uint32_t idxa, uint32_t idxb, uint32_t idxc, uint32_t idxd, uint32_t bankno)
{
    if ((idxa & 3) == bankno)
        *idx = idxa & 0x3ff;
    else if ((idxb & 3) == bankno)
        *idx = idxb & 0x3ff;
    else if ((idxc & 3) == bankno)
        *idx = idxc & 0x3ff;
    else if ((idxd & 3) == bankno)
        *idx = idxd & 0x3ff;
    else
        *idx = 0;
}

void sort_tmem_shorts_lowhalf(uint32_t* bindshort, uint32_t short0, uint32_t short1, uint32_t short2, uint32_t short3, uint32_t bankno)
{
    switch(bankno)
    {
    case 0:
        *bindshort = short0;
        break;
    case 1:
        *bindshort = short1;
        break;
    case 2:
        *bindshort = short2;
        break;
    case 3:
        *bindshort = short3;
        break;
    }
}

void compute_color_index(uint32_t* cidx, uint32_t readshort, uint32_t nybbleoffset, uint32_t tilenum)
{
    uint32_t lownib, hinib;
    if (tile[tilenum].size == PIXEL_SIZE_4BIT)
    {
        lownib = (nybbleoffset ^ 3) << 2;
        hinib = tile[tilenum].palette;
    }
    else
    {
        lownib = ((nybbleoffset & 2) ^ 2) << 2;
        hinib = lownib ? ((readshort >> 12) & 0xf) : ((readshort >> 4) & 0xf);
    }
    lownib = (readshort >> lownib) & 0xf;
    *cidx = (hinib << 4) | lownib;
}

void replicate_for_copy(uint32_t* outbyte, uint32_t inshort, uint32_t nybbleoffset, uint32_t tilenum, uint32_t tformat, uint32_t tsize)
{
    uint32_t lownib, hinib;
    switch(tsize)
    {
    case PIXEL_SIZE_4BIT:
        // extract nibble at given offset
        // (in nibbles from MSB) of inshort
        lownib = (nybbleoffset ^ 3) << 2;
        lownib = hinib = (inshort >> lownib) & 0xf;
        if (tformat == FORMAT_CI)
        {
            // get TLUT index from nibble and tile palette
            *outbyte = (tile[tilenum].palette << 4) | lownib;
        }
        else if (tformat == FORMAT_IA)
        {
            // triplicate I part of IA31
            lownib = (lownib << 4) | lownib;
            *outbyte = (lownib & 0xe0) | ((lownib & 0xe0) >> 3) | ((lownib & 0xc0) >> 6);
        }
        else
            // duplicate I4 value in outbyte
            *outbyte = (lownib << 4) | lownib;
        break;
    case PIXEL_SIZE_8BIT:
        // quantize nibble index to
        // bytes from MSB of inshort
        hinib = ((nybbleoffset ^ 3) | 1) << 2;
        if (tformat == FORMAT_IA)
        {
            // duplicate I part of IA44
            lownib = (inshort >> hinib) & 0xf;
            *outbyte = (lownib << 4) | lownib;
        }
        else
        {
            // extract byte at offset
            lownib = (inshort >> (hinib & ~4)) & 0xf;
            hinib = (inshort >> hinib) & 0xf;
            *outbyte = (hinib << 4) | lownib;
        }
        break;
    default:
        // extract R part of RGBA8888
        // (inshort only contains RG)
        *outbyte = (inshort >> 8) & 0xff;
        break;
    }
}

void fetch_qword_copy(uint32_t* hidword, uint32_t* lowdword, int32_t ssss, int32_t ssst, uint32_t tilenum)
{
    uint32_t shorta, shortb, shortc, shortd;
    uint32_t sortshort[8];
    int hibits[6];
    int lowbits[6];
    int32_t sss = ssss, sst = ssst, sss1 = 0, sss2 = 0, sss3 = 0;
    int largetex = 0;

    uint32_t tformat, tsize;
    if (other_modes.en_tlut)
    {
        tsize = PIXEL_SIZE_16BIT;
        tformat = other_modes.tlut_type ? FORMAT_IA : FORMAT_RGBA;
    }
    else
    {
        tsize = tile[tilenum].size;
        tformat = tile[tilenum].format;
    }

    // apply shift and mask to coords
    tc_pipeline_copy(&sss, &sss1, &sss2, &sss3, &sst, tilenum);
    read_tmem_copy(sss, sss1, sss2, sss3, sst, tilenum, sortshort, hibits, lowbits);
    largetex = (tformat == FORMAT_YUV || (tformat == FORMAT_RGBA && tsize == PIXEL_SIZE_32BIT));

    if (other_modes.en_tlut)
    {
        // if TLUT enabled use
        // values from high TMEM
        shorta = sortshort[4];
        shortb = sortshort[5];
        shortc = sortshort[6];
        shortd = sortshort[7];
    }
    else if (largetex)
    {
        // if format splits high/low TMEM
        // (YUV/RGBA32) only use low half
        shorta = sortshort[0];
        shortb = sortshort[1];
        shortc = sortshort[2];
        shortd = sortshort[3];
    }
    else
    {
        // otherwise use high or low TMEM
        // depending on texel coords
        shorta = hibits[0] ? sortshort[4] : sortshort[0];
        shortb = hibits[1] ? sortshort[5] : sortshort[1];
        shortc = hibits[3] ? sortshort[6] : sortshort[2];
        shortd = hibits[4] ? sortshort[7] : sortshort[3];
    }

    *lowdword = (shortc << 16) | shortd;

    if (tsize == PIXEL_SIZE_16BIT)
        *hidword = (shorta << 16) | shortb;
    else
    {
        replicate_for_copy(&shorta, shorta, lowbits[0] & 3, tilenum, tformat, tsize);
        replicate_for_copy(&shortb, shortb, lowbits[1] & 3, tilenum, tformat, tsize);
        replicate_for_copy(&shortc, shortc, lowbits[3] & 3, tilenum, tformat, tsize);
        replicate_for_copy(&shortd, shortd, lowbits[4] & 3, tilenum, tformat, tsize);
        *hidword = (shorta << 24) | (shortb << 16) | (shortc << 8) | shortd;
    }
}

The RDP texture pipeline can sample one texture each cycle, using the tile and coordinates provided to read four adjacent texels from TMEM and blend them into an output color with a unique 3-input bilinear filter.

static inline void texture_pipeline_cycle(COLOR* TEX, COLOR* prev, int32_t SSS, int32_t SST, uint32_t tilenum, uint32_t cycle)
{
#define TRELATIVE(x, y)     ((x) - ((y) << 3))

#define UPPER ((sfrac + tfrac) & 0x20)

    int32_t maxs, maxt, invt0r, invt0g, invt0b, invt0a;
    int32_t sfrac, tfrac, invsf, invtf;
    int upper = 0;
    int bilerp = cycle ? other_modes.bi_lerp1 : other_modes.bi_lerp0;
    int convert = other_modes.convert_one && cycle;
    COLOR t0, t1, t2, t3;
    int sss1, sst1, sss2, sst2;

    sss1 = SSS;
    sst1 = SST;

    // apply shift to S/T
    tcshift_cycle(&sss1, &sst1, &maxs, &maxt, tilenum);

    // get s.11.5 offset of s.11.5 S/T from 10.2
    // SL/TL of upper left corner of texture
    sss1 = TRELATIVE(sss1, tile[tilenum].sl);
    sst1 = TRELATIVE(sst1, tile[tilenum].tl);

    if (other_modes.sample_type)
    {
        sfrac = sss1 & 0x1f;
        tfrac = sst1 & 0x1f;

        // clamp to 11-bit unsigned to
        // get rounded-down integer S/T
        tcclamp_cycle(&sss1, &sst1, &sfrac, &tfrac, maxs, maxt, tilenum);

        // get rounded-up integer S/T
        if (tile[tilenum].format != FORMAT_YUV)
            sss2 = sss1 + 1;
        else
            sss2 = sss1 + 2;

        sst2 = sst1 + 1;

        // mask both rounded-up and rounded-down
        // integer S/T to (at most) 10-bit coords
        tcmask_coupled(&sss1, &sss2, &sst1, &sst2, tilenum);

        if (bilerp)
        {
            // sample 4 texels at S/T coords
            if (!other_modes.en_tlut)
                fetch_texel_quadro(&t0, &t1, &t2, &t3, sss1, sss2, sst1, sst2, tilenum);
            else
                fetch_texel_entlut_quadro(&t0, &t1, &t2, &t3, sss1, sss2, sst1, sst2, tilenum);

            // check fractional coords exactly centered
            if (!other_modes.mid_texel || sfrac != 0x10 || tfrac != 0x10)
            {
                if (!convert)
                {
                    if (UPPER)
                    {
                        // interpolate 3 bottom right texels
                        // to get signed 9-bit output color
                        invsf = 0x20 - sfrac;
                        invtf = 0x20 - tfrac;
                        TEX->r = t3.r + ((((invsf * (t2.r - t3.r)) + (invtf * (t1.r - t3.r))) + 0x10) >> 5);
                        TEX->g = t3.g + ((((invsf * (t2.g - t3.g)) + (invtf * (t1.g - t3.g))) + 0x10) >> 5);
                        TEX->b = t3.b + ((((invsf * (t2.b - t3.b)) + (invtf * (t1.b - t3.b))) + 0x10) >> 5);
                        TEX->a = t3.a + ((((invsf * (t2.a - t3.a)) + (invtf * (t1.a - t3.a))) + 0x10) >> 5);
                    }
                    else
                    {
                        // interpolate 3 top left texels
                        TEX->r = t0.r + ((((sfrac * (t1.r - t0.r)) + (tfrac * (t2.r - t0.r))) + 0x10) >> 5);
                        TEX->g = t0.g + ((((sfrac * (t1.g - t0.g)) + (tfrac * (t2.g - t0.g))) + 0x10) >> 5);
                        TEX->b = t0.b + ((((sfrac * (t1.b - t0.b)) + (tfrac * (t2.b - t0.b))) + 0x10) >> 5);
                        TEX->a = t0.a + ((((sfrac * (t1.a - t0.a)) + (tfrac * (t2.a - t0.a))) + 0x10) >> 5);
                    }
                }
                else
                {
                    if (UPPER)
                    {
                        TEX->r = prev->b + ((((prev->r * (t2.r - t3.r)) + (prev->g * (t1.r - t3.r))) + 0x80) >> 8);
                        TEX->g = prev->b + ((((prev->r * (t2.g - t3.g)) + (prev->g * (t1.g - t3.g))) + 0x80) >> 8);
                        TEX->b = prev->b + ((((prev->r * (t2.b - t3.b)) + (prev->g * (t1.b - t3.b))) + 0x80) >> 8);
                        TEX->a = prev->b + ((((prev->r * (t2.a - t3.a)) + (prev->g * (t1.a - t3.a))) + 0x80) >> 8);
                    }
                    else
                    {
                        TEX->r = prev->b + ((((prev->r * (t1.r - t0.r)) + (prev->g * (t2.r - t0.r))) + 0x80) >> 8);
                        TEX->g = prev->b + ((((prev->r * (t1.g - t0.g)) + (prev->g * (t2.g - t0.g))) + 0x80) >> 8);
                        TEX->b = prev->b + ((((prev->r * (t1.b - t0.b)) + (prev->g * (t2.b - t0.b))) + 0x80) >> 8);
                        TEX->a = prev->b + ((((prev->r * (t1.a - t0.a)) + (prev->g * (t2.a - t0.a))) + 0x80) >> 8);
                    }
                }

            }
            else
            {
                invt0r  = ~t0.r; invt0g = ~t0.g; invt0b = ~t0.b; invt0a = ~t0.a;
                if (!convert)
                {
                    // box filter if fractional coords exactly
                    // centered (average all 4 texel colors)
                    sfrac <<= 2;
                    tfrac <<= 2;
                    TEX->r = t0.r + ((((sfrac * (t1.r - t0.r)) + (tfrac * (t2.r - t0.r))) + ((invt0r + t3.r) << 6) + 0xc0) >> 8);
                    TEX->g = t0.g + ((((sfrac * (t1.g - t0.g)) + (tfrac * (t2.g - t0.g))) + ((invt0g + t3.g) << 6) + 0xc0) >> 8);
                    TEX->b = t0.b + ((((sfrac * (t1.b - t0.b)) + (tfrac * (t2.b - t0.b))) + ((invt0b + t3.b) << 6) + 0xc0) >> 8);
                    TEX->a = t0.a + ((((sfrac * (t1.a - t0.a)) + (tfrac * (t2.a - t0.a))) + ((invt0a + t3.a) << 6) + 0xc0) >> 8);
                }
                else
                {
                    TEX->r = prev->b + ((((prev->r * (t1.r - t0.r)) + (prev->g * (t2.r - t0.r))) + ((invt0r + t3.r) << 6) + 0xc0) >> 8);
                    TEX->g = prev->b + ((((prev->r * (t1.g - t0.g)) + (prev->g * (t2.g - t0.g))) + ((invt0g + t3.g) << 6) + 0xc0) >> 8);
                    TEX->b = prev->b + ((((prev->r * (t1.b - t0.b)) + (prev->g * (t2.b - t0.b))) + ((invt0b + t3.b) << 6) + 0xc0) >> 8);
                    TEX->a = prev->b + ((((prev->r * (t1.a - t0.a)) + (prev->g * (t2.a - t0.a))) + ((invt0a + t3.a) << 6) + 0xc0) >> 8);
                }
            }

        }
        else
        {
            // use previously sampled color if
            // cycle 1 and convert_one set
            if (!other_modes.en_tlut)
                fetch_texel(&t0, sss1, sst1, tilenum);
            else
                fetch_texel_entlut(&t0, sss1, sst1, tilenum);
            if (convert)
                t0 = *prev;

            // perform first part of RGB-YUV conversion
            // instead of bilerp (see gDPSetConvert docs)
            TEX->r = t0.b + ((k0_tf * t0.g + 0x80) >> 8);
            TEX->g = t0.b + ((k1_tf * t0.r + k2_tf * t0.g + 0x80) >> 8);
            TEX->b = t0.b + ((k3_tf * t0.r + 0x80) >> 8);
            TEX->a = t0.b;
        }

        TEX->r &= 0x1ff;
        TEX->g &= 0x1ff;
        TEX->b &= 0x1ff;
        TEX->a &= 0x1ff;

    }
    else
    {
        // perform point sample instead of 2x2
        tcclamp_cycle_light(&sss1, &sst1, maxs, maxt, tilenum);

        tcmask(&sss1, &sst1, tilenum);

        if (!other_modes.en_tlut)
            fetch_texel(&t0, sss1, sst1, tilenum);
        else
            fetch_texel_entlut(&t0, sss1, sst1, tilenum);

        if (bilerp)
        {
            if (!convert)
            {
                TEX->r = t0.r & 0x1ff;
                TEX->g = t0.g & 0x1ff;
                TEX->b = t0.b;
                TEX->a = t0.a;
            }
            else
                TEX->r = TEX->g = TEX->b = TEX->a = prev->b;
        }
        else
        {
            if (convert)
                t0 = *prev;
            t0.r = SIGN(t0.r, 9);
            t0.g = SIGN(t0.g, 9);
            t0.b = SIGN(t0.b, 9);

            TEX->r = t0.b + ((k0_tf * t0.g + 0x80) >> 8);
            TEX->g = t0.b + ((k1_tf * t0.r + k2_tf * t0.g + 0x80) >> 8);
            TEX->b = t0.b + ((k3_tf * t0.r + 0x80) >> 8);
            TEX->a = t0.b;
            TEX->r &= 0x1ff;
            TEX->g &= 0x1ff;
            TEX->b &= 0x1ff;
            TEX->a &= 0x1ff;
        }
    }

}

static inline void tc_pipeline_copy(int32_t* sss0, int32_t* sss1, int32_t* sss2, int32_t* sss3, int32_t* sst, int tilenum)
{
    int ss0 = *sss0, ss1 = 0, ss2 = 0, ss3 = 0, st = *sst;

    tcshift_copy(&ss0, &st, tilenum);

    ss0 = TRELATIVE(ss0, tile[tilenum].sl);
    st = TRELATIVE(st, tile[tilenum].tl);
    ss0 = (ss0 >> 5);
    st = (st >> 5);

    ss1 = ss0 + 1;
    ss2 = ss0 + 2;
    ss3 = ss0 + 3;

    tcmask_copy(&ss0, &ss1, &ss2, &ss3, &st, tilenum);

    *sss0 = ss0;
    *sss1 = ss1;
    *sss2 = ss2;
    *sss3 = ss3;
    *sst = st;
}

static inline void tc_pipeline_load(int32_t* sss, int32_t* sst, int tilenum, int coord_quad)
{
    int sss1 = *sss, sst1 = *sst;
    sss1 = SIGN16(sss1);
    sst1 = SIGN16(sst1);

    sss1 = TRELATIVE(sss1, tile[tilenum].sl);
    sst1 = TRELATIVE(sst1, tile[tilenum].tl);

    if (!coord_quad)
    {
        sss1 = (sss1 >> 5);
        sst1 = (sst1 >> 5);
    }
    else
    {
        sss1 = (sss1 >> 3);
        sst1 = (sst1 >> 3);
    }

    *sss = sss1;
    *sst = sst1;
}

1 Cycle and 2 Cycle Modes

After determining which pixels will be affected by a primitive, Angrylion can send them through the complete shading pipeline, one span (or scanline) at a time. First, accurate coverage values are calculated from the subpixel bounds already computed. Values like shade color and texture coordinates are calculated for each pixel by interpolating across the surface of the primitive.

void render_spans_1cycle_complete(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    SPANSIGS sigs;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;

    int prim_tile = tilenum;
    int tile1 = tilenum;
    int newtile = tilenum;
    int news, newt;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;

    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        // use dz calculated from DzDx/DzDy
        dzpix = spans_dzpix;
    else
    {
        // use primitive dz, disable
        // primitive z interpolation
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z, s, t, w;
    int sr, sg, sb, sa, sz, ss, st, sw;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int32_t prelodfrac;
    int curpixel = 0;
    int x, length, scdiff, lodlength;
    uint32_t fir, fig, fib;

    // iterate over scanlines within
    // scissored scanline bounds
    for (i = start; i <= end; i++)
    {
        // skip invalid scanlines
        if (span[i].validline)
        {

        // get precalculated values for scanline
        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;
        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        // use per-subscanline subpixel bounds to
        // compute coverage masks in cvgbuf[lx, rx]
        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        // interpolate across difference between
        // unscissored xright and render bound
        if (scdiff)
        {
            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
            s += (dsinc * scdiff);
            t += (dtinc * scdiff);
            w += (dwinc * scdiff);
        }

        // include unrendered but interpolated
        // pixels in lodlength calculation
        lodlength = length + scdiff;

        sigs.longspan = (lodlength > 7);
        sigs.midspan = (lodlength == 7);
        sigs.onelessthanmid = (lodlength == 6);

        sigs.startspan = 1;

        // iterate over pixels in scanline
        for (j = 0; j <= length; j++)
        {
            // truncate depth to 22 bits (s.18.3)
            // S/T/W coords to 16 bits (s.15)
            // R/G/B/A values to 13 bits (s.10.2)
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;
            sz = (z >> 10) & 0x3fffff;

            sigs.endspan = (j == length);
            sigs.preendspan = (j == (length - 1));

            // get total coverage, etc. from coverage mask
            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            // get perspective-corrected coordinates for texel1
            get_texel1_1cycle(&news, &newt, s, t, w, dsinc, dtinc, dwinc, i, &sigs);

            // check if first pixel in scanline
            if (!sigs.startspan)
            {
                // use texel1 color from last pixel
                // as texel0 color for this pixel
                texel0_color = texel1_color;
                lod_frac = prelodfrac;
            }
            else
            {
                tcdiv_ptr(ss, st, sw, &sss, &sst);

                // compute lod of texel0 (for mipmapping)
                tclod_1cycle_current(&sss, &sst, news, newt, s, t, w, dsinc, dtinc, dwinc, i, prim_tile, &tile1, &sigs);
                // get texel0 color for first pixel
                texture_pipeline_cycle(&texel0_color, &texel0_color, sss, sst, tile1, 0);

                sigs.startspan = 0;
            }

            sigs.nextspan = sigs.endspan;
            sigs.endspan = sigs.preendspan;
            sigs.preendspan = (j == (length - 2));

            s += dsinc;
            t += dtinc;
            w += dwinc;

            // compute lod of texel1
            tclod_1cycle_next(&news, &newt, s, t, w, dsinc, dtinc, dwinc, i, prim_tile, &newtile, &sigs, &prelodfrac);
            // get texel1 color for this pixel
            texture_pipeline_cycle(&texel1_color, &texel1_color, news, newt, newtile, 0);
            // correct shade of partially covered pixels
            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);
            // get color and alpha dither for pixel
            get_dither_noise_ptr(x, i, &cdith, &adith);
            // get new pixel RGBA color, coverage from color combiner
            combiner_1cycle(adith, &curpixel_cvg);
            // check current depth against Z buffer
            fbread1_ptr(curpixel, &curpixel_memcvg);
            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                // blender checks alpha and coverage against threshold
                if (blender_1cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit))
                {
                    // write new color/depth to framebuffer
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }

            // increment attributes at each new pixel
            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_1cycle_notexel1(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    SPANSIGS sigs;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;

    int prim_tile = tilenum;
    int tile1 = tilenum;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;
    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z, s, t, w;
    int sr, sg, sb, sa, sz, ss, st, sw;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int curpixel = 0;
    int x, length, scdiff, lodlength;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;
        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
            s += (dsinc * scdiff);
            t += (dtinc * scdiff);
            w += (dwinc * scdiff);
        }

        lodlength = length + scdiff;

        sigs.longspan = (lodlength > 7);
        sigs.midspan = (lodlength == 7);

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;
            sz = (z >> 10) & 0x3fffff;

            sigs.endspan = (j == length);
            sigs.preendspan = (j == (length - 1));

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            tcdiv_ptr(ss, st, sw, &sss, &sst);

            tclod_1cycle_current_simple(&sss, &sst, s, t, w, dsinc, dtinc, dwinc, i, prim_tile, &tile1, &sigs);

            texture_pipeline_cycle(&texel0_color, &texel0_color, sss, sst, tile1, 0);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_1cycle(adith, &curpixel_cvg);

            fbread1_ptr(curpixel, &curpixel_memcvg);
            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_1cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }

            s += dsinc;
            t += dtinc;
            w += dwinc;
            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_1cycle_notex(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc;
    int xinc;

    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z;
    int sr, sg, sb, sa, sz;
    int xstart, xend, xendsc;
    int curpixel = 0;
    int x, length, scdiff;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
        }

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            sz = (z >> 10) & 0x3fffff;

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_1cycle(adith, &curpixel_cvg);

            fbread1_ptr(curpixel, &curpixel_memcvg);
            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_1cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }
            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_2cycle_complete(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    SPANSIGS sigs;
    int32_t prelodfrac;
    COLOR nexttexel1_color;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;
    int32_t acalpha;

    int tile2 = (tilenum + 1) & 7;
    int tile1 = tilenum;
    int prim_tile = tilenum;

    int newtile1 = tile1;
    int newtile2 = tile2;
    int news, newt;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;
    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z, s, t, w;
    int sr, sg, sb, sa, sz, ss, st, sw;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int curpixel = 0;

    int x, length, scdiff;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;
        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
            s += (dsinc * scdiff);
            t += (dtinc * scdiff);
            w += (dwinc * scdiff);
        }
        sigs.startspan = 1;

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;
            sz = (z >> 10) & 0x3fffff;

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            get_nexttexel0_2cycle(&news, &newt, s, t, w, dsinc, dtinc, dwinc);

            if (!sigs.startspan)
            {
                lod_frac = prelodfrac;
                texel0_color = nexttexel_color;
                texel1_color = nexttexel1_color;
            }
            else
            {
                tcdiv_ptr(ss, st, sw, &sss, &sst);

                tclod_2cycle_current(&sss, &sst, news, newt, s, t, w, dsinc, dtinc, dwinc, prim_tile, &tile1, &tile2);

                texture_pipeline_cycle(&texel0_color, &texel0_color, sss, sst, tile1, 0);
                texture_pipeline_cycle(&texel1_color, &texel0_color, sss, sst, tile2, 1);

                sigs.startspan = 0;
            }

            s += dsinc;
            t += dtinc;
            w += dwinc;

            tclod_2cycle_next(&news, &newt, s, t, w, dsinc, dtinc, dwinc, prim_tile, &newtile1, &newtile2, &prelodfrac);

            texture_pipeline_cycle(&nexttexel_color, &nexttexel_color, news, newt, newtile1, 0);
            texture_pipeline_cycle(&nexttexel1_color, &nexttexel_color, news, newt, newtile2, 1);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_2cycle(adith, &curpixel_cvg, &acalpha);

            fbread2_ptr(curpixel, &curpixel_memcvg);

            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_2cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit, acalpha))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);

                }
            }

            else
                memory_color = pre_memory_color;

            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_2cycle_notexelnext(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;
    int32_t acalpha;

    int tile2 = (tilenum + 1) & 7;
    int tile1 = tilenum;
    int prim_tile = tilenum;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;
    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z, s, t, w;
    int sr, sg, sb, sa, sz, ss, st, sw;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int curpixel = 0;

    int x, length, scdiff;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;
        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
            s += (dsinc * scdiff);
            t += (dtinc * scdiff);
            w += (dwinc * scdiff);
        }

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;
            sz = (z >> 10) & 0x3fffff;

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            tcdiv_ptr(ss, st, sw, &sss, &sst);

            tclod_2cycle_current_simple(&sss, &sst, s, t, w, dsinc, dtinc, dwinc, prim_tile, &tile1, &tile2);

            texture_pipeline_cycle(&texel0_color, &texel0_color, sss, sst, tile1, 0);
            texture_pipeline_cycle(&texel1_color, &texel0_color, sss, sst, tile2, 1);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_2cycle(adith, &curpixel_cvg, &acalpha);

            fbread2_ptr(curpixel, &curpixel_memcvg);

            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_2cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit, acalpha))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }
            else
                memory_color = pre_memory_color;

            s += dsinc;
            t += dtinc;
            w += dwinc;
            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_2cycle_notexel1(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;
    int32_t acalpha;

    int tile1 = tilenum;
    int prim_tile = tilenum;

    int i, j;

    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;
    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z, s, t, w;
    int sr, sg, sb, sa, sz, ss, st, sw;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int curpixel = 0;

    int x, length, scdiff;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;
        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
            s += (dsinc * scdiff);
            t += (dtinc * scdiff);
            w += (dwinc * scdiff);
        }

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;
            sz = (z >> 10) & 0x3fffff;

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            tcdiv_ptr(ss, st, sw, &sss, &sst);

            tclod_2cycle_current_notexel1(&sss, &sst, s, t, w, dsinc, dtinc, dwinc, prim_tile, &tile1);

            texture_pipeline_cycle(&texel0_color, &texel0_color, sss, sst, tile1, 0);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_2cycle(adith, &curpixel_cvg, &acalpha);

            fbread2_ptr(curpixel, &curpixel_memcvg);

            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_2cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit, acalpha))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }
            else
                memory_color = pre_memory_color;

            s += dsinc;
            t += dtinc;
            w += dwinc;
            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

void render_spans_2cycle_notex(int start, int end, int tilenum, int flip)
{
    int zb = zb_address >> 1;
    int zbcur;
    uint8_t offx, offy;
    int i, j;
    uint32_t blend_en;
    uint32_t prewrap;
    uint32_t curpixel_cvg, curpixel_cvbit, curpixel_memcvg;
    int32_t acalpha;

    int drinc, dginc, dbinc, dainc, dzinc;
    int xinc;
    if (flip)
    {
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        xinc = 1;
    }
    else
    {
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        xinc = -1;
    }

    int dzpix;
    if (!other_modes.z_source_sel)
        dzpix = spans_dzpix;
    else
    {
        dzpix = primitive_delta_z;
        dzinc = spans_cdz = spans_dzdy = 0;
    }
    int dzpixenc = dz_compress(dzpix);

    int cdith = 7, adith = 0;
    int r, g, b, a, z;
    int sr, sg, sb, sa, sz;
    int xstart, xend, xendsc;
    int curpixel = 0;

    int x, length, scdiff;
    uint32_t fir, fig, fib;

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        r = span[i].r;
        g = span[i].g;
        b = span[i].b;
        a = span[i].a;
        z = other_modes.z_source_sel ? primitive_z : span[i].z;

        x = xendsc;
        curpixel = fb_width * i + x;
        zbcur = zb + curpixel;

        if (!flip)
        {
            length = xendsc - xstart;
            scdiff = xend - xendsc;
            compute_cvg_noflip(i);
        }
        else
        {
            length = xstart - xendsc;
            scdiff = xendsc - xend;
            compute_cvg_flip(i);
        }

        if (scdiff)
        {

            scdiff &= 0xfff;
            r += (drinc * scdiff);
            g += (dginc * scdiff);
            b += (dbinc * scdiff);
            a += (dainc * scdiff);
            z += (dzinc * scdiff);
        }

        for (j = 0; j <= length; j++)
        {
            sr = r >> 14;
            sg = g >> 14;
            sb = b >> 14;
            sa = a >> 14;
            sz = (z >> 10) & 0x3fffff;

            lookup_cvmask_derivatives(cvgbuf[x], &offx, &offy, &curpixel_cvg, &curpixel_cvbit);

            rgbaz_correct_clip(offx, offy, sr, sg, sb, sa, &sz, curpixel_cvg);

            get_dither_noise_ptr(x, i, &cdith, &adith);
            combiner_2cycle(adith, &curpixel_cvg, &acalpha);

            fbread2_ptr(curpixel, &curpixel_memcvg);

            if (z_compare(zbcur, sz, dzpix, dzpixenc, &blend_en, &prewrap, &curpixel_cvg, curpixel_memcvg))
            {
                if (blender_2cycle(&fir, &fig, &fib, cdith, blend_en, prewrap, curpixel_cvg, curpixel_cvbit, acalpha))
                {
                    fbwrite_ptr(curpixel, fir, fig, fib, blend_en, curpixel_cvg, curpixel_memcvg);
                    if (other_modes.z_update_en)
                        z_store(zbcur, sz, dzpixenc);
                }
            }
            else
                memory_color = pre_memory_color;

            r += drinc;
            g += dginc;
            b += dbinc;
            a += dainc;
            z += dzinc;

            x += xinc;
            curpixel += xinc;
            zbcur += xinc;
        }
        }
    }
}

Fill Mode

Fill mode bypasses most of the RDP hardware, allowing it to fill pixels with a solid color as fast as possible. At each scanline, all pixels in [lx, rx] are completely overwritten using the fill color register. No color format conversion is performed, and coverage is also overwritten. If the framebuffer is in <32bpp format, the pixel address in memory determines which part of the fill color register is used - in an 8bpp frambuffer, for example, if (pixel_addr & 3 == 0) the top 8 bits of the register are used. The 2 hidden bits duplicate the LSB of each 16-bit word in fill mode, just like they do on writes from the CPU and other components.

void render_spans_fill(int start, int end, int flip)
{
    if (unlikely(fb_size == PIXEL_SIZE_4BIT))
    {
        rdp_pipeline_crashed = 1;
        return;
    }

    int i, j;

    int fastkillbits = other_modes.image_read_en || other_modes.z_compare_en;
    int slowkillbits = other_modes.z_update_en && !other_modes.z_source_sel && !fastkillbits;

    int xinc = flip ? 1 : -1;

    int xstart = 0, xendsc;
    int prevxstart;
    int curpixel = 0;
    int x, length;

    for (i = start; i <= end; i++)
    {
        prevxstart = xstart;
        xstart = span[i].lx;
        xendsc = span[i].rx;

        x = xendsc;
        curpixel = fb_width * i + x;
        length = flip ? (xstart - xendsc) : (xendsc - xstart);

        if (span[i].validline)
        {
            if (unlikely(fastkillbits && length >= 0))
            {
                if (!onetimewarnings.fillmbitcrashes)
                    debug("render_spans_fill: image_read_en %x z_update_en %x z_compare_en %x. RDP crashed",
                    other_modes.image_read_en, other_modes.z_update_en, other_modes.z_compare_en);
                onetimewarnings.fillmbitcrashes = 1;
                rdp_pipeline_crashed = 1;
                return;
            }

            for (j = 0; j <= length; j++)
            {
                fbfill_ptr(curpixel);
                x += xinc;
                curpixel += xinc;
            }

            if (unlikely(slowkillbits && length >= 0))
            {
                if (!onetimewarnings.fillmbitcrashes)
                    debug("render_spans_fill: image_read_en %x z_update_en %x z_compare_en %x z_source_sel %x. RDP crashed",
                    other_modes.image_read_en, other_modes.z_update_en, other_modes.z_compare_en, other_modes.z_source_sel);
                onetimewarnings.fillmbitcrashes = 1;
                rdp_pipeline_crashed = 1;
                return;
            }
        }
    }
}

Copy Mode

Copy mode also bypasses most of the RDP hardware, but keeps the texture coordinates and alpha comparison to allow fast blitting of textures to screen.

void render_spans_copy(int start, int end, int tilenum, int flip)
{
    int i, j, k;

    if (unlikely(fb_size == PIXEL_SIZE_32BIT))
    {
        rdp_pipeline_crashed = 1;
        return;
    }

    int tile1 = tilenum;
    int prim_tile = tilenum;

    int dsinc, dtinc, dwinc;
    int xinc;
    if (flip)
    {
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
        xinc = 1;
    }
    else
    {
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
        xinc = -1;
    }

    int xstart = 0, xendsc;
    int s = 0, t = 0, w = 0, ss = 0, st = 0, sw = 0, sss = 0, sst = 0, ssw = 0;
    int fb_index, length;
    int diff = 0;

    uint32_t hidword = 0, lowdword = 0;
    uint32_t hidword1 = 0, lowdword1 = 0;
    int fbadvance = (fb_size == PIXEL_SIZE_4BIT) ? 8 : 16 >> fb_size;
    uint32_t fbptr = 0;
    int fbptr_advance = flip ? 8 : -8;
    uint64_t copyqword = 0;
    uint32_t tempdword = 0, tempbyte = 0;
    int copywmask = 0, alphamask = 0;
    int bytesperpixel = (fb_size == PIXEL_SIZE_4BIT) ? 1 : (1 << (fb_size - 1));
    uint32_t fbendptr = 0;
    int32_t threshold, currthreshold;

#define PIXELS_TO_BYTES_SPECIAL4(pix, siz) ((siz) ? PIXELS_TO_BYTES(pix, siz) : (pix))

    for (i = start; i <= end; i++)
    {
        if (span[i].validline)
        {

        s = span[i].s;
        t = span[i].t;
        w = span[i].w;

        xstart = span[i].lx;
        xendsc = span[i].rx;

        fb_index = fb_width * i + xendsc;
        fbptr = fb_address + PIXELS_TO_BYTES_SPECIAL4(fb_index, fb_size);
        fbendptr = fb_address + PIXELS_TO_BYTES_SPECIAL4((fb_width * i + xstart), fb_size);
        length = flip ? (xstart - xendsc) : (xendsc - xstart);

        for (j = 0; j <= length; j += fbadvance)
        {
            ss = s >> 16;
            st = t >> 16;
            sw = w >> 16;

            // apply perspective correction
            tcdiv_ptr(ss, st, sw, &sss, &sst);

            // compute lod of texel (ignores detail/sharpen)
            tclod_copy(&sss, &sst, s, t, w, dsinc, dtinc, dwinc, prim_tile, &tile1);

            // get texel color for 64 bits worth of pixels
            // (shift/mask applied, no clamp or filtering)
            fetch_qword_copy(&hidword, &lowdword, sss, sst, tile1);

            if (fb_size == PIXEL_SIZE_16BIT || fb_size == PIXEL_SIZE_8BIT)
                copyqword = ((uint64_t)hidword << 32) | ((uint64_t)lowdword);
            else
                copyqword = 0;

            // perform alpha comparison if enabled
            // assumes texture is RGBA5551 or IA88
            // depending on framebuffer format
            if (!other_modes.alpha_compare_en)
                alphamask = 0xff;
            else if (fb_size == PIXEL_SIZE_16BIT)
            {
                alphamask = 0;
                alphamask |= (((copyqword >> 48) & 1) ? 0xC0 : 0);
                alphamask |= (((copyqword >> 32) & 1) ? 0x30 : 0);
                alphamask |= (((copyqword >> 16) & 1) ? 0xC : 0);
                alphamask |= ((copyqword & 1) ? 0x3 : 0);
            }
            else if (fb_size == PIXEL_SIZE_8BIT)
            {
                alphamask = 0;
                threshold = (other_modes.dither_alpha_en) ? (irand() & 0xff) : blend_color.a;
                if (other_modes.dither_alpha_en)
                {
                    // rotate threshold right 2 bits for each IA88 texel
                    currthreshold = threshold;
                    alphamask |= (((copyqword >> 24) & 0xff) >= currthreshold ? 0xC0 : 0);
                    currthreshold = ((threshold & 3) << 6) | (threshold >> 2);
                    alphamask |= (((copyqword >> 16) & 0xff) >= currthreshold ? 0x30 : 0);
                    currthreshold = ((threshold & 0xf) << 4) | (threshold >> 4);
                    alphamask |= (((copyqword >> 8) & 0xff) >= currthreshold ? 0xC : 0);
                    currthreshold = ((threshold & 0x3f) << 2) | (threshold >> 6);
                    alphamask |= ((copyqword & 0xff) >= currthreshold ? 0x3 : 0);
                }
                else
                {
                    alphamask |= (((copyqword >> 24) & 0xff) >= threshold ? 0xC0 : 0);
                    alphamask |= (((copyqword >> 16) & 0xff) >= threshold ? 0x30 : 0);
                    alphamask |= (((copyqword >> 8) & 0xff) >= threshold ? 0xC : 0);
                    alphamask |= ((copyqword & 0xff) >= threshold ? 0x3 : 0);
                }
            }
            else
                alphamask = 0;

            // find number of bytes to write
            copywmask = (flip) ? (fbendptr - fbptr + bytesperpixel) : (fbptr - fbendptr + bytesperpixel);

            // write each byte if alpha comparison passed
            if (copywmask > 8)
                copywmask = 8;
            tempdword = fbptr;
            k = 7;
            while(copywmask > 0)
            {
                tempbyte = (uint32_t)((copyqword >> (k << 3)) & 0xff);
                if (alphamask & (1 << k))
                {
                    PAIRWRITE8(tempdword, tempbyte, (tempbyte & 1) ? 3 : 0);
                }
                k--;
                tempdword += xinc;
                copywmask--;
            }

            s += dsinc;
            t += dtinc;
            w += dwinc;
            fbptr += fbptr_advance;
        }
        }
    }
}

Texture Loading

Loading a texture into TMEM from RAM works very similarly to rendering a textured rectangle in RAM using a texture a TMEM, except rather than writing to the framebuffer specified by SetColorImage, the RDP reads from the texture buffer specified by SetTextureImage. Note that S/T texture coordinates can still be used to index TMEM, so the rectangle read from the texture buffer can write a differently shaped area in TMEM (This allows Load Block, for example, to write multiple rows in TMEM with one row of the texture buffer).

void loading_pipeline(int start, int end, int tilenum, int coord_quad, int ltlut)
{

    int localdebugmode = 0, cnt = 0;
    int i, j;

    int dsinc, dtinc;
    dsinc = spans_ds;
    dtinc = spans_dt;

    int s, t;
    int ss, st;
    int xstart, xend, xendsc;
    int sss = 0, sst = 0;
    int ti_index, length;

    uint32_t tmemidx0 = 0, tmemidx1 = 0, tmemidx2 = 0, tmemidx3 = 0;
    int dswap = 0;
    uint16_t* tmem16 = (uint16_t*)TMEM;
    uint32_t readval0, readval1, readval2, readval3;
    uint32_t readidx32;
    uint64_t loadqword;
    uint16_t tempshort;
    int tmem_formatting = 0;
    uint32_t bit3fl = 0, hibit = 0;

    if (unlikely(end > start && ltlut))
    {
        rdp_pipeline_crashed = 1;
        return;
    }

    if (tile[tilenum].format == FORMAT_YUV)
        tmem_formatting = 0;
    else if (tile[tilenum].format == FORMAT_RGBA && tile[tilenum].size == PIXEL_SIZE_32BIT)
        tmem_formatting = 1;
    else
        tmem_formatting = 2;

    // tiadvance = bytes per DRAM read
    // spanadvance = pixels per DRAM read
    int tiadvance = 0, spanadvance = 0;
    int tiptr = 0;
    switch (ti_size)
    {
    case PIXEL_SIZE_4BIT:
        rdp_pipeline_crashed = 1;
        return;
        break;
    case PIXEL_SIZE_8BIT:
        tiadvance = 8;
        spanadvance = 8;
        break;
    case PIXEL_SIZE_16BIT:
        if (!ltlut)
        {
            tiadvance = 8;
            spanadvance = 4;
        }
        else
        {
            tiadvance = 2;
            spanadvance = 1;
        }
        break;
    case PIXEL_SIZE_32BIT:
        tiadvance = 8;
        spanadvance = 2;
        break;
    }

    // perform seperate transfers
    // for each texture buffer row
    for (i = start; i <= end; i++)
    {
        xstart = span[i].lx;
        xend = span[i].unscrx;
        xendsc = span[i].rx;
        s = span[i].s;
        t = span[i].t;

        // get DRAM address of transfer row
        ti_index = ti_width * i + xend;
        tiptr = ti_address + PIXELS_TO_BYTES(ti_index, ti_size);
        // find length of row in pixels
        length = (xstart - xend + 1) & 0xfff;

        // for each DRAM read in row transfer
        for (j = 0; j < length; j+= spanadvance)
        {
            ss = s >> 16;
            st = t >> 16;

            sss = ss & 0xffff;
            sst = st & 0xffff;

            tc_pipeline_load(&sss, &sst, tilenum, coord_quad);

            dswap = sst & 1;

            // calculate TMEM addresses from
            // tile index and texture coordinates
            get_tmem_idx(sss, sst, tilenum, &tmemidx0, &tmemidx1, &tmemidx2, &tmemidx3, &bit3fl, &hibit);

            // read 4 big-endian 32-bit words
            // from 64-bit aligned row address
            readidx32 = (tiptr >> 2) & ~1;
            RREADIDX32(readval0, readidx32);
            readidx32++;
            RREADIDX32(readval1, readidx32);
            readidx32++;
            RREADIDX32(readval2, readidx32);
            readidx32++;
            RREADIDX32(readval3, readidx32);

            // get unaligned 64-bit word at row address
            // if tlut load of even address quadricate data
            // (duplicate 16-bit word at row address 4 times)
            switch(tiptr & 7)
            {
            case 0:
                if (!ltlut)
                    loadqword = ((uint64_t)readval0 << 32) | readval1;
                else
                {
                    tempshort = readval0 >> 16;
                    loadqword = ((uint64_t)tempshort << 48) | ((uint64_t) tempshort << 32) | ((uint64_t) tempshort << 16) | tempshort;
                }
                break;
            case 1:
                loadqword = ((uint64_t)readval0 << 40) | ((uint64_t)readval1 << 8) | (readval2 >> 24);
                break;
            case 2:
                if (!ltlut)
                    loadqword = ((uint64_t)readval0 << 48) | ((uint64_t)readval1 << 16) | (readval2 >> 16);
                else
                {
                    tempshort = readval0 & 0xffff;
                    loadqword = ((uint64_t)tempshort << 48) | ((uint64_t) tempshort << 32) | ((uint64_t) tempshort << 16) | tempshort;
                }
                break;
            case 3:
                loadqword = ((uint64_t)readval0 << 56) | ((uint64_t)readval1 << 24) | (readval2 >> 8);
                break;
            case 4:
                if (!ltlut)
                    loadqword = ((uint64_t)readval1 << 32) | readval2;
                else
                {
                    tempshort = readval1 >> 16;
                    loadqword = ((uint64_t)tempshort << 48) | ((uint64_t) tempshort << 32) | ((uint64_t) tempshort << 16) | tempshort;
                }
                break;
            case 5:
                loadqword = ((uint64_t)readval1 << 40) | ((uint64_t)readval2 << 8) | (readval3 >> 24);
                break;
            case 6:
                if (!ltlut)
                    loadqword = ((uint64_t)readval1 << 48) | ((uint64_t)readval2 << 16) | (readval3 >> 16);
                else
                {
                    tempshort = readval1 & 0xffff;
                    loadqword = ((uint64_t)tempshort << 48) | ((uint64_t) tempshort << 32) | ((uint64_t) tempshort << 16) | tempshort;
                }
                break;
            case 7:
                loadqword = ((uint64_t)readval1 << 56) | ((uint64_t)readval2 << 24) | (readval3 >> 8);
                break;
            }

            switch(tmem_formatting)
            {
            case 0:
                // for YUV textures, UV in low TMEM, Y in high TMEM
                // assumes texture buffer in UYVY 8 bit per channel format
                readval0 = (uint32_t)((((loadqword >> 56) & 0xff) << 24) | (((loadqword >> 40) & 0xff) << 16) | (((loadqword >> 24) & 0xff) << 8) | (((loadqword >> 8) & 0xff) << 0));
                readval1 = (uint32_t)((((loadqword >> 48) & 0xff) << 24) | (((loadqword >> 32) & 0xff) << 16) | (((loadqword >> 16) & 0xff) << 8) | (((loadqword >> 0) & 0xff) << 0));

                // use banks 2/3 or 0/1 depending on coords
                if (bit3fl)
                {
                    tmem16[tmemidx2 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 >> 16);
                    tmem16[tmemidx3 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 & 0xffff);
                    tmem16[(tmemidx2 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 >> 16);
                    tmem16[(tmemidx3 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 & 0xffff);
                }
                else
                {
                    tmem16[tmemidx0 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 >> 16);
                    tmem16[tmemidx1 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 & 0xffff);
                    tmem16[(tmemidx0 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 >> 16);
                    tmem16[(tmemidx1 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 & 0xffff);
                }
                break;
            case 1:
                // for 32bpp RGBA textures, RG in low TMEM, BG in high TMEM
                readval0 = (uint32_t)(((loadqword >> 48) << 16) | ((loadqword >> 16) & 0xffff));
                readval1 = (uint32_t)((((loadqword >> 32) & 0xffff) << 16) | (loadqword & 0xffff));

                if (bit3fl)
                {
                    tmem16[tmemidx2 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 >> 16);
                    tmem16[tmemidx3 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 & 0xffff);
                    tmem16[(tmemidx2 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 >> 16);
                    tmem16[(tmemidx3 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 & 0xffff);
                }
                else
                {
                    tmem16[tmemidx0 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 >> 16);
                    tmem16[tmemidx1 ^ WORD_ADDR_XOR] = (uint16_t)(readval0 & 0xffff);
                    tmem16[(tmemidx0 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 >> 16);
                    tmem16[(tmemidx1 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(readval1 & 0xffff);
                }
                break;
            case 2:
                // other formats use low or high TMEM depending on coords
                // see Section 13.8 for TMEM layout diagrams
                if (!dswap)
                {
                    if (!hibit)
                    {
                        tmem16[tmemidx0 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 48);
                        tmem16[tmemidx1 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 32);
                        tmem16[tmemidx2 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 16);
                        tmem16[tmemidx3 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword & 0xffff);
                    }
                    else
                    {
                        tmem16[(tmemidx0 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 48);
                        tmem16[(tmemidx1 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 32);
                        tmem16[(tmemidx2 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 16);
                        tmem16[(tmemidx3 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword & 0xffff);
                    }
                }
                else
                {
                    // swap each pair of 4-byte blocks if odd T coord
                    // could omit swap/bit3fl logic by not sorting tmemidx?
                    if (!hibit)
                    {
                        tmem16[tmemidx0 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 16);
                        tmem16[tmemidx1 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword & 0xffff);
                        tmem16[tmemidx2 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 48);
                        tmem16[tmemidx3 ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 32);
                    }
                    else
                    {
                        tmem16[(tmemidx0 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 16);
                        tmem16[(tmemidx1 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword & 0xffff);
                        tmem16[(tmemidx2 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 48);
                        tmem16[(tmemidx3 | 0x400) ^ WORD_ADDR_XOR] = (uint16_t)(loadqword >> 32);
                    }
                }
            break;
            }

            // advance coords for next read
            s = (s + dsinc) & ~0x1f;
            t = (t + dtinc) & ~0x1f;
            tiptr += tiadvance;
        }
    }
}

Primitive Rendering

The first thing Angrylion does upon reading a new primitive command is to compute which pixels are affected, and calculate the pixel bounds, starting value for each attribute, and other properties needed to render each scanline. These are stored in the span buffer to be retrieved by the render_spans call of the appropriate mode immediately afterward.

static void edgewalker_for_prims(int32_t* ewdata)
{
    int j = 0;
    int xleft = 0, xright = 0, xleft_inc = 0, xright_inc = 0;
    int r = 0, g = 0, b = 0, a = 0, z = 0, s = 0, t = 0, w = 0;
    int dr = 0, dg = 0, db = 0, da = 0;
    int drdx = 0, dgdx = 0, dbdx = 0, dadx = 0, dzdx = 0, dsdx = 0, dtdx = 0, dwdx = 0;
    int drdy = 0, dgdy = 0, dbdy = 0, dady = 0, dzdy = 0, dsdy = 0, dtdy = 0, dwdy = 0;
    int drde = 0, dgde = 0, dbde = 0, dade = 0, dzde = 0, dsde = 0, dtde = 0, dwde = 0;
    int tilenum = 0, flip = 0;
    int32_t yl = 0, ym = 0, yh = 0;
    int32_t xl = 0, xm = 0, xh = 0;
    int32_t dxldy = 0, dxhdy = 0, dxmdy = 0;

    if (unlikely(other_modes.f.stalederivs))
    {
        deduce_derivatives();
        other_modes.f.stalederivs = 0;
    }

    flip = (ewdata[0] & 0x800000) ? 1 : 0;
    max_level = (ewdata[0] >> 19) & 7;
    tilenum = (ewdata[0] >> 16) & 7;

    yl = SIGN(ewdata[0], 14);
    ym = ewdata[1] >> 16;
    ym = SIGN(ym, 14);
    yh = SIGN(ewdata[1], 14);

    xl = SIGN(ewdata[2], 28);
    xh = SIGN(ewdata[4], 28);
    xm = SIGN(ewdata[6], 28);

    dxldy = SIGN(ewdata[3], 30);

    dxhdy = SIGN(ewdata[5], 30);
    dxmdy = SIGN(ewdata[7], 30);

    r    = (ewdata[8] & 0xffff0000) | ((ewdata[12] >> 16) & 0x0000ffff);
    g    = ((ewdata[8] << 16) & 0xffff0000) | (ewdata[12] & 0x0000ffff);
    b    = (ewdata[9] & 0xffff0000) | ((ewdata[13] >> 16) & 0x0000ffff);
    a    = ((ewdata[9] << 16) & 0xffff0000) | (ewdata[13] & 0x0000ffff);
    drdx = (ewdata[10] & 0xffff0000) | ((ewdata[14] >> 16) & 0x0000ffff);
    dgdx = ((ewdata[10] << 16) & 0xffff0000) | (ewdata[14] & 0x0000ffff);
    dbdx = (ewdata[11] & 0xffff0000) | ((ewdata[15] >> 16) & 0x0000ffff);
    dadx = ((ewdata[11] << 16) & 0xffff0000) | (ewdata[15] & 0x0000ffff);
    drde = (ewdata[16] & 0xffff0000) | ((ewdata[20] >> 16) & 0x0000ffff);
    dgde = ((ewdata[16] << 16) & 0xffff0000) | (ewdata[20] & 0x0000ffff);
    dbde = (ewdata[17] & 0xffff0000) | ((ewdata[21] >> 16) & 0x0000ffff);
    dade = ((ewdata[17] << 16) & 0xffff0000) | (ewdata[21] & 0x0000ffff);
    drdy = (ewdata[18] & 0xffff0000) | ((ewdata[22] >> 16) & 0x0000ffff);
    dgdy = ((ewdata[18] << 16) & 0xffff0000) | (ewdata[22] & 0x0000ffff);
    dbdy = (ewdata[19] & 0xffff0000) | ((ewdata[23] >> 16) & 0x0000ffff);
    dady = ((ewdata[19] << 16) & 0xffff0000) | (ewdata[23] & 0x0000ffff);

    s    = (ewdata[24] & 0xffff0000) | ((ewdata[28] >> 16) & 0x0000ffff);
    t    = ((ewdata[24] << 16) & 0xffff0000)    | (ewdata[28] & 0x0000ffff);
    w    = (ewdata[25] & 0xffff0000) | ((ewdata[29] >> 16) & 0x0000ffff);
    dsdx = (ewdata[26] & 0xffff0000) | ((ewdata[30] >> 16) & 0x0000ffff);
    dtdx = ((ewdata[26] << 16) & 0xffff0000)    | (ewdata[30] & 0x0000ffff);
    dwdx = (ewdata[27] & 0xffff0000) | ((ewdata[31] >> 16) & 0x0000ffff);
    dsde = (ewdata[32] & 0xffff0000) | ((ewdata[36] >> 16) & 0x0000ffff);
    dtde = ((ewdata[32] << 16) & 0xffff0000)    | (ewdata[36] & 0x0000ffff);
    dwde = (ewdata[33] & 0xffff0000) | ((ewdata[37] >> 16) & 0x0000ffff);
    dsdy = (ewdata[34] & 0xffff0000) | ((ewdata[38] >> 16) & 0x0000ffff);
    dtdy = ((ewdata[34] << 16) & 0xffff0000)    | (ewdata[38] & 0x0000ffff);
    dwdy = (ewdata[35] & 0xffff0000) | ((ewdata[39] >> 16) & 0x0000ffff);

    z    = ewdata[40];
    dzdx = ewdata[41];
    dzde = ewdata[42];
    dzdy = ewdata[43];

    // last 5 bits of DrDx are ignored
    // for per-pixel increments
    spans_ds = dsdx & ~0x1f;
    spans_dt = dtdx & ~0x1f;
    spans_dw = dwdx & ~0x1f;
    spans_dr = drdx & ~0x1f;
    spans_dg = dgdx & ~0x1f;
    spans_db = dbdx & ~0x1f;
    spans_da = dadx & ~0x1f;
    spans_dz = dzdx;

    // DrDy is only used for subpixel adjustments,
    // due to partial coverage or sign(DxhDy) == flip
    spans_drdy = drdy >> 14;
    spans_dgdy = dgdy >> 14;
    spans_dbdy = dbdy >> 14;
    spans_dady = dady >> 14;
    spans_dzdy = dzdy >> 10;
    spans_drdy = SIGN(spans_drdy, 13);
    spans_dgdy = SIGN(spans_dgdy, 13);
    spans_dbdy = SIGN(spans_dbdy, 13);
    spans_dady = SIGN(spans_dady, 13);
    spans_dzdy = SIGN(spans_dzdy, 22);
    spans_cdr = spans_dr >> 14;
    spans_cdr = SIGN(spans_cdr, 13);
    spans_cdg = spans_dg >> 14;
    spans_cdg = SIGN(spans_cdg, 13);
    spans_cdb = spans_db >> 14;
    spans_cdb = SIGN(spans_cdb, 13);
    spans_cda = spans_da >> 14;
    spans_cda = SIGN(spans_cda, 13);
    spans_cdz = spans_dz >> 10;
    spans_cdz = SIGN(spans_cdz, 22);

    spans_dsdy = dsdy & ~0x7fff;
    spans_dtdy = dtdy & ~0x7fff;
    spans_dwdy = dwdy & ~0x7fff;

    int dzdy_dz = (dzdy >> 16) & 0xffff;
    int dzdx_dz = (dzdx >> 16) & 0xffff;

    // set dz = abs(dzdy_dz) + abs(dzdx_dz)
    // assuming 1s complement inputs
    spans_dzpix = ((dzdy_dz & 0x8000) ? ((~dzdy_dz) & 0x7fff) : dzdy_dz) + ((dzdx_dz & 0x8000) ? ((~dzdx_dz) & 0x7fff) : dzdx_dz);
    // round up to power of 2, clamping at 0x8000
    spans_dzpix = normalize_dzpix(spans_dzpix & 0xffff) & 0xffff;

    xleft_inc = (dxmdy >> 2) & ~0x1;
    xright_inc = (dxhdy >> 2) & ~0x1;

    xright = xh & ~0x1;
    xleft = xm & ~0x1;

    int k = 0;

    int dsdiff, dtdiff, dwdiff, drdiff, dgdiff, dbdiff, dadiff, dzdiff;
    int sign_dxhdy = (ewdata[5] & 0x80000000) ? 1 : 0;

    int dsdeh, dtdeh, dwdeh, drdeh, dgdeh, dbdeh, dadeh, dzdeh, dsdyh, dtdyh, dwdyh, drdyh, dgdyh, dbdyh, dadyh, dzdyh;
    int do_offset = !(sign_dxhdy ^ flip);

    // add 3/4 DrDe and subtract 3/4 DrDy from
    // attribute start value if sign(DxhDy) == flip
    if (do_offset)
    {
        dsdeh = dsde & ~0x1ff;
        dtdeh = dtde & ~0x1ff;