Optimization, inside the mind of a mad man

In this article I talk about optimizing the level rendering routine for my game Bouncy Ball for the Color Computer. With the help of the mad man Simon Jonassen, I was able to take the render routine from 7 fps to 50 fps! I was quite surprised at the results. I made a video at the end of this article, and explain things in detail here. I also list a couple resources you might want to have in your back pocket.

 

It takes several iterations to optimize code. First you get your prototype or first attempt working, in an unoptimized form. It might be ugly, it might be slow, but it should do the job it is suppose to do. The next step is try to analyze it and figure out how you can make it a little faster. Then analyze and figure out how to make it a little faster. Rinse and repeat until you are happy with the result. Happy with the result is the important thing. You might be able to make it faster, but you could also decide to get back to working on other parts of the code.

That’s what I have done. I got Bouncy Ball’s render code fast enough, so I can continue with the rest of Bouncy Ball.

Results

I pulled the render routine out of Bouncy Ball and made a stand alone program called render. From there I did all my optimizations, and will roll that back into Bouncy Ball.

Let’s first take a look at the final results in the table below. From here on, I will just refer to render1’s speed as 7 fps, which is what it runs at in slow speed, to keep things more consistent.

Frame rate (FPS) Program Language CPU Speed Notes
15 (7) render1 C high prototype blitter written entirely in C.
26 plot2 asm low My first kick at the can to convert C glitter to asm
37 plot4 asm low Simon’s first optimization pass
40 plot6 asm low Simon’s final optimization
51 plot7 asm low plot6 with color lookup removed and my optimized page flip
36 render2 C low plot7 moved to inline asm
50 render3 C low plot7 moved to inline asm and updated page flip

I have broken this article up into a few sections. I show the code we want to optimize. Converting to assembly. Simon’s optimizations. My further optimizations. Then rolling it back into C via inline assembly.

The Render Code in C

FPS: 7, render1.c

I was lucky to be developing on a Coco 3, and was able to make use of it’s high speed ‘poke’. As such, I was running at 15fps. But let’s stick with the 7 fps at the slow speed setting, since all the rest of the code examples were run on the low speed setting.

First the code I am trying to optimize. Nothing fancy here. It just looks at the tile data for the level, checks if the tile/wall is suppose to be invisible or not, lookups the tile in the color table, then stores the color in the display buffer. It does it’s job, albeit a little slow. But it got me to 80% code complete before I decided it was time to fix it up.

Color Lookup

Also worth noting is that I am double buffering. In other words, I am composing the image in an offscreen buffer, then blit the entire buffer to the video display. This avoids flickering caused by erasing and redrawing. All you ever see is the final image.

void showLevelSection(int offset)
{
    byte* pcolor = blockColors;
    byte* dispStart = doubleBuffer;  //point to second line
    byte *pdisp = dispStart;
    byte* poffsetStart = level->data+offset;
    byte* plevel = poffsetStart;

    for(int y=0; y<15; y++)
    {
        pdisp   = dispStart;
        plevel = poffsetStart;
        for(int x=0; x<DISP_WIDTH; x++)
        {
            if(BLOCK_WALL_INVISIBLE != *plevel)
                *pdisp = blockColors[*plevel];
            else
                *pdisp = blockColors[BLOCK_EMPTY];
            ++pdisp;
            ++plevel;
        }
        poffsetStart += level->width;
        dispStart    += DISP_WIDTH;
    }
}

Converting to assembly

FPS: 26, plot2.asm

Next step was to convert to straight assembly. The reason I didn’t just try to do it via inline right away was I didn’t want to deal with any possible inline C issues, which was a good choice, as just that happened when Simon send me his revisions. Going from 7 fps to 26 is pretty darn good. All I did was convert the C to assembly.

To give you an idea what the program structure is, we have have a couple loops that just scroll the level forward, then again in reverse. Nothing fancy. I use register A as the position in the level, and call drawlevel.

plot1           ldx     #offset         move forward
                lda     ,x              reg A contains offset
                jsr     drawlevel
                ldx     #offset
                lda     ,x
                inca
                sta     ,x
                cmpa    #MAX_OFFSET
                bne     plot1

plot2           ldx     #offset         reverse direction
                lda     ,x
                deca    
                sta     ,x
                beq     plot1
                jsr     drawlevel
                bra     plot2                 

In drawlevel I point to level data, point to the double buffer, then in two loops. The outer loop is for each line, the inner loop for each character in the line. Once both loops are done, the level has been drawn, and I page flip to show the user what the level looks like.

drawlevel       ldx     #level          reg A contains offset
                leax    a,x             point to correct offset in level
                ldy     #buffer

                ldb     #15             counter for outter loop
outterloop      lda     #32

loop            pshs    a               [5+1]   hold counter
                lda     ,x+             [2]     color index
                pshs    x               [5+1]   hold level pointer
                cmpa    #3              is this an invisible block?
                bne     loopcolor       no, we need to look up color
                lda     #0

loopcolor       ldx     #color          load correct color ...
                leax    a,x             ...offset
                lda     ,x              a = color[...]
                sta     ,y+             blit to buffer
                puls    x               point to level again
                puls    a               inner loop counter...
                deca                    ...
                bne     loop            ...keep looping?

                leax    49,x            point to next line in level
                decb                    outter loop counter...
                bne     outterloop      ...keep going

                jsr     flip            page flip from buffer to video memory

                rts

Finally, the page flip routine, which takes the double buffer and copies it to the video memory.

flip            ldx     #buffer
                ldy     #$400
fliploop        ldd     ,x++
                std     ,y++
                cmpy    #$600
                bne     fliploop
                rts                

First Revision of Optimizations

FPS: 37, plot4.asm

I thought Simon was going to take a look at the code and recommend changes I could make. Well, he don’t work that way. Must have been within the hour, he sent back a revision that used self modificating code, and made use of the U register, which I hadn’t considered using.

The control loops used the self mod code. This wasn’t actually useful for Bouncy Ball, as the control logic for the level position was handled elsewhere in the C code, it is interesting to see it in use. The ldd #0 is modified by inc plot1+1 to be lda #1 and so on. Very powerful technique. For larger programs, I wonder how readable it ends up being. As my programs get larger, it will be interesting to see.

Notice that he loads up X with the color table, so we can use it later and not load X inside an loop.

start           
                ldx     #color          load correct color ... 
plot1           lda     #0
                cmpa    #MAX_OFFSET
                beq     plot2
                jsr     drawlevel
                inc     plot1+1
                bra     plot1

It actually took a couple revision for Simon to get all the push and pulls out of the drawlevel code below. Registers Y and U need to be loaded every time at the start of the loop. Pointing U to the level differers considerably from my code where I use X to point to the level data, and a counter; dual purposing it, which causes me to have to push and pull A and X from inside the loop. Not very efficient. Also notice instead of loading A with 0 ldd #0, he used clra. I should have known that from coding my PC–1360 in assembly.

drawlevel       ldy     #buffer
                ldu     #level          reg A contains offset
                leau    a,u             point to correct offset in level
bcnt            ldb     #15             counter for outter loop

loop            lda     #32
                beq     nextline
                lda     ,u+             load A with color index from level
                cmpa    #3              is this an invisible block?
                bne     loopcolor       no, we need to look up color
                clra
loopcolor       lda     a,x              a = color[...]
                sta     ,y+             blit to buffer
                dec     loop+1
                bne     loop
nextline        leau    49,u            point to next line in level
                lda     #32
                sta     loop+1
                decb                    outter loop counter...
                bne     loop            ...keep going

He also made use of self modifying code to handle the outer loop, as you can see in lda loop+1.

Simon’s Final Revision

FPS: 40, plot6.asm

Ever try to squeeze that last little bit out of an orange? Well I know Simon does. A couple revisions later, and he got a tidy 3 fps and several cycles faster. The draw routine went from 15 instructions down to 13.

drawlevel       ldb     #15             counter for outer loop
                ldx     #color          load correct color ... 
                ldy     #buffer
                ldu     #level          reg A contains offset
                leau    a,u             point to correct offset in level
loop            lda     ,u+             load A with color index from level
                cmpa    #3              is this an invisible block?
                bne     loopcolor       no, we need to look up color
                clra
loopcolor       lda     a,x              a = color[...]
                sta     ,y+             blit to buffer
                dec     xcount
                bne     loop
nextline        leau    49,u            point to next line in level
                lda     #32             reset x counter
                sta     xcount
                decb                    outter loop counter...
                bne     loop            ...keep going

The changes in the above code are subtile, and really shows finesse. By introducing xcount to use as a counter instead of self modifying the lda instruction as seen in the old code:

loop            lda     #32
                beq     nextline
                lda     ,u+             load A with color index from level
                cmpa    #3              is this an invisible block?

He was able to just focus on loading A with the color index from the level. And just like that, we removed 2 instructions. From the new code:

loop            lda     ,u+             color index
                cmpa    #3              is this an invisible block?

We no longer have to load and branch. The rest of the code is identical, although instead of using dec loop+1 to control the inner loop, we use dec xcount.

My Optimizations to Simon’s Code

FPS: 51, plot7.asm

Simon had suggested a couple changes I could still make. One was loosing the color lookup table. At first I resisted removing the color table, since I wanted to be able to change level colors. Palette cycling if you will. But since this wasn’t going to happen during the game, where we are rendering the level, it started to make sense that we could just put the color data in the level, and just modify the level when we needed. So I dropped the color table and gained 6 fps.

The second item he pointed me to, was the page flip routine. Before you ask why I am not using hardware page flipping, I didn’t know I could. I’m still pretty green when it comes to low level programming on the Coco 3, and I wouldn’t have thought you could change the 32×16 text screens position in memory. The SAM rocks. In a later video post I will talk about all 3 ways we can optimize the page flip, but for now I will just focus on unrolling.

The 3 ways are, unrolling, which I ended up doing. Stack blasting, a sneaky but mad technique. And finally hardware page flipping, which I decided to switch to later in Bouncy Ball’s development cycle.

Unrolling

The reason for unrolling a loop is because you want to perform compares and branches as few times as possible, as they take a lot of cpu cycles. The process of unrolling a loop is to repeat the code, and loop fewer times.

This is the page flip routine before unrolling it:

flip            ldx     #buffer
                ldy     #$400
fliploop        ldd     ,x++
                std     ,y++
                cmpy    #$600
                bne     fliploop
                rts                

This is after. The number in brackets is the cycle count:

flip            ldu     #buffer         (3)
                ldx     #$400           (3)     (6)

fliploop
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)
                ldd     ,u++            (8)
                std     ,x++            (8)

                cmpx    #$600           (4)
                bne     fliploop        (3)     (8*32 + 4 + 3) *
                                                ;(23) * 256  = 5888 cycles       
                rts                     

One could argue that the first one is tidier and takes up less memory. Both points are true. But the performance of the unrolled loop gained 5 fps. Can’t argue with the numbers.

Integrating Back Into C

FPS: 36, render2.c

Now that I had a crazy fast render routine, I needed to get it back into my C test program. This is where I ran into a problem.

The code below is the assembly generated from render2.c for the showLevelSection(), the function to blit the level to the double buffer.

* FUNCTION showLevelSection(): defined at render2.c:266
_showLevelSection
    PSHS    U   
    LEAU    ,S  
    LEAS    -6,S    
* Line render2.c:299: init of variable xcount
    LDB #$20        32
    STB -6,U        variable xcount
* Line render2.c:300: init of variable ycount
    LDB #$0F        15
    STB -5,U        variable ycount
* Line render2.c:301: init of variable pbuffer
    LEAX    G00087+0,PCR    address of array doubleBuffer
    TFR X,D     as r-value
    STD -4,U        variable pbuffer
* Line render2.c:302: init of variable plevel
    LDD 4,U     variable offset, declared at render2.c:266
    PSHS    B,A 
    LDX G00086+0,PCR    variable level
    LDD 29,X        member data of BOUNCY_LEVEL
    ADDD    ,S++    
    STD -2,U        variable plevel
* Inline assembly:

        ldx     -2,U
        ldy     -4,U
        ...

The stuff before Inline assembly is all auto generated by the compiler, and I have no control over. The stuff after is what is inside the asm{} block. If we look at the C code:

void showLevelSection(int offset)
{
    byte xcount=32;
    byte ycount=BLIT_HEIGHT;
    byte* pbuffer = doubleBuffer;                 //needed for asm to see a byte* correctly
    byte* plevel = level->data+offset;
    asm
    {
        ldx     plevel
        ldy     pbuffer
        ...
    }
}

Notice that the instructions

        ldx     plevel
        ldy     pbuffer

Got changed to

        ldx     -2,U
        ldy     -4,U

This is where the problem is. If we continue to use Simon’s code

        ldu     #level          reg A contains offset
        leau    a,u             point to correct offset in level

We are changing the contents of U, and end we up crashing and burning. The reason is register U is being used to reference C variables.

This also influenced my decision to toss out the color lookup table. Since he used X, Y, and U, the code would have to change a lot.

        ldx     #color          load correct color ... 
        ldy     #buffer
        ldu     #level          reg A contains offset

So I tossed out color, and use X and Y. We end up with this:

void showLevelSection(int offset)
{
    byte xcount=32;
    byte ycount=BLIT_HEIGHT;
    byte* pbuffer = doubleBuffer; //needed for asm to see a byte* correctly
    byte* plevel = level->data+offset;
    asm
    {
        ldx     plevel
        ldy     pbuffer

loop:
        lda     ,x+             read from level
        cmpa    #INVISIBLE      should this be drawn?
        bne     drawfromlevel   not invisible, draw using level data
        lda     #BLACK          replace color with black tile
drawfromlevel:
        sta     ,y+             blit to buffer
        dec     xcount
        bne     loop

        leax    49,x            point to next line in level
        lda     #DISP_WIDTH     reset x counter
        sta     xcount
        dec     ycount          outter loop counter...
        bne     loop            ...keep going
    }
}

Updating C Version of Page Flip

FPS: 50 fps, render3.c

With the new inline assembly, we were humming along at 36 fps. Wait, what? This is the same assembly that was getting 50 fps in plot7.asm! I hadn’t realized I should have updated the page flip routine. This is what it use to be:

void flip()
{
    memcpy((byte*)DISP_ADDR,doubleBuffer,512);
}

Two words. Too Slow. That’s a general purpose memory copy, which works great for most cases. But let’s try some inline, and we end up with render3.c and 50 fps:

void flip()
{
    byte* pbuffer = doubleBuffer;   //asm needs byte* not byte[]
    asm
    {
        ldx     pbuffer
        ldy     #DISP_ADDR

        // unrolled loop, 16 bit operations, repeated 16 times = 32 bytes transfered
    fliploop:
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++
        ldd     ,x++
        std     ,y++

        cmpy    #$600
        bne     fliploop
    }
}

Conclusion

From 7 fps to 50 fps is totally mad. Thanks Simon for all the help. I learned a lot, and hope folks picked up a few ideas too.

The main thing I want to stress is to get your prototype working, even if it is slow, continue development, and circle back and optimize when needed. Don’t kill yourself right out of the gate.

Video

I seem to bumble along in my videos, but hopefully you will be able to follow along.

Related Links

SockMaster’s cycle timings,

Tech Ref for Coco,

Simon’s Coco Site


One thought on “Optimization, inside the mind of a mad man”

Comments are closed.