June 6, 2025

For the final quarter of my first year at UCSB, I took CCS Computing’s very own 130E, a project-based seminar course on exploring esoterica at the hardware-software interface.

My classmates took on a variety of cool projects:

This post describes my project: designing a continued logarithm arithmetic hardware unit in UCSB’s HDL of choice (PyRTL) and running it on an FPGA.

This is my latest dive into continued logarithms. I’ve previously investigated the question of how continued logarithms spread their representational power (somewhat inconclusively). And, while continued logarithm hardware has been designed and run on an FPGA once before, in 2006, the author uses a nonredundant continued logarithm scheme — having not yet devised the redundant scheme that might make them actually practical for high precision real arithmetic. I don’t think they’ll have much use outside of very specific applications (as an “easy” way to do real arithmetic). But I figured I’d take a crack at implementing them in hardware, just for fun.

All continued logarithm-based algorithms are best framed as nodes in a DAG, ingesting and egesting continued logarithm terms, pushing them through the graph until we evaluate the expression to satisfactory precision.

Having given it some thought, I opted to follow Bill Gosper’s suggested 3-cycle architecture (very bottom of the paper) — take x, take y, emit z — but using Brabec’s speculatively redundant scheme. The idea being that multiple such units could be connected together, feeding and being fed in turn, a complex arithmetic expression decomposed into a graph of these units.

(see this diagram reused from my 1L project)

There are a few continued logarithm algorithms I wanted to implement:

I’ll describe the blfts, since it’s the unit I actually implemented.

Each blfts unit has 10 internal registers — 8 signed integers representing the coefficients of a 2x4 matrix, a 1-bit “singularity” flag, and a “z” register, which is where output terms are written to (and read from by downstream units). On each major cycle it can ingest a single term from x, ingest a single term from y, or egest a single term to z; it can also reset itself.

It is shallow but wide, highly parallel, with a latency of not much more than two signed integer compares but occupying a large area on the die. This area scales linearly with the integer width chosen for the internal coefficient registers. I think these are unavoidable characteristics of any continued logarithm arithmetic unit (even if my implementation leaves lots of room for improvement).

I tested the design on a Fomu FPGA, transpiling PyRTL to Verilog and then synthesizing/uploading it (with the help of Yosys, YoWASP, and openFPGALoader). Unfortunately, the Fomu doesn’t have a big enough die to run even a single unit unless I crank the integer width down from my initial 256 bits all the way to 16 bits (logic area is by far the limiting factor). Were one to actually build this hardware unit, it’d need at least 256-bit integers to be practical for high precision computations. But this is just a toy test.

The Fomu’s max clock speed is 48MhZ; it has to underclock to 14MhZ to run this unit (max latency 68.37ns). Not the worst for such an underpowered processor.

Unfortunately, I can’t get the damn thing to work properly!! It works just fine in the PyRTL simulator, but when I upload the design, it behaves in weird nondeterministc ways. I suspected the PyRTL to Verilog transpilation step might be the problem, but testing the generated Verilog with verilator yields the exact same results as in PyRTL. So I really don’t know what’s wrong. I do have access to another, more powerful FPGA I could test on, but am also running out of time and am stonewalled by my computer’s refusal to recognize the device when I plug it in. Just installing the USB drivers is headache enough I’ve deemed it a task for another time.

Should you build continued logarithm arithmetic units? I mean, I was just today (6/3/25) made aware of a UCSB archlab project to build a hardware unit unifying communication with all manner of acceleration units under a single driver and API. This would actually be a perfect candidate to integrate in such a system, a little streaming acceleration unit invoked only occasionally for high precision arithmetic. My guess is it’s not worth it and GMP/MPFR are simply good enough. But without further investigation, who can say?

A very fun project to end my first year of college on :D