TinyGPU — Design and Implementation Using Verilog and Xilinx Vivado

A fully functional SIMT GPU core implementing parallel arithmetic, logic, and bitwise operations across 4 simultaneous threads.

Project Overview

TinyGPU is a minimal but architecturally complete GPU core written in synthesizable Verilog. It demonstrates the fundamental concept behind real GPUs: SIMT (Single Instruction, Multiple Threads) — one instruction broadcast to many execution units simultaneously, each operating on its own private register file.

All 4 ALUs fire in the same clock cycle, performing operations in true parallel.

Architecture

┌───────────────────────────────────────────────────────────────┐
│                            TinyGPU Top                        │
│                             ┌──────────────────────────────┐  │
│                             │         Warp Controller      │  │
│                             │  PC + FSM + Thread Dispatch  │  │
│                             │                              │  │
│                             │  ┌──────────┐  ┌──────────┐  │  │
│                             │  │ Thread 0 │  │ Thread 1 │  │  │
│  ┌──────────────────┐       │  │ ┌──────┐ │  │ ┌──────┐ │  │  │
│  │  Instruction Mem │ ────▶ │  │ │ ALU  │ │  │ │ ALU  │ │  │  │
│  │  (256 × 32-bit)  │       │  │ │ REG  │ │  │ │ REG  │ │  │  │
│  └──────────────────┘       │  │ └──────┘ │  │ └──────┘ │  │  │
│                             │  └──────────┘  └──────────┘  │  │
│                             │  ┌──────────┐  ┌──────────┐  │  │
│                             │  │ Thread 2 │  │ Thread 3 │  │  │
│                             │  │ ┌──────┐ │  │ ┌──────┐ │  │  │
│                             │  │ │ ALU  │ │  │ │ ALU  │ │  │  │
│                             │  │ │ REG  │ │  │ │ REG  │ │  │  │
│                             │  │ └──────┘ │  │ └──────┘ │  │  │
│                             │  └──────────┘  └──────────┘  │  │
│                             └──────────────────────────────┘  │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Module Hierarchy

tinygpu.v                    ← Top-level GPU
├── instruction_memory.v     ← 256×32-bit synchronous ROM
└── warp_controller.v        ← SIMT warp: PC + FSM + 4 threads
    └── thread_processor.v  ← Per-thread: decode + regfile + ALU
        ├── instruction_decoder.v  ← 32-bit instruction decode
        ├── register_file.v        ← 16×32-bit dual-read regfile
        └── alu.v                  ← Full arithmetic + logic unit

Supported Operations

Opcode	Mnemonic	Operation	Type
0000	ADD	rd = rs1 + rs2	R
0001	SUB	rd = rs1 - rs2	R
0010	MUL	rd = rs1 × rs2 (lo32)	R
0011	DIV	rd = rs1 ÷ rs2	R
0100	AND	rd = rs1 & rs2	R
0101	OR	rd = rs1 \| rs2	R
0110	XOR	rd = rs1 ^ rs2	R
0111	NOT	rd = ~rs1	R
1000	SHL	rd = rs1 << rs2[4:0]	R
1001	SHR	rd = rs1 >> rs2[4:0]	R
1010	NAND	rd = ~(rs1 & rs2)	R
1011	NOR	rd = ~(rs1 \| rs2)	R
1100	XNOR	rd = ~(rs1 ^ rs2)	R
1101	MOD	rd = rs1 % rs2	R
1110	ADDI	rd = rs1 + imm16	I
1111	MOV	rd = imm16	I

Special: 0x0000_FFFF = HALT (end of program)

Instruction Encoding (32-bit fixed-width)

R-Type (register operands)

 31      28 27    24 23    20 19    16 15         0
┌──────────┬────────┬────────┬────────┬─────────────┐
│  opcode  │   rd   │  rs1   │  rs2   │  (unused)   │
│  [4-bit] │[4-bit] │[4-bit] │[4-bit] │  [16-bit]   │
└──────────┴────────┴────────┴────────┴─────────────┘

I-Type (immediate operand)

 31      28 27    24 23    20 19              4 3    0
┌──────────┬────────┬────────┬────────────────┬──────┐
│  opcode  │   rd   │  rs1   │    imm16       │  0   │
│  [4-bit] │[4-bit] │[4-bit] │   [16-bit]     │[4-b] │
└──────────┴────────┴────────┴────────────────┴──────┘

File Structure

tinygpu/
├── rtl/
│   ├── alu.v                  ALU: all arithmetic + logic ops + flags
│   ├── register_file.v        16×32-bit dual-read register file
│   ├── instruction_decoder.v  32-bit instruction field extraction
│   ├── instruction_memory.v   256-word synchronous program ROM
│   ├── thread_processor.v     Single SIMT thread (ALU + regfile)
│   ├── warp_controller.v      4-thread warp + PC + state machine
│   └── tinygpu.v              Top-level integration
├── tb/
│   └── tb_tinygpu.v           Self-checking testbench (20 tests, 96 checks)
├── sim/
│   ├── sim.sh                 Icarus Verilog simulation script
│   └── demo_all_ops.hex       Demo program (all operations)
├── tinygpu_basys3.xdc         Vivado constraints (Basys3 board)
├── create_project.tcl         Vivado project creation script
└── README.md                  This file

Simulation (Icarus Verilog)

cd sim/
chmod +x sim.sh
./sim.sh

Or manually:

iverilog -g2012 \
  rtl/alu.v rtl/register_file.v rtl/instruction_decoder.v \
  rtl/instruction_memory.v rtl/thread_processor.v \
  rtl/warp_controller.v rtl/tinygpu.v tb/tb_tinygpu.v \
  -o sim/tinygpu_sim

vvp sim/tinygpu_sim

# View waveforms:
gtkwave sim/tinygpu_sim.vcd

Expected output:

Total Checks : 96
PASSED       : 96
FAILED       : 0
*** ALL TESTS PASSED — TinyGPU VERIFIED! ***

Vivado Synthesis (Xilinx)

Method 1: TCL Script (Recommended)

vivado -mode batch -source create_project.tcl

Then open vivado_project/TinyGPU.xpr in the GUI.

Method 2: GUI

New Project → RTL Project
Add Sources: all files in rtl/
Add Constraints: tinygpu_basys3.xdc
Target Part: xc7a35tcpg236-1 (Basys3)
Run Synthesis → Implementation → Generate Bitstream

How to Write a GPU Program

Each instruction is a 32-bit hex word. Example: compute R3 = (10 + 20) * 5:

// MOV R1, 10   → F1 00 00A 0 → F100_00A0
// MOV R2, 20   → F2 00 014 0 → F200_0140
// MOV R4, 5    → F4 00 005 0 → F400_0050
// ADD R3,R1,R2 → 03 12 0000  → 03120000
// MUL R5,R3,R4 → 25 34 0000  → 25340000
// HALT                        → 0000FFFF

In Verilog testbench:

dut.u_imem.mem[0] = 32'hF100_00A0;  // MOV R1, 10
dut.u_imem.mem[1] = 32'hF200_0140;  // MOV R2, 20
dut.u_imem.mem[2] = 32'hF400_0050;  // MOV R4, 5
dut.u_imem.mem[3] = 32'h0312_0000;  // ADD R3, R1, R2
dut.u_imem.mem[4] = 32'h2534_0000;  // MUL R5, R3, R4
dut.u_imem.mem[5] = 32'h0000_FFFF;  // HALT

Design Decisions & Key Concepts

SIMT Parallelism

All 4 threads share one instruction stream but have private register files. Every clock cycle when a valid instruction is issued, 4 ALUs compute simultaneously — this is the same principle used in NVIDIA/AMD GPUs.

Thread Mask

The thread_mask[3:0] input allows disabling specific threads (for handling divergence in future extensions). With 4'b1111, all 4 threads run.

Single-Cycle Execution

TinyGPU uses a single-cycle pipeline: fetch, decode, execute, and write-back all happen within one clock cycle. This maximizes simplicity while demonstrating correct parallel behavior.

Register File (R0 = Zero)

R0 is hardwired to 0 (writes to R0 are ignored), consistent with RISC design conventions.

Divide-by-Zero Handling

DIV and MOD by zero return 0xFFFFFFFF and flag_overflow is asserted — the GPU does not hang or produce undefined behavior.

Resource Estimates (Artix-7 XC7A35T)

Resource	Estimated Usage	Available
LUT	~850	20,800
FF	~320	41,600
BRAM	0	50
DSP48	4 (for MUL)	90

The design easily fits even on the smallest Artix-7 devices.

Testbench Coverage

Test	Operation	Threads Verified
1	MOV	4 × 2 checks
2	ADD	4 checks
3	ADDI	4 checks
4	SUB	4 checks
5	MUL	4 checks
6	DIV	4 checks
7	AND	4 checks
8	OR	4 checks
9	XOR	4 checks
10	NOT	4 checks
11	SHL	4 checks
12	SHR	4 checks
13	NAND	4 checks
14	NOR	4 checks
15	XNOR	4 checks
16	MOD	4 checks
17	Chain ops	4 checks
18	SIMT parallel	16 checks
19	Div-by-zero	4 checks
20	Large MUL	4 checks
Total		96 checks, 100% pass

Author Notes

This implementation faithfully captures the core GPU execution model:

SIMT: threads run in lockstep, each with private state
ISA: RISC-style fixed-width 32-bit encoding
ALU: complete set of arithmetic, logical, bitwise, and shift operations
Parallelism: 4× speedup on data-parallel workloads vs sequential CPU

The design is intentionally lean — no cache, no shared memory, no VRAM — making it ideal for understanding the GPU execution model without complexity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyGPU — Design and Implementation Using Verilog and Xilinx Vivado

Project Overview

Architecture

Module Hierarchy

Supported Operations

Instruction Encoding (32-bit fixed-width)

R-Type (register operands)

I-Type (immediate operand)

File Structure

Simulation (Icarus Verilog)

Vivado Synthesis (Xilinx)

Method 1: TCL Script (Recommended)

Method 2: GUI

How to Write a GPU Program

Design Decisions & Key Concepts

SIMT Parallelism

Thread Mask

Single-Cycle Execution

Register File (R0 = Zero)

Divide-by-Zero Handling

Resource Estimates (Artix-7 XC7A35T)

Testbench Coverage

Author Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
rtl		rtl
sim		sim
tb		tb
README.md		README.md
arch.png		arch.png
create_project.tcl		create_project.tcl
tinygpu_basys3.xdc		tinygpu_basys3.xdc

Folders and files

Latest commit

History

Repository files navigation

TinyGPU — Design and Implementation Using Verilog and Xilinx Vivado

Project Overview

Architecture

Module Hierarchy

Supported Operations

Instruction Encoding (32-bit fixed-width)

R-Type (register operands)

I-Type (immediate operand)

File Structure

Simulation (Icarus Verilog)

Vivado Synthesis (Xilinx)

Method 1: TCL Script (Recommended)

Method 2: GUI

How to Write a GPU Program

Design Decisions & Key Concepts

SIMT Parallelism

Thread Mask

Single-Cycle Execution

Register File (R0 = Zero)

Divide-by-Zero Handling

Resource Estimates (Artix-7 XC7A35T)

Testbench Coverage

Author Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages