Portfolio Example

SIC Assembler (Rust)

A two-pass assembler for the Simplified Instructional Computer architecture, written in Rust.

Overview

Language
Rust
Architecture
SIC
Passes
Two-pass
Output Format
Standard object code

This assembler processes SIC assembly language source files through a two-pass algorithm. Pass 1 builds the symbol table and assigns addresses; Pass 2 generates object code in standard SIC format with Header, Text, Modification, and End records. Features include indexed addressing, BYTE/WORD constants, and proper error handling with line numbers. Full repo can be found on Github: https://github.com/snack-mors/SicAssem

Two-Pass Assembly Trace

The following demonstrates the assembler processing a sample from the included test file. Pass 1 scans to build the symbol table; Pass 2 uses those symbols to generate object code.

Sample source excerpt:

myfile.txt (excerpt)
COPY      START   1000
FIRST     STL     RETADR
CLOOP     JSUB    RDREC
          LDA     LENGTH
          COMP    ZERO
          JEQ     ENDFIL
EOF       BYTE    C'EOF'
THREE     WORD    3
ZERO      WORD    0
RETADR    RESW    1

Symbol table after Pass 1 (partial):

SymbolAddress (hex)Notes
FIRST1000Program entry point
CLOOP1003Main copy loop
ENDFIL1015End of file handling
RDREC1036Read record subroutine
WRREC1061Write record subroutine
EOF102AEnd-of-file constant
ZERO1030Zero constant
RETADR1033Return address storage
LENGTH1034Record length storage
BUFFER10354KB I/O buffer

Object code generation (Pass 2 excerpt):

AddressSourceObject CodeNotes
1000STL RETADR141033opcode 14, addr 1033
1003JSUB RDREC482039opcode 48, addr 2039
1006LDA LENGTH001034opcode 00, addr 1034
1009COMP ZERO281030opcode 28, addr 1030
100CJEQ ENDFIL301015opcode 30, addr 1015
100FJSUB WRREC482061opcode 48, addr 2061
1012J CLOOP3C1003opcode 3C, addr 1003
102ABYTE C'EOF'454F46ASCII 'EOF'
102DWORD 300000324-bit constant
1030WORD 000000024-bit zero
2039LDX ZERO041030Start of RDREC subroutine
203CLDA ZERO001030Initialize accumulator

Sample Output

Input (myfile.txt)
COPY      START   1000
FIRST     STL     RETADR
CLOOP     JSUB    RDREC
          LDA     LENGTH
          COMP    ZERO
          JEQ     ENDFIL
          JSUB    WRREC
          J       CLOOP
ENDFIL    LDA     EOF
          STA     BUFFER
          LDA     THREE
          STA     LENGTH
          JSUB    WRREC
          LDL     RETADR
          RSUB
EOF       BYTE    C'EOF'
THREE     WORD    3
ZERO      WORD    0
RETADR    RESW    1
LENGTH    RESW    1
BUFFER    RESB    4096
RDREC     LDX     ZERO
          LDA     ZERO
RLOOP     TD      INPUT
          JEQ     RLOOP
          RD      INPUT
          COMP    ZERO
          JEQ     EXIT
          STCH    BUFFER,X
          TIX     MAXLEN
          JLT     RLOOP
EXIT      STX     LENGTH
          RSUB
INPUT     BYTE    X'F1'
MAXLEN    WORD    4096
WRREC     LDX     ZERO
WLOOP     TD      OUTPUT
          JEQ     WLOOP
          LDCH    BUFFER,X
          WD      OUTPUT
          TIX     LENGTH
          JLT     WLOOP
          RSUB
OUTPUT    BYTE    X'05'
          END     FIRST
Generated myfile.txt.obj
HCOPY  00100000107A
T0010001E1410334820390010362810303010154820613C100300102A0C103900102D
T00101E150C10364820610810334C0000454F46000003000000
T0020391E041030001030E0205D30203FD8205D2810303020575490392C205E38203F
T0020571E1010364C0000F1001000041030E02079302064509039DC20792C10363820
T00207505644C000005
M00100104
M00100404
M00100704
M00100A04
M00100D04
M00101004
M00101304
M00101604
M00101904
M00101C04
M00101F04
M00102204
M00102504
M00203A04
M00203D04
M00204004
M00204304
M00204604
M00204904
M00204C04
M00204F04
M00205204
M00205504
M00205804
M00206204
M00206504
M00206804
M00206B04
M00206E04
M00207104
M00207404
E001000

The object file shows the complete structure: Header record (H) with program name "COPY", starting address 001000, and total length 00107A. Multiple Text records (T) contain the object code in 30-byte chunks. The extensive Modification records (M) indicate all instruction addresses that need relocation when the program is loaded at a different base address. The End record (E) specifies entry point 001000.

Annotated Source Code

src/main.rs entry point and CLI handling
Uses clap for modern CLI parsing instead of manual arg handling. Drives the two-pass process by calling pass_one() to build symbols and IR, then pass_two() to generate object code.
mod symbols;
mod mnemonics;
mod ir;
mod pass1;
mod pass2;

use std::io;
use clap::Parser;
use crate::pass1::pass_one;
use crate::pass2::pass_two;

#[derive(Parser)]
#[command(version, about = "A simple file reader")]
struct Args {
    filename: String,
}

fn main() -> Result<(), io::Error> {
    let args = Args::parse();
    println!("Opening file: {}", args.filename);
    
    match pass_one(&args.filename) {
        Ok((symtab, _ir)) => {
            println!("Pass 1 Successful!");
            match pass_two(&_ir, &symtab, &args.filename) {
                Ok(_) => println!("Pass 2 Successful. Object file created."),
                Err(e) => {
                    eprintln!("Pass 2 Failed: {}", e);
                    std::process::exit(1);
                }
            }
        },
        Err(e) => {
            eprintln!("Assembly Failed: {}", e);
            std::process::exit(1);
        },
    }
    Ok(())
}
src/ir.rs intermediate representation
Defines the Line struct that represents each parsed source line in memory between passes. Stores address, optional label, mnemonic, optional operand, and source line number for error reporting.
#[derive(Debug)]
pub struct Line {
    pub address: i32,
    pub label: Option,
    pub mnemonic: String,
    pub operand: Option,
    pub source_line: usize,
}

impl Line {
    // A standard "constructor" in Rust.
    // We take &str arguments to make the calling code cleaner.
    pub fn new(
        address: i32,
        label: Option<&str>,
        mnemonic: &str,
        operand: Option<&str>,
        source_line: usize
    ) -> Self {
        Line {
            address,
            source_line,
            label: label.map(|s| s.to_string()),
            mnemonic: mnemonic.to_string(),
            operand: operand.map(|s| s.to_string()),
        }
    }
}
src/symbols.rs symbol table implementation
Wraps HashMap for symbol storage with SIC-specific constraints (6-char limit, duplicate detection). Symbol struct stores both address and source line for error reporting. Private map field enforces encapsulation.
use std::collections::HashMap;

#[derive(Debug, Clone)]
pub struct Symbol {
    pub address: i32,
    pub source_line: i32,
}

pub struct SymbolTable {
    // Note how we don't declare map as pub, this makes it a private field.
    map: HashMap,
}

impl SymbolTable {
    pub fn new() -> Self {
        SymbolTable {
            map: HashMap::new(),
        }
    }
    
    // Note how in the insert we have to declare that a pointer is &mut = MUTABLE.
    pub fn insert(&mut self, name: String, address: i32, source_line: i32) -> Result<(), String> {
        if name.len() > 6 {
            return Err("Symbol name cannot be more than 6 characters".to_string());
        }
        if self.map.contains_key(&name) {
            return Err(format!("Duplicate symbol name: {}", name));
        }
        let sym = Symbol {
            address,
            source_line,
        };
        self.map.insert(name, sym);
        Ok(())
    }
    
    pub fn get_address(&self, name: &str) -> Option {
        self.map.get(name).map(|sym| sym.address)
    }
    
    pub fn print_symbols(&self) {
        println!("Symbol Table Content:");
        println!("---------------------");
        // Collect all entries into a vector to sort them.
        // Note that we borrow READABLE references.
        let mut entries: Vec<(&String, &Symbol)> = self.map.iter().collect();
        // Sort by address
        entries.sort_by(|a, b| a.1.address.cmp(&b.1.address));
        // For each entry in entries, note how the loop itself destructures the tuple.
        for (name, sym) in entries {
            println!("{:<8} | {:04X}", name, sym.address);
        }
    }
}
src/mnemonics.rs opcode table and directives
Defines OpInfo struct for opcode/format pairs and the complete SIC instruction set. Separate Directive enum handles assembler directives with size calculation for RESW/RESB/BYTE.
#[derive(Debug, Clone, Copy)]
pub struct OpInfo{
    pub opcode: u8,
    pub format: u8,
}

pub fn get_opcode(mnemonic: &str) -> Option {
    match mnemonic {
        "ADD"  => Some(OpInfo { opcode: 0x18, format: 3 }),
        "AND"  => Some(OpInfo { opcode: 0x40, format: 3 }),
        "COMP" => Some(OpInfo { opcode: 0x28, format: 3 }),
        "DIV"  => Some(OpInfo { opcode: 0x24, format: 3 }),
        "J"    => Some(OpInfo { opcode: 0x3C, format: 3 }),
        "JEQ"  => Some(OpInfo { opcode: 0x30, format: 3 }),
        "JGT"  => Some(OpInfo { opcode: 0x34, format: 3 }),
        "JLT"  => Some(OpInfo { opcode: 0x38, format: 3 }),
        "JSUB" => Some(OpInfo { opcode: 0x48, format: 3 }),
        "LDA"  => Some(OpInfo { opcode: 0x00, format: 3 }),
        "LDCH" => Some(OpInfo { opcode: 0x50, format: 3 }),
        "LDL"  => Some(OpInfo { opcode: 0x08, format: 3 }),
        "LDX"  => Some(OpInfo { opcode: 0x04, format: 3 }),
        "MUL"  => Some(OpInfo { opcode: 0x20, format: 3 }),
        "OR"   => Some(OpInfo { opcode: 0x44, format: 3 }),
        "RD"   => Some(OpInfo { opcode: 0xD8, format: 3 }),
        "RSUB" => Some(OpInfo { opcode: 0x4C, format: 3 }),
        "STA"  => Some(OpInfo { opcode: 0x0C, format: 3 }),
        "STCH" => Some(OpInfo { opcode: 0x54, format: 3 }),
        "STL"  => Some(OpInfo { opcode: 0x14, format: 3 }),
        "STSW" => Some(OpInfo { opcode: 0xE8, format: 3 }),
        "STX"  => Some(OpInfo { opcode: 0x10, format: 3 }),
        "SUB"  => Some(OpInfo { opcode: 0x1C, format: 3 }),
        "TD"   => Some(OpInfo { opcode: 0xE0, format: 3 }),
        "TIX"  => Some(OpInfo { opcode: 0x2C, format: 3 }),
        "WD"   => Some(OpInfo { opcode: 0xDC, format: 3 }),
        _ => None,
    }
}

#[derive(Debug, PartialEq, Clone, Copy)]
pub enum Directive {
    Start, End, Byte, Word, Resb, Resw,
}

impl Directive {
    pub fn from_str(s: &str) -> Option {
        match s {
            "START" => Some(Directive::Start),
            "END"   => Some(Directive::End),
            "BYTE"  => Some(Directive::Byte),
            "WORD"  => Some(Directive::Word),
            "RESB"  => Some(Directive::Resb),
            "RESW"  => Some(Directive::Resw),
            _       => None,
        }
    }
    
    // We pass the operand because RESW/BYTE need it to calculate size.
    pub fn get_size(&self, operand: Option<&str>) -> Result {
        match self {
            Directive::Word => Ok(3),
            Directive::Start | Directive::End => Ok(0),

            Directive::Resw => {
                let val = operand.ok_or("Missing operand for RESW")?
                    .parse::()
                    .map_err(|_| "Invalid integer for RESW")?;
                Ok(val * 3)
            },

            Directive::Resb => {
                let val = operand.ok_or("Missing operand for RESB")?
                    .parse::()
                    .map_err(|_| "Invalid integer for RESB")?;
                Ok(val)
            },

            Directive::Byte => {
                let op = operand.ok_or("Missing operand for BYTE")?;
                if op.starts_with("C'") && op.ends_with('\'') {
                    // C'EOF' -> 3 bytes
                    Ok((op.len() - 3) as i32)
                } else if op.starts_with("X'") && op.ends_with('\'') {
                    // X'F1' -> 1 byte per 2 hex chars
                    let hex_len = op.len() - 3;
                    if hex_len % 2 != 0 {
                        return Err("Hex literal must have even number of digits".to_string());
                    }
                    Ok((hex_len / 2) as i32)
                } else {
                    Err("Invalid BYTE format".to_string())
                }
            }
        }
    }
}
src/pass1.rs first pass - symbol table construction
Scans source file to build symbol table and intermediate representation. Handles line parsing with flexible token counts, manages location counter, processes START directive, and validates symbols before pass 2.
use std::io::{BufRead, BufReader};
use std::fs::File;
use crate::ir::Line;
use crate::symbols::SymbolTable;
use crate::mnemonics::{get_opcode, Directive};

pub fn pass_one(filename: &str) -> Result<(SymbolTable, Vec), String> {
    let file = File::open(filename).map_err(|e| e.to_string())?;
    let reader = BufReader::new(file);

    let mut symtab = SymbolTable::new();
    let mut intermediate_code = Vec::new();
    let mut locctr = 0;
    let mut start_seen = false;

    for (index, line_result) in reader.lines().enumerate() {
        let source_line_number = index + 1;
        let line = line_result.map_err(|e| e.to_string())?;

        let tokens: Vec<&str> = line.split_whitespace().collect();

        // Checks for comments.
        if tokens.is_empty() || tokens[0].starts_with('#') || tokens[0].starts_with('.') {
            continue;
        }

        // Parse tokens intelligently based on count
        let (label, mnemonic, operand) = match tokens.len() {
            3 => (Some(tokens[0]), tokens[1], Some(tokens[2])),
            2 => {
                // Determine if first token is instruction/directive or label
                if get_opcode(tokens[0]).is_some() || Directive::from_str(tokens[0]).is_some() {
                    (None, tokens[0], Some(tokens[1]))
                } else {
                    (Some(tokens[0]), tokens[1], None)
                }
            },
            1 => (None, tokens[0], None),
            _ => return Err(format!("Line {}: Too many tokens", source_line_number)),
        };

        // Handle START directive specially
        if mnemonic == "START" {
            if let Some(op) = operand {
                // Parse hex: strtol(op, NULL, 16)
                locctr = i32::from_str_radix(op, 16).unwrap_or(0);
            }
            start_seen = true;
            intermediate_code.push(Line::new(
                locctr, label, mnemonic, operand, source_line_number
            ));
            continue;
        }
        
        let current_address = locctr;
        
        // If this line has a label, add it to symbol table
        if let Some(lbl) = label {
            if symtab.insert(lbl.to_string(), current_address, source_line_number as i32).is_err(){
                return Err(format!("Line: {}: '{}'", source_line_number, lbl));
            }
        }

        // Calculate instruction size
        let mut instruction_size = 0;
        if get_opcode(mnemonic).is_some() {
            instruction_size = 3;  // All SIC instructions are 3 bytes
        } else if let Some(dir) = Directive::from_str(mnemonic) {
            instruction_size = dir.get_size(operand)
                .map_err(|e| format!("Line {}: {}", source_line_number, e))?;
        } else {
            return Err(format!("Line {}: Unknown Opcode '{}'", source_line_number, mnemonic));
        }
        
        intermediate_code.push(Line::new(
            current_address, label, mnemonic, operand, source_line_number
        ));
        
        locctr += instruction_size;
        
        if mnemonic == "END" {
            break;
        }
    }
    
    if !start_seen {
        return Err("Error: Missing START directive".to_string());
    }
    
    Ok((symtab, intermediate_code))
}
src/pass2.rs second pass - object code generation
Generates standard SIC object program format with H/T/M/E records. TextRecord struct manages 30-byte text record buffering with automatic flushing. Handles instruction assembly, BYTE/WORD constants, and generates modification records for relocatable addresses.
use std::fs::File;
use std::io::{BufWriter, Write};
use crate::ir::Line;
use crate::symbols::SymbolTable;
use crate::mnemonics::{get_opcode};

/// Manages the buffering of the current Text Record (T-record)
struct TextRecord {
    start_addr: Option,
    buffer: Vec,
    max_len: usize,
}

impl TextRecord {
    fn new() -> Self {
        TextRecord {
            start_addr: None,
            buffer: Vec::with_capacity(30),
            max_len: 30,
        }
    }

    fn add_bytes(&mut self, addr: i32, data: &[u8]) -> Vec {
        let mut output_lines = Vec::new();
        let mut data_idx = 0;
        let addr_u32 = addr as u32;

        while data_idx < data.len() {
            if self.start_addr.is_none() {
                self.start_addr = Some(addr_u32 + data_idx as u32);
            }
            let current_start = self.start_addr.unwrap();
            let current_loc = current_start + self.buffer.len() as u32;
            let incoming_loc = addr_u32 + data_idx as u32;

            // Check for address gap - flush if needed
            if current_loc != incoming_loc {
                if let Some(line) = self.flush() {
                    output_lines.push(line);
                }
                self.start_addr = Some(incoming_loc);
            }

            let space_left = self.max_len - self.buffer.len();
            let chunk_size = std::cmp::min(space_left, data.len() - data_idx);

            self.buffer.extend_from_slice(&data[data_idx..data_idx + chunk_size]);
            data_idx += chunk_size;

            // Flush if buffer full
            if self.buffer.len() == self.max_len {
                if let Some(line) = self.flush() {
                    output_lines.push(line);
                }
            }
        }
        output_lines
    }

    fn flush(&mut self) -> Option {
        if self.buffer.is_empty() {
            return None;
        }

        let addr = self.start_addr.unwrap();
        let record_len = self.buffer.len();
        let header = format!("T{:06X}{:02X}", addr, record_len);
        let body: String = self.buffer.iter().map(|b| format!("{:02X}", b)).collect();

        self.start_addr = None;
        self.buffer.clear();

        Some(format!("{}{}", header, body))
    }
}

pub fn pass_two(
    ir: &[Line],
    symtab: &SymbolTable,
    filename: &str
) -> Result<(), String> {

    let start_addr = ir.first().map(|l| l.address).unwrap_or(0);
    // Simple calc for program length (Last Address - First Address)
    let prog_len = if let Some(last) = ir.last() {
        last.address - start_addr
    } else {
        0
    };

    let obj_filename = format!("{}.obj", filename);
    let file = File::create(&obj_filename).map_err(|e| e.to_string())?;
    let mut writer = BufWriter::new(file);

    // 1. Header Record
    let prog_name = ir.first()
        .and_then(|l| l.label.as_deref())
        .unwrap_or("      ");

    writeln!(writer, "H{:<6}{:>06X}{:>06X}", prog_name, start_addr, prog_len)
        .map_err(|e| e.to_string())?;

    // 2. Text Records
    let mut text_rec = TextRecord::new();
    // Vector to store address locations that need Modification Records
    let mut mod_records: Vec = Vec::new();
    let mut entry_point = start_addr;

    for line in ir {
        if line.mnemonic == "START" { continue; }
        if line.mnemonic == "END" {
            if let Some(ref op) = line.operand {
                entry_point = symtab.get_address(op).unwrap_or(start_addr);
            }
            break;
        }

        // Generate Object Code
        let object_code = if get_opcode(&line.mnemonic).is_some() {
            Some(assemble_instruction(line, symtab, &mut mod_records)?)
        } else if line.mnemonic == "BYTE" {
            Some(assemble_byte(line)?)
        } else if line.mnemonic == "WORD" {
            Some(assemble_word(line)?)
        } else {
            None  // RESW/RESB generate no code
        };

        if let Some(bytes) = object_code {
            let records = text_rec.add_bytes(line.address, &bytes);
            for rec in records {
                writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
            }
        } else {
            // Gap handling (RESW/RESB)
            if let Some(rec) = text_rec.flush() {
                writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
            }
        }
    }

    if let Some(rec) = text_rec.flush() {
        writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
    }

    // 3. Modification Records
    // For Standard SIC, we modify the last 4 half-bytes (16 bits) at the specified address.
    for addr in mod_records {
        // M
        // Length 04 represents 4 half-bytes (16 bits)
        writeln!(writer, "M{:06X}04", addr + 1).map_err(|e| e.to_string())?;
    }

    // 4. End Record
    writeln!(writer, "E{:06X}", entry_point).map_err(|e| e.to_string())?;

    println!("Output written to {}", obj_filename);
    Ok(())
}

fn assemble_instruction(
    line: &Line,
    symtab: &SymbolTable,
    mod_records: &mut Vec
) -> Result, String> {
    let op_info = get_opcode(&line.mnemonic)
        .ok_or(format!("Unknown mnemonic {}", line.mnemonic))?;

    let opcode = op_info.opcode;
    let mut address = 0;

    if let Some(ref operand) = line.operand {
        // Handle indexed addressing (,X)
        let (label, is_indexed) = if operand.ends_with(",X") {
            (&operand[..operand.len()-2], true)
        } else {
            (operand.as_str(), false)
        };

        // Resolve symbol address
        if let Some(addr) = symtab.get_address(label) {
            address = addr;
            // Record that this instruction location needs modification
            mod_records.push(line.address);
        } else {
            return Err(format!("Undefined Symbol: {}", label));
        }

        // Set indexed bit if needed
        if is_indexed {
            address |= 0x8000;
        }
    } else if line.mnemonic != "RSUB" {
        return Err(format!("Missing operand for {}", line.mnemonic));
    }

    let b1 = opcode;
    let b2 = (address >> 8) as u8;
    let b3 = (address & 0xFF) as u8;

    Ok(vec![b1, b2, b3])
}

fn assemble_byte(line: &Line) -> Result, String> {
    let operand = line.operand.as_ref()
        .ok_or(format!("BYTE requires operand at line {}", line.source_line))?;

    if operand.starts_with("C'") && operand.ends_with('\'') {
        // Character constant: C'EOF' -> ASCII bytes
        let content = &operand[2..operand.len()-1];
        Ok(content.as_bytes().to_vec())
    } else if operand.starts_with("X'") && operand.ends_with('\'') {
        // Hex constant: X'F1' -> raw bytes
        let hex_str = &operand[2..operand.len()-1];
        if hex_str.len() % 2 != 0 {
            return Err("X constant must have even number of digits".to_string());
        }
        let mut bytes = Vec::new();
        for i in (0..hex_str.len()).step_by(2) {
            let byte_val = u8::from_str_radix(&hex_str[i..i+2], 16)
                .map_err(|_| "Invalid hex in X constant")?;
            bytes.push(byte_val);
        }
        Ok(bytes)
    } else {
        Err(format!("Invalid BYTE format: {}", operand))
    }
}

fn assemble_word(line: &Line) -> Result, String> {
    let operand = line.operand.as_ref()
        .ok_or(format!("WORD requires operand at line {}", line.source_line))?;

    let value = operand.parse::()
        .map_err(|_| format!("Invalid WORD constant: {}", operand))?;

    // Check 24-bit range
    let min = -(1 << 23);
    let max = (1 << 23) - 1;
    if value < min || value > max {
        return Err(format!("WORD value too large for 24-bits: {}", value));
    }

    // Convert to 3-byte big-endian
    let bytes = value.to_be_bytes();
    Ok(vec![bytes[1], bytes[2], bytes[3]])
}

Design Notes

Rust's ownership system and pattern matching prove ideal for this project. The Option<T> type naturally represents optional labels and operands, while Result<T, E> provides structured error handling with precise line-level error reporting. The match expressions in the opcode table and directive parsing are both exhaustive (compiler-enforced) and highly readable.

The symbol table wraps HashMap<String, Symbol> to add SIC-specific validation like the 6-character limit and duplicate detection. Using a private map field with public methods provides proper encapsulation while allowing controlled access to symbol addresses during pass 2.

The TextRecord struct demonstrates Rust's ability to manage complex state with automatic resource management. Its buffer management handles the tricky details of standard SIC object format: 30-byte text records, address gaps that require new records, and proper flushing at boundaries; all with zero memory leaks thanks to Rust's RAII semantics.