A two-pass assembler for the Simplified Instructional Computer architecture, written in Rust.
This assembler processes SIC assembly language source files through a two-pass algorithm. Pass 1 builds the symbol table and assigns addresses; Pass 2 generates object code in standard SIC format with Header, Text, Modification, and End records. Features include indexed addressing, BYTE/WORD constants, and proper error handling with line numbers. Full repo can be found on Github: https://github.com/snack-mors/SicAssem
The following demonstrates the assembler processing a sample from the included test file. Pass 1 scans to build the symbol table; Pass 2 uses those symbols to generate object code.
Sample source excerpt:
COPY START 1000
FIRST STL RETADR
CLOOP JSUB RDREC
LDA LENGTH
COMP ZERO
JEQ ENDFIL
EOF BYTE C'EOF'
THREE WORD 3
ZERO WORD 0
RETADR RESW 1
Symbol table after Pass 1 (partial):
| Symbol | Address (hex) | Notes |
|---|---|---|
| FIRST | 1000 | Program entry point |
| CLOOP | 1003 | Main copy loop |
| ENDFIL | 1015 | End of file handling |
| RDREC | 1036 | Read record subroutine |
| WRREC | 1061 | Write record subroutine |
| EOF | 102A | End-of-file constant |
| ZERO | 1030 | Zero constant |
| RETADR | 1033 | Return address storage |
| LENGTH | 1034 | Record length storage |
| BUFFER | 1035 | 4KB I/O buffer |
Object code generation (Pass 2 excerpt):
| Address | Source | Object Code | Notes |
|---|---|---|---|
| 1000 | STL RETADR | 141033 | opcode 14, addr 1033 |
| 1003 | JSUB RDREC | 482039 | opcode 48, addr 2039 |
| 1006 | LDA LENGTH | 001034 | opcode 00, addr 1034 |
| 1009 | COMP ZERO | 281030 | opcode 28, addr 1030 |
| 100C | JEQ ENDFIL | 301015 | opcode 30, addr 1015 |
| 100F | JSUB WRREC | 482061 | opcode 48, addr 2061 |
| 1012 | J CLOOP | 3C1003 | opcode 3C, addr 1003 |
| 102A | BYTE C'EOF' | 454F46 | ASCII 'EOF' |
| 102D | WORD 3 | 000003 | 24-bit constant |
| 1030 | WORD 0 | 000000 | 24-bit zero |
| 2039 | LDX ZERO | 041030 | Start of RDREC subroutine |
| 203C | LDA ZERO | 001030 | Initialize accumulator |
COPY START 1000
FIRST STL RETADR
CLOOP JSUB RDREC
LDA LENGTH
COMP ZERO
JEQ ENDFIL
JSUB WRREC
J CLOOP
ENDFIL LDA EOF
STA BUFFER
LDA THREE
STA LENGTH
JSUB WRREC
LDL RETADR
RSUB
EOF BYTE C'EOF'
THREE WORD 3
ZERO WORD 0
RETADR RESW 1
LENGTH RESW 1
BUFFER RESB 4096
RDREC LDX ZERO
LDA ZERO
RLOOP TD INPUT
JEQ RLOOP
RD INPUT
COMP ZERO
JEQ EXIT
STCH BUFFER,X
TIX MAXLEN
JLT RLOOP
EXIT STX LENGTH
RSUB
INPUT BYTE X'F1'
MAXLEN WORD 4096
WRREC LDX ZERO
WLOOP TD OUTPUT
JEQ WLOOP
LDCH BUFFER,X
WD OUTPUT
TIX LENGTH
JLT WLOOP
RSUB
OUTPUT BYTE X'05'
END FIRST
HCOPY 00100000107A T0010001E1410334820390010362810303010154820613C100300102A0C103900102D T00101E150C10364820610810334C0000454F46000003000000 T0020391E041030001030E0205D30203FD8205D2810303020575490392C205E38203F T0020571E1010364C0000F1001000041030E02079302064509039DC20792C10363820 T00207505644C000005 M00100104 M00100404 M00100704 M00100A04 M00100D04 M00101004 M00101304 M00101604 M00101904 M00101C04 M00101F04 M00102204 M00102504 M00203A04 M00203D04 M00204004 M00204304 M00204604 M00204904 M00204C04 M00204F04 M00205204 M00205504 M00205804 M00206204 M00206504 M00206804 M00206B04 M00206E04 M00207104 M00207404 E001000
The object file shows the complete structure: Header record (H) with program name "COPY", starting address 001000, and total length 00107A. Multiple Text records (T) contain the object code in 30-byte chunks. The extensive Modification records (M) indicate all instruction addresses that need relocation when the program is loaded at a different base address. The End record (E) specifies entry point 001000.
mod symbols;
mod mnemonics;
mod ir;
mod pass1;
mod pass2;
use std::io;
use clap::Parser;
use crate::pass1::pass_one;
use crate::pass2::pass_two;
#[derive(Parser)]
#[command(version, about = "A simple file reader")]
struct Args {
filename: String,
}
fn main() -> Result<(), io::Error> {
let args = Args::parse();
println!("Opening file: {}", args.filename);
match pass_one(&args.filename) {
Ok((symtab, _ir)) => {
println!("Pass 1 Successful!");
match pass_two(&_ir, &symtab, &args.filename) {
Ok(_) => println!("Pass 2 Successful. Object file created."),
Err(e) => {
eprintln!("Pass 2 Failed: {}", e);
std::process::exit(1);
}
}
},
Err(e) => {
eprintln!("Assembly Failed: {}", e);
std::process::exit(1);
},
}
Ok(())
}
#[derive(Debug)]
pub struct Line {
pub address: i32,
pub label: Option,
pub mnemonic: String,
pub operand: Option,
pub source_line: usize,
}
impl Line {
// A standard "constructor" in Rust.
// We take &str arguments to make the calling code cleaner.
pub fn new(
address: i32,
label: Option<&str>,
mnemonic: &str,
operand: Option<&str>,
source_line: usize
) -> Self {
Line {
address,
source_line,
label: label.map(|s| s.to_string()),
mnemonic: mnemonic.to_string(),
operand: operand.map(|s| s.to_string()),
}
}
}
use std::collections::HashMap;
#[derive(Debug, Clone)]
pub struct Symbol {
pub address: i32,
pub source_line: i32,
}
pub struct SymbolTable {
// Note how we don't declare map as pub, this makes it a private field.
map: HashMap,
}
impl SymbolTable {
pub fn new() -> Self {
SymbolTable {
map: HashMap::new(),
}
}
// Note how in the insert we have to declare that a pointer is &mut = MUTABLE.
pub fn insert(&mut self, name: String, address: i32, source_line: i32) -> Result<(), String> {
if name.len() > 6 {
return Err("Symbol name cannot be more than 6 characters".to_string());
}
if self.map.contains_key(&name) {
return Err(format!("Duplicate symbol name: {}", name));
}
let sym = Symbol {
address,
source_line,
};
self.map.insert(name, sym);
Ok(())
}
pub fn get_address(&self, name: &str) -> Option {
self.map.get(name).map(|sym| sym.address)
}
pub fn print_symbols(&self) {
println!("Symbol Table Content:");
println!("---------------------");
// Collect all entries into a vector to sort them.
// Note that we borrow READABLE references.
let mut entries: Vec<(&String, &Symbol)> = self.map.iter().collect();
// Sort by address
entries.sort_by(|a, b| a.1.address.cmp(&b.1.address));
// For each entry in entries, note how the loop itself destructures the tuple.
for (name, sym) in entries {
println!("{:<8} | {:04X}", name, sym.address);
}
}
}
#[derive(Debug, Clone, Copy)]
pub struct OpInfo{
pub opcode: u8,
pub format: u8,
}
pub fn get_opcode(mnemonic: &str) -> Option {
match mnemonic {
"ADD" => Some(OpInfo { opcode: 0x18, format: 3 }),
"AND" => Some(OpInfo { opcode: 0x40, format: 3 }),
"COMP" => Some(OpInfo { opcode: 0x28, format: 3 }),
"DIV" => Some(OpInfo { opcode: 0x24, format: 3 }),
"J" => Some(OpInfo { opcode: 0x3C, format: 3 }),
"JEQ" => Some(OpInfo { opcode: 0x30, format: 3 }),
"JGT" => Some(OpInfo { opcode: 0x34, format: 3 }),
"JLT" => Some(OpInfo { opcode: 0x38, format: 3 }),
"JSUB" => Some(OpInfo { opcode: 0x48, format: 3 }),
"LDA" => Some(OpInfo { opcode: 0x00, format: 3 }),
"LDCH" => Some(OpInfo { opcode: 0x50, format: 3 }),
"LDL" => Some(OpInfo { opcode: 0x08, format: 3 }),
"LDX" => Some(OpInfo { opcode: 0x04, format: 3 }),
"MUL" => Some(OpInfo { opcode: 0x20, format: 3 }),
"OR" => Some(OpInfo { opcode: 0x44, format: 3 }),
"RD" => Some(OpInfo { opcode: 0xD8, format: 3 }),
"RSUB" => Some(OpInfo { opcode: 0x4C, format: 3 }),
"STA" => Some(OpInfo { opcode: 0x0C, format: 3 }),
"STCH" => Some(OpInfo { opcode: 0x54, format: 3 }),
"STL" => Some(OpInfo { opcode: 0x14, format: 3 }),
"STSW" => Some(OpInfo { opcode: 0xE8, format: 3 }),
"STX" => Some(OpInfo { opcode: 0x10, format: 3 }),
"SUB" => Some(OpInfo { opcode: 0x1C, format: 3 }),
"TD" => Some(OpInfo { opcode: 0xE0, format: 3 }),
"TIX" => Some(OpInfo { opcode: 0x2C, format: 3 }),
"WD" => Some(OpInfo { opcode: 0xDC, format: 3 }),
_ => None,
}
}
#[derive(Debug, PartialEq, Clone, Copy)]
pub enum Directive {
Start, End, Byte, Word, Resb, Resw,
}
impl Directive {
pub fn from_str(s: &str) -> Option {
match s {
"START" => Some(Directive::Start),
"END" => Some(Directive::End),
"BYTE" => Some(Directive::Byte),
"WORD" => Some(Directive::Word),
"RESB" => Some(Directive::Resb),
"RESW" => Some(Directive::Resw),
_ => None,
}
}
// We pass the operand because RESW/BYTE need it to calculate size.
pub fn get_size(&self, operand: Option<&str>) -> Result {
match self {
Directive::Word => Ok(3),
Directive::Start | Directive::End => Ok(0),
Directive::Resw => {
let val = operand.ok_or("Missing operand for RESW")?
.parse::()
.map_err(|_| "Invalid integer for RESW")?;
Ok(val * 3)
},
Directive::Resb => {
let val = operand.ok_or("Missing operand for RESB")?
.parse::()
.map_err(|_| "Invalid integer for RESB")?;
Ok(val)
},
Directive::Byte => {
let op = operand.ok_or("Missing operand for BYTE")?;
if op.starts_with("C'") && op.ends_with('\'') {
// C'EOF' -> 3 bytes
Ok((op.len() - 3) as i32)
} else if op.starts_with("X'") && op.ends_with('\'') {
// X'F1' -> 1 byte per 2 hex chars
let hex_len = op.len() - 3;
if hex_len % 2 != 0 {
return Err("Hex literal must have even number of digits".to_string());
}
Ok((hex_len / 2) as i32)
} else {
Err("Invalid BYTE format".to_string())
}
}
}
}
}
use std::io::{BufRead, BufReader};
use std::fs::File;
use crate::ir::Line;
use crate::symbols::SymbolTable;
use crate::mnemonics::{get_opcode, Directive};
pub fn pass_one(filename: &str) -> Result<(SymbolTable, Vec), String> {
let file = File::open(filename).map_err(|e| e.to_string())?;
let reader = BufReader::new(file);
let mut symtab = SymbolTable::new();
let mut intermediate_code = Vec::new();
let mut locctr = 0;
let mut start_seen = false;
for (index, line_result) in reader.lines().enumerate() {
let source_line_number = index + 1;
let line = line_result.map_err(|e| e.to_string())?;
let tokens: Vec<&str> = line.split_whitespace().collect();
// Checks for comments.
if tokens.is_empty() || tokens[0].starts_with('#') || tokens[0].starts_with('.') {
continue;
}
// Parse tokens intelligently based on count
let (label, mnemonic, operand) = match tokens.len() {
3 => (Some(tokens[0]), tokens[1], Some(tokens[2])),
2 => {
// Determine if first token is instruction/directive or label
if get_opcode(tokens[0]).is_some() || Directive::from_str(tokens[0]).is_some() {
(None, tokens[0], Some(tokens[1]))
} else {
(Some(tokens[0]), tokens[1], None)
}
},
1 => (None, tokens[0], None),
_ => return Err(format!("Line {}: Too many tokens", source_line_number)),
};
// Handle START directive specially
if mnemonic == "START" {
if let Some(op) = operand {
// Parse hex: strtol(op, NULL, 16)
locctr = i32::from_str_radix(op, 16).unwrap_or(0);
}
start_seen = true;
intermediate_code.push(Line::new(
locctr, label, mnemonic, operand, source_line_number
));
continue;
}
let current_address = locctr;
// If this line has a label, add it to symbol table
if let Some(lbl) = label {
if symtab.insert(lbl.to_string(), current_address, source_line_number as i32).is_err(){
return Err(format!("Line: {}: '{}'", source_line_number, lbl));
}
}
// Calculate instruction size
let mut instruction_size = 0;
if get_opcode(mnemonic).is_some() {
instruction_size = 3; // All SIC instructions are 3 bytes
} else if let Some(dir) = Directive::from_str(mnemonic) {
instruction_size = dir.get_size(operand)
.map_err(|e| format!("Line {}: {}", source_line_number, e))?;
} else {
return Err(format!("Line {}: Unknown Opcode '{}'", source_line_number, mnemonic));
}
intermediate_code.push(Line::new(
current_address, label, mnemonic, operand, source_line_number
));
locctr += instruction_size;
if mnemonic == "END" {
break;
}
}
if !start_seen {
return Err("Error: Missing START directive".to_string());
}
Ok((symtab, intermediate_code))
}
use std::fs::File;
use std::io::{BufWriter, Write};
use crate::ir::Line;
use crate::symbols::SymbolTable;
use crate::mnemonics::{get_opcode};
/// Manages the buffering of the current Text Record (T-record)
struct TextRecord {
start_addr: Option,
buffer: Vec,
max_len: usize,
}
impl TextRecord {
fn new() -> Self {
TextRecord {
start_addr: None,
buffer: Vec::with_capacity(30),
max_len: 30,
}
}
fn add_bytes(&mut self, addr: i32, data: &[u8]) -> Vec {
let mut output_lines = Vec::new();
let mut data_idx = 0;
let addr_u32 = addr as u32;
while data_idx < data.len() {
if self.start_addr.is_none() {
self.start_addr = Some(addr_u32 + data_idx as u32);
}
let current_start = self.start_addr.unwrap();
let current_loc = current_start + self.buffer.len() as u32;
let incoming_loc = addr_u32 + data_idx as u32;
// Check for address gap - flush if needed
if current_loc != incoming_loc {
if let Some(line) = self.flush() {
output_lines.push(line);
}
self.start_addr = Some(incoming_loc);
}
let space_left = self.max_len - self.buffer.len();
let chunk_size = std::cmp::min(space_left, data.len() - data_idx);
self.buffer.extend_from_slice(&data[data_idx..data_idx + chunk_size]);
data_idx += chunk_size;
// Flush if buffer full
if self.buffer.len() == self.max_len {
if let Some(line) = self.flush() {
output_lines.push(line);
}
}
}
output_lines
}
fn flush(&mut self) -> Option {
if self.buffer.is_empty() {
return None;
}
let addr = self.start_addr.unwrap();
let record_len = self.buffer.len();
let header = format!("T{:06X}{:02X}", addr, record_len);
let body: String = self.buffer.iter().map(|b| format!("{:02X}", b)).collect();
self.start_addr = None;
self.buffer.clear();
Some(format!("{}{}", header, body))
}
}
pub fn pass_two(
ir: &[Line],
symtab: &SymbolTable,
filename: &str
) -> Result<(), String> {
let start_addr = ir.first().map(|l| l.address).unwrap_or(0);
// Simple calc for program length (Last Address - First Address)
let prog_len = if let Some(last) = ir.last() {
last.address - start_addr
} else {
0
};
let obj_filename = format!("{}.obj", filename);
let file = File::create(&obj_filename).map_err(|e| e.to_string())?;
let mut writer = BufWriter::new(file);
// 1. Header Record
let prog_name = ir.first()
.and_then(|l| l.label.as_deref())
.unwrap_or(" ");
writeln!(writer, "H{:<6}{:>06X}{:>06X}", prog_name, start_addr, prog_len)
.map_err(|e| e.to_string())?;
// 2. Text Records
let mut text_rec = TextRecord::new();
// Vector to store address locations that need Modification Records
let mut mod_records: Vec = Vec::new();
let mut entry_point = start_addr;
for line in ir {
if line.mnemonic == "START" { continue; }
if line.mnemonic == "END" {
if let Some(ref op) = line.operand {
entry_point = symtab.get_address(op).unwrap_or(start_addr);
}
break;
}
// Generate Object Code
let object_code = if get_opcode(&line.mnemonic).is_some() {
Some(assemble_instruction(line, symtab, &mut mod_records)?)
} else if line.mnemonic == "BYTE" {
Some(assemble_byte(line)?)
} else if line.mnemonic == "WORD" {
Some(assemble_word(line)?)
} else {
None // RESW/RESB generate no code
};
if let Some(bytes) = object_code {
let records = text_rec.add_bytes(line.address, &bytes);
for rec in records {
writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
}
} else {
// Gap handling (RESW/RESB)
if let Some(rec) = text_rec.flush() {
writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
}
}
}
if let Some(rec) = text_rec.flush() {
writeln!(writer, "{}", rec).map_err(|e| e.to_string())?;
}
// 3. Modification Records
// For Standard SIC, we modify the last 4 half-bytes (16 bits) at the specified address.
for addr in mod_records {
// M
// Length 04 represents 4 half-bytes (16 bits)
writeln!(writer, "M{:06X}04", addr + 1).map_err(|e| e.to_string())?;
}
// 4. End Record
writeln!(writer, "E{:06X}", entry_point).map_err(|e| e.to_string())?;
println!("Output written to {}", obj_filename);
Ok(())
}
fn assemble_instruction(
line: &Line,
symtab: &SymbolTable,
mod_records: &mut Vec
) -> Result, String> {
let op_info = get_opcode(&line.mnemonic)
.ok_or(format!("Unknown mnemonic {}", line.mnemonic))?;
let opcode = op_info.opcode;
let mut address = 0;
if let Some(ref operand) = line.operand {
// Handle indexed addressing (,X)
let (label, is_indexed) = if operand.ends_with(",X") {
(&operand[..operand.len()-2], true)
} else {
(operand.as_str(), false)
};
// Resolve symbol address
if let Some(addr) = symtab.get_address(label) {
address = addr;
// Record that this instruction location needs modification
mod_records.push(line.address);
} else {
return Err(format!("Undefined Symbol: {}", label));
}
// Set indexed bit if needed
if is_indexed {
address |= 0x8000;
}
} else if line.mnemonic != "RSUB" {
return Err(format!("Missing operand for {}", line.mnemonic));
}
let b1 = opcode;
let b2 = (address >> 8) as u8;
let b3 = (address & 0xFF) as u8;
Ok(vec![b1, b2, b3])
}
fn assemble_byte(line: &Line) -> Result, String> {
let operand = line.operand.as_ref()
.ok_or(format!("BYTE requires operand at line {}", line.source_line))?;
if operand.starts_with("C'") && operand.ends_with('\'') {
// Character constant: C'EOF' -> ASCII bytes
let content = &operand[2..operand.len()-1];
Ok(content.as_bytes().to_vec())
} else if operand.starts_with("X'") && operand.ends_with('\'') {
// Hex constant: X'F1' -> raw bytes
let hex_str = &operand[2..operand.len()-1];
if hex_str.len() % 2 != 0 {
return Err("X constant must have even number of digits".to_string());
}
let mut bytes = Vec::new();
for i in (0..hex_str.len()).step_by(2) {
let byte_val = u8::from_str_radix(&hex_str[i..i+2], 16)
.map_err(|_| "Invalid hex in X constant")?;
bytes.push(byte_val);
}
Ok(bytes)
} else {
Err(format!("Invalid BYTE format: {}", operand))
}
}
fn assemble_word(line: &Line) -> Result, String> {
let operand = line.operand.as_ref()
.ok_or(format!("WORD requires operand at line {}", line.source_line))?;
let value = operand.parse::()
.map_err(|_| format!("Invalid WORD constant: {}", operand))?;
// Check 24-bit range
let min = -(1 << 23);
let max = (1 << 23) - 1;
if value < min || value > max {
return Err(format!("WORD value too large for 24-bits: {}", value));
}
// Convert to 3-byte big-endian
let bytes = value.to_be_bytes();
Ok(vec![bytes[1], bytes[2], bytes[3]])
}
Rust's ownership system and pattern matching prove ideal for this project. The Option<T> type naturally
represents optional labels and operands, while Result<T, E> provides structured error handling with precise
line-level error reporting. The match expressions in the opcode table and directive parsing are both
exhaustive (compiler-enforced) and highly readable.
The symbol table wraps HashMap<String, Symbol> to add SIC-specific validation like the 6-character
limit and duplicate detection. Using a private map field with public methods provides proper encapsulation
while allowing controlled access to symbol addresses during pass 2.
The TextRecord struct demonstrates Rust's ability to manage complex state with automatic resource management. Its buffer management handles the tricky details of standard SIC object format: 30-byte text records, address gaps that require new records, and proper flushing at boundaries; all with zero memory leaks thanks to Rust's RAII semantics.