build-your-own-x/llm/tokenizer.py
Claude 1d3ce8cff7
Add basic LLM implementation from scratch
Implements a character-level GPT-style Transformer:
- model.py: CausalSelfAttention, FeedForward, TransformerBlock, LLM
- tokenizer.py: CharTokenizer (char -> int mapping)
- train.py: training loop with AdamW, gradient clipping, checkpointing, sampling
- generate.py: load checkpoint and generate text from a prompt

Verified working on a built-in Shakespeare excerpt (805k param model).

https://claude.ai/code/session_01SWXLQb3nFTiygbp74dpjVa
2026-03-22 22:51:49 +00:00

20 lines
615 B
Python

"""
Character-level tokenizer.
Maps every unique character in the training corpus to an integer id.
Simple, requires no external libraries, and good enough for a tiny LLM.
"""
class CharTokenizer:
def __init__(self, text: str):
chars = sorted(set(text))
self.vocab_size = len(chars)
self._stoi = {ch: i for i, ch in enumerate(chars)}
self._itos = {i: ch for i, ch in enumerate(chars)}
def encode(self, text: str) -> list[int]:
return [self._stoi[ch] for ch in text]
def decode(self, ids: list[int]) -> str:
return "".join(self._itos[i] for i in ids)