How to Train an LLM: Part 4

Welcome back to the continuation of Part 1, Part 2 and Part 3 of the saga.

In the last worklog, I settled on an architecture with a Llama-style base, QKNorm, MixAttention instead of standard GQA, and cross-document masking. If you want to learn more about that, check out the last section of this worklog.

Introduction

I have been quiet about the end use case I have in mind that justifies doing a pre-train. I have a specific architecture and strict criteria (blazing-fast, local inference with a decent context window) in mind, which limit my options for models I can simply post-train. Working across the whole pipeline has been fun, it takes a certain kind of persistence to debug through data, code and architecture for days on end.

My goal is to build a punch-packing code fill-in-the-middle (FIM) model. I am surprised at how we don't have a powerful, small, self-hostable FIM code model. Early versions of GitHub Copilot used 12B param code-cushman-001 by OpenAI, and we have come a long way since then; Copilot now relies on much larger GPT-5–class models.

It's going to be hard to compete with such a capable model. 4o was trained on multiple trillion tokens, neither do I have that scale of data nor do I have a giant wallet for that much compute.

Here are a few ideas I am NOT subscribed to:

In this blog, I want to talk about data and my first FIM model pre-train, which I trained for 5B tokens (undertrained). I originally wanted this post to be about the “final” model, but I do not think I have the right recipe just yet. I need to spend some more time with my codebase.

Data for FIM

I spent a long time understanding what kind of data FIM models currently consume. I have read through papers on Deepseek-Coder, Structured FIM, SmolLM2, StarCoder, OpenAI's OG FIM, and many others.

From all this, I have a few key insights:

Less is More

Some data-focused papers emphasize filtering the dataset and training on the smaller dataset while still performing better on evals. I will speak on evals more later. This idea of "Cognitive Core" is appealing to me where the model is trained to understand the essence of a language or a design pattern at train time and performs well at test-time given additional information in-context.

Datamix

For code FIM, I train on 90% code and 10% English data. Let's talk about what both are composed of:

Data format

FIM-style cloze objectives are conceptually similar to what BERT-style masked language models excel at. However, OpenAI's FIM paper describes the FIM-for-free property of modern decoder-only LMs. Let's say I have the following snippet of code in my training data:

def calculate_sum(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

result = calculate_sum([1, 2, 3, 4, 5])
print(f"The sum is: {result}")

If we want to feed these tokens to train our FIM model, we restructure the input to roughly look like the following:

<|fim_prefix|>def calculate_sum(numbers):
    total = 0
    for num in numbers:
        <|fim_suffix|>
    return total

result = calculate_sum([1, 2, 3, 4, 5])
print(f"The sum is: {result}")
<|fim_middle|>total += num

Simply put, code before middle (or prefix) comes first, then comes code after the middle (or suffix), and finally, we let the model see what should come in the middle. The sentinel or special tokens (e.g. <|fim_prefix|>) add structural signal to the model during training. The format above is PSM (or Prefix-Suffix-Middle), there is another format SPM (or Suffix-Prefix-Middle) which allows for better KV-caching.

OpenAI's FIM paper as well as the Structured FIM paper linked above are where I draw my inspiration from. Let me break these ideas down & how I use them.

Determining how long the mask is a challenge by itself. My first idea to support a range of mask sizes (similar to IRL coding flow) was to make mask_length = clamp(normal_distribution(mean=50, stddev=40), 1, 100). The model I ended up pre-training calculates mask length with that and it's... sub-optimal in hindsight. My thinking was - I'd want the model to complete full AST trees worth of code, so 40-60 tokens should be good for the average generation - but this absolutely messed up shorter, more common generations, with the model not knowing when to stop. Therefore my new updated formula is mask_length = uniform_distribution(a=1, b=100) which should provide better generations across a wider range of use cases.

Eval setup

For simplicity's sake, I created my own eval. To make it realistic and simple, I'll take 10 in-production files from my own training codebase and, with minimal edits, manually select masks of different sizes and subtrees such as if statements.

This guarantees the model has definitely neither seen these exact masked snippets nor the code itself during training.

Edited example (... is code removed for brevity):

<|fim_prefix|>
...
            self.weights.append(dataset_config.mix_ratio)
        
        self.buffer: <|fim_suffix|> = []
        self.doc_ids = []
        self.current_doc_id = 0
...
<|fim_middle|>List[int]

The example above tests if the model understand types in python.

My First FIM Model

I decided to put all the theory to rest and train a model. I relaxed some constraints to get a baseline: I only use random FIM (no AST-based masks), python-only code part of the datamix (10% english still kept in the mix) and the 0.6B model is llama-arch + QK norm + cross-doc masking (no mixattention yet). I spun up 2xH100s to make this happen and trained over 5B tokens instead of 20B (definitely undertrained).

Results

The gen/sample generations are supposed to help me vibe check the model's outputs. The generation is correct, but it doesn't add the <|endoftext|> token until a few tokens later. I think this is because it sees so few examples that output only a few tokens, so the model thinks it has to generate a few more tokens. I could be wrong, and this behaviour could totally be an artifact of undertraining. Suffice to say, this model does horrible on my evals.

Conclusion

This has been my shortest worklog in a while. It feels like not much got done, but in fact, a lot was done. I gained invaluable intuition on data massaging, have evidence that my model/dataset combination can work, and improved my training codebase, among other things!

Hopefully, I have good news in my next worklog :)

Omkaar Kamath