![[P] How I found & fixed 4 bugs in Microsoft's Phi-4 model](https://jfbmhhfxbbrxcmwilqxt.supabase.co/storage/v1/object/public/resource-images/MachineLearning_No-code_AI_20250328_190624_processed_image.jpg)
[P] How I found & fixed 4 bugs in Microsoft's Phi-4 model
Hey r/MachineLearning! Last week, Microsoft released Phi-4, a 14B open-source model that rivals OpenAI's GPT-4-o-mini. I managed to find & fix 4 bugs impacting its output quality. You might remember me previously from fixing 8 bugs in Google's Gemma model! :)
I'm going to walk you through how I found & fixed the bugs. Phi-4's benchmarks were amazing, however many users reported weird or just wrong outputs. Since I maintain the open-source project called 'Unsloth' (fine-tuning LLMs 2x faster with 70% less VRAM) with my brother, I firstly tested Phi-4 for inference and found many errors. Our GitHub repo: https://github.com/unslothai/unsloth
This time, the model had no implementation issues (unlike Gemma 2) but did have problems in the model card. For my first inference run, I randomly found an extra token which is obviously incorrect (2 eos tokens is never a good idea). Also during more runs, I found there was an extra assistant prompt which is once again incorrect. And, lastly, from past experience with Unsloth's bug fixes, I already knew fine-tuning was wrong when I read the code.
These bugs caused Phi-4 to have some drop in accuracy and also broke fine-tuning runs. Our fixes are now under review by Microsoft to be officially added to Hugging Face. We uploaded the fixed versions to https://huggingface.co/unsloth/phi-4-GGUF
Here’s a breakdown of the bugs and their fixes:
1. Tokenizer bug fixes
The Phi-4 tokenizer interestingly uses as the BOS (beginning of sentence), EOS (end of sentence) and PAD (padding) tokens. The main issue is the EOS token is wrong - it should be . Otherwise, you will get in generations.
2. Fine-tuning bug fixes
The padding token should be a designated pad token like in Llama () or we can use an untrained token - for example we use , fixing infinite generations and outputs.
3. Chat template issues
The Phi-4 tokenizer always adds an assistant prompt - it should only do this if prompted by add_generation_prompt. Most LLM serving libraries expect non auto assistant additions, and this might cause issues during serving.
We dive deeper into the bugs in our blog: https://unsloth.ai/blog/phi4
Do our Fixes Work?
Yes! Our fixed Phi-4 uploads show clear performance gains, with even better scores than Microsoft's original uploads on the Open LLM Leaderboard.
https://preview.redd.it/d8hew26e06ce1.png?width=2366&format=png&auto=webp&s=173c23feacc625566271470839fe7a5e25eb860e
Some redditors even tested our fixes to show greatly improved results in:
- Example 1: Multiple-choice tasks
https://preview.redd.it/qx50pkq706ce1.png?width=1579&format=png&auto=webp&s=437da2cabdbf98ef5a8b8cbdc5592907a20e2316
- Example 2: ASCII art generation
https://preview.redd.it/sw1o3a3yt4de1.png?width=2326&format=png&auto=webp&s=fc6bfc45d14134d45f332ba58bbd1de049f5776b
We also made a Colab notebook fine-tune Phi-4 completely for free using Google's free Tesla T4 (16GB) GPUs: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb
Thank you for reading this long post and hope you all found this insightful! If you have any questions, please feel free to ask! :)
How I found the bugs:
- I first downloaded the original Phi-4 from https://huggingface.co/microsoft/phi-4, and tested inference out. Weirdly I found
assistant
to be appended at the even withadd_generation_prompt = False
in Hugging Face, so I theorized there was a chat template problem. Adding assistant prompts by default can break serving libraries. - And yes, https://huggingface.co/microsoft/phi-4/blob/f957856cd926f9d681b14153374d755dd97e45ed/tokenizer_config.json#L774 had by default added the assistant prompt - I first fixed this!
- I then found
to be used for the BOS, EOS and PAD tokens, which is a common issue amongst models - I ignored the BOS, since Phi-4 did not have one anyways, but changed the PAD token to
. You can select any of the tokens since they're empty and not trained. This counteracts issues of infinite generations during finetuning. - For Llama-fication, I used torch.allclose to confirm all tensors are in fact equivalent. I also used some fake random data to check all activations are also mostly similar bitwise. I also uploaded the model to the HF Open LLM Leaderboard to confirm if the original Phi-4 arch and the new Llama-fied models are equivalent.
- Finally I verified all finetuning runs with Unsloth in a Colab Notebook to confirm all runs were correct.
Vibe Score

0
Sentiment

1
Rate this Resource
Join the VibeBuilders.ai Newsletter
The newsletter helps digital entrepreneurs how to harness AI to build your own assets for your funnel & ecosystem without bloating your subscription costs.
Start the free 5-day AI Captain's Command Line Bootcamp when you sign up: