Fine Tuning a Local LLM to Categorize Questions
Project Overview
- The goal is to build a chatbot for household maintenance queries using RAG against a vector database.
- A preprocessing step uses a small local LLM to categorize incoming questions into known metadata tags (e.g., pool, car, hvac, cooking) to narrow the vector search space.
- The experiment tests the hypothesis that a very small local LLM can be reliably fine-tuned for this classification task.
LLM & Finetuning Details
- Models used: Qwen 3:4B (general QA) and Qwen 3:0.6B (classifier).
- Finetuning framework: Unsloth, using QLoRA.
- Initial dataset size: ~850 entries, split 70/15/15 (Train/Eval/Test).
Baseline Performance (Zero-Shot Prompting)
- Initial testing of Qwen 0.6B without fine-tuning showed poor performance.
- Accuracy on 131 tests was only ~10% correct.
- Failure modes included overuse of broad labels (e.g., ‘electric’) and inventing categories not in the allowed list.
Finetuning Attempts & Results
- 1st Attempt: Fine-tuning improved accuracy to ~79% (104/131 correct). Issues persisted with category fragmentation (e.g., ‘ac’ instead of ‘hvac’).
- 2nd Attempt (Code Mapping): Switching the output requirement from a variable category string to a fixed, two-character opaque code significantly boosted performance.
- Final Result: Accuracy reached ~92% (120/131 correct) by enforcing a fixed-format code output, demonstrating the critical role of output constraints in small model reliability.
Key Takeaways
- Small, specialized LLMs can be effectively fine-tuned for specific classification tasks like question routing.
- The success of fine-tuning heavily depends on constraining the output format (e.g., using fixed codes) to minimize ambiguity and improve reliability.
- The transition from 10% to 92% accuracy highlights the power of targeted fine-tuning over zero-shot prompting for constrained tasks.
Topics: ML Model Fine-Tuning for Question Categorization
Tags: LLM RAG FineTuning