Fine Tuning a Local LLM to Categorize Questions

Project Overview

The goal is to build a chatbot for household maintenance queries using RAG against a vector database.
A preprocessing step uses a small local LLM to categorize incoming questions into known metadata tags (e.g., pool, car, hvac, cooking) to narrow the vector search space.
The experiment tests the hypothesis that a very small local LLM can be reliably fine-tuned for this classification task.

Initial testing of Qwen 0.6B without fine-tuning showed poor performance.
Accuracy on 131 tests was only ~10% correct.
Failure modes included overuse of broad labels (e.g., ‘electric’) and inventing categories not in the allowed list.

1st Attempt: Fine-tuning improved accuracy to ~79% (104/131 correct). Issues persisted with category fragmentation (e.g., ‘ac’ instead of ‘hvac’).
2nd Attempt (Code Mapping): Switching the output requirement from a variable category string to a fixed, two-character opaque code significantly boosted performance.
Final Result: Accuracy reached ~92% (120/131 correct) by enforcing a fixed-format code output, demonstrating the critical role of output constraints in small model reliability.

Small, specialized LLMs can be effectively fine-tuned for specific classification tasks like question routing.
The success of fine-tuning heavily depends on constraining the output format (e.g., using fixed codes) to minimize ambiguity and improve reliability.
The transition from 10% to 92% accuracy highlights the power of targeted fine-tuning over zero-shot prompting for constrained tasks.

Topics: ML Model Fine-Tuning for Question Categorization
Tags: LLM RAG FineTuning