Orca-Math showcases the benefits of fine-tuning SLMs with multi-agent flows

General small language models struggle when asked to solve grade school math problems. This task is often used as a benchmark to evaluate the capabilities of larger foundational models. It has become increasingly evident that small language models can efficiently achieve state-of-the-art performance in specific domains with the help of fine-tuning. Following this trend, a research team at Microsoft developed Orca-Math by fine-tuning Mistral-7B using a small 200,0000-problem dataset and a multi-agent flow to achieve an 86.81% score on GSM8k pass@1. Most models achieving a score over 80% in the GSM8k have over 30 billion parameters.

To put Orca-Math's performance into perspective, it scored higher than general models LLAMA-2-70, Gemini Pro, GPT-3.5, and math-specific models MetaMath-70B and WizardMa8th-70B. Moreover, the small size of the dataset and the fact that it was trained to solve problems without using external tools, verifiers, or ensembling means that Orca-Math's training is faster and cheaper than many alternatives. A multi-agent flow generates synthetic high-quality training data from the small dataset, for instance, by using an agent to examine the problem and suggest modifications to make it more complex, passing it on to another agent that reviews the suggestions and incorporates them into revised, more challenging problems based on the original one. This process can be iterated to increase the problems' complexity even more, and a third agent can be incorporated to ensure that the math problems are solvable by creating a solution.

Once the multi-agent flow is set, the model is trained using a teacher-student paradigm where a larger model (the teacher) creates demonstrations of solutions for the smaller model to learn from using AgentInstruct. Then, the SLM is left to solve problems on its own, where the SLM can create multiple solutions for a single math problem. The larger model goes over the solutions, offering feedback on them. If the SLM cannot solve a problem correctly after several attempts, the SLM is re-taught using one of the teacher's solutions. Finally, the feedback is used to create preference data so the model learns to discern a good solution from a bad one. This process can be repeated several times to continue refining the SLM's capabilities.

To support the continued research on the training and specialization of small language models in specific domains, the research team behind Orca-Math is releasing the dataset used for training, as well as a report describing the procedure.