Train a QA model

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

The Hugging Face Model Hub has a wide range of models that can handle many tasks. While these models perform well, the best performance is often found when fine-tuning a model with task-specific data.

Hugging Face provides a number of full-featured examples to assist with training task-specific models. When building models from the command line, these scripts are a great way to get started.

txtai provides a training pipeline that can be used to train new models programmatically using the Transformers Trainer framework.

This example trains a small QA model and then further fine-tunes it with a couple new examples (few-shot learning).

Install dependencies

Install txtai and all dependencies.

pip install txtai datasets pandas

Train a SQuAD 2.0 Model

The first step is training a SQuAD 2.0 model. SQuAD is a question-answer dataset that poses a question with a context along with the identified answer. It's also possible to not have an answer. See the SQuAD dataset website for more information.

We'll use a tiny Bert model with a portion of SQuAD 2.0 for efficiency purposes.

from datasets import load_dataset
from txtai.pipeline import HFTrainer

ds = load_dataset("squad_v2")

trainer = HFTrainer()
trainer("google/bert_uncased_L-2_H-128_A-2", ds["train"].select(range(3000)), task="question-answering", output_dir="bert-tiny-squadv2")
print("Training complete")

Fine-tune with new data

Next we'll add a few additional examples. Fine-tuning a QA model will help with framing a certain type of question or improve performance for a specific use-case.

For smaller models with a narrow use case, this helps the model zero in on the types of questions that are to be asked. In this case, we want to tell the model exactly the types of information we're looking for when asking for ingredients. This will help improve confidence in the answers the model is generating.

# Training data
data = [
    {"question": "What ingredient?", "context": "Pour 1 can whole tomatoes", "answers": "tomatoes"},
    {"question": "What ingredient?", "context": "Dice 1 yellow onion", "answers": "onion"},
    {"question": "What ingredient?", "context": "Cut 1 red pepper", "answers": "pepper"},
    {"question": "What ingredient?", "context": "Peel and dice 1 clove garlic", "answers": "garlic"},
    {"question": "What ingredient?", "context": "Put 1/2 lb beef", "answers": "beef"},
]

model, tokenizer = trainer("bert-tiny-squadv2", data, task="question-answering", num_train_epochs=10)

Test the model

Now we're ready to test the results! The following sections run a question against the original model only trained with SQuAD 2.0 and the further fine-tuned model.

from transformers import pipeline

questions = pipeline("question-answering", model="bert-tiny-squadv2")
questions("What ingredient?", "Peel and dice 1 shallot")

{'answer': 'dice 1 shallot',
 'end': 23,
 'score': 0.05128436163067818,
 'start': 9}

from transformers import pipeline

questions = pipeline("question-answering", model=model.to("cpu"), tokenizer=tokenizer)
questions("What ingredient?", "Peel and dice 1 shallot")

{'answer': 'shallot', 'end': 23, 'score': 0.13187439739704132, 'start': 16}

See how the results are more confident and have a better answer. This method allows using a smaller model with a narrow set of functionality with the upside of increased speed. Give it a try with your own data!