Train without labels

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

Almost all data available is unlabeled. Labeled data takes effort to manually review and/or takes time to collect. Zero-shot classification takes existing large language models and runs a similarity comparison between candidate text and a list of labels. This has been shown to perform surprisingly well.

The problem with zero-shot classifiers is that they need to have a large number of parameters (400M+) to perform well against general tasks, which comes with sizable hardware requirements.

This article explores using zero-shot classifiers to build training data for smaller models. A simple form of knowledge distillation.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Apply zero-shot classifier to unlabeled text

The following section takes a small 1000 record random sample of the sst2 dataset and applies a zero-shot classifer to the text. The labels are ignored. This dataset was chosen only to be able to evaluate the accuracy at then end.

import random

from datasets import load_dataset

from txtai.pipeline import Labels

def batch(texts, size):
    return [texts[x : x + size] for x in range(0, len(texts), size)]

# Set random seed for repeatable sampling
random.seed(42)

ds = load_dataset("glue", "sst2")

sentences = random.sample(ds["train"]["sentence"], 1000)

# Load a zero shot classifier - txtai provides this through the Labels pipeline
labels = Labels("microsoft/deberta-large-mnli")

train = []

# Zero-shot prediction using ["negative", "positive"] labels
for chunk in batch(sentences, 32):
    train.extend([{"text": chunk[x], "label": label[0][0]} for x, label in enumerate(labels(chunk, ["negative", "positive"]))])

Next, we'll use the training set we just built to train a smaller Electra model.

from txtai.pipeline import HFTrainer

trainer = HFTrainer()
model, tokenizer = trainer("google/electra-base-discriminator", train, num_train_epochs=5)

Evaluating accuracy

Recall the training set is only 1000 records. To be clear, training an Electra model against the full sst2 dataset would perform better than below. But for this exercise, we're are not using the training labels and simulating labeled data not being available.

First, lets see what the baseline accuracy for the zero-shot model would be against the sst2 evaluation set. Reminder that this has not seen any of the sst2 training data.

labels = Labels("microsoft/deberta-large-mnli")

results = [row["label"] == labels(row["sentence"], ["negative", "positive"])[0][0] for row in ds["validation"]]
sum(results) / len(ds["validation"])

0.8818807339449541

88.19% accuracy, not bad for a model that has not been trained on the dataset at all! Shows the power of zero-shot classification.

Next, let's test our model trained on the 1000 zero-shot labeled records.

labels = Labels((model, tokenizer), dynamic=False)

results = [row["label"] == labels(row["sentence"])[0][0] for row in ds["validation"]]
sum(results) / len(ds["validation"])

0.8738532110091743

87.39% accuracy! Wouldn't get too carried away with the percentages but this at least nearly meets the accuracy of the zero-shot classifier.

Now this model will be highly tuned for a specific task but it had the opportunity to learn from the combined 1000 records whereas the zero-shot classifier views each record independently. It's also much more performant.

Conclusion

This notebook explored a method of building trained text classifiers without training data being available. Given the amount of resources needed to run large-scale zero-shot classifiers, this method is a simple way to build smaller models tuned for specific tasks. In this example, the zero-shot classifier has 400M parameters and the trained text classifier has 110M.