LabelTrain

Label Efficient Learning in the Era of Foundation Models

About Us

LabelTrain is an ongoing open-source project for experimentally exploring the strengths and weaknesses of label-efficient learning algorithms. Our results so far underscore the significant advantages of label-efficient learning methods in the era of foundation models, by integrating ideas from self-supervised/contrastive pretraining, semi-supervised learning, active learning, and many more.

Research Highlights

Best Label Efficiency =

Active Learning + Semi-SL + Foudation Model

The LabelBench framework utilizes both active learning and semi-supervised learning methods to finetune large pretrained models. We observe the best label efficiency and the synergistic effect when taking advantage of all these label efficient methods under a single framework.

To address the extra computational cost induced by active learning, we propose a selection-via-proxy method that reduces LabelBench's total computational cost to no more than 2x that of Semi-SL alone.

LabelBench: [Paper] | [Repo]

Label Efficient Supervised Fine-Tuning Saves 50% Annotation Cost for LLMs

To enhance label efficiency in Supervised Fine-Tuning, we introduce an experimental design framework that addresses active learning's high computational overhead for LLMs. Our evaluation of eight uncertainty and/or diversity based algorithms, demonstrates that the top-performing method matches the generalization performance in generative tasks at just 50% the annotation costs compared to random sampling, while keeping computational overhead low.

Label Efficient SFT: [Paper]

New Efficiency Records of Active Learning

Active learning saves significantly more annotation cost when finetuning large pretrained models, in contrast to marginal savings from training ResNet from scratch. This suggests the increasing benefit of active learning in the era of foundation models while challenging the convention wisdom that active learning obtains marginal benefits for deep learning.

LabelBench: [Paper] | [Repo]

Active Learning for Imbalanced Data

Class imbalance significantly impedes learning performance, especially for rare and minority classes. Active learning excels in addressing this by efficiently labeling a balanced and informative subset of examples, showing up to 16x improvement in label efficiency. The key to this success is labeling examples closest to the optimal separation threshold, rather than the model decision boundary used in other active learning literature.

Imbalanced Active Learning: [Paper 1] | [Paper 2] | [Paper 3] | [Paper 4]

Active Multi-task Learning

We introduces an active multi-task learning framework that extends beyond the conventional single-task, sample-based querying approach by exploiting shared representations from multiple source tasks. This method addresses the challenge of scarce data in target tasks and save budget by querying the most relevant source tasks.

Active multitask learning: [Paper 1] | [Paper 2]

Continual Active Learning

Active learning typically requires retraining the model from scratch for every query round. Using the proposed continual active learning (CAL) framework, we show that this major bottleneck can be circumvented without any deterioration in generalization performance.

Continual Active Learning: [Paper] | [Repo]