The choice of the loss function is critical in extreme multi- label learning where the objective is to annotate each data point with the most relevant subset of labels from an ex- tremely large label set. Unfortunately, existing loss func- tions, such as the Hamming loss, are unsuitable for learning, model selection, hyperparameter tuning and performance evaluation. This paper addresses the issue by developing propensity scored losses which: (a) prioritize predicting the few relevant labels over the large number of irrelevant ones; (b) do not erroneously treat missing labels as irrelevant but instead provide unbiased estimates of the true loss func- tion even when ground truth labels go missing under arbi- trary probabilistic label noise models; and (c) promote the accurate prediction of infrequently occurring, hard to pre- dict, but rewarding tail labels. Another contribution is the development of the PfastreXML algorithm (code available from [1]) which efficiently scales to large datasets with up to 9 million labels, 70 million points and 2 million dimensions and which gives significant improvements over the state-of- the-art.

This paper’s results also apply to tagging, recommenda- tion and ranking which are the motivating applications for extreme multi-label learning. They generalize previous at- tempts at deriving unbiased losses under the restrictive as- sumption that labels go missing uniformly at random from the ground truth. Furthermore, they provide a sound the- oretical justification for popular label weighting heuristics used to recommend rare items. Finally, they demonstrate that the proposed contributions align with real world ap- plications by achieving superior clickthrough rates on spon- sored search advertising in Bing.

Filed under: