Headlines are particularly important for online news out-lets where there are many similar news stories competing for users’ attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We define a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes affect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the effect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two different experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The first task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to specifically identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model significantly outperforms all other comparative methods.

Filed under: Graph Mining and Social Networks | Dimensionality Reduction