How to Compete Online for News Audience: Modeling Words that Attract Clicks
Joon Hee Kim*, KAIST; Amin Mantrach, Yahoo! Research; Alex Jaimes, Yahoo!; Alice Oh, Korea Advanced Institute of Science and Technology
Headlines are particularly important for online news out-lets where there are many similar news stories competing for users’ attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We deﬁne a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes aﬀect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the eﬀect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two diﬀerent experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The ﬁrst task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to speciﬁcally identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model signiﬁcantly outperforms all other comparative methods.