Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email
Ying Sheng (Google); Sandeep Tata (Google); James B. Wendt (Google); Jing Xie (Google); Qi Zhao (Google); Marc Najork (Google)
Extracting structured data from emails can enable several assistive experiences, such as reminding the user when a bill payment is due, answering queries about the departure time of a booked flight, or proactively surfacing an emailed discount coupon while the user is at that store. This paper presents Juicer, a system for extracting information from email that is serving over a billion Gmail users daily. We describe how the design of the system was informed by three key principles: scaling to a planet-wide email service, isolating the complexity to provide a simple experience for the developer, and safeguarding the privacy of users (our team and the developers we support are not allowed to view any single email). We describe the design tradeoffs made in building this system, the challenges faced and the approaches used to tackle them. We present case studies of three extraction tasks implemented on this platform—-bill reminders, commercial offers, and hotel reservations—-to illustrate the effectiveness of the platform despite challenges unique to each task. Finally, we outline several areas of ongoing research in large-scale machine-learned information extraction from email.