Earmarks are legislative provisions that direct federal funds to specific projects, circumventing the competitive grant-making process of federal agencies. Identifying and cataloging earmarks is a tedious, time-consuming process carried out by experts from public interest groups. In this paper, we present a machine learning system for automatically extracting earmarks from congressional bills and reports. We first describe a table-parsing algorithm for extracting budget allocations from appropriations tables in congressional bills. We then use machine learning classifiers to identify budget allocations as earmarked objects with an out of sample ROC AUC score of 0.89. Using this system, we construct the first publicly available database of earmarks dating back to 1995. Our machine learning approach adds transparency, accuracy and speed to the congressional appropriations process.

Filed under: Classification