A guide for using computational text analysis to learn about the social world
From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights.
Text as Data is organized around the core tasks in research projects using text―representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research.
Bridging many divides―computer science and social science, the qualitative and the quantitative, and industry and academia― Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain.
A clear, concise, deeply practical guide to applied text analysis for social science. Appropriate for a reader just getting into text data, whether a grad student or a researcher moving into a new area, it walks you through the complete set of steps for conducting a research project based on quantitative text analysis, from conception to execution, with extensive examples, mostly but not exclusively from political science, and intuitive explanations of major concepts.
This is not primarily a stats or technical natural language processing book: there are formulas and discussion of algorithms, but it's kept fairly light, and it covers the "what" and "why" more than the "how," albeit with sufficient references that a reader could pick up the details on their own. Given the speed of technical advances in this field, I view that largely as a positive. An explainer of the hottest current ML framework or software package is going to become obsolete fairly quickly, and indeed, a lot of the actual discussion, when concrete, is about various extensions of LDA-style topic models that seem to have peaked in usage several years ago and are rapidly being displaced by neural methods. But while implementation will change, the experienced discussion of how to put together a data set, pose your problem as a supervised or unsupervised learning problem, obtain and validate labels, and so on will continue to be useful. So while it definitely needs to be supplemented with tutorials and papers and the like, I can see this forming the core of classes on text data for social scientists for years to come.
The prose in this book is an extremely accessible description of text analysis, and it includes enough in-the-weeds mathematical notation to satisfy more advanced folks. The one thing I really found myself wishing it had was accompanying R code examples for each chapter. I think that would have made this book a clear 5 stars.
I thought most of the material was covered well, but some of the technical descriptions were not the clearest I’ve read (e.g., naive bayes—which is covered as a simple introduction to supervised learning with text data in most natural language processing text—felt particularly opaque here); in addition, the section on causal inference felt long (I feel with more abstraction, the chapters could be combined into one without much loss)
a bittersweet read though: I started this book with a desperate enthusiasm at what would end up being my last term as a doctoral student, saw that the section on word representations covered a chapter of my planned dissertation well shortly after leaving my program (and so feeling both that I had been on the right track and that I may have had a place (I hadn’t found quickly enough) in the research world), and wrapped up in the book today, which is a day on which I remain uncertain of what comes next for me; certainly I am not picking up my research from here, reanimated by my infant nostalgia for my academic attempt by this book, but I do love the topic for what it is separate from my decision to walk away from my research, and maybe I will work with text as data in the future, but at the moment, I am not sure