Text as Data

Textual data have rapidly gained attention in the communication, social, and political sciences. Without their discussion, a companion to data management would be incomplete. This chapter starts with a discussion of basic operations on character strings, such as concatenation, search and replace. It then moves on to the management of corpora of text. It also discusses routine issues in the management of textual data, such as stemming, stop-word deletion, and the creation of term-frequency matrices.

Below is the supporting material for the various sections of the chapter.

Character Strings

Text Corpora with the tm Package

Improvements Provided by the quanteda Package