We are all familiar with Internet searches. After typing a few well-chosen words the search engines are able to return to us a number of documents relevant to what we are looking for. At least we hope so and often it all works well.
Here we are concerned with a subtly different problem. Instead of finding relevant documents, assume we have the documents and wish to score them with regard to a theme or concept. For example, use of “we words” which standardly in English (with allowance for typos) are:
lets, let’s, our, ours, ourselves, us, we, we’d, we’ll, we’re, weve, we’ve
Together this vocabulary constitutes a representation of a concept, “we words”. In accepting this, we take the stance of accepting vector semantics (the representation of concepts as collections of terms) and we call the vocabulary a concept vector for “we words”.
This stance has proved productive and useful in extracting information from bodies of text (corpora). For example, Pennebaker in his book The Secret Life of Pronouns (2011, page 111) writes that
[W]e-words are used frequently when people are arrogant, emotionally distant and high in status. Males especially use we in a distancing or royal form:” We need to analyze that data” or ” We aren’t going to put up with higher taxes.”
OK, what about, say, corporations? Well we collected 23 years of 10-K (annual report) filings by IBM with the Security and Exchange Commission (1994-2016) and applied the we-words concept vector to the documents.
Here’s what we got.
Normalizing by the length of the documents, we get an explosive growth in we-words starting about 2010. Why? One can conjecture. Simple text mining with concept vectors can hardly settle the matter, but it clearly can serve to find patterns of interest and to raise issues well worth following up.
— Christine Chou and Steven Kimbrough