Vector semantics to capture information from texts

We are all familiar with Internet searches. After typing a few well-chosen words the search engines are able to return to us a number of documents relevant to what we are looking for. At least we hope so and often it all works well.

Here we are concerned with a subtly different problem. Instead of finding relevant documents, assume we have the documents and wish to score them with regard to a theme or concept. For example, use of “we words” which standardly in English (with allowance for typos) are:

lets, let’s, our, ours, ourselves, us, we, we’d, we’ll, we’re, weve, we’ve

Together this vocabulary constitutes a representation of a concept, “we words”. In accepting this, we take the stance of accepting vector semantics (the representation of concepts as collections of terms) and we call the vocabulary a concept vector for “we words”.

This stance has proved productive and useful in extracting information from bodies of text (corpora). For example, Pennebaker in his book The Secret Life of Pronouns (2011, page 111) writes that

[W]e-words are used frequently when people are arrogant, emotionally distant and high in status. Males especially use we in a distancing or royal form:” We need to analyze that data” or ” We aren’t going to put up with higher taxes.”

OK, what about, say, corporations? Well we collected 23 years of 10-K (annual report) filings by IBM with the Security and Exchange Commission (1994-2016) and applied the we-words concept vector to the documents.

Here’s what we got.


Normalizing by the length of the documents, we get an explosive growth in we-words starting about 2010. Why? One can conjecture. Simple text mining with concept vectors can hardly settle the matter, but it clearly can serve to find patterns of interest and to raise issues well worth following up.

— Christine Chou and Steven Kimbrough


Posted in Concept vector, Text Analytics, Vector semantics | Tagged | Leave a comment

Coming soon in our Applications of Analytics series

How do socially responsible firms differ from firms that rate poorly on social responsibility? We look at what they say in their annual reports.

Posted in Uncategorized | Leave a comment

Text Analytics Clinic, Fall 2013.

In fall of 2013, I (Steve Kimbrough) offered a “Text Analytics Clinic” at the University of Pennsylvania. I encouraged students to find interesting projects for which text analytics would be helpful tools. I was delighted with what they came up with. For example, Robert Dearborn and Dasha Zmachynskaya looked at speeches in the Congressional Record, correlated them with Wikipedia posts on the individuals involved and their publicly known issues with honesty and integrity. They found quite fascinating, positive associations. See their report summarized and with comments here.

Posted in Business Analytics Clinic, Text Analytics | Tagged | Leave a comment