LIWC is great, all hail LIWC. Except when it stops working.
The Linguistic Inquiry and Word Count (LIWC) program is a text analysis program developed which provides a numerical representation of a number of dimensions of the speech (Positive emotional, anger, pronouns, the list goes on). The way it works is pretty simple: We start with the dictionary.
If the word “boy” is in your text, the categories 121 and 124 get incremented by one (those category numbers correspond to “social” and “human”). Then, the output file calculates the percentage of words that matched the category in the sample, out of the total words in the sample.
The LIWC program is great at this kind of stuff, and can do a bunch of cool things, like process multiple files at a time. Except for the fact that sometimes it breaks. Specifically, when you’re dealing with really big text files.
The project I worked on last semester involved analyzing tweets through LIWC. And, boy, people sure do tweet a lot. 7.6 megabytes doesn’t sound like a lot. A high quality picture will probably be larger than that. But 7.6 megabytes of text is a lot of text. In this case, it was over a million words. LIWC, the program, crashes when you put in this file.
Thankfully, we had the dictionary from LIWC, so I began working on our own version of LIWC, coded in python. It didn’t seem that hard, we knew how LIWC was doing the calculations, and while we wouldn’t have any of the fancy features the commercial software provides, for our purposes it seemed good enough.
What’s the progress? Well it turns out that just counting the total number of words is a pretty complicated task. I have a version of the text analysis program working where I get numerical outputs, but they do not match what the LIWC program provides. This is probably a fine tuning issue. Over the coming weeks, I’ll be giving LIWC and my program the same inputs, and adjusting my program until the results match.
How long will this take? Hopefully not long. Famous last words.