Counting the Words
I’ve been looking recently at different ways that newspapers can add value to the news they produce, and one of them is using technology to better mine the information that’s available to bring out themes and nuances that might otherwise be lost. But does it always work?
The post popular page on the WSJ.com website at the moment is Barack Obama’s speech, which has dozens of comments added to it (not all them illuminating; but there’s another story.) What intrigued me was the text analysis box in the text:
Click on that link and you see a sort of tag cloud of words and how frequently they appear in the text of the piece itself. Mouse over a word and a popup tells you how many times Obama used the word. “Black,” for example, appears 38 times; “white” appears only 29. That’s nearly 25% fewer times.
Interesting, but useful? My gut reaction is that it cheapens a remarkable speech–remarkable not because of its views, but remarkable because it’s a piece of oratory that could have been uttered 10, 20, 50, maybe even 100 years ago and still be understood.
My point? Analyzing a speech using a simple counter is not only pretty pointless–does the fact he said ‘black’ more times than ‘white’ tell us anything? What about the words he didn’t use?–but it paves the way to speechwriters running their own text analysis over speeches before they’re spoken. “Hey, Bob! We need to put more ‘whites’ in there otherwise people are going to freak out!” “OK how about mentioning you were in White Plains a couple of times last year?”
Maybe this already happens. But oratory is an art form: it doesn’t succumb to analysis, just as efforts to subject Shakespeare to text analysis don’t really tell us very much about Shakespeare.
The Journal is just messing around, of course, experimenting with what it can to see what might work. We’re merely watching a small episode in newspapers trying to be relevant. And it should be applauded for doing so. But I really hope that something more substantial and smart will come along, because this kind of thing not only misses the mark, but is in danger of quickly becoming absurd.
Perhaps more important, it fails to really add value to the data. Without any analysis of the frequency of words, there’s not really much one can say to the exercise except, maybe, “hmmm.” Compare that with a Canadian research project a couple of years back which developed algorithms to measure spin in the 2006 election there. They looked at politicians’ use of particular words: “exception words” — however, unless — for example, and the decreased use of personal pronouns–I, we, me, us– which might imply the speaker was distancing him- or herself from what was being said.
That sounds smart, but was it revealing? The New Scientist, writing in January 2006, said the results concluded that the incumbent, Prime Minister Paul Martin, of the Liberal Party, spun “dramatically more than Conservative Party leader, Stephen Harper, and the New Democratic Party leader, Jack Layton.” Harper, needless to say, won the election.
Oh, and in case you’re interested, Shakespeare used the word “black” 174 times in his oeuvre, according to Open Source Shakespeare, and “white” only 148, 15% fewer occurrences. Clearly a story there.