Text-mining tips
These are some insights from the text-mining that I’ve been doing this week:
Stop and think about stop words
One of the first rules of text-mining should be: always make your own list of stop words. Nothing absolutely and objectively is or isn’t a stop word. Which words are and aren’t meaningful depends on your research questions. For example, pronouns are often included in lists of stop words, but I’m very interested in gender so I want to know the frequencies of gendered words like ‘he’ and ‘she’. If you use someone else’s list without thinking about it you’ll probably inherit various biases and assumptions. The kind of text you’re working with also makes a difference. In the proceedings of parliament words like ‘ordered’, ‘resolved’ and ‘committee’ occur too regularly to be much use to most people. If you don’t define your stop words until after you’ve calculated frequencies for every word you can get a better idea of which words are getting in the way and which ones are interesting.
BeautifulSoup is not always the answer
The Python library BeautifulSoup is really useful for extracting data from HTML pages, but maybe I got into the habit of using it too much. This week I was trying to work out how to get some data from pages that didn’t have a very good semantic structure. Doing it with BeautifulSoup looked like it would be really complicated, but then I realised that in this case regular expressions would be much easier.
Have sets
Python includes a sequence type called a set, which combines the best aspects of a Python sequence and a mathematical set, and is incredibly useful for text-mining scripts. Turning a list into a set automatically gets rid of duplicates. For example, suppose you’ve split some text into a list of separate words.
>>>wordlist = 'it was the best of times it was the worst of times'.split()
>>>wordlist
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']
>>>wordset = set(wordlist)
>>>wordset
set(['of', 'it', 'times', 'worst', 'the', 'was', 'best'])
Now we have a set of unique words which we can iterate through using a for loop, counting the occurrences of each word in the list:
for word in wordset:
wordcount = wordlist.count(word)
Then we can do whatever we want with wordcount (print it to the screen, add it to a tuple or a dictionary, write it to a file).
You can also do mathematical operations on sets, which can be really useful for removing stop words.
Suppose we have a set of stopwords:
>>>stopwordset = set(['of', 'it', 'the'])
We can deduct that from the set of words before we iterate through it:
>>>wordset = wordset - stopwordset
>>>wordset
set(['was', 'worst', 'best', 'times'])
Now the stop words in wordlist are completely ignored, and we don’t even have to do an if test at every iteration.
A dictionary is a bit like a database
Python dictionaries can be thought of as very simple databases. Obviously they can’t do everything that a database can do, but you don’t have to worry about connections or cursors either. When counting words across multiple files it’s easy to keep a running total of each word by updating a dictionary at every iteration. If the word is already in the dictionary, add to the existing count; if it isn’t, add a new key/value pair.
This is how I do it:
>>>wordcount = dict()
(Then iterate through each file, open and read it etc.)
for word in wordset:
if word in wordcount:
wordcount[word] = wordcount[word] + wordlist.count(word)
else:
newword = [(word, wordlist.count(word))]
wordcount.update(newword)
