Skip to content

Burt table from word frequency list

2 messages · Alan Zaslavsky, Joan-Josep Vallbé

#
Maybe not terribly hard, depending on exactly what you need.  Suppose you 
turn your text into a character vector 'mytext' of words.  Then for a 
table of words appearing delta words apart (ordered), you can table mytext 
against itself with a lag:

nwords=length(mytext)
burttab=table(mytext[-(1:delta)],mytext[nwords+1-(1:delta)])

Add to its transpose and sum over delta up to your maximum distance apart. 
If you want only words appearing near each other within the same sentence 
(or some other unit), pad out the sentence break with at least delta 
instances of a dummy spacer:

     the cat chased the greedy rat SPACER SPACER SPACER the dog chased the
     clever cat

This will count all pairings at distance delta; if you want to count only 
those for which this was the NEAREST co-occurence (so

     the cat and the rate chased the dog

would count as two at delta=3 but not one at delta=6) it will be trickier 
and I'm not sure this approach can be modified to handle it.
#
Thank you very much for all your comments, and sorry for the confusion  
of my messages. My corpus is a collection of responses to an open  
question from a questionnaire. Since my intention is not to create  
groups of respondents but to treat all responses as a "whole  
discourse" on a particular issue so that I can find out different  
"semantic contexts" within the text. I have all the responses in a  
single document, then I want to split it into strings of (specified) n  
words. The resulting semantic contexts would be sets of (correlated)  
word-strings containing particularly relevant (correlated) words.

I guess I must dive deeper into the "ca" and "tm" packages. Any other  
ideas will be really welcomed.

best,

Pep Vallb?
On Mar 30, 2009, at 2:05 PM, Alan Zaslavsky wrote: