Skip to content
Prev 246072 / 398513 Next

Analysing Character Strings for subsequent frequency analysis

On Dec 30, 2010, at 12:03 PM, bob stoner wrote:

            
There are likely to be some text analysis packages on CRAN, but taking a basic approach to generating a frequency table of characters in a vector:

Vec <- "The lazy brown fox"


# See ?strsplit, which returns a list
[[1]]
 [1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"


# Get the first list element
[1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"


# Where are the o's in the vector?
[1] 12 17


# generate the frequency table of letters
a b e f h l n o r T w x y z 
3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 




Now, let's say that Vec has multiple elements, perhaps the result of using readLines() on a text file:

Vec <- c("The lazy brown fox", "jumped over the fence")
[[1]]
 [1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"

[[2]]
 [1] "j" "u" "m" "p" "e" "d" " " "o" "v" "e" "r" " " "t" "h" "e" " " "f"
[18] "e" "n" "c" "e"


# Use lapply() to loop over each list element returned by strsplit()
# generating a frequency table for each
[[1]]

  a b e f h l n o r T w x y z 
3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 

[[2]]

  c d e f h j m n o p r t u v 
3 1 1 5 1 1 1 1 1 1 1 1 1 1 1 


# Get the first 4 letters in each 
# See ?substr
[1] "The " "jump"


HTH,

Marc Schwartz