data after write() is off by 1 ?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121120/547284ac/attachment.pl>
A followup to my own post, I believe I figured this out, but if I should be doing something different please correct:
prediction.out <- levels(prediction)[prediction]
write(prediction.out, file="prediction.csv")
This gives me my correctly adjusted values

Brian

I am new to R, so I am sure I am making a simple mistake.  I am including complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every value is off by 1.

Here is my flow:

str(prediction)
Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
- attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
print(prediction)
   1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23 
   2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1 

ok, so it shows my values are 2, 0, 9, 9, 3 etc

# I write my file out
write(prediction, file="prediction.csv")

# look at the first 10 values
$ head -10 prediction.csv 
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

The complete work of what I did was as follows:

# First I load in a dataset, label the first column as a factor
dataset <- read.csv('train.csv',head=TRUE)
dataset$label <- as.factor(dataset$label)
# it has 42000 obs. 785 variables
str(dataset)
'data.frame':	42000 obs. of  785 variables:
$ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ...
$ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
$ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
$ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
 [list output truncated]

# I make a sampling testset and trainset
index <- 1:nrow(dataset)
testindex <- sample(index, trunc(length(index)*30/100))
testset <- dataset[testindex,]
trainset <- dataset[-testindex,]
# build model, predict, view
model  <- svm(label~., data = trainset, type="C-classification", kernel="radial", gamma=0.0000001, cost=16)
prediction <- predict(model, testset)
tab <- table(pred = prediction, true = testset[,1])
   true
pred    0    1    2    3    4    5    6    7    8    9
  0 1210    0    3    1    0    5    7    2    5    8
  1    0 1415    2    0    2    1    0    7    5    0
  2    0    2 1127   12    3    0    2    7    2    0
  3    0    0    7 1296    0   10    0    2   15    6
  4    1    1    8    2 1201    2    4    3    5   16
  5    3    1    0   13    0 1100    3    1    2    3
  6    3    0    3    0    5    9 1263    0    1    0
  7    0    2    9    6    6    1    0 1296    1   13
  8    3    5    7   11    1    2    0    2 1190    4
  9    1    1    2    3   17    2    0    4    4 1190

Ok everything looks great up to this point..........so I try to apply my model to a "real" testset, which is the same format as my previous
dataset, except it does not have the label/factor column, so its 28000 obs 784 variables:

testset <- read.csv('test.csv',head=TRUE)
str(testset)
'data.frame':	28000 obs. of  784 variables:
$ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
$ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
$ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
 [list output truncated]

prediction <- predict(model, testset)
summary(prediction)
  0    1    2    3    4    5    6    7    8    9 
2780 3204 2824 2767 2771 2516 2744 2898 2736 2760 
print(prediction)
   1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23 
   2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1 
  24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40    41    42    43    44    45    46 
   5     7     4     2     7     4     7     7     5     4     2     6     2     5     5     1     6     7     7     4     9     8     7 
 [list output truncated]

write(prediction, file="prediction.csv")
$ head -10 prediction.csv 
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

I am obviously making a mistake.  Everything is off by a value of 1.

Can someone tell me what I am doing wrong?

Brian

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
I am new to R, so I am sure I am making a simple mistake.  I am including complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every value is off by 1.

Here is my flow:

str(prediction)
  Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
  - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
You have a factor, not numerical data.  Apparently write() is writing 
out the factor values (index into the levels) rather than their string 
representation.  (I've never used write().  Normally would use cat() or 
write.csv() or something related to write data
to a file for reading outside of R. )  write.csv() will write out the 
strings, by default in quotes, but there are lots of arguments
to control the formatting.

Duncan Murdoch
print(prediction)
     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1

ok, so it shows my values are 2, 0, 9, 9, 3 etc

# I write my file out
write(prediction, file="prediction.csv")

# look at the first 10 values
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

The complete work of what I did was as follows:

# First I load in a dataset, label the first column as a factor
dataset <- read.csv('train.csv',head=TRUE)
dataset$label <- as.factor(dataset$label)
# it has 42000 obs. 785 variables
str(dataset)
'data.frame':	42000 obs. of  785 variables:
  $ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ...
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

# I make a sampling testset and trainset
index <- 1:nrow(dataset)
testindex <- sample(index, trunc(length(index)*30/100))
testset <- dataset[testindex,]
trainset <- dataset[-testindex,]
# build model, predict, view
model  <- svm(label~., data = trainset, type="C-classification", kernel="radial", gamma=0.0000001, cost=16)
prediction <- predict(model, testset)
tab <- table(pred = prediction, true = testset[,1])
     true
pred    0    1    2    3    4    5    6    7    8    9
    0 1210    0    3    1    0    5    7    2    5    8
    1    0 1415    2    0    2    1    0    7    5    0
    2    0    2 1127   12    3    0    2    7    2    0
    3    0    0    7 1296    0   10    0    2   15    6
    4    1    1    8    2 1201    2    4    3    5   16
    5    3    1    0   13    0 1100    3    1    2    3
    6    3    0    3    0    5    9 1263    0    1    0
    7    0    2    9    6    6    1    0 1296    1   13
    8    3    5    7   11    1    2    0    2 1190    4
    9    1    1    2    3   17    2    0    4    4 1190

Ok everything looks great up to this point..........so I try to apply my model to a "real" testset, which is the same format as my previous
dataset, except it does not have the label/factor column, so its 28000 obs 784 variables:

testset <- read.csv('test.csv',head=TRUE)
str(testset)
'data.frame':	28000 obs. of  784 variables:
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

prediction <- predict(model, testset)
summary(prediction)
    0    1    2    3    4    5    6    7    8    9
2780 3204 2824 2767 2771 2516 2744 2898 2736 2760
print(prediction)
     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1
    24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40    41    42    43    44    45    46
     5     7     4     2     7     4     7     7     5     4     2     6     2     5     5     1     6     7     7     4     9     8     7
   [list output truncated]

write(prediction, file="prediction.csv")
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

I am obviously making a mistake.  Everything is off by a value of 1.

Can someone tell me what I am doing wrong?

Brian

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

You are seeing the levels of a factor but saving its values. Internally, 
factors are coded as consecutive integers starting at 1, and that's what 
is saved to file using write.table. To have the levels "0", "1", etc and 
not the corresponding values 1, 2, etc, try

levels(prediction)[prediction]

or

as.integer(levels(prediction)[prediction])

Hope this helps,

Rui Barradas
Em 20-11-2012 19:30, Brian Feeny escreveu:
I am new to R, so I am sure I am making a simple mistake.  I am including complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every value is off by 1.

Here is my flow:

str(prediction)
  Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
  - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
print(prediction)
     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1

ok, so it shows my values are 2, 0, 9, 9, 3 etc

# I write my file out
write(prediction, file="prediction.csv")

# look at the first 10 values
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

The complete work of what I did was as follows:

# First I load in a dataset, label the first column as a factor
dataset <- read.csv('train.csv',head=TRUE)
dataset$label <- as.factor(dataset$label)
# it has 42000 obs. 785 variables
str(dataset)
'data.frame':	42000 obs. of  785 variables:
  $ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ...
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

# I make a sampling testset and trainset
index <- 1:nrow(dataset)
testindex <- sample(index, trunc(length(index)*30/100))
testset <- dataset[testindex,]
trainset <- dataset[-testindex,]
# build model, predict, view
model  <- svm(label~., data = trainset, type="C-classification", kernel="radial", gamma=0.0000001, cost=16)
prediction <- predict(model, testset)
tab <- table(pred = prediction, true = testset[,1])
     true
pred    0    1    2    3    4    5    6    7    8    9
    0 1210    0    3    1    0    5    7    2    5    8
    1    0 1415    2    0    2    1    0    7    5    0
    2    0    2 1127   12    3    0    2    7    2    0
    3    0    0    7 1296    0   10    0    2   15    6
    4    1    1    8    2 1201    2    4    3    5   16
    5    3    1    0   13    0 1100    3    1    2    3
    6    3    0    3    0    5    9 1263    0    1    0
    7    0    2    9    6    6    1    0 1296    1   13
    8    3    5    7   11    1    2    0    2 1190    4
    9    1    1    2    3   17    2    0    4    4 1190

Ok everything looks great up to this point..........so I try to apply my model to a "real" testset, which is the same format as my previous
dataset, except it does not have the label/factor column, so its 28000 obs 784 variables:

testset <- read.csv('test.csv',head=TRUE)
str(testset)
'data.frame':	28000 obs. of  784 variables:
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

prediction <- predict(model, testset)
summary(prediction)
    0    1    2    3    4    5    6    7    8    9
2780 3204 2824 2767 2771 2516 2744 2898 2736 2760
print(prediction)
     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5     7     4     0     4     3     3     1     9     0     9     1     1
    24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40    41    42    43    44    45    46
     5     7     4     2     7     4     7     7     5     4     2     6     2     5     5     1     6     7     7     4     9     8     7
   [list output truncated]

write(prediction, file="prediction.csv")
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

I am obviously making a mistake.  Everything is off by a value of 1.

Can someone tell me what I am doing wrong?

Brian

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
On 20/11/2012 2:30 PM, Brian Feeny wrote:
I am new to R, so I am sure I am making a simple mistake.  I am
including complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every
value is off by 1.

Here is my flow:

str(prediction)
  Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
  - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
You have a factor, not numerical data.  Apparently write() is writing
out the factor values (index into the levels) rather than their string
representation.  (I've never used write().  Normally would use cat() or
write.csv() or something related to write data
But as the help page says

      ?write? is a wrapper for ?cat?, which gives further details on the
      format used.

and cat() does treat a factor as an integer vector:

      Currently only atomic vectors and names are handled, together with
      ?NULL? and other zero-length objects (which produce no output).
      Character strings are output ?as is? (unlike ?print.default? which
      escapes non-printable characters and backslash - use
      ?encodeString? if you want to output encoded strings using ?cat?).
      Other types of R object should be converted (e.g. by
      ?as.character? or ?format?) before being passed to ?cat?.
to a file for reading outside of R. )  write.csv() will write out the
strings, by default in quotes, but there are lots of arguments
to control the formatting.

Duncan Murdoch

print(prediction)
     1     2     3     4     5     6     7     8     9    10    11
12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5
7     4     0     4     3     3     1     9     0     9     1     1

ok, so it shows my values are 2, 0, 9, 9, 3 etc

# I write my file out
write(prediction, file="prediction.csv")

# look at the first 10 values
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

The complete work of what I did was as follows:

# First I load in a dataset, label the first column as a factor
dataset <- read.csv('train.csv',head=TRUE)
dataset$label <- as.factor(dataset$label)
# it has 42000 obs. 785 variables
str(dataset)
'data.frame':    42000 obs. of  785 variables:
  $ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4
6 4 ...
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

# I make a sampling testset and trainset
index <- 1:nrow(dataset)
testindex <- sample(index, trunc(length(index)*30/100))
testset <- dataset[testindex,]
trainset <- dataset[-testindex,]
# build model, predict, view
model  <- svm(label~., data = trainset, type="C-classification",
kernel="radial", gamma=0.0000001, cost=16)
prediction <- predict(model, testset)
tab <- table(pred = prediction, true = testset[,1])
     true
pred    0    1    2    3    4    5    6    7    8    9
    0 1210    0    3    1    0    5    7    2    5    8
    1    0 1415    2    0    2    1    0    7    5    0
    2    0    2 1127   12    3    0    2    7    2    0
    3    0    0    7 1296    0   10    0    2   15    6
    4    1    1    8    2 1201    2    4    3    5   16
    5    3    1    0   13    0 1100    3    1    2    3
    6    3    0    3    0    5    9 1263    0    1    0
    7    0    2    9    6    6    1    0 1296    1   13
    8    3    5    7   11    1    2    0    2 1190    4
    9    1    1    2    3   17    2    0    4    4 1190

Ok everything looks great up to this point..........so I try to apply
my model to a "real" testset, which is the same format as my previous
dataset, except it does not have the label/factor column, so its 28000
obs 784 variables:

testset <- read.csv('test.csv',head=TRUE)
str(testset)
'data.frame':    28000 obs. of  784 variables:
  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
   [list output truncated]

prediction <- predict(model, testset)
summary(prediction)
    0    1    2    3    4    5    6    7    8    9
2780 3204 2824 2767 2771 2516 2744 2898 2736 2760
print(prediction)
     1     2     3     4     5     6     7     8     9    10    11
12    13    14    15    16    17    18    19    20    21    22    23
     2     0     9     9     3     7     0     3     0     3     5
7     4     0     4     3     3     1     9     0     9     1     1
    24    25    26    27    28    29    30    31    32    33    34
35    36    37    38    39    40    41    42    43    44    45    46
     5     7     4     2     7     4     7     7     5     4     2
6     2     5     5     1     6     7     7     4     9     8     7
   [list output truncated]

write(prediction, file="prediction.csv")
$ head -10 prediction.csv
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

I am obviously making a mistake.  Everything is off by a value of 1.

Can someone tell me what I am doing wrong?

Brian

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
On 20/11/2012 19:46, Duncan Murdoch wrote:
On 20/11/2012 2:30 PM, Brian Feeny wrote:
I am new to R, so I am sure I am making a simple mistake.  I am
including complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every
value is off by 1.

Here is my flow:

str(prediction)
   Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
   - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
You have a factor, not numerical data.  Apparently write() is writing
out the factor values (index into the levels) rather than their string
representation.  (I've never used write().  Normally would use cat() or
write.csv() or something related to write data
But as the help page says

       ?write? is a wrapper for ?cat?, which gives further details on the
       format used.

and cat() does treat a factor as an integer vector:

       Currently only atomic vectors and names are handled, together with
       ?NULL? and other zero-length objects (which produce no output).
       Character strings are output ?as is? (unlike ?print.default? which
       escapes non-printable characters and backslash - use
       ?encodeString? if you want to output encoded strings using ?cat?).
       Other types of R object should be converted (e.g. by
       ?as.character? or ?format?) before being passed to ?cat?.

to a file for reading outside of R. )  write.csv() will write out the
strings, by default in quotes, but there are lots of arguments
to control the formatting.
Yes, I didn't claim otherwise.

Duncan Murdoch