Skip to content

Why isn't R recognising integers as numbers?

8 messages · jim holtman, Ted, Marc Schwartz +2 more

Ted
#
I have a number of files containing anywhere from a few dozen to a few
thousand integers, one per record.

The statement "refdata18 =
read.csv("K:\\MerchantData\\RiskModel\\Capture.Week.18.csv", header =
TRUE,na.strings="")" works fine, and if I type refdata18, I get the integers
displayed, one value per record (along with a record number).  However, when
I try " fitdistr(refdata18,"negative binomial")", or hist.scott(refdata18,
prob = TRUE), I get an error:

Error in fitdistr(refdata18, "negative binomial") : 
  'x' must be a non-empty numeric vector
Or
Error in hist.default(x, nclass.scott(x), prob = prob, xlab = xlab, ...) : 
  'x' must be numeric

How can it not recognise integers as numbers?

Thanks

Ted
#
on 09/21/2008 08:01 PM Ted Byers wrote:
'refdata18' is a data frame and the two functions are expecting a
numeric vector.

If you use:

  fitdistr(refdata18[, 1], "negative binomial")

or

  hist(refdata18[, 1])

you should get a suitable result, presuming that the first column in the
data frame is a numeric vector.

Use:

  str(refdata18)

to get a sense for the structure of the data frame, including the column
names, which you could then use, instead of the above index based syntax.

HTH,

Marc Schwartz
Ted
#
Thanks Jim,

Alas, it wasn't this.  Here is the output from both of your suggestions:
'data.frame':   341 obs. of  1 variable:
 $ X0: int  0 0 0 0 0 0 0 0 0 0 ...
Read 342 items
  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [26]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0
 [76]  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[101]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[126]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  1
[151]  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
2  2
[176]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3 
3  3
[201]  3  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4 
4  4
[226]  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6 
6  6
[251]  6  6  6  6  6  6  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  7 
7  7
[276]  7  7  7  8  8  8  8  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10
10 10
[301] 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12
12 12
[326] 12 12 12 18 18 18 18 18 18 18 18 18 18 18 18 18 18

Thanks anyway.

Ted

        
jholtman wrote:

  
    
Ted
#
Thanks Marc,

That was it. 

For the last 30 years, I'd write my own code, in FORTRAN, C++, or even Java,
to do whatever statistical analysis I needed.  When at the office, sometimes
I could use SAS, but that hasn't been an option for me in years.

This is the first time I have had to load real data into R (instead of
generating random data to use while playing with some of the stats
functions, or manually typing dummy data).

I take it, then, that the result of loading data is a data frame, and not
just a matrix or array.  Using something like "refdata18[, 1]" feels rather
alien, but I'm sure I'll quickly get used to it.  I'd seen it before in the
R docs, but it didn't register that I had to use it to get the functions of
most interest to me to recognise my data as a vector of numbers, given I'd
provided only a vector of integers as input.

Thanks

Ted
Marc Schwartz wrote:

  
    
#
on 09/21/2008 09:09 PM Ted Byers wrote:
<snip>

Ted,

If you read the 'Value' section of ?read.csv, it indicates that the
function returns a data frame. It is important to fully read the help
page for new functions so that you understand both how they are used and
the result(s) of their actions, including the 'Notes' section, which can
include further details, including gotchas and idiosyncrasies.

A data frame will be the result of read.csv() even if the data source is
a single column. Think of a data frame in the same way as a spreadsheet
or database table with one or more columns and one or more rows. The
unique aspect of a data frame is that each column can be a different
data type, though that need not be the case.

Thus, you still need to identify the column within the data frame that
you wish to manipulate/analyze further. There are various ways of doing
this, which are covered in Chapter 6 of "An Introduction to R" on Lists
and Data Frames. Some involve the use of indices, others using a column
name, as appropriate. There will be situations where they can be
interchangeable and others where one method will be superior to the
other. Time and experience will provide insight and intuition.

There are a myriad of ways of reading data into R and these are covered
in the Data Import/Export manual. Not all result in a data frame, but in
general and perhaps most commonly, that will be the result.

HTH,

Marc
#
Ted Byers wrote:
Ummm, is there a header line or not? If there isn't, read.csv is going 
to eat the first observation thinking it is a name (and since it is 
non-syntactic add an X in front).

The scan command looks fine, you just should have assigned it somewhere, 
x <- scan(......) and then fitdistr(x, ....)

  
    
#
Hi Ted (from Ted),
Just to clarify Marc's comments about dataframes in more basic terms.

If you read in data with read.csv() the result returned by the function
is a dataframe. This is a specialised kind of list, which you can think
of as a list of "columns" all of the same length. You can think of each
"column" as a vector of elements, all of which must be of the same type
within the column, though the type can vary (e.g. numeric, factor,
character) between columns. When you display a dataframe, it looks like
a matrix, though in R terms it is not really a matrix; it is a list,
where each component of the list is a "column".

Of course a dataframe, like any list, might have only one component.
But it is still a list -- and the actual contents are only available
"one layer down", after you have extracted that component by some
means (e.g. by using the "$" extractor). Simple example:

  L <- c(1,2,3,4)         ## vector
  L
# [1] 1 2 3 4
  L.df <- data.frame(L=L) ## Dataframe with 1 component named "L"
  L.df
#   L
# 1 1
# 2 2
# 3 3
# 4 4
  L.df$L                  ## Extract the component named "L"
# [1] 1 2 3 4             ## Compare with the result of 'L' above

# Try a regression on L (this works):
  lm(L ~ 1)
# Call:
# lm(formula = L ~ 1)
# Coefficients:
# (Intercept)  
#         2.5  

# Try a regression on L.df (this doesn't work):
  lm(L.df ~ 1)
# Error in model.frame.default(formula = L.df ~ 1,
#   drop.unused.levels = TRUE) : 
#   invalid type (list) for variable 'L.df'

# But it does after you refer to the component L by name:
  lm(L.df$L ~ 1)
# Call:
# lm(formula = L.df$L ~ 1)
# Coefficients:
# (Intercept)  
#         2.5  

# or:
  lm(L ~ 1, data=L.df)
# Call:
# lm(formula = L ~ 1, data = L.df)
# Coefficients:
# (Intercept)  
#         2.5  

# But you can (for a dataframe, not a general list) use an "index"
method of extraction *as if* it were a matrix (even though it isn't):

  L.df[,1]
# [1] 1 2 3 4
  L.df[3,1]
# [1] 3

# But compare with:
  L.df[1]
#   L
# 1 1
# 2 2
# 3 3
# 4 4

which is essentially the same as L.df itself (e.g. lm(L.df[1] ~ 1)
will not work in exactly the same way as lm(L.df ~ 1) didn't work).

The dataframe structure exists in R because so much data is typically
in the row by column (case by variables) layout such as you get in
spreadsheets and associated CSV files, and it is very useful to be
able to get into this layout directly (and refer to the variables
by name, as above).

The full generality of a 'list' can also be useful for encapsulating
data of a less strictly structured kind, but that is another (longer)
story!

Helping this helps.
Ted.
On 22-Sep-08 02:09:29, Ted Byers wrote:
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Sep-08                                       Time: 09:30:47
------------------------------ XFMail ------------------------------