Legend for curve fit plot - R-SIG-teaching

Tue, Dec 30, 2014 2:49 PM #

Hello all,

I provide a function for my students to do two curve fits with a single set of data:

# Performs two curve fits, quadratic and n lg n, with a plot of the data and the two curves
# First parameter: A data frame
# Second parameter: Name of the independent (x) variable
# Third parameter: Name of the dependent (y) variable
# Fourth parameter: The label for the x-axis
# Fifth parameter: The label for the y-axis
dp4dsFit <- function(dataFrame, indepVarName, depVarName, xLabel, yLabel) {
  library(ggplot2)
  library(labeling)
  dp4dsQuadraticFit <- lm(dataFrame[,depVarName] ~ poly(dataFrame[,indepVarName],2))
  write("=============\r",file="")
  write("Quadratic fit\r",file="")
  write("=============\r",file="")
  print(summary(dp4dsQuadraticFit))
  dp4dsNlogNFit <- lm(dataFrame[,depVarName] ~ dataFrame[,indepVarName]:log(dataFrame[,indepVarName]) + dataFrame[,indepVarName])
  write("==========\r",file="")
  write("n lg n fit\r",file="")
  write("==========\r",file="")
  print(summary(dp4dsNlogNFit))
  ggplot() +
    geom_point(data = dataFrame, aes_string(x = indepVarName, y = depVarName), size = 3) +
    geom_smooth(data = dataFrame, aes_string(x = indepVarName, y = depVarName),
                method = "lm", se = FALSE, colour = "RED", formula = y ~ poly(x,2)) +
    geom_smooth(data = dataFrame, aes_string(x = indepVarName, y = depVarName),
                method = "lm", se = FALSE, colour = "BLUE", formula = y ~ x:log(x) + x) +
    xlab(label = xLabel) +
    ylab(label = yLabel)
}


I use ggplot to produce the plot, but I cannot figure out how to produce the legend. Every example I have seen assumes a separate entry in the legend for each set of data. The problem is I have a single set of data with two different curve fits. How do I make a legend with red for the quadratic curve fit and blue for the n lg n curve fit?

Another minor question. Are the above write statements the best way to echo a message to the console?

Thanks,
Stan

J. Stanley Warford
Professor of Computer Science
Pepperdine University
Malibu, CA 90263
Stan.Warford at pepperdine.edu<mailto:Stan.Warford at pepperdine.edu>
310-506-4332

Ista Zahn

Tue, Dec 30, 2014 5:58 PM #

Hi Stan,

Here is one way to get the legend (I've also cleaned up the write statements):

dp4dsFit <- function(dataFrame,
                     indepVarName,
                     depVarName,
                     xLabel = indepVarName,
                     yLabel = depVarName) {
  library(ggplot2)
  library(labeling)
  dp4dsQuadraticFit <- lm(dataFrame[,depVarName] ~
poly(dataFrame[,indepVarName],2))
  cat(
"=============\r
Quadratic fit\r
=============\r")
  print(summary(dp4dsQuadraticFit))
  dp4dsNlogNFit <- lm(dataFrame[,depVarName] ~
dataFrame[,indepVarName]:log(dataFrame[,indepVarName]) +
dataFrame[,indepVarName])
  cat(
"==========\r
n lg n fit\r
==========\r")
  print(summary(dp4dsNlogNFit))
  dataFrame <- rbind(data.frame(dataFrame,
                                predicted = predict(dp4dsQuadraticFit),
                                model = "Quadratic"),
                     data.frame(dataFrame,
                                predicted = predict(dp4dsNlogNFit),
                                model = "n lg n"))
  ggplot() +
    geom_point(data = subset(dataFrame, model = "Quadratic"),
               aes_string(x = indepVarName, y = depVarName),
               size = 3) +
    geom_line(data = dataFrame,
                aes_string(x = indepVarName, y = "predicted", color =
"model")) +
    xlab(label = xLabel) +
    ylab(label = yLabel)
}


But this is not the R way(tm). The R way is to give your user control
over the output by returning values from your functions, and writing
print or summary methods. Here is how I would go about it:

dp4dsFit <- function(dataFrame,
                     indepVarName,
                     depVarName) {
  dp4dsQuadraticFit <- lm(dataFrame[,depVarName] ~
poly(dataFrame[,indepVarName],2))
  dp4dsNlogNFit <- lm(dataFrame[,depVarName] ~
dataFrame[,indepVarName]:log(dataFrame[,indepVarName]) +
dataFrame[,indepVarName])
  dataFrame <- rbind(data.frame(dataFrame,
                                predicted = predict(dp4dsQuadraticFit),
                                model = "Quadratic"),
                     data.frame(dataFrame,
                                predicted = predict(dp4dsNlogNFit),
                                model = "n lg n"))
  R <- list(dp4dsQuadraticFit = dp4dsQuadraticFit,
            dp4dsNlogNFit = dp4dsNlogNFit,
            dataFrame = dataFrame,
            indepVarName = indepVarName,
            depVarName = depVarName)
  class(R) <- c("dp4dsFit", class(R))
  return(R)
}

print.dp4dsFit <- function(x) {
  cat(
"=============\r
Quadratic fit\r
=============\r")
  print(x$dp4dsQuadraticFit)
  cat(
"==========\r
n lg n fit\r
==========\r")
  print(x$dp4dsNlogNFit)
}

summary.dp4dsFit <- function(x, plot = FALSE) {
  R <- sapply(x[1:2],
              summary,
              simplify=FALSE)
  if(plot) print(plot(x))
  class(R) <- c("dp4dsFit", class(R))
  return(R)
}

plot.dp4dsFit <- function(x, xLabel = x$indepVarName, yLabel = x$depVarName) {
  library(ggplot2)
  ggplot() +
    geom_point(data = subset(x$dataFrame, model == "Quadratic"),
               aes_string(x = x$indepVarName, y = x$depVarName),
               size = 3) +
    geom_line(data = x$dataFrame,
              aes_string(x = x$indepVarName, y = "predicted", color =
"model")) +
    xlab(label = xLabel) +
    ylab(label = yLabel)
}

## now you can do it all in one:
models <- dp4dsFit(mtcars, "mpg", "hp")
summary(models, plot=TRUE)
## or just plot it
plot(models)
## or just look at the model summaries
summary(models)
## or do something else entirely:
par( mfcol = c(2, 1))
plot(models[[1]], which = 1)
plot(models[[2]], which = 1)

Best,
Ista

On Tue, Dec 30, 2014 at 5:49 PM, Warford, Stan

<Stan.Warford at pepperdine.edu> wrote:

Hello all,

I provide a function for my students to do two curve fits with a single set of data:

# Performs two curve fits, quadratic and n lg n, with a plot of the data and the two curves
# First parameter: A data frame
# Second parameter: Name of the independent (x) variable
# Third parameter: Name of the dependent (y) variable
# Fourth parameter: The label for the x-axis
# Fifth parameter: The label for the y-axis
dp4dsFit <- function(dataFrame, indepVarName, depVarName, xLabel, yLabel) {
  library(ggplot2)
  library(labeling)
  dp4dsQuadraticFit <- lm(dataFrame[,depVarName] ~ poly(dataFrame[,indepVarName],2))
  write("=============\r",file="")
  write("Quadratic fit\r",file="")
  write("=============\r",file="")
  print(summary(dp4dsQuadraticFit))
  dp4dsNlogNFit <- lm(dataFrame[,depVarName] ~ dataFrame[,indepVarName]:log(dataFrame[,indepVarName]) + dataFrame[,indepVarName])
  write("==========\r",file="")
  write("n lg n fit\r",file="")
  write("==========\r",file="")
  print(summary(dp4dsNlogNFit))
  ggplot() +
    geom_point(data = dataFrame, aes_string(x = indepVarName, y = depVarName), size = 3) +
    geom_smooth(data = dataFrame, aes_string(x = indepVarName, y = depVarName),
                method = "lm", se = FALSE, colour = "RED", formula = y ~ poly(x,2)) +
    geom_smooth(data = dataFrame, aes_string(x = indepVarName, y = depVarName),
                method = "lm", se = FALSE, colour = "BLUE", formula = y ~ x:log(x) + x) +
    xlab(label = xLabel) +
    ylab(label = yLabel)
}


I use ggplot to produce the plot, but I cannot figure out how to produce the legend. Every example I have seen assumes a separate entry in the legend for each set of data. The problem is I have a single set of data with two different curve fits. How do I make a legend with red for the quadratic curve fit and blue for the n lg n curve fit?

Another minor question. Are the above write statements the best way to echo a message to the console?

Thanks,
Stan

J. Stanley Warford
Professor of Computer Science
Pepperdine University
Malibu, CA 90263
Stan.Warford at pepperdine.edu<mailto:Stan.Warford at pepperdine.edu>
310-506-4332


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-teaching at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

Warford, Stan

Tue, Dec 30, 2014 9:49 PM #

Thanks for the prompt responses. What a great list!

I am going with Ista?s solution. I appreciate the R way, but the whole point of this script is to shield students from having to know R as much as possible. I don?t want to give them any choices. In fact, last year I had them use Deducer thinking that point and click would be easy and they could experiment to their hearts content, but that was a complete disaster. This year using these pre-written scripts with RStudio was much better. Even I have only learned enough R to show students how to do a curve fit. I am a complete novice.

Dennis questioned the model.

Q: Do you want x:log(x) or x * log(x) in the second geom_smooth() formula?

I hope I am doing this correctly. Computer science theory predicts n lg n behavior for some data sets and quadratic for others. I hope I am fitting to 

A * n * log(n) + B * n + C

where * in the above expression represents multiplication. I was under the impression that : in the model formula was multiplication. Can someone verify that.

Thanks,
Stan

J. Stanley Warford
Professor of Computer Science
Pepperdine University
Malibu, CA 90263
Stan.Warford at pepperdine.edu
310-506-4332

Randall Pruim

Wed, Dec 31, 2014 9:16 AM #

A few more things for you to consider:

1) ?formula will tell you all about the formula syntax in R. a * b expands to a + b + a:b, and a:b is an ?interaction term? which is basically multiplication. You can also use the I() function to inhibit the formula from treating things like + and * in formula-special ways.

2) I think you should modify your approach to R slightly. You are correct that beginners are not well served by having too much thrown at them. But I think they are also poorly served by being fed lots of specialized functions that do 1 highly specific task in an idiosyncratic way. So...

2a) If you are providing them scripts, make sure that they ?play well together? and represent a coherent system. Doing things ?the R way? is part of this, but there are actually multiple R ways (and also many things that are definitely not an R way). So...

2b) Begin by determining which standard tools and packages you want to use and make the rest of what you give them fit into that system. This reduces the cognitive load for your students and allows them to do more with less.

2c) In fact, I recommend that you make a list of all the code you used in your most recent course. Then arrange the code into things that you find essential and nonessential. Also mark the things your students found easy and hard. Now try to get rid of as much as you can without losing things students can do. This may require replacing some old favorites with some new favorites. You will know you are succeeding if when you introduce something new, your students can pretty much guess how it works before you show them.

3) The mosaic package (I am the maintainer) provides one particular way of doing this. It begins by assuming you will want to show students the modeling language (to use things like lm()), so it emphasizes using formulas. For this reason, the primary graphics system chosen is lattice rather than ggplot2. (There are a few things that support ggplot2 users as well, and you might take a look at mplot(), in particular.) We have also added functions that provide formula interfaces to many numerical summaries, including our favorite:

.group   min     Q1 median     Q3   max     mean       sd  n missing
1      4  71.1  78.85  108.0 120.65 146.7 105.1364 26.87159 11       0
2      6 145.0 160.00  167.6 196.30 258.0 183.3143 41.56246  7       0
3      8 275.8 301.75  350.5 390.00 472.0 353.1000 67.77132 14       0

In the end, most of what beginners need to do can be done with a single template:

goal( formula, data = mydata )

where formula is one of

y ~ x
y ~ x | z
          ~ x
  ~ x | z

For data manipulation (if that is important to you), we are moving to using the dplyr and tidyr packages.

4) The internals of your functions are less important (for student use) than the API, so choose your API and variable names very carefully.  For example, nearly all R functions that receive data call the variable ?data?, not ?dataFrame?.  Why have your function be different from all the rest? That makes students need to remember when to use ?data? and when to use ?dataFrame?.  Compare your functions to the functions in your list in 2c.

5) The mosaic package also provides functions called makeFun() and plotFun() that make it easy to extract from a model a functional representation of the fit and to plot that function on top of a scatter plot.  Try example(makeFun).

6) I?m including below another version of your function that provides a formula interface:

models <- dp4dsFit(mpg ~ hp, data = mtcars)

It also rearranges the data a bit to be less wasteful of space (at the cost of some slightly trickier ggplot2 code).  These plots could be done in lattice as well, if you decided to go that route.  I?ve not done any serious debugging or tried to make other improvements, I just wanted to demonstrate how to write something with a formula interface, in case you decide to go more in that direction.

Enjoy!

?rjp

dp4dsFit <- function(formula, data = parent.frame())
{
  # code below presumes formula has y ~ x shape.  Fancier code
  # could check this and throw an error when badly shaped formulas
  # are attempted.
  # The code could be made more beautiful using mosaic::lhs() and
  # mosaic::rhs() to extract left and right sides of formula.
  yName = paste(deparse(formula[[3]]), collapse="")
  xName = paste(deparse(formula[[2]]), collapse="")
  quadraticFormula <-
    substitute(y ~ poly(x, 2),
               list(y = formula[[2]], x = formula[[3]]))
  nlognFormula <-
    substitute(y ~ x + x : log(x),
               list(y = formula[[2]], x = formula[[3]]))

  dp4dsQuadraticFit <- eval(
    substitute( lm(f, data=data), list(f=quadraticFormula))
    )
  dp4dsNlogNFit <- eval(
    substitute(lm(f, data=data), list(f=nlognFormula))
  )

  # could also use dplyr::mutate() to add in extra variables
  data <- transform(
    data,
    predicted_quad = predict(dp4dsQuadraticFit),
    predicted_nlogn = predict(dp4dsNlogNFit)
  )
  R <- list(dp4dsQuadraticFit = dp4dsQuadraticFit,
            dp4dsNlogNFit = dp4dsNlogNFit,
            data = data,
            yName = yName,
            xName = xName)
  class(R) <- c("dp4dsFit", class(R))
  return(R)
}

print.dp4dsFit <- function(x) {
  cat(
"=============\r
Quadratic fit\r
=============\r")
  print(x$dp4dsQuadraticFit)
  cat(
"==========\r
n lg n fit\r
==========\r")
  print(x$dp4dsNlogNFit)
}

summary.dp4dsFit <- function(object, plot = FALSE, ...) {
  R <- sapply(object[1:2],
              summary,
              simplify=FALSE)
  if(plot) print(plot(object))
  class(R) <- c("dp4dsFit", class(R))
  return(R)
}

aes_c <- function( ... ) {
  res <- c( ... )
  class(res) <- "uneval"
  res
}

plot.dp4dsFit <- function(x, y, xLabel = x$yName, yLabel = x$xName, ...) {
  library(ggplot2)
  ggplot() +
    geom_point(data = x$data,
               aes_string(x = x$yName, y = x$xName),
               size = 3) +
    geom_line(data = x$data,
              aes_c(
                aes_string(x = x$yName),
                aes(y = predicted_quad, colour = "quadratic"))) +
    geom_line(data = x$data,
              aes_c(
                aes_string(x = x$yName),
                aes(y = predicted_nlogn, colour = "n log n"))) +
    xlab(label = xLabel) +
    ylab(label = yLabel) +
    guides(colour = guide_legend(title="model"))
}

## now you can do it all in one:
models <- dp4dsFit(mpg ~ hp, data = mtcars)
summary(models, plot=TRUE)
## or just plot it
plot(models)
## or just look at the model summaries
summary(models)
## or do something else entirely:
plot(models[[1]], which = 1)
mosaic::mplot(models[[2]], which = 1, system="gg")

On Dec 31, 2014, at 12:49 AM, Warford, Stan <Stan.Warford at pepperdine.edu<mailto:Stan.Warford at pepperdine.edu>> wrote:

Thanks for the prompt responses. What a great list!

I am going with Ista?s solution. I appreciate the R way, but the whole point of this script is to shield students from having to know R as much as possible. I don?t want to give them any choices. In fact, last year I had them use Deducer thinking that point and click would be easy and they could experiment to their hearts content, but that was a complete disaster. This year using these pre-written scripts with RStudio was much better. Even I have only learned enough R to show students how to do a curve fit. I am a complete novice.

Dennis questioned the model.

Q: Do you want x:log(x) or x * log(x) in the second geom_smooth() formula?

I hope I am doing this correctly. Computer science theory predicts n lg n behavior for some data sets and quadratic for others. I hope I am fitting to

A * n * log(n) + B * n + C

where * in the above expression represents multiplication. I was under the impression that : in the model formula was multiplication. Can someone verify that.

Thanks,
Stan

J. Stanley Warford
Professor of Computer Science
Pepperdine University
Malibu, CA 90263
Stan.Warford at pepperdine.edu<mailto:Stan.Warford at pepperdine.edu>
310-506-4332

_______________________________________________
R-sig-teaching at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching