database example

Douglas Bates · 2002-05-16T14:03:42Z

Paul Murrell writes: > > > I am preparing a new computing course for stage II stats students and I > > > will be doing some stuff on databases. > > > > > > I am on the lookout for example databases that I would be able to > > > provide for the students to practice querying. I have a machine set up > > > for the job with MySQL installed, but I'm a bit light on actual > > > databases(!) > > I showed examples of the use of relational databases in a short > > course

Douglas Bates

Thu, May 16, 2002 7:03 AM

Paul Murrell <p.murrell at auckland.ac.nz> writes:

Certainly.

It turns out that the data set as provided by Bryk and Raudenbush did
not have the mean socioeconomic score calculated correctly.  John Fox,
in his online appendix on Linear Mixed Models to his forthcoming book
"An R and S-PLUS Companion to Applied Regression" shows how you could
do that calculation in R.  In SQL the calculation would be

SELECT avg(ses) FROM student GROUP BY school ORDER BY school;

To put that into a new table I think it would be most effective to
create the table from both the student and school tables.  (SQL is not
very good at incorporating new values, calculated within SQL, into
existing tables.)

If you check John's appendix you will see that social scientists
typically use within-group centered values of covariates like ses,
which can be achieved with a SELECT statement joining the tables.
When I do this example in PostgreSQL I create a view then use
db.copy.table to copy the view.  With MySQL you would need to use an
explicit select statement.

CREATE VIEW df AS
   SELECT c.School, c.MEANSES, c.Sector, t.SES - c.MEANSES as cses, t.MathAch
      FROM school as c, student as t
      WHERE c.School = t.School

Another interesting multilevel data set, used by Rodriguez and
Goldman, is available at http://data.princeton.edu/multilevel/
That page has links both to the real data and to simulated data sets.  All
have a three-level structure of child within family within community.

This also looks very useful.  Thanks again!

Finally I was involved in a large project on the analysis of data from
the Texas Assessment of Academic Skills (TAAS) which consisted of 10.5
million test scores on 3.5 million students in thousands of schools
(campuses) within hundreds of school districts.

The enclosed file, installation.txt, describes the installation of the
database and some cleaning up of the records.  Cleaning the database
was a major part of the analysis.

Again superb!  Thank you so much Doug!  With this, and other responses
from the mailing list, I have a plenty of examples for my course to
justify the use of databases (by statisticians) and to demonstrate their
use (by statisticians).  This is such a huge help for me.