Skip to content

request for examples

5 messages · Paul Murrell, M. Edward (Ed) Borasky, Jason Turner +2 more

#
Hi

I hope you don't mind this "cold call", but this seems like a really
good place to contact people with interest/experience/expertise in stats
and databases ... 

I am busy producing a course on statistical computing for stage II
students (to be delivered in the second half of this year).

I will be teaching them about some databases issues:  advantages of
databases as a way to store information, how to design databases
properly, how to retrieve information using SQL.

What I am seriously lacking are some killer examples.  

Would anyone be able to help me with any of the following ...

(i)  killer examples where a database is clearly a superior method of
storing information than, say, plain text files or spreadsheets or
statistical-package-specific formats

(ii)  an actual real-life statistical database that could be copied to a
local server for the students to practise accessing

(iii)  killer examples where an important data source is stored in a
database therefore requiring something like SQL knowledge to get access
to the information.

I would also obviously be interested in any general comments regarding
which database issues people think are the most crucial for statistics
students to learn.

Again, apologies if this approach is an imposition.  
Any help would be greatly appreciated.

Paul

Dr Paul Murrell
Department of Statistics
The University of Auckland
Auckland
New Zealand
#
Well ... I don't know if the company I work for will let me send out any
real data, but I can tell you what I do with databases and R. I collect
large quantities of Linux and Windows performance data. For example, I will
have perhaps 50 - 250 columns of high-frequency samples taken, say, every 15
seconds over a 12-hour period. A typical benchmarking project will take two
weeks (Saturdays and Sundays included!), giving about a dozen test cases to
be processed. Excel can deal with the columns all right, but after 65536
rows it rolls over and plays dead. And even if I just left the files in CSV
format, R's "read.csv" function runs out of memory on my 128 MB workstation
fairly quickly - somewhere in the vicinity of a 15 or 20 MB CSV file.

So, I load all the raw data into tables in a Microsoft Access database. I
then write queries to format the data, add case tags (factors) to the rows,
do some date/time calculations, etc. Then I assign a Data Set Name to the
".mdb" file and read from those queries using RODBC. I'm told all this magic
can be made to work on Linux with PostGres, but we're mostly a Windows shop
so I have Access and SQL Server available. Given the quantity of data I
have, Access and SQL help me organize things as well, plus I can do much of
the inevitable data cleaning much more easily with Access queries than I can
in R. As far as I'm concerned it's a match made in heaven.

-----Original Message-----
From: r-sig-db-admin at stat.math.ethz.ch
[mailto:r-sig-db-admin at stat.math.ethz.ch]On Behalf Of Paul Murrell
Sent: Sunday, May 12, 2002 7:13 PM
To: r-sig-db at stat.math.ethz.ch
Subject: [R-sig-DB] request for examples

Hi

I hope you don't mind this "cold call", but this seems like a really
good place to contact people with interest/experience/expertise in stats
and databases ...

I am busy producing a course on statistical computing for stage II
students (to be delivered in the second half of this year).

I will be teaching them about some databases issues:  advantages of
databases as a way to store information, how to design databases
properly, how to retrieve information using SQL.

What I am seriously lacking are some killer examples.

Would anyone be able to help me with any of the following ...

(i)  killer examples where a database is clearly a superior method of
storing information than, say, plain text files or spreadsheets or
statistical-package-specific formats

(ii)  an actual real-life statistical database that could be copied to a
local server for the students to practise accessing

(iii)  killer examples where an important data source is stored in a
database therefore requiring something like SQL knowledge to get access
to the information.

I would also obviously be interested in any general comments regarding
which database issues people think are the most crucial for statistics
students to learn.

Again, apologies if this approach is an imposition.
Any help would be greatly appreciated.

Paul

Dr Paul Murrell
Department of Statistics
The University of Auckland
Auckland
New Zealand
_______________________________________________
R-sig-DB mailing list -- R Special Interest Group
R-sig-DB at stat.math.ethz.ch
http://www.stat.math.ethz.ch/mailman/listinfo/r-sig-db
#
On Mon, May 13, 2002 at 02:13:06PM +1200, Paul Murrell wrote:
...
As an advanced something-to-think-about...
The product "Pi" by OSI software is a Real-Time data storage, retrival,
graphing and caluculation package.  While it has an SQL front-end,
the data storage is done in a very non-relational way.  Files are stored in
a binary format, optimised for quick reading of vast ammounts of data.
Calling up a year's worth of data sampled every 10 sec happens in a few
seconds.

The interesting thing here is:

1) SQL front-ends mean a *language* advantage - the advantage is
compatibilty with other apps.

2) the  file format means tight storage and fast bulk retrieval.
Good Things (tm).

The massive ammounts of data here are what makes a database front-end
so useful.  There's just no way to store that ammount of information
any other way, and hope to retrieve it.

Cheers

Jason
#
On Mon, 13 May 2002, Paul Murrell wrote:

            
That's true of almost all data mining applications.  Think about
a supermarket chain collecting information on all transactions at tills.

Reasons include scale (as above), integrity of data coming from multiple
sources (also as above) and security (most organizations' financial data
is in databases).   Related to scale is efficiency: lots of preprocessing
(indices etc) makes online queries possible.

Another good example is an online transaction system such as airlines'
booking systems and those behind banks' ATM networks.  Or, since, I have
just been browsing one, large discussion forums,  web search engines ....
MIT Press, is a good source for statistics/databases interaction.
On what DBMS?
(Many) pharamaceuticals have their gene chip data stored in databases.
We had to set up Oracle lite and get help (thanks Fei) to get some
results out last year.   Insurance companies have all their claims data on
databases, and MSc summer projects have been 25% taken up extracting the
data.

Yet another one was work on university admissions data: both locally and
nationally that was on a database, and about 70,000 records were extracted
to a spreadsheet.  (And that was just one year's data.)

Brian
2 days later
#
Paul Murrell writes:
 > I would also obviously be interested in any general comments regarding
 > which database issues people think are the most crucial for statistics
 > students to learn.

I am not sure if this counts as a "killer application", but statistics students
should be aware that knowledge of SQL is very useful in the business world,
especially in finance. I am much more likely to interview/hire someone with SQL
experience. It is not so important that they know the database that I
(currently) use. The key is that they know what a RDMS is, how different tables
are related (or not) to one another, how to get data into and out of a database
and so on.

Regards,

Dave Kane