Skip to content

Parameterised queries

8 messages · Ian Gow, Hadley Wickham, Alex Mandel +2 more

#
Is there merit in using PostgreSQL?s arrays to send data from R to PostgreSQL? (Caveat: I say this having no idea how PL/R gets vectors from R to PostgreSQL.)

PL/R is able to pull vectors in from R, though RPostgreSQL cannot pulls arrays from PostgreSQL as vectors (though it would be great to have this feature).

The second and third approaches below do not differ in performance.

# Getting data one vector at a time

DROP FUNCTION IF EXISTS get_int_vector();

CREATE OR REPLACE FUNCTION get_int_vector()
  RETURNS integer[] AS
$BODY$
     sample(1:10)
$BODY$ LANGUAGE plr;

DROP FUNCTION IF EXISTS get_str_vector();

CREATE OR REPLACE FUNCTION get_str_vector()
  RETURNS text[] AS
$BODY$
     sample(letters[1:10])
$BODY$ LANGUAGE plr;

DROP FUNCTION IF EXISTS get_date_vector();

? For some reason, dates get converted to integers.
CREATE OR REPLACE FUNCTION get_date_vector()
  RETURNS date[] AS
$BODY$
     as.character(seq(from=as.Date("1971-01-01"), by=1, length.out = 10))
$BODY$ LANGUAGE plr;


WITH raw_data AS (
    SELECT get_int_vector() AS ints, 
        get_str_vector() AS strs,
        get_date_vector() AS dates)
SELECT 
    UNNEST(ints) AS int,
    UNNEST(strs) AS str,
    UNNEST(dates) AS date
FROM raw_data;

## Via RPostgreSQL

system.time({
    n <- 1e6
    df <- data.frame(int=sample(1:10, size=n, replace=TRUE),
                     str=sample(letters[1:10], size=n, replace=TRUE),
                     date=as.character(sample(seq(from=as.Date("1971-01-01"), 
                                                  by=1, length.out = 10), 
                                              size=n, replace=TRUE)),
                     stringsAsFactors=FALSE)
    
    library("RPostgreSQL")
    
    pg <- dbConnect(PostgreSQL())
    
    dbWriteTable(pg, "rpostgresql_test", df, row.names=FALSE, overwrite=TRUE)
})

## Using composite type created with table

DROP TABLE IF EXISTS plr_test CASCADE;

CREATE TABLE plr_test (int integer, str text, date date);

CREATE OR REPLACE FUNCTION get_df()
  RETURNS SETOF plr_test AS
$BODY$
    n <- 1e6
    df <- data.frame(int=sample(1:10, size=n, replace=TRUE),
                    str=sample(letters[1:10], size=n, replace=TRUE),
                    date=as.character(sample(seq(from=as.Date("1971-01-01"), 
                                                by=1, length.out = 10), 
                                             size=n, replace=TRUE)),
                    stringsAsFactors=FALSE)
    return(df)
$BODY$ LANGUAGE plr;

INSERT INTO plr_test
SELECT * FROM get_df();
#
On Thu, Feb 12, 2015 at 8:20 AM, Ian Gow <iandgow at gmail.com> wrote:

            
Not in terms of performance. Communication between the client and server is
one byte at a time. The reason inserts are slow is that the server does a
lot of validation, updates indices, triggers, etc. It is also a lot safer
than copy, which just writes whatever it sees to the fields.

If you want to see how things are copied byte-by-byte, you can trace the
communication between R and PostgreSQL in 'rpg'.
I've forgotten how DBI works, but an array should just come through in text
form and you can parse it how ever you want on the R side. You can also
send an array by converting it to a string of the appropriate format.

rpg has a 'format_for_send' method that you can dispatch on any type to
convert it to something PostgreSQL can read. I'm working on something for
formatting strings returned from the server.

THK

  
    
#
I've started thinking about a general interface for DBI to do this
too. It's particularly important for SQLite since it has such limited
type information.

Ian, if you'd like RPostgres (https://github.com/rstats-db/RPostgres)
to support arrays natively, please file an issue describing your use
case and providing a reproducible example that creates a table with
array data.

Hadley
#
On Thu, Feb 12, 2015 at 10:48 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
Perhaps its time for DBI2. As long as its not cAmEl. (I've become allergic.)

THK

  
    
1 day later
#
(Possibly naive question.)
I've now seen this stated a few times on this thread and I don't fully 
understand the context everyone seems to be assuming. With a Perl client 
I can insert multiple records with a single insert command, over a 
network, and it seems to be nearly as fast as loading with the batch 
utility locally on the server. As I recall, I had to block less that 
10,000 records or something choked, but multiple blocks loaded a fairly 
large database reasonably quickly. (I'm travelling and cannot verify 
this number at the moment.) It is import to load before specifying 
indexes, otherwise the index gets recalculated with each insert.

When people say inserts  are slow:

-Is this an R specific problem?
-Is it assumed indexes are already specified?
-Or am I missing something else once again?

Thanks,
Paul
#
On 02/13/2015 02:53 PM, Paul Gilbert wrote:
It's not exactly R specific, though we are discussing the implementation
in R.

If you bulk load generally you want it to be an atomic commit (all
records or none), and you want it to hold of on doing triggers and
indexes until after the data has made it to the db. Different Dbs have
different performance levels.

In postgres you can compare using the psql copy command vs sending a sql
file with INSERT commands.

More background
http://en.wikipedia.org/wiki/Prepared_statement
http://rosettacode.org/wiki/Parametrized_SQL_statement

Not knowing how you wrote your Perl code, you may have been using
prepared statements. In which case both of your runs remote or local
should of had similar performance. What's being discussed is the
introduction of prepared statements which has not existed for most Db
connectors in R previously.

Having done bulk loads in the 10 GB+ range before I can tell you it
matters a lot.

Thanks,
Alex
#
On Fri, Feb 13, 2015 at 5:05 PM, Alex Mandel <tech_dev at wildintellect.com>
wrote:
The example for "copy_to" in rpg compares pushing the hflights db to
postgresql using inserts and copy. You'll notice the insert version
('write_table') is commented out because it takes 5 minutes or so and I got
tired of waiting. Using 'copy_to' is less than 5 seconds. No indices or
constraints in operation.
I've experimented with using parallel R to run many inserts in parallel.
Still have not gotten it worked out and I'm not sure it will help in any
event owing to write contention. Would be interesting to have a good test
case.

THK

  
    
#
On 15-02-13 08:49 PM, Tim Keitt wrote:
I don't have the tools to test anything at the moment, but the rpg note 
under write_tables says "write_table uses SQL INSERT statements".

Just to be clear, I know

     INSERT INTO PRODUCT (name, price) VALUES (bread, 1.0)
     ...
     INSERT INTO PRODUCT (name, price) VALUES (butter, 2.5)

is slow, it is

     INSERT INTO PRODUCT (name, price) VALUES (
       (bread, 1.0),
        ...
       (butter, 2.5))

that I think is fairly fast. (I might have the syntax wrong.)

Paul