Skip to content
Prev 1588 / 2152 Next

Rserve, Rserve.cluster, and cluster of local and remote

Edi,
On Jan 17, 2013, at 12:32 PM, Edi Bice wrote:

            
This is actually a very good question. There are a few issues here, not just the one you see ;). But let's start with that one -- the easiest way to do that is using "control commands" - you can simply run RS.server.eval() [or the equivalent from your client language] with the command to initiate the cluster.

However, that's not your real problem;). There is a bigger one: you cannot create a cluster in the server and let all clients use it. Because the FDs are shared after a fork() you'll end up with a broken mess the moment you have more than one client. What happens is that since it is the same socket to the cluster nodes for both, they are both talking to the same cluster instance and thus their messages will cross. So the only way to do a pre-emptive cluster allocation is to make sure you close the cluster in the server when a client has taken off with it. You can do that now in Rserve 1.7-0 (as of today ;)) using the .Rserve.served hook - you define a function that stops the the cluster and creates a new one. Now, what this effectively does is to defer the cluster initialization to a time between connections. So this will influence the latency between connections -- if you expect many subsequent connections at once, then you may be better off just starting the cluster on demand. As a side-effect this solves your other problem, too, because you just need to connect once after starting the server to bootstrap the cluster.

But there is more to that, too :). The reason makeRserveCluster() is slow is sort of unnecessary: it has to do with the fact that the connections to the nodes are created sequentially. I'm adding support for asynchronous connect to RSclient right now, so another solution will be to use that in Rserve.cluster instead. There will still be a slight overhead, but it should be smaller.


Finally, I have to say that Rserve.cluster is somewhat limited by the fact that it needs to be "wedged" into the snow setup which is not the typical way you'd use Rserve. I'm working on a more comprehensive solution that is more along the lines of scheduling - keeping workers around and re-spawning them as needed. I'll keep you posed on that - it would allow better balancing of workers across multiple connections. It would also allow us to take advantage of the asynchronous send features of Rserve and support data streaming - all this is a very active area on my ToDo list.

Cheers,
Simon