How important is set.seed

Tue, Mar 22, 2022 11:37 AM

I would also disagree with your rephrasing. What is the point in characterizing if there is no understanding? What one wants is to understand the variability in outcome caused by including a random element in the model if the focus is on the random numbers. It may also be that one wants to understand the variability in outcome if one were to repeat an experiment. One approach is to split a dataset into testing and training sets, and use the RNG to decide which observation goes into which set. However, every run will give a slightly different answer.
The random number generator is then used in place of a permutation test where the number of permutations is too large for current computational effort.

I assume what the OP was asking is whether the conclusion(s) of two (or more) models were the same given the range in outcomes produced by the random number generator(s). The only way to address this is to characterize the distribution of model outcomes from different runs with different random seeds. Examine that characterization and hope for understanding.

Tim

From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Tuesday, March 22, 2022 2:03 PM
To: Ebert,Timothy Aaron <tebert at ufl.edu>
Cc: Neha gupta <neha.bologna90 at gmail.com>; r-help at r-project.org
Subject: Re: [R] How important is set.seed

[External Email]
"rather to understand how the choice of seed influences final model output."

No! Different seeds just produce different streams of (pseudo)-random numbers.  Hence there cannot be any "understanding" of how "choice of seed" influences results.  Presumably, what you meant is to characterize the variability in results from the procedure due to its incorporation of randomness in what it does. Re-read Jeff's last post.  This does *not* require set.seed() at all.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Tue, Mar 22, 2022 at 9:55 AM Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>> wrote:

So step 1 is not to compare models, rather to understand how the choice of seed influences final model output. Once you have a handle on this issue, then work at comparing models.
Tim

From: Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at gmail.com>>
Sent: Tuesday, March 22, 2022 12:19 PM
To: Bert Gunter <bgunter.4567 at gmail.com<mailto:bgunter.4567 at gmail.com>>
Cc: Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>>; r-help at r-project.org<mailto:r-help at r-project.org>
Subject: Re: [R] How important is set.seed

[External Email]
I read a paper two days ago (and that's why I then posted here about set.seed) which used interpretable machine learning.

According to the authors, different explanations (of the black-box models) will be produced by the ML models if different seeds are used or never used.

On Tue, Mar 22, 2022 at 5:12 PM Bert Gunter <bgunter.4567 at gmail.com<mailto:bgunter.4567 at gmail.com>> wrote:

OK, I'm somewhat puzzled by this discussion. Maybe I'm just clueless. But...

1. set.seed() is used to make any procedure that uses R's
pseudo-random number generator -- including, for example, sampling
from a distribution, random data splitting, etc. -- "reproducible".
That is, if the procedure is repeated *exactly,* by invoking
set.seed() with its original argument values (once!) *before* the
procedure begins, exactly the same results should be produced by the
procedure. Full stop. It does not matter how many times random number
generation occurs within the procedure thereafter -- R preserves the
state of the rng between invocations (but see the notes in ?set.seed
for subtle qualifications of this claim).

2. Hence, if no (pseudo-) random number generation is used, set.seed()
is irrelevant. Full stop.

3. Hence, if you don't care about reproducibility (you should! -- if
for no other reason than debugging), you don't need set.seed()

4. The "randomness" of any sequence of results from any particular
set.seed() arguments (including further calls to the rng) is a complex
issue. ?set.seed has some discussion of this, but one needs
considerable expertise to make informed choices here. As usual, we
untutored users should be guided by the expert recommendations of the
Help file.

*** If anything I have said above is wrong, I would greatly appreciate
a public response here showing my error.***

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Tue, Mar 22, 2022 at 7:48 AM Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at gmail.com>> wrote:

Hello Tim

In some of the examples I see in the tutorials, they put the random seed
just before the model training e.g train function in case of caret library.
Should I follow this?

Best regards
On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>> wrote:

Ah, so maybe what you need is to think of ?set.seed()? as a treatment in
an experiment. You could use a random number generator to select an
appropriate number of seeds, then use those seeds repeatedly in the
different models to see how seed selection influences outcomes. I am not
quite sure how many seeds would constitute a good sample. For me that would
depend on what I find and how long a run takes.

  In parallel processing you set seed in master and then use a random
number generator to set seeds in each worker.

Tim



*From:* Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at gmail.com>>
*Sent:* Tuesday, March 22, 2022 6:33 AM
*To:* Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>>
*Cc:* Jeff Newmiller <jdnewmil at dcn.davis.ca.us<mailto:jdnewmil at dcn.davis.ca.us>>; r-help at r-project.org<mailto:r-help at r-project.org>
*Subject:* Re: How important is set.seed



*[External Email]*

Thank you all.



Actually I need set.seed because I have to evaluate the consistency of
features selection generated by different models, so I think for this, it's
recommended to use the seed.



Warm regards

On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>> wrote:

If you are using the program for data analysis then set.seed() is not
necessary unless you are developing a reproducible example. In a standard
analysis it is mostly counter-productive because one should then ask if
your presented results are an artifact of a specific seed that you selected
to get a particular result. However, in cases where you need a reproducible
example, debugging a program, or specific other cases where you might need
the same result with every run of the program then set.seed() is an
essential tool.
Tim

-----Original Message-----
From: R-help <r-help-bounces at r-project.org<mailto:r-help-bounces at r-project.org>> On Behalf Of Jeff Newmiller
Sent: Monday, March 21, 2022 8:41 PM
To: r-help at r-project.org<mailto:r-help at r-project.org>; Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at gmail.com>>; r-help
mailing list <r-help at r-project.org<mailto:r-help at r-project.org>>
Subject: Re: [R] How important is set.seed

[External Email]

First off, "ML models" do not all use random numbers (for prediction I
would guess very few of them do). Learn and pay attention to what the
functions you are using do.

Second, if you use random numbers properly and understand the precision
that your specific use case offers, then you don't need to use set.seed.
However, in practice, using set.seed can allow you to temporarily avoid
chasing precision gremlins, or set up specific test cases for testing code,
not results. It is your responsibility to not let this become a crutch... a
randomized simulation that is actually sensitive to the seed is unlikely to
offer an accurate result.

Where to put set.seed depends a lot on how you are performing your
simulations. In general each process should set it once uniquely at the
beginning, and if you use parallel processing then use the features of your
parallel processing framework to insure that this happens. Beware of
setting all worker processes to use the same seed.

On March 21, 2022 5:03:30 PM PDT, Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at gmail.com>>
wrote:

Hello everyone

I want to know

(1) In which cases, we need to use set.seed while building ML models?

(2) Which is the exact location we need to put the set.seed function i.e.
when we split data into train/test sets, or just before we train a model?

Thank you

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm
an_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRz
sn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf
0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e=
PLEASE do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org
_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsR
zsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrm
f0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.
ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=
9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_
AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2Wy
RxpXsq4Y3TRMU&e=
PLEASE do read the posting guide https://urldefense.proofpoint<https://urldefense.proofpoint.com/v2/url?u=https-3A__urldefense.proofpoint&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=DheoTpUbiMMhocxNg-tk-BO_ZtdxO9LJyzryBrNGDROu1fkI31lSK_GB-p_qTuGX&s=PQ6DQb4poGhoaIYvUOp1VjwHR_LLJ5Cf6ugqj9o6_q8&e=>.
com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.
html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=
s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcL
wt2jrmf0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help<https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=DheoTpUbiMMhocxNg-tk-BO_ZtdxO9LJyzryBrNGDROu1fkI31lSK_GB-p_qTuGX&s=gQOur-Bj_IkQUQavZr9GRQWDI6FLMolie3oSJK0pC1w&e=>
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=DheoTpUbiMMhocxNg-tk-BO_ZtdxO9LJyzryBrNGDROu1fkI31lSK_GB-p_qTuGX&s=yuDFhe31-hTPEV6voKWLGaIpMKTCGzo2zYVhaCzHqlc&e=>
and provide commented, minimal, self-contained, reproducible code.

How important is set.seed

Thread (19 messages)