Jeff Newmiller makes an interesting point about distributed processing, but I don?t know how to use the usual pseudo-random processes to obtain deterministic results when I don?t know how the data will be sharded. You might have to replace pseudo-random sampling with deterministic sampling using a hash of something involving the unique key. Then the selection of a salt is the equivalent of a call to set.seed in non-parallel processing. The results should be the same as long as you fix the data set & the salt, and then you can test sensitivity to changes in the salt.
Jorgen Harmse
From: Neha gupta <neha.bologna90 at gmail.com>
To: "Ebert,Timothy Aaron" <tebert at ufl.edu>
Cc: Jeff Newmiller <jdnewmil at dcn.davis.ca.us>, "r-help at r-project.org"
<r-help at r-project.org>
Subject: Re: [R] How important is set.seed
Message-ID:
<CA+nrPnurAqBUgbrP-Oq4a8eo4Y7CO-k5xfH8c3EK-DGNCscidw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Thank you all.
Actually I need set.seed because I have to evaluate the consistency of
features selection generated by different models, so I think for this, it's
recommended to use the seed.
Warm regards
On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert at ufl.edu> wrote:
If you are using the program for data analysis then set.seed() is not necessary unless you are developing a reproducible example. In a standard analysis it is mostly counter-productive because one should then ask if your presented results are an artifact of a specific seed that you selected to get a particular result. However, in cases where you need a reproducible example, debugging a program, or specific other cases where you might need the same result with every run of the program then set.seed() is an essential tool. Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Jeff Newmiller Sent: Monday, March 21, 2022 8:41 PM To: r-help at r-project.org; Neha gupta <neha.bologna90 at gmail.com>; r-help mailing list <r-help at r-project.org> Subject: Re: [R] How important is set.seed [External Email] First off, "ML models" do not all use random numbers (for prediction I would guess very few of them do). Learn and pay attention to what the functions you are using do. Second, if you use random numbers properly and understand the precision that your specific use case offers, then you don't need to use set.seed. However, in practice, using set.seed can allow you to temporarily avoid chasing precision gremlins, or set up specific test cases for testing code, not results. It is your responsibility to not let this become a crutch... a randomized simulation that is actually sensitive to the seed is unlikely to offer an accurate result. Where to put set.seed depends a lot on how you are performing your simulations. In general each process should set it once uniquely at the beginning, and if you use parallel processing then use the features of your parallel processing framework to insure that this happens. Beware of setting all worker processes to use the same seed. On March 21, 2022 5:03:30 PM PDT, Neha gupta <neha.bologna90 at gmail.com> wrote:
Hello everyone
I want to know
(1) In which cases, we need to use set.seed while building ML models?
(2) Which is the exact location we need to put the set.seed function i.e.
when we split data into train/test sets, or just before we train a model?
Thank you
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm an_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRz sn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf 0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e= PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org _posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsR zsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrm f0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e= and provide commented, minimal, self-contained, reproducible code.
-- Sent from my phone. Please excuse my brevity.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat. ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r= 9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_ AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2Wy RxpXsq4Y3TRMU&e= PLEASE do read the posting guide https://urldefense.proofpoint. com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide. html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m= s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcL wt2jrmf0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e= and provide commented, minimal, self-contained, reproducible code.
------------------------------ Subject: Digest Footer _______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ------------------------------ End of R-help Digest, Vol 229, Issue 20 ***************************************