Skip to content

Looking for package for data generation for classification and regression

9 messages · Tom Woolman, Sarah Goslee, Ranjan Maitra +1 more

#
Dear All,

I am in need of generating artificial data for machine learning
classification and regression analysis. What I am looking for is
something similar to Python sklearn.datasets.make_classification and
sklearn.datasets.make_regression:

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html

I have searched CRAN for something similar, but found nothing. Could
someone please help me with this?

Thanks in advance,

Paul
#
Hi Paul. Have you considered just going onto Kaggle and GitHub and 
searching for some of the many freely available real datasets that are 
posted there? I'm seeing a lot of productivity there days with research 
focused on data generation, and not just on creating algorithms and 
predictive models. Which is a good thing for us ;)

One of the current research papers I'm working on now is based on mining 
a dataset I discovered on Kaggle a few months back and trying to create 
a novel solution for that. Proper credit will of course be provided in 
the citation references for the data provider.


Thanks,
Tom
On 2022-03-03 16:00, Paul Smith wrote:
#
Sounds interesting, Tom! Thanks!

I am trying to find datasets for creating assignments for students of
a course of machine learning.

Paul
On Thu, Mar 3, 2022 at 9:04 PM Tom Woolman <twoolman at ontargettek.com> wrote:
#
Hi Paul,

If you aren't committed to creating your own, the cluster.datasets
package might be of interest. I've also used
http://cs.joensuu.fi/sipu/datasets/ quite often.

Sarah
On Thu, Mar 3, 2022 at 4:20 PM Paul Smith <phhs80 at gmail.com> wrote:

  
    
#
Thanks, Sarah! Your answer is quite helpful!

Paul
On Thu, Mar 3, 2022 at 10:43 PM Sarah Goslee <sarah.goslee at gmail.com> wrote:
#
On Thu Mar03'22 09:00:08PM, Paul Smith wrote:
Not sure if this helps, but at least for classification and clustering, there is the MixSim package on CRAN which provides classification datasets according to an overall overlap measure.


Hope this helps!

Best wishes,
Ranjan
#
On Fri, Mar 4, 2022 at 8:07 AM Ranjan Maitra <mlmaitra at gmx.com> wrote:
Thanks, Ranjan, that is also quite helpful, since clustering is also a
topic of the course!

Paul
#
On Fri Mar04'22 10:41:24AM, Paul Smith wrote:
The Clustering Algorithms Referee Package (CARP) uses the same codebase but is more general.

https://jmlr.org/papers/v12/melnykov11a.html

Unfortunately, it is written in C, so may not help.

It is on www.mloss.org at:

https://mloss.org/software/view/248/

but perhaps should also be moved to github.

Best wishes,
Ranjan
#
On Fri, Mar 4, 2022 at 5:03 PM Ranjan Maitra <mlmaitra at gmx.com> wrote:
That is quite interesting, Ranjan! I hope you will have that on GitHub
as a R package ready for installation.

Best wishes, Paul