Problem with comparing multiple data sets
Thank you John. Yes. as you mentioned this is not really what I am looking for. It's interesting because I was really thinking that it should be pretty easy. All I need to do is just compare class1, class2 and class3 for each text and put the most frequent number next to it in each row. Repeat it for all the rows. Apparently it's not that simple. Sorry I didn't notice that I sent it only to you! Thanks for letting me know. I appreciate if anybody can help on this. Thank you.
On Tue, May 26, 2015 at 7:27 PM, John Kane <jrkrideau at inbox.com> wrote:
Hi Mohammad,
The data came through beautifully despite the fact that you posted in
HTML. Please, post in plain text.
Oh, just as I was ready to push Send, I noticed you only replied to me.
You really should reply to the R-help list since there are a lot more and
better people to help there. Besides it's a world-wide list. Others can
play with the problem while we sleep :) .
I will just reply to you but I really suggest sending all of this to the
list.
Now I am wondering what to do with the data. As a first swipe I just added
up all the values in each class by each text value. Results are below. Not
what you want by any means but perhaps a small step.
Then I started to think are we really interested in the sum or should we
be looking at incidence, that is should we be looking at the frequency
rather than the sum?
Is
class.1 class.2 class #dac
0 2 0
a value of 2 (sum) or a hit of 1 (count or freq) ?
Anyway below is what I have tried so far -- it may not be anywhere near
what you want but if it makes any sense then I think we just need to pick
off the highest values for each combination of terms and class to give you
what you want.
I suspect our real data-munging gurus can do all this faster and better
than I can but hopefully it is a start.
Where your data set is dat1
#=====================================
# If reshape2 is not installed.
install.packages("reshape2")
#=====================================
library(reshape2)
mdat <- melt(dat1, id.vars= c("terms"),
variable.name = "class",
value.name = "value",
na.rm = FALSE)
mdat1 <- aggregate(value ~ terms + class, data = mdat, sum)
mdat1[order(mdat1$terms, mdat1$class), ]
#=====================================
John Kane
Kingston ON Canada
-----Original Message-----
From: mxalimohamma at ualr.edu
Sent: Tue, 26 May 2015 09:50:43 -0500
To: jrkrideau at inbox.com
Subject: Re: [R] Problem with comparing multiple data sets
Thank you John for being patient with me.
My original post was to compare 3 sets of data which had difference in
their class value for the same text. However, I thought it might be easier
to combine those 3 data sets into one that shows the 3 different classes
and then find the most frequent class value for the text. So that's what I
did. Now I only want to add the most frequent class value in a new column.
I tried to create a dput version of the data set (Only a small part of it)
so you can see. I hope it works.
Tweet1<- read.csv(file="part1_complete.csv",head=TRUE,sep= ",")
dput(head(Tweet1, 100))
structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), class.2 = c(2L,
2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 2L, 2L, 0L, 2L, 1L, 1L, 1L,
1L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 2L, 2L, 2L, 2L, 2L,
0L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), terms = structure(c(9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 69L, 69L, 69L, 69L, 69L, 40L, 40L, 40L, 40L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 98L, 98L, 98L, 98L, 98L,
98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 23L, 87L, 87L, 87L,
87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
87L, 87L), .Label = c("#accountability",
"#accountability,#anonymity,anonymity",
"#accountability,recovery", "#anonymity,anonymity",
"#anonymous,anonymous",
"#attacker,security", "#authentication,access control", "#confidential",
"#dac", "#encryption,#privacy,#security", "#identifier",
"#identifier,identifier",
"#intrusion,#security,security", "#mac", "#mac,#security",
"#mac,password",
"#mac,security", "#password,privacy", "#password,security",
"#prevention,prevention",
"#privacy,#security,password", "#privacy,identifiable",
"#privacy,information privacy,privacy",
"#privacy,intrusion", "#privacy,location privacy,privacy",
"#privacy,password,security",
"#privacy,personal data", "#privacy,personal information,privacy",
"#privacy,security", "#pseudonym", "#pseudonymity",
"#security,authentication,identity management",
"#security,identity management,security", "#security,mac,security",
"#security,malicious,security", "#security,personal information",
"#security,retention", "#token", "#token,token",
"accountability,anonymous",
"accountability,audit trail", "accountability,confidential",
"accountability,security", "accountability,token", "adversary,pin",
"anonymity,authentication", "anonymity,security", "anonymous,disclosure",
"anonymous,password", "authentication,password,security",
"authorization,mac",
"authorization,permission", "confidential,disclosure",
"confidential,disclosure,security",
"confidential,mac", "confidential,personal information",
"confidential,pin",
"confidential,privilege", "confidentiality,security", "consent",
"dac", "dac,pcm", "data aggregation,privacy", "data controller",
"data protection,encryption", "data protection,recovery", "data
protection,security",
"data quality,security", "data security,encryption,security",
"data security,mac,security", "data security,personal data,security",
"data security,prevention,security", "detection", "detection,mac",
"detection,password", "deterrence,prevention", "digital signature",
"disclosure,password", "disclosure,private information",
"disclosure,security",
"encryption,password,recovery", "encryption,private data", "id
management,privacy",
"id management,security", "identifier", "identifier,token", "location
privacy,privacy",
"mac,password,security", "mac,permission", "mac,prevention",
"mac,privacy", "mac,pseudonym", "malicious,prevention", "non-repudiation",
"password,prevention,security", "password,private information",
"password,recovery", "password,user id", "permission,personal data",
"permission,privacy,privacy policy", "personal data", "personal
identification number,pin",
"personal information", "personal information,security", "prevention",
"prevention,privilege", "privacy,privacy policy", "privacy,privacy
preferences",
"private information,security", "recovery,retention", "recovery,token",
"retention,token", "sensitive data", "token"), class = "factor")), .Names
= c("class.1",
"class.2", "class.3", "terms"), row.names = c(NA, 100L), class =
"data.frame")
On Mon, May 25, 2015 at 2:04 PM, John Kane <jrkrideau at inbox.com> wrote:
Hi Mohammad,
If you are just starting with R a sense of total confusion is often the
first feeling. Welcome :).
If you are a SAS or SPSS user this may help
https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
[
https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
]
If anything, I am even more lost than before.
Did Jim Lemon's approach help? Confuse ?
Perhaps one of the problems is that the data did not come through
cleanly. You posted in HTML and the R-help list strips out all HTML so the
result often is mangled beyond any real use.
I may have imagined that your data are more complicated than they really
are if all you really want is some kind of frequency count possibly by some
conditioning variable. Is this it?
It seems too simple but that is what I read that Excel is doing (as
incompetently as usual---I had not realised it was possible to be even less
impressed with Excel than I already was.)
Can you send us some more data in dput() format. See the links I provided
earlier or have a look at ?dput for more information.
If you have lot of data, a representative sample is fine. It is often
enough to do something like :
dput(head(mydata, 100))
which supplies 100 rows of data.
Just output the dput() data, copy and paste into your email, et voil?
we have the exact same data.
The reason for dput() is that it provides a snapshot of exactly how the
data exists on your machine. Given all sorts of differences between OS's,
personal settings, human languages and so on. what I or another R-help
reader see or read in may not correspond to what you have. Using dput()
avoids all of this.
Here is a simple example of what I mean. If you look at dat1 and dat2
they 'look' the same but ... I could read in data either way depending on
all sorts of variable and have no idea which, if either is how you see the
data.
Data are supplied in dput() format, just copy and paste into R.
=====
dat1 <- structure(list(aa = structure(1:10, .Label = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"), class = "factor"), bb = c(10L,
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names =
c(NA,
-10L), class = "data.frame")
dat2 <- structure(list(aa = 1:10, bb = c(10L, 9L, 8L, 7L, 6L, 5L, 4L,
3L, 2L, 1L)), .Names = c("aa", "bb"), row.names = c(NA, -10L), class =
"data.frame")
dat1
dat2 # looks a lot like dat1
with(dat1, aa*bb)
with(dat2 , aa*bb)
str(dat1)
str(dat2)
=======
John Kane
Kingston ON Canada
-----Original Message-----
From: mxalimohamma at ualr.edu
Sent: Mon, 25 May 2015 12:14:46 -0500
To: jrkrideau at inbox.com
Subject: Re: [R] Problem with comparing multiple data sets
Hi John.
Thank you for your response.
Here is a small portion of my actual data set. What I am supposed to do
is to use a function similar to mode function in excel to find the most
frequent value (class) for each term.
V1 V2 V3 V4
1 class 1 class 2 class 3 terms
2 0 2 0 #dac
3 0 2 0 #dac
4 0 2 0 #dac
5 0 2 0 #dac
6 1 0 1 #dac
7 0 0 0 #dac
....
Since I just started using R. I don't know where I am going with this. I
appreciate any help.
On Sat, May 23, 2015 at 8:23 AM, John Kane <jrkrideau at inbox.com> wrote:
Hi Mohammad
Welcome to the R-help list.
There probably is a fairly easy way to what you want but I think we
probably need a bit more background information on what you are trying to
achieve. I know I'm not exactly clear on your decision rule(s).
It would also be very useful to see some actual sample data in useable R
format.Have a look at these links
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
[
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]
[
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
[
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]]
and http://adv-r.had.co.nz/Reproducibility.html [
http://adv-r.had.co.nz/Reproducibility.html] [
http://adv-r.had.co.nz/Reproducibility.html [
http://adv-r.had.co.nz/Reproducibility.html]] for some hints on what you
might want to include in your question.
In particular, read up about dput() in those links and/or see ?dput.
This is the generally preferred way to supply sample or illustrative data
to the R-help list. It basically creates a perfect copy of the data as it
exists on 'your' machine so that R-help readers see exactly what you do.
John Kane
Kingston ON Canada
> -----Original Message----- > From: mxalimohamma at ualr.edu > Sent: Fri, 22 May 2015 12:37:50 -0500 > To: r-help at r-project.org > Subject: [R] Problem with comparing multiple data sets > > Hi everyone, > > I am very new to R and I have a task to do. I appreciate any help. I
have
> 3 > data sets. Each data set has 4 columns. For example: > > Class Comment Term Text > 0 com1 aac text1 > 2 com2 aax text2 > 1 com3 vvx text3 > > Now I need t compare the class section between 3 data sets and assign
the
> most available class to that text. For example if text1 is assigned to > class 0 in data set 1&2 but assigned as 2 in data set 3 then it should
be
> assigned to class 0. If they are all the same so the class will be the > same. The ideal thing would be to keep the same format and just update > the > class. Is there any easy way to do this? > > Thanks a lot. >
> [[alternative HTML version deleted]] > > ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help] [ https://stat.ethz.ch/mailman/listinfo/r-help [ https://stat.ethz.ch/mailman/listinfo/r-help]] > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html [ http://www.R-project.org/posting-guide.html] [ http://www.R-project.org/posting-guide.html [ http://www.R-project.org/posting-guide.html]] > and provide commented, minimal, self-contained, reproducible code. ____________________________________________________________ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop! Check it out at http://www.inbox.com/earth [http://www.inbox.com/earth] [http://www.inbox.com/earth [http://www.inbox.com/earth]] -- Mohammad Alimohammadi | Graduate Assistant University of Arkansas at Little Rock | College of Science and Mathematics (CSAM) 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu] [ http://ualr.edu/ [http://ualr.edu/]] Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [ http://scholar.google.com/citations?user=MsfN_i8AAAAJ] [ http://scholar.google.com/citations?user=MsfN_i8AAAAJ [ http://scholar.google.com/citations?user=MsfN_i8AAAAJ]] ____________________________________________________________ FREE ONLINE PHOTOSHARING - Share your photos online with your friends and family! Visit http://www.inbox.com/photosharing [ http://www.inbox.com/photosharing] to find out more! -- Mohammad Alimohammadi | Graduate Assistant University of Arkansas at Little Rock | College of Science and Mathematics (CSAM) 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu/] Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [ http://scholar.google.com/citations?user=MsfN_i8AAAAJ] ____________________________________________________________ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop! Check it out at http://www.inbox.com/earth
Mohammad Alimohammadi | Graduate Assistant University of Arkansas at Little Rock | College of Science and Mathematics (CSAM) 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [[alternative HTML version deleted]]