Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?
Hello,

You should post a working example, we have no idea what your 2d data set 
is. A matrix? A data.frame? Something else?

And the string you are looking for? Are you thinking of regular 
expressions (grep) or is it a simple equality '=='?

Here is a reproducible example of the use of ?which() with argument 
arr.ind set to TRUE.

# create a data set
set.seed(2021)
A <- matrix(sample(letters, 24, TRUE), ncol = 4)

# Test for equality, this returns
# a logical matrix and which() can
# be applied to it
found <- A == "g"
which(found, arr.ind = TRUE)
#     row col
#[1,]   1   1
#[2,]   5   1
#[3,]   2   3

# The same code can be use if the data is
# a data.frame
df1 <- as.data.frame(A)
df1 == "g"

But if you want to look for a regex, try sapply. In this example the 
pattern is a simple one, and I use grepl.

pattern <- "g"
found2 <- sapply(df1, function(x) grepl(pattern, x))
which(found2, arr.ind = TRUE)

Hope this helps,

Rui Barradas

?s 18:07 de 15/05/21, Tuhin Chakraborty escreveu:
Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain strings so
it does not sound like you are using a matrix as then  ALL columns would be
of the same type.

So are you using a data.frame or tibble or something you made on your own?

Can you address one column at a time and would that be of type vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same length
or the integer offset of a match such as telling you it is the fifth item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Thank you everyone, for the very helpful suggestions. I understand that my
question is not altogether clear. So let me share an example.
The below is a part of a dataset, there are around 40000 rows.
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3
?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175

Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
spec(Dataset) function in R, it shows me which columns contain strings.
Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
would like to know if there is someway where I can learn exactly where the
string values are, like for LI(PPM) in the top row. As this is a huge
dataset, it is difficult to go through all the rows manually.
Thank you again and in anticipation.
Tuhin

On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help <r-help at r-project.org>
wrote:
Tuhin,

What do you mean by a 2-D dataset? You say some columns contain strings so
it does not sound like you are using a matrix as then  ALL columns would be
of the same type.

So are you using a data.frame or tibble or something you made on your own?

Can you address one column at a time and would that be of type vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you
want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same
length
or the integer offset of a match such as telling you it is the fifth item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For
matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Do look at the mess below that we received, and make an effort not to send HTML email to this list. What you saw when you sent it is not what we see when it gets to us.
Thank you everyone, for the very helpful suggestions. I understand that
my
question is not altogether clear. So let me share an example.
The below is a part of a dataset, there are around 40000 rows.
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3
?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175

Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
spec(Dataset) function in R, it shows me which columns contain strings.
Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But,
I
would like to know if there is someway where I can learn exactly where
the
string values are, like for LI(PPM) in the top row. As this is a huge
dataset, it is difficult to go through all the rows manually.
Thank you again and in anticipation.
Tuhin

On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help
<r-help at r-project.org>
wrote:

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain
strings so
it does not sound like you are using a matrix as then  ALL columns
would be
of the same type.

So are you using a data.frame or tibble or something you made on your
own?
Can you address one column at a time and would that be of type
vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what
you
want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to
match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two
vowels
in a row in it?

Once you figure out what you have and what you want, how do you want
to
identify what you are looking for? Will there be one match or
possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same
length
or the integer offset of a match such as telling you it is the fifth
item.
R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For
matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin
Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset?
spec(Dataset)
will reveal the columns that contain the strings. But can I know
where
exactly the string values are in the column?

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Sent from my phone. Please excuse my brevity.
Hello,

The data makes clearer.
Do you want to know where are the values that cannot be coerced to numeric?
The auxiliary function f outputs a logical vector, sapply applies it 
column by column and which(., arr.ind) gives the TRUE values as (row, 
col) pairs.

txt <- "
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3 ?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175
"
df1 <- read.table(text = txt, header = TRUE)
df1

f <- function(x){
   suppressWarnings(is.na(as.numeric(x)))
}
found <- sapply(df1, f)
which(found, arr.ind = TRUE)

Hope this helps,

Rui Barradas

?s 06:31 de 16/05/21, Tuhin Chakraborty escreveu:
Thank you everyone, for the very helpful suggestions. I understand that my
question is not altogether clear. So let me share an example.
The below is a part of a dataset, there are around 40000 rows.
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3
?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175

Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
spec(Dataset) function in R, it shows me which columns contain strings.
Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
would like to know if there is someway where I can learn exactly where the
string values are, like for LI(PPM) in the top row. As this is a huge
dataset, it is difficult to go through all the rows manually.
Thank you again and in anticipation.
Tuhin

On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help <r-help at r-project.org>
wrote:

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain strings so
it does not sound like you are using a matrix as then  ALL columns would be
of the same type.

So are you using a data.frame or tibble or something you made on your own?

Can you address one column at a time and would that be of type vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you
want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same
length
or the integer offset of a match such as telling you it is the fifth item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For
matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

         [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

You can also create an extra column with the column names corresponding 
to the column col. I believe this extra column is not needed and with a 
big data set it's even a waste of time and memory space but the code 
below creates it.

res <- which(found, arr.ind = TRUE)
res <- as.data.frame(res)
res$col_name <- names(df1)[ res$col ]

With a big data set the first res is a numeric matrix and it's access 
and extraction is faster, matrix operations are generally faster than 
data.frame operations.

Hope this helps,

Rui Barradas

?s 08:30 de 16/05/21, Rui Barradas escreveu:
Hello,

The data makes clearer.
Do you want to know where are the values that cannot be coerced to numeric?
The auxiliary function f outputs a logical vector, sapply applies it 
column by column and which(., arr.ind) gives the TRUE values as (row, 
col) pairs.

txt <- "
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3 ?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175
"
df1 <- read.table(text = txt, header = TRUE)
df1

f <- function(x){
 ? suppressWarnings(is.na(as.numeric(x)))
}
found <- sapply(df1, f)
which(found, arr.ind = TRUE)

Hope this helps,

Rui Barradas

?s 06:31 de 16/05/21, Tuhin Chakraborty escreveu:
Thank you everyone, for the very helpful suggestions. I understand 
that my
question is not altogether clear. So let me share an example.
The below is a part of a dataset, there are around 40000 rows.
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3
?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175

Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
spec(Dataset) function in R, it shows me which columns contain strings.
Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
would like to know if there is someway where I can learn exactly where 
the
string values are, like for LI(PPM) in the top row. As this is a huge
dataset, it is difficult to go through all the rows manually.
Thank you again and in anticipation.
Tuhin

On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help 
<r-help at r-project.org>
wrote:

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain 
strings so
it does not sound like you are using a matrix as then? ALL columns 
would be
of the same type.

So are you using a data.frame or tibble or something you made on your 
own?

Can you address one column at a time and would that be of type 
vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you
want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two 
vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or 
possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same
length
or the integer offset of a match such as telling you it is the fifth 
item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For
matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin 
Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? 
spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

???????? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

????[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thank you. This possibly will work.
Tuhin Chakraborty
PhD
Geology & Geophysics
Indian Institute Of Technology, Kharagpur
Kharagpur-721302

Hello,

You can also create an extra column with the column names corresponding
to the column col. I believe this extra column is not needed and with a
big data set it's even a waste of time and memory space but the code
below creates it.

res <- which(found, arr.ind = TRUE)
res <- as.data.frame(res)
res$col_name <- names(df1)[ res$col ]

With a big data set the first res is a numeric matrix and it's access
and extraction is faster, matrix operations are generally faster than
data.frame operations.

Hope this helps,

Rui Barradas

?s 08:30 de 16/05/21, Rui Barradas escreveu:
Hello,

The data makes clearer.
Do you want to know where are the values that cannot be coerced to
numeric?
The auxiliary function f outputs a logical vector, sapply applies it
column by column and which(., arr.ind) gives the TRUE values as (row,
col) pairs.

txt <- "
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3 ?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175
"
df1 <- read.table(text = txt, header = TRUE)
df1

f <- function(x){
   suppressWarnings(is.na(as.numeric(x)))
}
found <- sapply(df1, f)
which(found, arr.ind = TRUE)

Hope this helps,

Rui Barradas

?s 06:31 de 16/05/21, Tuhin Chakraborty escreveu:
Thank you everyone, for the very helpful suggestions. I understand
that my
question is not altogether clear. So let me share an example.
The below is a part of a dataset, there are around 40000 rows.
LI(PPM) SC(PPM) TI(PPM) V(PPM)
3.1/0.5 ? ? ?
? ? 0.2/0.3
?
? 2.8/0.75 ? >0.2
0.0389 108.6591 0.0214 85.18818
0.0688 146.1739 0.0117 108.0221
0.0265 121.3268 0.00749 85.34932
0.139901 125.3066 0.00984 97.23175

Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
spec(Dataset) function in R, it shows me which columns contain strings.
Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
would like to know if there is someway where I can learn exactly where
the
string values are, like for LI(PPM) in the top row. As this is a huge
dataset, it is difficult to go through all the rows manually.
Thank you again and in anticipation.
Tuhin

On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help
<r-help at r-project.org>
wrote:

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain
strings so
it does not sound like you are using a matrix as then  ALL columns
would be
of the same type.

So are you using a data.frame or tibble or something you made on your
own?

Can you address one column at a time and would that be of type
vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you
want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to
match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two
vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or
possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same
length
or the integer offset of a match such as telling you it is the fifth
item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For
matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tuhin
Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help at r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset?
spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

         [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.