[Bioc-devel] MRD measurements in Leukemic patients using NGS data in r - Bioc-devel

Wed, Mar 4, 2020 6:47 AM #

Hi,

I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

Tim Triche, Jr.

Thu, Mar 5, 2020 1:36 PM #

a few thoughts:

1) look into Shearwater (
https://bioconductor.org/packages/release/bioc/html/deepSNV.html), then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @
Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that
most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR
(lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a
30x-100x genome (or even 1000x FM1) and are trying to compete with existing
standard approaches that can detect molecules at 1e-6, it'll be rough.  An
alternative approach (that has been used repeatedly) is to throw caution to
the wind, generate primers for numerous subject-specific somatic variants,
and use the ensemble to try and model MRD (speaking of ML). On the one
hand, that could give the clinic a "customer for life"; on the other hand,
it's not conducive to large-scale automation & deployment. As far as I
know, it never got much traction, in leukemia or anywhere else.  (Consider
that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million
cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads,
please talk to the people in #2 above; the reason most people won't go
below 5% VAF is that you get thwacked by error rates at that level, and the
reason most NGS-based MRD is based on UMIs is that existing PCR-based
methods have 6 logs sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl> wrote:

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Tim Triche, Jr.

Fri, Mar 6, 2020 4:57 AM #

Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory). 

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible. 

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.  

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:

?
a few thoughts: 

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html), then 

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t


On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl> wrote:

Hi,


I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can 
 be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.


I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.


Thanks,


Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Mulder, R

Sat, Mar 7, 2020 12:51 PM #

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory).

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:

?
a few thoughts:

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347080053&sdata=LpUeflml9u2XLENpmDAEuDBRANXZpjkFbIEiJAeP2pw%3D&reserved=0>), then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:

Hi,

I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can
be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347090042&sdata=HXNdz6iiEUQ3l%2BVwwDIGX0RAdO7Tdd7r7f4FcgW1k9M%3D&reserved=0>
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.

Tim Triche, Jr.

Sat, Mar 7, 2020 1:28 PM #

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),
`[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19
REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs
the read pairs, and so on)



--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl> wrote:

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command
to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321,
5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more
region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

------------------------------
*Van:* Tim Triche, Jr. <tim.triche at gmail.com>
*Verzonden:* vrijdag 6 maart 2020 13:57:10
*Aan:* Mulder, R
*CC:* bioc-devel at r-project.org
*Onderwerp:* Re: [Bioc-devel] MRD measurements in Leukemic patients using
NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per
region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges)
holding a GRanges (Genomic Ranges) of your regions added to scanBamParams
for each BAM. It sounds awkward but in practice it is super fast and will
give you all the nucleotide and read level information you could want. One
of my interns implemented this for mitochondrial variant calling in
MTseeker when we got sick of using gmapR and being flagged for errors on
not-Linux. (We gutted the entire package recently and have new, insanely
deep examples from Oxford Nanopore direct RNA sequencing and from large
single cell datasets; I need to add those and get the package back out of
purgatory).

That said, in the end you will want a LOT of validation material so this
is very much just a starting point. But still, it?s your starting point, in
R at least. And truthfully I much prefer R/Bioconductor idioms to (say)
pysam or the like. htsnim is nice but then you?ll be implementing the ML
bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it
will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:

?
a few thoughts:

1) look into Shearwater (
https://bioconductor.org/packages/release/bioc/html/deepSNV.html
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347080053&sdata=LpUeflml9u2XLENpmDAEuDBRANXZpjkFbIEiJAeP2pw%3D&reserved=0>),
then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @
Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that
most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR
(lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a
30x-100x genome (or even 1000x FM1) and are trying to compete with existing
standard approaches that can detect molecules at 1e-6, it'll be rough.  An
alternative approach (that has been used repeatedly) is to throw caution to
the wind, generate primers for numerous subject-specific somatic variants,
and use the ensemble to try and model MRD (speaking of ML). On the one
hand, that could give the clinic a "customer for life"; on the other hand,
it's not conducive to large-scale automation & deployment. As far as I
know, it never got much traction, in leukemia or anywhere else.  (Consider
that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million
cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads,
please talk to the people in #2 above; the reason most people won't go
below 5% VAF is that you get thwacked by error rates at that level, and the
reason most NGS-based MRD is based on UMIs is that existing PCR-based
methods have 6 logs sensitivity.)

--t


On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl> wrote:

Hi,


I was wondering if anyone could help me with a script and support for the
above mentioned goal.

For this I have several BAM files for which I want to determine de
nucleotide count per region of interest. The latter could be several
hotspot mutation sites. I would like to get an overall overview of all the
BAM files. Next I want to use these counts to determine for any new BAM
file if the count for a particular genomic position is higher than the
allowable range, hence could indicate if a mutation is present. For this I
would like to use the modified Thompson Tau test. I think machine learning
could be used for this. So, why do I want to do all this? Well, normal NGS
pipelines only deal with variants at a frequency of 5%. Mutatios below this
frequency are often missed. To know if a mutation is present below this
level, you showed dive into the alignment and most often manually
investigate the base calls. I know that this races some questions regarding
call qualities, but then again our conventional assays have actually
confirmed some of these low mutations. In addition, NGS can
 be used to determine the presence of low frequent mutation which is of
great importance for determining the measurable residual disease after
treatment.


I am new to r and bioconductor so I would be very thankful if someone
could help me in my mission to setting up a script for this purpose.


Thanks,


Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347090042&sdata=HXNdz6iiEUQ3l%2BVwwDIGX0RAdO7Tdd7r7f4FcgW1k9M%3D&reserved=0>

Mulder, R

Sun, Mar 8, 2020 3:21 AM #

Dear Tim,


Thanks again for your (quick) reply.


I just ran it on my computer and it gave me some results but also several errors.


The first part went perfect

BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

Error in genes(Homo.sapiens) : could not find function "genes"

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Error: object 'gxs' not found

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

+                        include_insertions=TRUE, min_minor_allele_depth=1)

+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }

seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253


Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran it again, but the errors at the next operations remained.

Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should I look up how this genome is called?

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Where can I find this function?

Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?



________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the read pairs, and so on)



--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory).

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>> wrote:

?
a few thoughts:

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0>), then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:

Hi,

I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can
be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0>
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.

Vincent Carey

Sun, Mar 8, 2020 4:00 AM #

This has been a very interesting and illuminating dialogue but it really
should be moved to
support.bioconductor.org so that a record is available to the general user
community.

The solution here

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

is BiocManager::install("Homo.sapiens")

On Sun, Mar 8, 2020 at 6:22 AM Mulder, R <r.mulder01 at umcg.nl> wrote:

Dear Tim,


Thanks again for your (quick) reply.


I just ran it on my computer and it gave me some results but also several
errors.


The first part went perfect

BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),

`[`, 1)

show(BamFiles)

BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

library(Homo.sapiens)

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,

+                        include_insertions=TRUE, min_minor_allele_depth=1)

piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam,

PileupParam=puparam)

filterInvariant <- function(x) {

+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }

show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))

              seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253


Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran
it again, but the errors at the next operations remained.

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should
I look up how this genome is called?

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Where can I find this function?

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using
NGS data in r

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),
`[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19
REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam,
PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs
the read pairs, and so on)



--t


On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl<mailto:
r.mulder01 at umcg.nl>> wrote:

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command
to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321,
5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more
region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using
NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per
region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges)
holding a GRanges (Genomic Ranges) of your regions added to scanBamParams
for each BAM. It sounds awkward but in practice it is super fast and will
give you all the nucleotide and read level information you could want. One
of my interns implemented this for mitochondrial variant calling in
MTseeker when we got sick of using gmapR and being flagged for errors on
not-Linux. (We gutted the entire package recently and have new, insanely
deep examples from Oxford Nanopore direct RNA sequencing and from large
single cell datasets; I need to add those and get the package back out of
purgatory).

That said, in the end you will want a LOT of validation material so this
is very much just a starting point. But still, it?s your starting point, in
R at least. And truthfully I much prefer R/Bioconductor idioms to (say)
pysam or the like. htsnim is nice but then you?ll be implementing the ML
bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it
will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com<mailto:
tim.triche at gmail.com>> wrote:

?
a few thoughts:

1) look into Shearwater (
https://bioconductor.org/packages/release/bioc/html/deepSNV.html<
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0>),
then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @
Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that
most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR
(lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a
30x-100x genome (or even 1000x FM1) and are trying to compete with existing
standard approaches that can detect molecules at 1e-6, it'll be rough.  An
alternative approach (that has been used repeatedly) is to throw caution to
the wind, generate primers for numerous subject-specific somatic variants,
and use the ensemble to try and model MRD (speaking of ML). On the one
hand, that could give the clinic a "customer for life"; on the other hand,
it's not conducive to large-scale automation & deployment. As far as I
know, it never got much traction, in leukemia or anywhere else.  (Consider
that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million
cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads,
please talk to the people in #2 above; the reason most people won't go
below 5% VAF is that you get thwacked by error rates at that level, and the
reason most NGS-based MRD is based on UMIs is that existing PCR-based
methods have 6 logs sensitivity.)

--t


On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl<mailto:
r.mulder01 at umcg.nl>> wrote:
Hi,


I was wondering if anyone could help me with a script and support for the
above mentioned goal.

For this I have several BAM files for which I want to determine de
nucleotide count per region of interest. The latter could be several
hotspot mutation sites. I would like to get an overall overview of all the
BAM files. Next I want to use these counts to determine for any new BAM
file if the count for a particular genomic position is higher than the
allowable range, hence could indicate if a mutation is present. For this I
would like to use the modified Thompson Tau test. I think machine learning
could be used for this. So, why do I want to do all this? Well, normal NGS
pipelines only deal with variants at a frequency of 5%. Mutatios below this
frequency are often missed. To know if a mutation is present below this
level, you showed dive into the alignment and most often manually
investigate the base calls. I know that this races some questions regarding
call qualities, but then again our conventional assays have actually
confirmed some of these low mutations. In addition, NGS can
 be used to determine the presence of low frequent mutation which is of
great importance for determining the measurable residual disease after
treatment.


I am new to r and bioconductor so I would be very thankful if someone
could help me in my mission to setting up a script for this purpose.


Thanks,


Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken
van dit bericht, het niet openbaar maken of op enige wijze verspreiden of
vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een
incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the
eyes of the addressee(s). Others than the addressee(s) are not allowed to
use this message, to make it public or to distribute or multiply this
message in any way. The UMCG cannot be held responsible for incomplete
reception or delay of this transferred message.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken
van dit bericht, het niet openbaar maken of op enige wijze verspreiden of
vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een
incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the
eyes of the addressee(s). Others than the addressee(s) are not allowed to
use this message, to make it public or to distribute or multiply this
message in any way. The UMCG cannot be held responsible for incomplete
reception or delay of this transferred message.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

The information in this e-mail is intended only for the ...{{dropped:18}}

Mulder, R

Sun, Mar 8, 2020 5:52 AM #

Dear Vincent,


Maybe we can move the discussion retrospectively? At the moment it seems we are tweaking a few things of which the for last obstacle good be tackled thanks to you. The only remaining issue is that I cant get results for my genes of interest. In stead I keep getting the same results over and over again. Would it be possible to help me a bit further until it works?In addition, I think incoporating the solution and later on posting it would be very interesting too.


Kind regards,


Rene

________________________________
Van: Vincent Carey <stvjc at channing.harvard.edu>
Verzonden: zondag 8 maart 2020 12:00:41
Aan: Mulder, R
CC: Tim Triche, Jr.; bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

This has been a very interesting and illuminating dialogue but it really should be moved to
support.bioconductor.org<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsupport.bioconductor.org%2F&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564238716&sdata=uRk3R%2Bs38f52BZQoZWYQokcv%2BOfyksDOwJf9vXwRCDM%3D&reserved=0> so that a record is available to the general user community.

The solution here

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

is BiocManager::install("Homo.sapiens")

On Sun, Mar 8, 2020 at 6:22 AM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:

Dear Tim,


Thanks again for your (quick) reply.


I just ran it on my computer and it gave me some results but also several errors.


The first part went perfect

BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

Error in genes(Homo.sapiens) : could not find function "genes"

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Error: object 'gxs' not found

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

+                        include_insertions=TRUE, min_minor_allele_depth=1)

+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }

seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253


Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran it again, but the errors at the next operations remained.

Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should I look up how this genome is called?

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Where can I find this function?

Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?



________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the read pairs, and so on)



--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl><mailto:r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>>> wrote:

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com><mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory).

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com><mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>> wrote:

?
a few thoughts:

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564238716&sdata=c8b%2FREMotc72%2BZt9XENI7nxiP1Ww0xQymAqX524XJoY%3D&reserved=0><https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564248707&sdata=y%2BxXbPrxep4RNyrHK9IJ6bnxMNKS9HqPufIAu3bkl7I%3D&reserved=0>>), then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl><mailto:r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>>> wrote:

Hi,

I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can
be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564248707&sdata=QdAVedMFqBqJP%2FZPcDR6Vuicw%2BDvqHC87BKsswUeERA%3D&reserved=0><https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564258701&sdata=hHNQn2AMqEkr2LtrD2G3lEmEQKc%2BfHSlb5ydAa%2FdWuo%3D&reserved=0>>
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564258701&sdata=hHNQn2AMqEkr2LtrD2G3lEmEQKc%2BfHSlb5ydAa%2FdWuo%3D&reserved=0>

The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.partners.org%2Fcomplianceline&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564268692&sdata=IBbiS%2Bc2PiRWUHVRCbFgGgC41T729Pi2HOw5Sn3guPQ%3D&reserved=0> . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.

Tim Triche, Jr.

Sun, Mar 8, 2020 9:15 AM #

Absolutely true. Even better would be a workflow which is what I?ve started trying to ?groom? students/people appealing for help into ;-)

Is there a recommended way to segue this type of discussion to support.bioconductor.org without losing context?  I used to get emails from the site when I was mentioned, now it appears that I have drifted off into irrelevance ;-)

Ps. Thanks to you and Aaron for the pointers to OnClass for ontology aware cell identification, I should start a topic on that too ? I had forgotten how horrible python configuration issues can be. It may still be worth retro fitting that machinery onto SingleR! But for the moment in the kids? single cell data we are testing out OnClass against SingleR and both have strengths. A comparison with CITE-seq will really help ? I?m going to try and convince Marco?s group to do that for the acid test :-). 

Thanks as always!

--t

On Mar 8, 2020, at 7:00 AM, Vincent Carey <stvjc at channing.harvard.edu> wrote:

?
This has been a very interesting and illuminating dialogue but it really should be moved to 
support.bioconductor.org so that a record is available to the general user community.

The solution here

library(Homo.sapiens)

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

is BiocManager::install("Homo.sapiens")

On Sun, Mar 8, 2020 at 6:22 AM Mulder, R <r.mulder01 at umcg.nl> wrote:
Dear Tim,


Thanks again for your (quick) reply.


I just ran it on my computer and it gave me some results but also several errors.


The first part went perfect

BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)

BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

library(Homo.sapiens)

Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,

+                        include_insertions=TRUE, min_minor_allele_depth=1)

piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {

+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }

show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))

              seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253


Error in library(Homo.sapiens) :
  there is no package called ?Homo.sapiens?

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran it again, but the errors at the next operations remained.

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should I look up how this genome is called?

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Where can I find this function?

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the read pairs, and so on)



--t


On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:

Dear Tim,


I installed Rsamtools and ran it on a bamfile using the following command to get nucleotide count for 2 hotspot regions.


library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory).

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>> wrote:

?
a few thoughts:

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0>), then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t


On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl<mailto:r.mulder01 at umcg.nl>> wrote:
Hi,


I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can
 be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.


I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.


Thanks,


Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0>
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van dit bericht, het niet openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Tim Triche, Jr.

Sun, Mar 8, 2020 9:50 AM #

Hard to say why you got results if the ?Homo.sapiens? organismDb package was not installed!  And a tiny bit disturbing. You will want to use BiocManager::install(?Homo.saliens?) to ensure the bits are in place (and if you don?t have BiocManager installed, you?ll want to fix that using install.packages(?BiocManager?), which will likely be the last time you use install.packages). 

Anyways, Vince is right, this should be a support thread or perhaps a workflow ? Erin, Andy, Wing Wong, Todd &c just put out a new paper on NGS-based MRD in AML which could serve as a basis for such a workflow:

https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-020-0671-8

I highly recommend reading it. 

--t

On Mar 8, 2020, at 6:21 AM, Mulder, R <r.mulder01 at umcg.nl> wrote:

?
Dear Tim,


Thanks again for your (quick) reply.


I just ran it on my computer and it gave me some results but also several errors. 


The first part went perfect

BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)

BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

library(Homo.sapiens)

Error in library(Homo.sapiens) : 
  there is no package called ?Homo.sapiens?

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") : 
  could not find function "mapIds"

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,

+                        include_insertions=TRUE, min_minor_allele_depth=1)

piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {

+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }

show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))

              seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253


Error in library(Homo.sapiens) : 
  there is no package called ?Homo.sapiens?

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran it again, but the errors at the next operations remained.

gxs <- genes(Homo.sapiens)

Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should I look up how this genome is called?

names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")

Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") : 
  could not find function "mapIds"

Where can I find this function?

SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

sbparam <- ScanBamParam(which=SomeGenes)

Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?




Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r
 
something like this perhaps? 

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1


Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the read pairs, and so on) 



--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 at umcg.nl> wrote:
Dear Tim,



I installed Rsamtools and ran it on a bamfile using the following command to get nucleotide count for 2 hotspot regions. 



library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene
Van: Tim Triche, Jr. <tim.triche at gmail.com>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: bioc-devel at r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r
 
Oh hey, one last thing ? if all you want is to get nucleotide counts per region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. It sounds awkward but in practice it is super fast and will give you all the nucleotide and read level information you could want. One of my interns implemented this for mitochondrial variant calling in MTseeker when we got sick of using gmapR and being flagged for errors on not-Linux. (We gutted the entire package recently and have new, insanely deep examples from Oxford Nanopore direct RNA sequencing and from large single cell datasets; I need to add those and get the package back out of purgatory). 

That said, in the end you will want a LOT of validation material so this is very much just a starting point. But still, it?s your starting point, in R at least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the like. htsnim is nice but then you?ll be implementing the ML bits from scratch, so I think your instincts to try R first are sensible. 

Good luck! Even if you use this for something else besides MRD, I think it will become a useful exercise.  

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:

?
a few thoughts: 

1) look into Shearwater (https://bioconductor.org/packages/release/bioc/html/deepSNV.html), then 

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x genome (or even 1000x FM1) and are trying to compete with existing standard approaches that can detect molecules at 1e-6, it'll be rough.  An alternative approach (that has been used repeatedly) is to throw caution to the wind, generate primers for numerous subject-specific somatic variants, and use the ensemble to try and model MRD (speaking of ML). On the one hand, that could give the clinic a "customer for life"; on the other hand, it's not conducive to large-scale automation & deployment. As far as I know, it never got much traction, in leukemia or anywhere else.  (Consider that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please talk to the people in #2 above; the reason most people won't go below 5% VAF is that you get thwacked by error rates at that level, and the reason most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 at umcg.nl> wrote:
Hi,

I was wondering if anyone could help me with a script and support for the above mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide count per region of interest. The latter could be several hotspot mutation sites. I would like to get an overall overview of all the BAM files. Next I want to use these counts to determine for any new BAM file if the count for a particular genomic position is higher than the allowable range, hence could indicate if a mutation is present. For this I would like to use the modified Thompson Tau test. I think machine learning could be used for this. So, why do I want to do all this? Well, normal NGS pipelines only deal with variants at a frequency of 5%. Mutatios below this frequency are often missed. To know if a mutation is present below this level, you showed dive into the alignment and most often manually investigate the base calls. I know that this races some questions regarding call qualities, but then again our conventional assays have actually confirmed some of these low mutations. In addition, NGS can
be used to determine the presence of low frequent mutation which is of great importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel