Extract sample data from VCF files

时间秒杀一切 提交于 2019-12-05 00:41:20

In VariantAnnotation use a ScanVcfParam to specify the data that you'd like to extract. Using the sample VCF file included with the package

library(VariantAnnotation)
vcfFile = system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")

discover information about the file

scanVcfHeader(vcfFile)
## class: VCFHeader 
## samples(5): HG00096 HG00097 HG00099 HG00100 HG00101
## meta(1): fileformat
## fixed(0):
## info(22): LDAF AVGPOST ... VT SNPSOURCE
## geno(3): GT DS GL

Formulate a request for the "LDAF", "AVGPOST" info fields, "GT" genotype field for samples "HG00097", "HG00101" for variants on chromosome 22 between coordinates 50300000, 50400000

param = ScanVcfParam(
    info=c("LDAF", "AVGPOST"),
    geno="GT",
    samples=c("HG00097", "HG00101"),
    which=GRanges("22", IRanges(50300000, 50400000)))

Read the requested data

vcf = readVcf(vcfFile, "hg19", param=param)

and extract from VCF the relevant data

head(geno(vcf)[["GT"]])
##             HG00097 HG00101
## rs7410291   "0|0"   "0|0"  
## rs147922003 "0|0"   "0|0"  
## rs114143073 "0|0"   "0|0"  
## rs141778433 "0|0"   "0|0"  
## rs182170314 "0|0"   "0|0"  
## rs115145310 "0|0"   "0|0"  
head(info(vcf)[["LDAF"]])
## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117
ranges(vcf)
## IRanges of length 1169
##           start      end width             names
## [1]    50300078 50300078     1         rs7410291
## [2]    50300086 50300086     1       rs147922003
## [3]    50300101 50300101     1       rs114143073
## [4]    50300113 50300113     1       rs141778433
## [5]    50300166 50300166     1       rs182170314
## ...         ...      ...   ...               ...
## [1165] 50364310 50364312     3 22:50364310_GCA/G
## [1166] 50364311 50364313     3 22:50364311_CAT/C
## [1167] 50364464 50364464     1       rs150069372
## [1168] 50364465 50364465     1       rs146661152
## [1169] 50364609 50364609     1       rs184235324

Maybe you're only interested in genotype element "GS" as a simple R matrix, then just specify the samples and / or ranges you're interested in and use readGeno (or readGT or readInfo for similar specialized queries).

There is extensive documentation in the VariantAnnotation vignettes and reference manual; see also ?ScanVcfParam; example(ScanVcfParam).

I met the same situation. The following link may provide possible solution: http://biocluster.ucr.edu/~rsun/workshop/SNPINDEL/variantsAnalysis.pdf In the page describing "Read VCF file", they do not use the dataset in some package, but the actual path for a local gz file:

library(VariantAnnotation)

library(GenomicFeatures)

Please make sure exact data.zip under your current work directory

vcf <- readVcf("data/var.raw.vcf.gz","TAIR10")

vcf

TAIR10 is the reference genome

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!