Chip-seq peak annontation
Chip-seq peak annontation
PeRl
narrowPeak/boardPeak
narrowPeak/boardPeak 是ENCODE可提供下载的两种 Chip-seq 经过参考人类基因组mapping后的关于peak的数据.
其他类型的seq数据储存个数可以参看FAQformat
narrowPeak
数据按照以下规则储存:
1. string chrom: "Reference sequence chromosome or scaffold" 2. uint chromStart: "Start position in chromosome" 3. uint chromEnd: "End position in chromosome" 4. string name: "Name given to a region (preferably unique). Use . if no name is assigned" 5. uint score: "Indicates how dark the peak will be displayed in the browser (0-1000) " 6. char[1] strand: "+ or - or . for unknown" 7. float signalValue: "Measurement of average enrichment for the region" 8. float pValue: "Statistical significance of signal value (-log10). Set to -1 if not used." 9. float qValue: "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if n-ot used." 10. int peak: "Point-source called for this peak; 0-based offset from chromStart. Set to -1 if no point-sour-ce called."
boardPeak
数据按照以下规则储存:
1. string chrom: "Reference sequence chromosome or scaffold" 2. uint chromStart: "Start position in chromosome" 3. uint chromEnd: "End position in chromosome" 4. string name: "Name given to a region (preferably unique). Use . if no name is assigned" 5. uint score: "Indicates how dark the peak will be displayed in the browser (0-1000) " 6. char[1] strand: "+ or - or . for unknown" 7. float signalValue: "Measurement of average enrichment for the region" 8. float pValue: "Statistical significance of signal value (-log10). Set to -1 if<BR> not used." 9. float qValue: "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if n-ot used."
在接下去的peak annotation中,只演示narrowPeak.bed格式数据.
示例数据
演示数据来自ENCODE 的 H3K9me3 的Chip-seq,样本ID为ENCFF199BLM.
样本的基本信息:
- Homo sapiens liver male adult (32 years)
- Target: H3K9me3
- Lab: Bing Ren, UCSD
- Project: Roadmap
narrowPeak 数据基本信息:
## seqnames start end name score strand signalValue pValue ## 1 chr10 100134639 100134831 Peak_7974 178 . 4.41261 6.68330 ## 2 chr10 100446376 100446664 Peak_4189 244 . 5.13248 8.41215 ## 3 chr10 100779568 100779699 Peak_32369 101 . 3.34436 4.50936 ## 4 chr10 10088147 10088346 Peak_28509 112 . 4.05830 4.84890 ## 5 chr10 101149252 101149594 Peak_6146 211 . 4.89252 7.53305 ## 6 chr10 101173156 101173424 Peak_32985 94 . 3.45278 4.33257 ## qValue peak ## 1 2.96490 132 ## 2 4.37217 165 ## 3 1.47374 105 ## 4 1.69566 90 ## 5 3.67439 174 ## 6 1.30993 237
构建注释文件
在peak annotation中需要一个用于参考的注释文件,需要包括一下信息: 1. 染色体名 2. 转录本起始位点 3. 转录本终止位点 4. gene的名字 5. 正反链
注释文件可以直接从外部导入,也可以利用biomaRt 生成
library(ChIPpeakAnno) library(biomaRt) mart <- useMart("ensembl") datasets <- listDatasets(mart) mart <- useDataset("hsapiens_gene_ensembl",mart) # 需要筛选的特征 props <- c("ensembl_gene_id", "external_gene_name", "transcript_biotype", "chromosome_name", "start_position", "end_position", "strand") # 筛选染色体号 lincRNA <- subset( getBM(attributes=props, mart=mart, filters = "chromosome_name", values = c(1:22,"X","Y")), transcript_biotype == "lincRNA" ) # 在染色体前加"chr"保持和narrowPeak数据一致 lincRNA[,4] <- paste0("chr", lincRNA[,4])
得到的数据包含以下信息:
## ensembl_gene_id external_gene_name transcript_biotype chromosome_name ## 1 ENSG00000276255 RP5-881P19.7 lincRNA chr1 ## 2 ENSG00000234277 LINC01641 lincRNA chr1 ## 3 ENSG00000238107 RP11-495P10.5 lincRNA chr1 ## 4 ENSG00000274020 LINC01138 lincRNA chr1 ## 5 ENSG00000225620 RP11-569A11.2 lincRNA chr1 ## 6 ENSG00000237520 RP11-443B7.2 lincRNA chr1 ## start_position end_position strand ## 1 228073909 228076550 -1 ## 2 227393591 227431035 1 ## 3 148295180 148297556 1 ## 4 148290889 148519604 -1 ## 5 202632428 202632911 1 ## 6 234957231 234959989 1
进行peak annotation
有了这个注释文件以后,我们就可以根据我们想要的筛选规则对peak进行注释,主要用到的包是 ChIPpeakAnno.用到的函数为annotatePeakInBatch
.
# 将前面得到的注释文件转换为RangedData对象 library(ChIPpeakAnno) myCustomAnno <- RangedData( IRanges( start=lincRNA[,"start_position"], end=lincRNA[,"end_position"], names=lincRNA[,"ensembl_gene_id"]), space=lincRNA[,"chromosome_name"], strand=lincRNA[,"strand"]) # 读入需要注释的narrowPeak数据 bed_file <- read.table("ENCFF199BLM.bed", header = T, sep = "\t", stringsAsFactors = F) # 将peak数据转换为GRanges peaks <- toGRanges(bed_file, format="narrowPeak", colNames = colnames(bed_file)) # 根据需要进行筛选 anno <- annotatePeakInBatch(peaks, AnnotationData=myCustomAnno, output="overlapping", FeatureLocForDistance="TSS", bindingRegion=c(-2000, 2000))
从anno
中提取我们需要的lincRNA的ID
result_lincRNA <- anno@elementMetadata$feature head(result_lincRNA)
## [1] "ENSG00000237579" "ENSG00000235180" "ENSG00000232259" "ENSG00000204365" ## [5] "ENSG00000226578" "ENSG00000260137"
来源:http://www.cnblogs.com/wwdPeRl/p/8541362.html