graph的数据结构 nodes, which are labeled by sequences and ids edges, which connect two nodes via either of their respective ends paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges 说中文就是构图,将linear genome转变为grpah genome的形式 vg也同时支持多种输入格式来构建variation graph vg直接操作gfa的前提是自动转变为PackedGraph来进行操作(暗箱) vg默认使用1.1版本的gfa,GFA v1.1 需要以提前joint calling得到的variation information作为index来输入,作为graph的基础(fa + vcf) vcf需要index 将任意格式的graph转为gfa格式 W-lines和P-lines代表了使用不同的字母表示paths,若需要使用P来代表paths则使用 P-lines对应gfa 1.0版本 基于kmer的算法将graph分chunk,构建index;且在mapping过程中,所使用的kmer size一定需要小于构建graph index的kmer size graph kmer size和mapping kmer size如何影响比对的准确率? 需要锻炼自己的测试数据的能力 When mapping, any kmer size shorter than that used in the index can be employed, and by default the mapper will decrease the kmer size to increase sensitivity when alignment at a particular k fails. 当前对SV得准确率更高 Operating on 1) 2) 只针对graph中的variants进行calling,Background
vg|Install
git clone --recursive https://github.com/vgteam/vg.git
cd vg
# MacOS系统下安装对应的dependencies——使用bundle
# Install all the dependencies in the Brewfile
brew bundle
. ./source_me.sh && make
vg|variation graph construction
PackedGraph (.vg)
:vg's
native format,支持各类操作(topology和paths内容的修改),但是在大样本时可能效率低下GFA (.gfa)
:graph sequence的通用结果(中枢结构),text形式,方便在各个pangenome软件中作为中转
vg
can also operate on (uncompressed) GFA files directly, by way of using a PackedGraph
representation in memory (and therefore shares that format's scaling concerns and edit-ability).GBZ (.gbz)
:highly-compressed graph format(不允许任何操作)
1)
vg construct
|from vcf
tabix -p vcf x.vcf.gz
vg construct -r small/x.fa -v small/x.vcf.gz >x.vg
-r
,--reference
-v
,--vcf
2)
vg construct
|from assembly
-)
vg autoindex
|build graph & index-)
vg convert -g
-)
vg convert -f
vg convert -fW
vg|mapping
1)
vg giraffe
vg autoindex
)# 0. simulate short reads
# simulate a bunch of 150bp reads from the graph, into a GAM file of reads aligned to a graph
vg sim -n 1000 -l 150 -x x.giraffe.gbz -a > x.sim.gam
# 1. construct the graph and indexes (paths below assume running from `vg/test` directory)
vg autoindex --workflow giraffe -r small/x.fa -v small/x.vcf.gz -p x
[vg autoindex] Executing command: vg autoindex --workflow giraffe -r small/x.fa -v small/x.vcf.gz -p x
[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Constructing VG graph from FASTA and VCF input.
[IndexRegistry]: Constructing GBWT from VG graph and phased VCF input.
[IndexRegistry]: Stripping allele paths from VG.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Downsampling full GBWT.
[IndexRegistry]: Not enough haplotypes; augmenting the full GBWT instead.
[IndexRegistry]: Constructing GBZ.
[IndexRegistry]: Constructing distance index for Giraffe.
[IndexRegistry]: Constructing minimizer index.
# 2. remap: using **gam** as the input for the giraffe
# Notes:
# - 这一部分的逻辑可以理解为基于linear得到的BAM分析得到了对应的VCF——完成graph的构建
# - BAM转为GAM,比对到graph上再重新得到BAM
# now re-map these reads against the graph, and get BAM output in linear space
# FASTQ input uses -f instead of -G.
vg giraffe -Z x.giraffe.gbz -G x.sim.gam -o BAM > aln.bam
# 需要得index文件:x.min,x.dist
Guessing that x.min is Minimizers
Guessing that x.dist is Giraffe Distance Index
warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.
vg sim
:基于graph来simulate reads
--num-reads
;-l --read-length
;-x --xg-name
,输入的graph;-a --align-out
,输出为gam format的文件vg giraffe
:graph-based short reads mapping
-o
,根据后缀名输出结果比对文件-f
而不是-G
2)
vg map
# 1. construct the graph (paths below assume running from `vg/test` directory)
# Notes: 注意此处生成的是x.vg
vg construct -r small/x.fa -v small/x.vcf.gz > x.vg
# 2. store the graph in the xg/gcsa index pair
# Notes:
# - 使用k=16来构建index
# - x.gcsa,即输出`.gcsa`类型的index
# - 最终得到的是 xg/gcsa格式的index pairs
vg index -x x.xg -g x.gcsa -k 16 x.vg
# 3.1 align a read to the indexed version of the graph
# note that the graph file is not opened, but x.vg.index is assumed
vg map -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -x x.xg -g x.gcsa > read.gam
# 3.2 simulate a bunch of 150bp reads from the graph, one per line
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
# now map these reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam
# 4.1 surject the alignments back into the reference space of sequence "x", yielding a BAM file
# Notes: 将vg map生成的gam文件重新比对到graph中的reference path上,生成得到BAM文件
vg surject -x x.xg -b aln.gam > aln.bam
# 4.2 surject them to BAM in the call to map
# Notes: 在使用vg map比对的时候就可以直接生成bam文件
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam
vg index
--xg-name
:use this file to store a succinct, queryable version of the graph(s), or read for GCSA or distance indexing--gcsa-out
:output a GCSA2 index to the given filevg sim
vg map
--reads
:take reads (one per line) from FILE, write alignments to stdoutvg|variant calling
1)calling variants with read support
.vg
or '.gfa' uses the most memory and is not recommended for large graphs.pack
:获得gam文件中的reads depth信息
-x
,代表输入的graph-Q 5
代表过滤mapping quality小于5的read mapping结果call
-k
,--pack
:supports created from vg pack for given input graph(reads support文件的输入flag)# Compute the read support from the gam
# -Q 5: ignore mapping and base qualitiy < 5
vg pack -x x.xg -g aln.gam -Q 5 -o aln.pack
# Generate a VCF from the support.
vg call x.xg -k aln.pack > graph_calls.vcf
vg call x.xg -k aln.pack -a > snarl_genotypes.vcf
Hi,这里是有朴的第二大脑。
很高兴与你相遇
很高兴与你相遇