「Pangenome tutorial」vg｜utilizing pangenome graph with variation graph

2023-11-13, 1499 words, 8 min read

Background

graph的数据结构
- nodes, which are labeled by sequences and ids
- edges, which connect two nodes via either of their respective ends
- paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges
  - Paths provide coordinate systems relative to genomes encoded in the graph, allowing stable mappings to be produced even if the structure of the graph is changed.

vg｜Install

git clone --recursive https://github.com/vgteam/vg.git
cd vg

# MacOS系统下安装对应的dependencies——使用bundle
# Install all the dependencies in the Brewfile
brew bundle

. ./source_me.sh && make

vg｜variation graph construction

说中文就是构图，将linear genome转变为grpah genome的形式
vg也同时支持多种输入格式来构建variation graph
- PackedGraph (.vg)：vg's native format，支持各类操作（topology和paths内容的修改），但是在大样本时可能效率低下
- GFA (.gfa)：graph sequence的通用结果（中枢结构），text形式，方便在各个pangenome软件中作为中转
  - vg直接操作gfa的前提是自动转变为PackedGraph来进行操作（暗箱）
    
    vg can also operate on (uncompressed) GFA files directly, by way of using a PackedGraph representation in memory (and therefore shares that format's scaling concerns and edit-ability).
- GBZ (.gbz)：highly-compressed graph format（不允许任何操作）
  - GBZ格式将reference graph和haplotype graph给区分，比如HPRC中hg38和t2t是作为reference graph，而剩余assembly haplotype则是作为haplotype graph
vg默认使用1.1版本的gfa，GFA v1.1

1）`vg construct`｜from vcf

需要以提前joint calling得到的variation information作为index来输入，作为graph的基础（fa + vcf）
vcf需要index
```
tabix -p vcf x.vcf.gz
```

vg construct -r small/x.fa -v small/x.vcf.gz >x.vg

-r，--reference
-v，--vcf

2）`vg construct`｜from assembly

比如使用Minigraph-Cactus构建得到的gfa：github.com/ComparativeGe...

-）`vg autoindex`｜build graph & index

-）`vg convert -g`

将ODGI和PGGB的grpah形式转变为vg格式

-）`vg convert -f`

将任意格式的graph转为gfa格式
W-lines和P-lines代表了使用不同的字母表示paths，若需要使用P来代表paths则使用vg convert -fW

P-lines对应gfa 1.0版本

vg｜mapping

1）`vg giraffe`

fast for highly accurate short reads, against graphs with haplotype information.
在进行giraffe mapping之前，首先需要构建graph的index（使用vg autoindex）

# 0. simulate short reads
# simulate a bunch of 150bp reads from the graph, into a GAM file of reads aligned to a graph
vg sim -n 1000 -l 150 -x x.giraffe.gbz -a > x.sim.gam


# 1. construct the graph and indexes (paths below assume running from `vg/test` directory)
vg autoindex --workflow giraffe -r small/x.fa -v small/x.vcf.gz -p x

[vg autoindex] Executing command: vg autoindex --workflow giraffe -r small/x.fa -v small/x.vcf.gz -p x
[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Constructing VG graph from FASTA and VCF input.
[IndexRegistry]: Constructing GBWT from VG graph and phased VCF input.
[IndexRegistry]: Stripping allele paths from VG.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Downsampling full GBWT.
[IndexRegistry]: Not enough haplotypes; augmenting the full GBWT instead.
[IndexRegistry]: Constructing GBZ.
[IndexRegistry]: Constructing distance index for Giraffe.
[IndexRegistry]: Constructing minimizer index.


# 2. remap: using **gam** as the input for the giraffe
# Notes: 
# - 这一部分的逻辑可以理解为基于linear得到的BAM分析得到了对应的VCF——完成graph的构建
# - BAM转为GAM，比对到graph上再重新得到BAM
# now re-map these reads against the graph, and get BAM output in linear space
# FASTQ input uses -f instead of -G.
vg giraffe -Z x.giraffe.gbz -G x.sim.gam -o BAM > aln.bam

# 需要得index文件：x.min，x.dist
Guessing that x.min is Minimizers
Guessing that x.dist is Giraffe Distance Index
warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.

vg sim：基于graph来simulate reads
- -n --num-reads；-l --read-length；-x --xg-name，输入的graph；-a --align-out，输出为gam format的文件
vg giraffe：graph-based short reads mapping
- -o，根据后缀名输出结果比对文件
- 如果以fastq作为输入，则需要使用-f而不是-G

2）`vg map`

基于kmer的算法将graph分chunk，构建index；且在mapping过程中，所使用的kmer size一定需要小于构建graph index的kmer size

graph kmer size和mapping kmer size如何影响比对的准确率？

需要锻炼自己的测试数据的能力

When mapping, any kmer size shorter than that used in the index can be employed, and by default the mapper will decrease the kmer size to increase sensitivity when alignment at a particular k fails.

# 1. construct the graph (paths below assume running from `vg/test` directory)
# Notes: 注意此处生成的是x.vg
vg construct -r small/x.fa -v small/x.vcf.gz > x.vg

# 2. store the graph in the xg/gcsa index pair
# Notes: 
# - 使用k=16来构建index
# - x.gcsa，即输出`.gcsa`类型的index
# - 最终得到的是 xg/gcsa格式的index pairs
vg index -x x.xg -g x.gcsa -k 16 x.vg

# 3.1 align a read to the indexed version of the graph
# note that the graph file is not opened, but x.vg.index is assumed
vg map -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -x x.xg -g x.gcsa > read.gam

# 3.2 simulate a bunch of 150bp reads from the graph, one per line
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
# now map these reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# 4.1 surject the alignments back into the reference space of sequence "x", yielding a BAM file
# Notes: 将vg map生成的gam文件重新比对到graph中的reference path上，生成得到BAM文件
vg surject -x x.xg -b aln.gam > aln.bam

# 4.2 surject them to BAM in the call to map
# Notes: 在使用vg map比对的时候就可以直接生成bam文件
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam

vg index
- -x，--xg-name：use this file to store a succinct, queryable version of the graph(s), or read for GCSA or distance indexing
- -g，--gcsa-out：output a GCSA2 index to the given file
vg sim
- 默认情况下生成text格式的sequence，一行一条序列
vg map
- 默认生产的格式为gam
- -T，--reads：take reads (one per line) from FILE, write alignments to stdout
- -x + -g，输入index pairs

vg｜variant calling

1）calling variants with read support

当前对SV得准确率更高
Operating on .vg or '.gfa' uses the most memory and is not recommended for large graphs.
1）pack：获得gam文件中的reads depth信息
- -x，代表输入的graph
- -Q 5代表过滤mapping quality小于5的read mapping结果
2）call
- 默认情况下输出到STDOUT
- -k，--pack：supports created from vg pack for given input graph（reads support文件的输入flag）

只针对graph中的variants进行calling，

# Compute the read support from the gam
# -Q 5: ignore mapping and base qualitiy < 5
vg pack -x x.xg -g aln.gam -Q 5  -o aln.pack

# Generate a VCF from the support.  
vg call x.xg -k aln.pack > graph_calls.vcf

vg call x.xg -k aln.pack -a > snarl_genotypes.vcf

pangenome 基因组学

「Pangenome tutorial」vg｜utilizing pangenome graph with variation graph

Background

vg｜Install

vg｜variation graph construction

1）vg construct｜from vcf

2）vg construct｜from assembly

-）vg autoindex｜build graph & index

-）vg convert -g

-）vg convert -f

vg｜mapping

1）vg giraffe

2）vg map

vg｜variant calling

1）calling variants with read support

「有谱talk」2023年的就业寒冬与我有什么关系？

1）`vg construct`｜from vcf

2）`vg construct`｜from assembly

-）`vg autoindex`｜build graph & index

-）`vg convert -g`

-）`vg convert -f`

1）`vg giraffe`

2）`vg map`