2015-12-17Xiong Xu

samtools升级至1.3

(本文由GeneDock Senior Bioinformatics Engineer 许雄翻译撰写,转载请保留作者信息和原文链接)

usage


samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3 (using htslib 1.3)
Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
      faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

升级日志


  1. 移除了过时的samtools sort in.bam out.prefix使用说明。如果你仍在使用-f,-o,或者out.prefix,则可转用-T PREFIX,-o FILE替代。
  2. 将bamshuf重命名成collate。
  3. mpileup命令在VCF/BCF中现将不可见的等位基因输出为<*>,替换掉了之前的X或,–output-tags注释加上了AD, ADF, ADR, INFO/AD, INFO/ADF, INFO/ADR,取代了之前的DV, DP4, DPR。
  4. mpileup现在在每个碱基位置采用BAQ计算,忽略了之前的-l 和 -r选项。
  5. samtools现在有一个configure脚本,这个脚本用来检查安装环境和帮助筛选HTSlib的安装环境。
  6. Samtools的Makefile文件现在支持可根据需要重写的CC/CPPFLAGS/CFLAGS/LDFLAGS/LIBS标准变量。
  7. 新增addreplacerg子命令,可用于增加或更改@RG header和添加RG:Z 记录项。
  8. rmdup不再立即终止,但仍不推荐大量使用。
  9. 合并上百万个headers现在只需花费合理的时间即可完成。
  10. samtools index 支持可选的输出路径选项。
  11. 修复了将远超过参考序列结束位置坐标给予未成功比对的reads时,calmd、targetcut、和潜在的 mpileup 字段错误。
  12. 如果你在源代码中使用bam_md.c’s bam_fillmd1_core(), bam_cap_mapQ(), or bam_prob_realn_core()这几个函数时,需要注意额外的ref_len选项。
  13. tview命令颜色方案改成了更加适应色盲者的颜色。
  14. samtools depad命令现可操作CIGAR记录项的N,而且支持CRAM文件。
  15. samtools stats 按列输出ATCG在每个碱基位的含量的N、other统计。
  16. samtools depth 增加-a选项,支持输出深度为0的位点。
  17. 增加了samtools dict子命令,与Picard用法一致,用于对参考序列建立字典。
  18. 在sam.h中增加了sam_index_load()和samfetch() API,提供了类似bam_fetch()迭代BAM或CRAM文件的方式。
  19. 修复了当写sam文件的时候,samopen()只能以wh模式写header。
  20. samtools fixmate - -又能以数据流的模式工作。
  21. 恢复了当写压缩级别为0的bam文件时,之前的samtools calmd -u选项。Samtools 1.0 到 1.2 版本写non-BGZF BAM文件不正确,以致其他工具不能读此bam文件。
  22. 恢复了bam_nt16_nt4_table[]到bam.h头文件。

建立参考基因组字典

samtools dict

About:   Create a sequence dictionary file from a fasta file
Usage:   samtools dict [options] <file.fa|file.fa.gz>

Options: -a, --assembly STR    assembly
         -H, --no-header       do not print @HD line
         -o, --output STR      file to write out dict file [stdout]
         -s, --species STR     species
         -u, --uri STR         URI [file:///abs/path/to/file.fa]
time samtools dict ../testData/GRCh37.p13.genome.fa -o ../testData/GRCh37.p13.genome.fa.dict

real    0m53.134s
user    0m16.855s
sys 0m2.020s

建立参考基因组索引

time samtools faidx ../testData/GRCh37.p13.genome.fa

real    0m13.375s
user    0m12.553s
sys 0m0.771s

建好参考基因组索引之后,可检索出基因组上任意大小任意区域的序列

time samtools faidx ../testData/GRCh37.p13.genome.fa chr17:7564097-7565197
>chr17:7564097-7565197
CAGCCTGGGCAATGAGAGCGAAACTCTGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AGATAACAAAAAGAACAAAAGAACAAATATATCTAATGTGTTCCAATTAAATTGAAATGC
TAATTTTGGCCATGTAATAATTCTCCTGTATGAGAACATCTCCTGTATTTGTTACTTGCT
TTCCTTGGTTTGTCTACCACTCCATATGTCACATCCACCTCATTCCAGCAGTTATATAGT
AAGTTTTCTTTTTTTTTTTCTCGTTTCTTTTTTTTTTTTTTTGATGGAGTCTCACTCTTG
TTGCCCAGGCTGGAGTACAGTAGCGTGATCTCAGCTCGCTGCAGCTTCCACCTCTCGAGT
TCAAGTGATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCATGTGCCACCAT
GCCTGGCTAATTTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGTCAGGCTGGTC
TTGAACTCCTGACCTCATGTGATCCACTGCCTTGGACTCCCAAAGTGCTGGGATTACAGG
CGTGAGCCACTGCACCTGCCTAGCCTATACTAAGTTCTCTAAAATCTGGCAAGGCAAATA
GGACCCCCTCTCCATTCTGTTGCTTATTATTTACTTATTATTTACAATTTTATTATTTAT
GTTATTTTATATTATTTATTTTATTTATTGAGACAGTCTCCCGCTGTGTCGTCCAGGCTG
GAGTGCAGTGGCACGATTTCAGCTTACTGCAACCTCTGCCTCCCGGTTTCAAGGGATTCT
CCTGCCTCAGCCTCTCGAGTAGCTGGGATTATAGGTATGCACTACCACGCCTGGCTAAGT
TTTGTATTTTTAGTATAGACGGGGTTTTGCCATGTTGCCCAGACTGGTCCGGAACTCCTG
AGCTGAAAGCGATTCACCCGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGCGGGAGCCA
CCGTGCCCGGCCTCCAGTATTTTGTTTATTTATTTTTTTTGAGACAGAGTCTCACTCTGT
TGCACAGGCTGGAGTGCAGTGGCACAATCTCTGCTCACTGCAACCTCCTCCTCCCTGGTT
AAGAGATCCTCCTGCCTCAGC

real    0m0.003s
user    0m0.000s
sys 0m0.005s

为bam文件建立索引(需要已排序的bam文件)

samtools index
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]

Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]
time samtools index 5667a836534680001d0ce64c.sort.bam

real    2m5.484s
user    1m26.812s
sys 0m3.096s

对bam文件随机打乱顺序

samtools collate
Usage:   samtools collate [-Ou] [-n nFiles] [-c cLevel] <in.bam> <out.prefix>

Options:
      -O       output to stdout
      -u       uncompressed BAM output
      -l INT   compression level [1]
      -n INT   number of temporary files [64]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

连接多个bam文件(不做排序)

Usage: samtools cat [-h header.sam] [-o out.bam] <in1.bam> [...]

合并多个排序后的bam文件并排序输出一个bam文件

samtools merge
Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]

Options:
  -n         Sort by read names
  -r         Attach RG tag (inferred from file names)
  -u         Uncompressed BAM output
  -f         Overwrite the output BAM if exist
  -1         Compress level 1
  -l INT     Compression level, from 0 to 9 [-1]
  -R STR     Merge file in the specified region STR [all]
  -h FILE    Copy the header in FILE to <out.bam> [in1.bam]
  -c         Combine @RG headers with colliding IDs [alter IDs to be distinct]
  -p         Combine @PG headers with colliding IDs [alter IDs to be distinct]
  -s VALUE   Override random seed
  -b FILE    List of input BAM filenames, one per line [null]
  -@, --threads INT
             Number of BAM/CRAM compression threads [0]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

对参考基因组每个位点做碱基堆积,用于call SNP和INDEL

samtools mpileup

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
  -A, --count-orphans     do not discard anomalous read pairs
  -b, --bam-list FILE     list of input BAM filenames, one per line
  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
  -d, --max-depth INT     max per-BAM depth; avoids excessive memory usage [250]
  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs
  -f, --fasta-ref FILE    faidx indexed reference sequence file
  -G, --exclude-RG FILE   exclude read groups listed in FILE
  -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  -r, --region REG        region in which pileup is generated
  -R, --ignore-RG         ignore RG tags (one BAM = one sample)
  --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset []
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]
  -x, --ignore-overlaps   disable read-pair overlap detection

Output options:
  -o, --output FILE       write output to FILE [standard output]
  -g, --BCF               generate genotype likelihoods in BCF format
  -v, --VCF               generate genotype likelihoods in VCF format

Output options for mpileup format (without -g/-v):
  -O, --output-BP         output base positions on reads
  -s, --output-MQ         output mapping quality

Output options for genotype likelihoods (when -g/-v is used):
  -t, --output-tags LIST  optional tags to output:
               DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
  -u, --uncompressed      generate uncompressed VCF/BCF output

SNP/INDEL genotype likelihoods options (effective with -g/-v):
  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]
  -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]
  -h, --tandem-qual INT   coefficient for homopolymer errors [100]
  -I, --skip-indels       do not perform indel calling
  -L, --max-idepth INT    maximum per-sample depth for INDEL calling [250]
  -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]
  -o, --open-prob INT     Phred-scaled gap open seq error probability [40]
  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity
  -P, --platforms STR     comma separated list of platforms for indels [all]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Notes: Assuming diploid individuals.

对bam文件按基因组上位置排序或者按read名字排序,支持指定内存大小,内存越大,分块的字排序文件越大

samtools sort
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  -@, --threads INT
             Set number of sorting and compression threads [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

根据read group分割bam文件

samtools split
Usage: samtools split [-u <unaccounted.bam>[:<unaccounted_header.sam>]]
                      [-f <format_string>] [-v] <merged.bam>
Options:
  -f STRING       output filename format string ["%*_%#.%."]
  -u FILE1        put reads with no RG tag or an unrecognised RG tag in FILE1
  -u FILE1:FILE2  ...and override the header with FILE2
  -v              verbose output
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Format string expansions:
  %%     %
  %*     basename
  %#     @RG index
  %!     @RG ID
  %.     filename extension for output format

检查 SAM/BAM/CRAM文件的完整性

samtools quickcheck
Usage: samtools quickcheck [options] <input> [...]
Options:
  -v              verbose output (repeat for more verbosity)

将bam转为fastq格式

samtools fastq
Usage: samtools fastq [options...] <in.bam>
Options:
  -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
  -1 FILE   write paired reads flagged READ1 to FILE
  -2 FILE   write paired reads flagged READ2 to FILE
  -f INT    only include reads with all bits set in INT set in FLAG [0]
  -F INT    only include reads with none of the bits set in INT set in FLAG [0]
  -n        don't append /1 and /2 to the read name
  -O        output quality in the OQ tag if present
  -s FILE   write singleton reads to FILE [assume single-end]
  -t        copy RG, BC and QT tags to the FASTQ header line
  -v INT    default quality score if not given in file [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

将bam转为fasta格式

samtools fasta
Usage: samtools fasta [options...] <in.bam>
Options:
  -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
  -1 FILE   write paired reads flagged READ1 to FILE
  -2 FILE   write paired reads flagged READ2 to FILE
  -f INT    only include reads with all bits set in INT set in FLAG [0]
  -F INT    only include reads with none of the bits set in INT set in FLAG [0]
  -n        don't append /1 and /2 to the read name
  -O        output quality in the OQ tag if present
  -s FILE   write singleton reads to FILE [assume single-end]
  -t        copy RG, BC and QT tags to the FASTQ header line
  -v INT    default quality score if not given in file [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

统计bed文件中每个区域的测序深度

samtools bedcov
Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]

  -Q INT       Only count bases of at least INT quality [0]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

统计基因组每个位置的测序深度

samtools depth

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>)
   -d/-m <int>         maximum coverage depth [8000]
   -q <int>            base quality threshold
   -Q <int>            mapping quality threshold
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name, position, and coverage depth.  Note that positions with zero  coverage may be omitted by default; see the -a option.

对bam文件做简单统计(reads数,比对率,双端比对率等)

samtools flagstat
Usage: samtools flagstat [--input-fmt-option OPT=VAL] <in.bam>

对bam索引文件做统计,输出结果格式为reference sequence name, sequence length, # mapped reads and # unmapped reads

samtools idxstats
Usage: samtools idxstats <in.bam>

对bam文件做详细统计,其统计结果可用mics/plot-bamstats作图

samtools stats
About: The program collects statistics from BAM files. The output can be visualized using plot-bamstats.
Usage: samtools stats [OPTIONS] file.bam
       samtools stats [OPTIONS] file.bam chr:from-to
Options:
    -c, --coverage <int>,<int>,<int>    Coverage distribution min,max,step [1,1000,1]
    -d, --remove-dups                   Exclude from statistics reads marked as duplicates
    -f, --required-flag  <str|int>      Required flag, 0 for unset. See also `samtools flags` [0]
    -F, --filtering-flag <str|int>      Filtering flag, 0 for unset. See also `samtools flags` [0]
        --GC-depth <float>              the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]
    -h, --help                          This help message
    -i, --insert-size <int>             Maximum insert size [8000]
    -I, --id <string>                   Include only listed read group or sample name
    -l, --read-length <int>             Include in the statistics only reads with the given read length []
    -m, --most-inserts <float>          Report only the main part of inserts [0.99]
    -P, --split-prefix <str>            Path or string prefix for filepaths output by -S (default is input filename)
    -q, --trim-quality <int>            The BWA trimming parameter [0]
    -r, --ref-seq <file>                Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
    -s, --sam                           Ignored (input format is auto-detected).
    -S, --split <tag>                   Also write statistics to separate files split by tagged field.
    -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
    -x, --sparse                        Suppress outputting IS rows where there are no insertions.
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

bam文件QC,plot-bamstats依赖gnuplot

samtools stats -m -c 1,100,1 -F 4 -i 1000 -r hg19.fa outfile.sort.bam > outfile.bc ;\
perl misc/plot-bamstats -p outfile.bc outfile.bc

输出bam文件中第二列bitwise flag所代表的含义,如0x4表示没有比对上

samtools flags 0x4
0x4 4  UNMAP

命令行可视化显示区域的比对情况

samtools tview
Usage: samtools tview [options] <aln.bam> [ref.fasta]
Options:
   -d display      output as (H)tml or (C)urses or (T)ext
   -p chr:pos      go directly to this position
   -s STR          display only reads from this sample or group
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

文件格式转换SAM<->BAM<->CRAM

samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all bits set in INT set in FLAG [0]
  -F INT   only include reads with none of the bits set in INT set in FLAG [0]
  -x STR   read tag to strip (repeatable) [null]
  -B       collapse the backward CIGAR operation
  -s FLOAT integer part sets seed of random number generator [0];
           rest sets fraction of templates to subsample [no subsampling]
  -@, --threads INT
           number of BAM/CRAM compression threads [0]
  -?       print long help, including note about region specification
  -S       ignored (input format is auto-detected)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]

快速检索区域bam文件,需要先运行samtools index input.bam生成input.bam.bai

samtools view 5667a836534680001d0ce64c.sort.bam chr17:7564097-7565197 -o region.bam

按分位数比例抽样reads,整数位为随机数种子,小数位为抽取reads数的分位数比例

samtools view -b -s 1.1 input.bam -o input.0.1.bam

添加或替换header

samtools reheader
Usage: samtools reheader [-P] in.header.sam in.bam > out.bam
   or  samtools reheader [-P] -i in.header.sam file.bam

Options:
    -P, --no-PG      Do not generate an @PG header line.
    -i, --in-place   Modify the bam/cram file directly.
                     (Defaults to outputting to stdout.)

去除bam文件中的重复

samtools rmdup

Usage:  samtools rmdup [-sS] <input.srt.bam> <output.bam>

Option: -s    rmdup for SE reads
        -S    treat PE reads as SE in rmdup (force -s)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

添加或替换read group

samtools addreplacerg
Usage: samtools addreplacerg [options] [-r <@RG line> | -R <existing id>] [-o <output.bam>] <input.bam>

Options:
  -m MODE   Set the mode of operation from one of overwrite_all, orphan_only [overwrite_all]
  -o FILE   Where to write output to [stdout]
  -r STRING @RG line text
  -R STRING ID of @RG line in existing header to use
      --input-fmt FORMAT[,OPT[=VAL]]...
               Specify input format (SAM, BAM, CRAM)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

如果输入bam是一个按reads name排序的文件,则填充上mate alignment(read1的mate是read2)的坐标,插入片段大小等相关的flag

samtools fixmate
Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
Options:
  -r           Remove unmapped reads and secondary alignments
  -p           Disable FR proper pair check
  -c           Add template cigar ct tag
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input file must be grouped by read name (e.g. sorted by name). Coordinated sorted input is not accepted.

计算MD tag(a optional field,记录了mismatch信息)

samtools calmd
Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>
Options:
  -e       change identical bases to '='
  -u       uncompressed BAM output (for piping)
  -b       compressed BAM output
  -S       ignored (input format is auto-detected)
  -A       modify the quality string
  -r       compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)
  -E       extended BAQ for better sensitivity but lower specificity
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

call杂合SNP,确定相位

samtools phase

Usage:   samtools phase [options] <in.bam>

Options: -k INT    block length [13]
         -b STR    prefix of BAMs to output [null]
         -q INT    min het phred-LOD [37]
         -Q INT    min base quality in het calling [13]
         -D INT    max read depth [256]
         -F        do not attempt to fix chimeras
         -A        drop reads with ambiguous phase

      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

以数据流方式工作,比对工具(bowtie/bwa/soap)比对后直接输出sorted.bam

bwa mem -M -t 8 -R '@RG\tID:TS11N_L1_R1.fastq.gz.sam\tPL:ILLUMINA\tSM:TS11N_L1_R1.fastq.gz.sam\tDS:ref=/data/xux_data/testData/hg19.fasta,pfx=/data/xux_data/testData/hg19.fasta' /data/xux_data/testData/hg19.fasta /data/xux_data/testData/TS11N_L1_R1.fastq.gz /data/xux_data/testData/TS11N_L1_R2.fastq.gz|samtools view -u -S - | samtools sort -m 4G - TS11N_L1.sort

Reference