View on GitHub

SMRT-SV

Structural variant and indel caller for PacBio reads

Download this project as a .zip file Download this project as a tar.gz file

SMRT-SV

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

Publication

Discovery and genotyping of structural variation from long-read haploid genome sequence data
Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin CS, Korlach J, Wilson RK, Eichler EE
https://doi.org/10.1101/gr.214007.116

Data

Whole genome sequences, structural variants, and Sanger sequences from variant validations are available through the BioProject PRJNA335618.

Small indels are available here for CHM1 and CHM13.

Installation

SMRT-SV requires git, Python (2.6.6 or later) and Perl (5.10.1 or later) for installation.

SMRT-SV has been tested on CentOS 6.8 and should work with most Linux-style distributions.

Get the code

Clone the repository into your desired installation directory and build SMRT-SV dependencies.

mkdir /usr/local/smrtsv
cd /usr/local/smrtsv
git clone --recursive git@github.com:EichlerLab/pacbio_variant_caller.git .
make

Note that some dependencies (e.g., RepeatMasker) require hardcoded paths to this installation directory. If you need to move SMRT-SV to another directory, it is easier to change to that directory, clone the repository, and rebuild the dependencies there.

Test installation

Add the installation directory to your path.

export PATH=/usr/local/smrtsv/bin:$PATH

Print SMRT-SV help to confirm installation.

smrtsv.py --help

Alternately, run smrtsv.py directly from the installation directory.

/usr/local/smrtsv/bin/smrtsv.py --help

Configure distributed environment

SMRT-SV uses DRMAA to submit jobs to a grid-engine-style cluster. To enable the --distribute option of SMRT SV, add the following line to your .bash_profile with the correct path to the DRMAA library for your cluster.

export DRMAA_LIBRARY_PATH=/opt/uge/lib/lx-amd64/libdrmaa.so.1.0

Alternately, provide the path to your DRMAA library with the SMRT-SV --drmaalib option.

Additionally, you may need to configure resource requirements depending on your cluster and PacBio data. Use the --cluster_config option when running SMRT-SV to pass a JSON file that specifies Snakemake-style cluster parameters. An example configuration used to run SMRT-SV with human genomes on the Eichler lab cluster is provided in this repository in the file cluster.eichler.json.

Tutorial

The following tutorial shows how to call structural variants and indels in yeast.

Download PacBio reads

Note that this data set requires ~33 GB of disk space for the reads and another ~30 GB for the read alignments.

# List of AWS-hosted files from PacBio including raw reads and an HGAP assembly.
wget https://gist.githubusercontent.com/pb-jchin/6359919/raw/9c172c7ff7cbc0193ce89e715215ce912f3f30e6/gistfile1.txt

# Keep only .xml, .bas.h5, and .bax.h5 files.
sed '/fasta/d;/fastq/d;/celera/d;/HGAP/d' gistfile1.txt > gistfile1.keep.txt

# Download data into a raw reads directory.
mkdir -p raw_reads
cd raw_reads
for f in `cat ../gistfile1.keep.txt`; do wget --force-directories $f; done

# Create a list of reads for analysis.
cd ..
find ./raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn

Prepare the reference assembly

Download the reference assembly (sacCer3) from UCSC.

mkdir -p reference
cd reference
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz

Unpack the reference tarball and concatenate individual chromosome files into a single reference FASTA file.

tar zxvf chromFa.tar.gz
cat *.fa > sacCer3.fasta
rm -f *.fa *.gz
cd ..

Prepare the reference sequence for alignment with PacBio reads. This step produces suffix array and ctab files used by BLASR to speed up alignments.

smrtsv.py index reference/sacCer3.fasta

Align reads to the reference

Align reads to the reference with BLASR.

smrtsv.py align reference/sacCer3.fasta reads.fofn

Find signatures of variants in raw reads

Find candidate regions to search for SVs based on SV signatures.

smrtsv.py detect reference/sacCer3.fasta alignments.fofn candidates.bed

Assemble regions

Assemble local regions of the genome that have SV signatures or tile across the genome.

smrtsv.py assemble reference/sacCer3.fasta reads.fofn alignments.fofn candidates.bed local_assembly_alignments.bam

Call variants

Call variants by aligning tiled local assemblies back to the reference. Optionally, specify the sample name for annotation of the final VCF file and a species name (common or scientific as supported by RepeatMasker) for repeat masking of structural variants.

smrtsv.py call reference/sacCer3.fasta alignments.fofn local_assembly_alignments.bam variants.vcf --sample UCSF_Yeast9464 --species yeast