SMRT-SV
Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.
Publication
Discovery and genotyping of structural variation from long-read haploid genome sequence data
Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin CS, Korlach J, Wilson RK, Eichler EE
https://doi.org/10.1101/gr.214007.116
Data
Whole genome sequences, structural variants, and Sanger sequences from variant validations are available through the BioProject PRJNA335618.
Small indels are available here for CHM1 and CHM13.
Installation
SMRT-SV requires git, Python (2.6.6 or later) and Perl (5.10.1 or later) for installation.
SMRT-SV has been tested on CentOS 6.8 and should work with most Linux-style distributions.
Get the code
Clone the repository into your desired installation directory and build SMRT-SV dependencies.
mkdir /usr/local/smrtsv
cd /usr/local/smrtsv
git clone --recursive git@github.com:EichlerLab/pacbio_variant_caller.git .
make
Note that some dependencies (e.g., RepeatMasker) require hardcoded paths to this installation directory. If you need to move SMRT-SV to another directory, it is easier to change to that directory, clone the repository, and rebuild the dependencies there.
Test installation
Add the installation directory to your path.
export PATH=/usr/local/smrtsv/bin:$PATH
Print SMRT-SV help to confirm installation.
smrtsv.py --help
Alternately, run smrtsv.py directly from the installation directory.
/usr/local/smrtsv/bin/smrtsv.py --help
Configure distributed environment
SMRT-SV uses DRMAA to submit jobs to a grid-engine-style cluster. To enable the --distribute option of SMRT SV, add the following line to your .bash_profile with the correct path to the DRMAA library for your cluster.
export DRMAA_LIBRARY_PATH=/opt/uge/lib/lx-amd64/libdrmaa.so.1.0
Alternately, provide the path to your DRMAA library with the SMRT-SV
--drmaalib option.
Additionally, you may need to configure resource requirements depending on your
cluster and PacBio data. Use the --cluster_config option when running SMRT-SV
to pass a JSON file that specifies Snakemake-style cluster
parameters. An
example configuration used to run SMRT-SV with human genomes on the Eichler lab
cluster is provided in this repository in the file cluster.eichler.json.
Tutorial
The following tutorial shows how to call structural variants and indels in yeast.
Download PacBio reads
Note that this data set requires ~33 GB of disk space for the reads and another ~30 GB for the read alignments.
# List of AWS-hosted files from PacBio including raw reads and an HGAP assembly.
wget https://gist.githubusercontent.com/pb-jchin/6359919/raw/9c172c7ff7cbc0193ce89e715215ce912f3f30e6/gistfile1.txt
# Keep only .xml, .bas.h5, and .bax.h5 files.
sed '/fasta/d;/fastq/d;/celera/d;/HGAP/d' gistfile1.txt > gistfile1.keep.txt
# Download data into a raw reads directory.
mkdir -p raw_reads
cd raw_reads
for f in `cat ../gistfile1.keep.txt`; do wget --force-directories $f; done
# Create a list of reads for analysis.
cd ..
find ./raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn
Prepare the reference assembly
Download the reference assembly (sacCer3) from UCSC.
mkdir -p reference
cd reference
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz
Unpack the reference tarball and concatenate individual chromosome files into a single reference FASTA file.
tar zxvf chromFa.tar.gz
cat *.fa > sacCer3.fasta
rm -f *.fa *.gz
cd ..
Prepare the reference sequence for alignment with PacBio reads. This step produces suffix array and ctab files used by BLASR to speed up alignments.
smrtsv.py index reference/sacCer3.fasta
Align reads to the reference
Align reads to the reference with BLASR.
smrtsv.py align reference/sacCer3.fasta reads.fofn
Find signatures of variants in raw reads
Find candidate regions to search for SVs based on SV signatures.
smrtsv.py detect reference/sacCer3.fasta alignments.fofn candidates.bed
Assemble regions
Assemble local regions of the genome that have SV signatures or tile across the genome.
smrtsv.py assemble reference/sacCer3.fasta reads.fofn alignments.fofn candidates.bed local_assembly_alignments.bam
Call variants
Call variants by aligning tiled local assemblies back to the reference. Optionally, specify the sample name for annotation of the final VCF file and a species name (common or scientific as supported by RepeatMasker) for repeat masking of structural variants.
smrtsv.py call reference/sacCer3.fasta alignments.fofn local_assembly_alignments.bam variants.vcf --sample UCSF_Yeast9464 --species yeast