Phylogenetic tree is a good tool to infer evolutionary relationships among various organisms so the tree has been used in many evolutionary studies. Consequently, phylogenetic tree based on SNP data have been determined in resequencing projects. However, there was no simple way to determine phylogenetic tree with the huge number of variants determined from resequencing data. Thus, we had developed new pipeline, SNPhylo, to construct phylogenetic tree based on SNP data. With this pipeline, user can construct a phylogenetic tree from a file containing huge SNP data.

1) Tree construction based on genome wide SNPs. Conventional tree construction is based on hand full of genes with certain properties such single copy gene, ribosomal RNA gene, Internal transcribed spacer sequences (ITS). SNPhylo builds tree with genome wide information, thus, it is more accurate
2) Reduce SNP redundancy by linkage disequilibrium (LD).SNPs in a same LD block provides redundant lineage information. SNPhylo keeps only one informative SNP in a LD block. It greatly decreases running time without losing informative sites.
3) Tree construction process is highly automated. SNPhylo takes most common SNP/genotype format (vcf/hapmap) as input and produces maximum likelihood tree with only one command!

snphylo.tar.gz (ver. 20160204)

You can download the latest files at the GitHub as well.
Dependent software and packages
Rscript
It is included in the software R, which is freely available at http://www.r-project.org/.
Python
It is freely available at http://www.python.org/.
MUSCLE
It is freely available at http://www.drive5.com/muscle/.
dnaml
It is included in the software PHYLIP, which is freely available at http://evolution.genetics.washington.edu/phylip.html.
R packages (phangorn, gdsfmt, SNPRelate and getopt)
User can install these R packages manually. For example, by executing below commands in R (as root), the R packages will be installed.
> install.packages("getopt", repos="http://cran.r-project.org")
> install.packages("phangorn", repos="http://cran.r-project.org")

> source("http://bioconductor.org/biocLite.R")
> biocLite("gdsfmt")
> biocLite("SNPRelate")

To install the pipeline
1) download the file Snphylo.zip
2) unzip Snphylo.zip
3) cd snphylo
4) bash setup.sh
The setup script will find the dependent programs and ask you a few basic questions to setup SNPlylo.
In addition, with user’s permission, setup.sh can automatically install above R packages.
If the setup process is successfully finished, you can see two files (snphylo.sh and snphylo.cfg).

Examples of the installation
Install on Linux
Install on OS X

To run the program
Typing /DIRECTORY_OF_SNPHYLO/snphylo.sh with below options.

Usage:
snphylo.sh -v VCF_file [-p Maximum_PLCS (5)] [-c Minimum_depth_of_coverage (5)]|-H HapMap_file [-p Maximum_PNSS (5)]|-s Simple_SNP_file [-p Maximum_PNSS (5)]|-d GDS_file [-l LD_threshold (0.1)] [-m MAF_threshold (0.1)] [-M Missing_rate (0.1)] [-o Outgroup_sample_name] [-P Prefix_of_output_files (snphylo.output)] [-b [-B The_number_of_bootstrap_samples (100)]] [-a The_number_of_the_last_autosome (22)] [-r] [-A] [-h]

Options:
-A: Perform multiple alignment by MUSCLE
-b: Performs (non-parametric) bootstrap analysis and generate a tree
-h: Show help and exit
-r: Skip the step removing low quality data (-p and -c option are ignored)

Acronyms:
PLCS: The percent of Low Coverage Sample
PNSS: The percent of Sample which has no SNP information
LD: Linkage Disequilibrium
MAF: Minor Allele Frequency

Simple SNP File Format:
#Chrom Pos SampleID1 SampleID2 SampleID3 ...
1 1000 A A T ...
1 1002 G C G ...
...
2 2000 G C G ...
2 2002 A A T ...
...

For example, you can get the phylogenetic tree from 31 Soybean SNP data1 like below.
$ ./snphylo.sh -H soybean.hapmap

Files generated by SNPhylo
* PREFIX.hapmap - A HapMap file generated from the Simple SNP file (with -s option)
* PREFIX.filtered.hapmap - A HapMap file after filtration of low-quality data
* PREFIX.gds - A GDS file generated from a filtered file
* PREFIX.fasta - A FASTA file containing sequences generated from the selected SNP data
* PREFIX.id.txt - A list file of the IDs selected
* PREFIX.phylip.txt - A multiple alignment file of the FASTA sequences by MUSCLE
* PREFIX.ml.txt - A output file generated by DNAML
* PREFIX.ml.tree - A Newick file for a tree by Maximum likelihood analysis
* PREFIX.ml.png - A PNG file for a tree by Maximum likelihood analysis
* PREFIX.bs.tree - A Newick file for a tree by Bootstrapping analysis (with -b option)
* PREFIX.bs.png - A PNG file for a tree by Bootstrapping analysis (with -b option)
SNPhylo generates the multiple alignment file (*.phylip.txt) in PHYLIP format so you can do additional analysis with the file. For example, you can perform bootstrapping analysis by PhyML with the file.

Soybean SNP data1 in Hapmap format (Population size: 31; SNP number: 6,289,747)
soybean.hapmap.gz (115M)

  • Origin of Arabidopsis Zu-0 is Switzerland (Thank you, Dr. Kristian Ullrich, University of Marburg)
  • Dr. Kristian Ullrich, University of Marburg
  • Mr. En-Hua Xia
  • Dr. David Magee, University College Dublin
Lee, T. H., Guo, H., Wang, X., Kim, C., & Paterson, A. H. (2014). SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics, 15(1).
1Lam, H. M., et al. (2010) Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection, Nat. Genet., 42, 1053-1059.