FastOMA is a software for inferring homology information on your custom genomes, including generating Hierarchical Orthologous Groups. It takes as input the protein sequence in FASTA format in addition to the species tree.
For Modules 2-4, you will need to use our GitPod instance
In this exercise, we will run FastOMA standalone to infer the orthology information for five yeast species. We already provided the proteomes of five species in the GitPod environment, located at /workspace/SIBBiodiversityBionformatics2023/Module3_FastOMA/working_dir/in_folder/proteome
.
Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided in the GitPod workspace: Module3_FastOMA/working_dir/in_folder/species_tree.nwk
. It is as follows:
(((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota;
The FastOMA software is already installed, and you should be able to use it after logging into your GitPod workspace.
Optional (If you are not using GitPod)
If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page[https://github.com/DessimozLab/fastoma].
If you want to download the proteomes on your own system, check out the following hint:
Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz
gunzip -k UP000002485_284812.fasta.gz
1. In what format are the proteome files?
2. How many proteins are there in the Schizosaccharomyces pombe proteome?
grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l
3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?
The FastOMA algorithm runs in three main steps:
Note that these steps are executed thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format of HOG. For more information on HOGs, see Module 1 and also Page 4 of Zahn-Zabal et al. F1000, 2020.
First change directory to the Module3_FastOMA/working_dir/ where the folder in_folder exists.
cd /workspace/SIBBiodiversityBioinformatics2023/Module3_FastOMA/working_dir/
Then, check whether Nextflow is installed your system by running nextflow -h
. Now we can use the command line to run FastOMA on the five proteomes in in_folder/species_tree.nwk
, also using the species tree from in_folder/species_tree.nwk.
1. What is the command line to run FastOMA?
nextflow FastOMA_light.nf --input_folder in_folder --output_folder out_folder
Execute the above command to run FastOMA.
2. Where is the output orthoXML file?
Recall that Orthologous Groups are groups of strict orthologs, with at most 1 representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.
The output of FastOMA includes two folders (hogmap and OrthologousGroupsFasta) and three files (OrthologousGroupsFasta.tsv, rootHOGs.tsv and output_hog.orthoxml).
The hogmap folder includes the output of OMAmer (Module 2); each file corresponds to an input proteome. The folder OrthologousGroupsFasta includes FASTA files, and all proteins inside each FASTA file are orthologous to each other. These could be used as gene markers for species tree inference (Module 3).
1. How many Orthologous Groups are there?
2. How many genes in total are present in all Orthologous Groups?
grep geneRef output_hog.orthoxml | wc -l
Orthologous Groups which have a representative gene in every species could be considered as the core genome.
3. How many Orthologous Groups include one representative gene for each species?
cat OrthologousGroups.tsv |sed 's/[^,]//g' | awk '{ print length }' | grep "4" | wc -l
.
4. How many Root HOGs are in the HOG file?
5. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)?
6. Which genes are orthologous to the gene A7EQW0_SCLS1?
grep
on the OrthologousGroups.tsv.