Protein prediction from genome
BRAKER3: Protein prediction from genome
Overview
Galaxy hosts multiple bioinformatics tools that are run from the GUI rather than the command line in Terminal. You can register for a free online account with 250 GB of output storage.
BRAKER3 uses RNA-Seq and/or predicted protein databases to guide protein prediction from a genome in .fna or .fasta format. RNA-Seq data are submitted as BAM files (very large) while predicted protein files are submitted as .fa or .fasta files obtained from NCBI or Uniprot (preferred) which are much smaller.
These instructions predict and then combine exons from your genome of interest into likely genes based on a guiding high-quality genome, converts those genes into protein sequences, and then assess the quality of those protein predictions. Instructions draw heavily from Galaxy’s BRAKER3 tutorial with modifications.
Upload files
- Log in to your Galaxy account
- Open the sequences headers in your genome file to be converted to proteins. Use Find-Replace to convert all spaces, colons, commas, etc. to underscores. For instance, “ “, “: “, and “, “ should be replaced with “_” so that “apple banana”, “apple: banana”, and “apple, banana” become “apple_banana”.
- Check your genome file to be converted to proteins to see if it is soft-masked. Open the file and see if some of the nucleotides are lower-case; if so, your file is soft-masked.
- If your genome file to be converted to proteins is not soft-masked, follow these instructions for Red or RepeatMasker tools. If you had to soft-mask the genome, be sure to use this new file in subsequent steps.
- Click on Upload in the top left corner.
- Drag your genome file to be converted to proteins and the reference RNA-Seq and/or predicted protein files into the upload pop-up window. Multiple files can be uploaded at once. Click Start. Then click Close.
BRAKER3
- In the Tools column (left half of screen), type Braker3 into the search bar. Click on BRAKER3 genome annotation. This opens the BRAKER3 tool on the right half of the screen.
- Set the following: a. Assembly to annotate = your genome file to be converted to proteins (.fna or .fasta file type). Remember to use the soft-masked version to improve protein prediction quality. b. Genome sequence is soft-masked = Yes c. RNA-seq mapped to genome to train Augustus/GeneMark = reference RNA-Seq.bam file; this is optional d. Proteins to map to genome = reference predicted protein file from Uniprot (preferred) or NCBI as uniprot.fasta e. Fungal genome = ignore if not working with fungal genome f. Augustus settings = default g. Advanced settings = default h. Output format = GFF3
- Click Run Tool at the bottom of the settings.
- BRAKER3 could take anywhere from a few hours to 2-3 days to run depending on genome size and reference database quality.
Get file and review errors
- To download the completed GFF file, click on the Save icon (floppy disk) in the lower left corner of the green box for that analysis (green = successfully completed) on the right side of the screen.
- Errors in the analysis are indicated by a red box for that analysis. To see the error(s) information, click on the View symbol that looks like an eye inside box corners in the red box for that analysis. Then click on the Error icon that looks like a person with an ‘i’ on their abdomen. Galaxy includes an error wizard that will analyze error codes and give a layman’s description of the error plus possible fixes.
Convert nucleotide sequences to proteins
- Convert the GFF file to CDS and protein sequences using the GFFread tool. Search for this tool the same way as you found BRAKER3 in step 7
- Set the following: a. Input BED, GTF, or GFF3 feature file = the output from BRAKER3. This can be chosen from the dropdown directly in Galaxy (preferred as .gff uploads seem to be read by Galaxy as .bed files and so convert exons to CDSs incorrectly). Or you can upload the GFF file you downloaded in step 6. b. Reference genome = From your history c. Genome reference fasta = your genome to be converted to predicted proteins, which you already uploaded d. Select fasta outputs = e. Fasta file with spliced exons for each GFF transcript (-w) f. Protein fasta file with the translation of CDS for each record (-y) g. For protein fasta: use (*) instead of (.) as stop codon translation (-S) h. Full GFF attribute preservation = Yes e. Decode URL encoded characters within attributes = Yes j. Warn about duplicate transcript IDs and other potential problems with the given GTF/GFF records = Yes k. Click Run Tool at the bottom of the settings.
- Download the nucleotide and protein files as in step 11 for the GFF file output from BRAKER3.
BUSCO evaluation
- Evaluate the quality of the gene/protein model prediction for your genome using BUSCO (Benchmarking Universal Single-Copy Orthologs). Search for the BUSCO tool the same way as you found BRAKER3 in steps 6 and 12.
- Set the following: a. Sequences to analyze = the output from GFFread. This can be chosen from the dropdown directly in Galaxy. Or you can upload the pep.fa or fasta file you downloaded in step 6. b. Cached database with lineage = Eukaryotes (DATE) from the dropdown c. Mode = annotated gene sets (protein) d. Auto-detect or select lineage = select lineage e. Auto-lineage group = Eukaryotes (–auto-lineage-euk) f. Which outputs should be generated = g. Short summary text (default), at a minimum this one should be selected h. If desired, any of the other files can be chosen i. Summary image and List with missing IDs will be small files; gff, Protein sequences, and Nucleotide sequences will be MB in size j. All other settings should be default
- Click Run Tool at the bottom of the settings.
High quality predictions for eukaryotes find complete versions of >90% of the reference orthologs