A web-app for characterizing proteins using molecular evolution and phylogeny

Example Analyses

Load an existing analysis to explore MolEvolvR results right away without inputting anything.

liasix

FASTA upload: Full analysis, for 6 bacterial proteins

5uNQ9l

FASTA upload: Full analysis, for 1 viral protein

7JtOTB

BLAST results upload, for 1 eukaryotic protein

v7omZ3

InterProScan results upload, for 1 viral protein


Overview and Features

MolEvolvR allows users:
1. Analyze proteins of interest with full analysis (1, 3, & 4).
2. Perform only protein characterization (1 & 3).
3. Run only homology searches (1 & 4).
4. Start with external BLAST/InterProScan outputs for further analysis (2, 3, & 4).
MolEvolvR is interactive, searchable, and customizable.
Domain Architecture Phylogeny MSA Phylogeny MSA Domain Architecture Phylogeny Sunburst Domain Architecture Domain Architecture Domain Architecture Phylogeny Tree Domain Architecture

Abstract

Studying proteins through the lens of evolution can reveal conserved features, lineage-specific variants, and their potential functions. MolEvolvR (https://jravilab.org/molevolvr) is a novel web-app enabling researchers to visualize the molecular evolution of their proteins of interest in a phylogenetic context across the tree of life, spanning all superkingdoms. The web-app accepts multiple input formats — protein/domain sequences, homologous proteins, or domain scans — and, using a general-purpose computational workflow, returns detailed homolog data and dynamic graphical summaries (e.g., phylogenetic trees, multiple sequence alignments, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns across lineages). Thus, MolEvolvR is a powerful, easy-to-use web interface for computational protein characterization.

How to Cite

MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*, Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vincent P Rubinetti, Shaddai Amolitos, Kellen M Reason, John B Johnson, Janani Ravi. [*Co-primary]
bioRxiv 2022. doi: https://doi.org/10.1101/2022.02.18.461833 ; web-app: http://jravilab.org/molevolvr

Submit your sequence or preprocessed data here and use your retrieval code (custom URL) to view the results.


Analysis

Past Analyses

Jobs you have previously submitted on this device will appear here.

Results Summary

An overview of the protein analysis. To view the full results, explore the additional tabs (or click the buttons!). After exploring results and configuring graphics, you may compile a report of your results using the "Generate Report" button below. You will be notified when the report is available for download.

Download Report

Domain Architecture

Visualizations and summaries for protein domains.

Phylogeny

Visualizations for protein evolution

Data

Summary table of proteins including domain architectures, phylogeny, and homologs, when applicable.

Query Data

Input data, additional metadata, and preliminary analyses of query protein(s).

The data table provides a summary of the sequences submitted, or "queried", for analysis. The preview shown can be extended by using "Add/remove column(s)" to see info about other taxonomic classes as well as domain architecuture codes from databases other than the default (Pfam).

Loading...

Uploaded amino acid FASTA sequence(s).

Loading...
Loading...
Loading...
Loading...

A heatmap of submitted sequences and their respective taxonomic lineages.

Loading...

Summary and visualizations of protein motifs/subunits (domains) and their configurations within the query protein(s) (domain architectures).

Homolog Data

Full set of homologs of query sequences, including their lineage and domain architecture info.

All proteins are shown by default. Use the input below to select proteins to filter the table.

Loading...
Download plot
Loading...

Domain Architecture

Visualizations and analyses of all query and homologous protein domains, structural or functional subunits, and their architectures.

Select a row to see which lineages the domain architecture is present in

Loading...
If the legend or figure is clipped, or if the legend is colliding with the bars, you can try the 'Right-align Legend' option or resize your browser window.
Download plot
Loading...
Loading...
Messy graph? Try re-arranging the vertices by clicking and dragging the vertices! Also try zooming in and out using your scroll wheel!
Loading...
Loading...
Loading...
Download plot
Loading...

Phylogeny

Visualizations of phyletic patterns, sequence similarity, and evolution of related proteins.

* Analyses will appear only if there is at least one visible row that has values for that analysis


Loading...
Legend
Loading...
Loading...

Help and FAQ docs for MolEvolvR

This website is free and open to all users, no login required.

This help page shows how to use MolEvolvR to its fullest potential.

Coming Soon: Videos demonstrating what you can do with MolEvolvR, and how to set up custom analyses and navigate the app after loading your results.

The UI: Workflow and Usage

Proteins are the functional units of cellular processes. The goal of MolEvolvR is to characterize proteins by their sequence, structure, function, and phylogeny by using sequence similarity, domain architecture, lineages/phyletic spread, and more.

Published use cases/Testing

You can explore a sample set of phage shock proteins (PSP) (e.g., lia operon from Bacillus subtilis here), and the full set of PSP proteins here. We created homology, domain architecture, and phylogeny of these proteins (and genomic contexts) to show their prevalence in other organisms and detail how variations of this phage shock stress response system are present across many lineages.

We have applied the approach underlying MolEvolvR to study diverse systems, including:

  • Nutrient acquisition systems in Staphylococcus aureus [tcyABCP, gis-gt]
  • A novel phage defense system in Vibrio cholerae [Vch1]
  • Surface layer proteins in Bacillus anthracis [SLPs]
  • Helicase operators in bacteria [DciA]
  • Internalins in Listeria [InlP]
  • Antimicrobial resistance (AMR) genes from the CARD database and uncharacterized proteins associated with AMR using machine learning approaches [AMR]

How to use MolEvolvR

You can provide a variety of protein inputs, including:

  1. Protein sequence(s) in FASTA format
  2. Protein accession number(s) in NCBI and/or UniProt format
  3. Protein multiple sequence alignment (MSA) in FASTA/Pearson format
  4. Protein BLAST output in .csv format
  5. InterProScan output in .tsv format

With any of these inputs, proteins of interest will be analyzed to identify homologs, determine domain architectures, and delineate phyletic spreads. These analyses provide insights into the biological role(s) of the protein(s) of interest within organisms, as well as trace their evolution.

MolEvolvR can perform 4 types of analyses:

  1. Domain architecture, which allows identification of protein domains, exploration of domain interactions, and domain co-occurences
  2. Identification of homologs, which reveals patterns within and across species
  3. Phylogenetic analysis, which shows the phyletic spread of proteins across the tree of life, a multiple sequence alignment, and a phylogenetic tree
  4. Visualization and analysis of results from BLAST suite, InterProScan, and multiple sequence alignments

Enter data

Accession Numbers and FASTA (full analysis)

To begin, enter the amino acid FASTA sequence(s) or accession numbers of your protein(s) of interest into the Start Analysis tab. You can also upload a file containing multiple FASTA sequences (.fa, .faa, .fasta ), or accession numbers (.csv). Up to 100 protein sequences per job are accepted. For analyses with more than 100 proteins, please contact us.

Multiple Accessions/FASTA of homologs

If you have a pre-existing set of homologous proteins, you can enter/upload the multiprotein FASTA or list of accession numbers. MolEvolvR can also use an MSA in FASTA/Pearson format generated through external programs such as Clustal Omega, ClustalW, Kalign, or MUSCLE.

Advanced Options

Advanced options allow you to customize your analysis.

Phylogenetic Analysis

Selecting Phylogenetic Analysis will analyze a set of known homologous proteins. Because this type of analysis already uses homologs, the homology search option will be disabled.

Homology Search

Selecting Homology Search will identify homologs (related proteins) for each input protein. This pairs well with domain architecture searches that can be obtained for all homologous hits for each query.

Domain Architecture

Selecting Domain Architecture will generate domain architecture (including sequence-structure motifs/domains such as Pfam, Hamap and SignalP, disorder predictions using MobiDBlite, and seconday structure/cellular localization using Phobius and Coils) for the query proteins provided. If selected alone, no other analysis will be performed.

Split Queries by Domain

This option allows you to search for proteins with homologous domains to those found within your query proteins. This allows for a broader search of remote homologs that would be missed by a standard whole-protein search. Phylogenetic searches, domain architecture, and characterization are then performed.

BLAST Parameters

For analyses that include a homology search, you can adjust parameters like database (default refseq), maximum hits (default 100), and E-value (default 0.00001).

Organism(s) to Include/Exclude

You can filter your homology search results to specific organisms via the Organism(s) to Include/Exclude dropdowns. Enter either organism names or taxon IDs and the list will be dynamically filtered to show only matching organisms, with your query highlighted in each list item. The full taxonomic classification for each item is shown and is searchable. Multiple selections are possible.

The first dropdown filters organisms to include; if empty, all organisms/taxa will be eligible for inclusion. The second dropdown filters organisms to exclude; if empty, no organisms/taxa will be excluded. If both dropdowns are used, the “include” filter will be applied first, followed by the “exclude” filter.

The filters are applied to your currently selected BLAST database (e.g., refseq, nr).

In the backend, these options are passed to BLAST as -taxids <comma-delimited list of IDs> for the inclusion filter and -negative_taxids <comma-delimited list of taxon IDs> for exclusions. See Limiting a Search by taxonomy for NCBI’s documentation on this feature.

Customizing your analysis

Fully characterize proteins of interest

You can start your analysis with a full list of accession numbers or FASTA files for protein(s) of interest. MolEvolvR gathers homologs of your input protein(s), and then performs domain architecture analysis to on all homolog and query sequences. You have the option to perform only Phylogenetic Analysis or Domain Architecture if you don’t need both.

Analyze external data

You can start your analysis from uploaded NCBI BLAST or InterProScan results. Web-BLAST results allow you to determine homolog similarity, the domain architecture and/or phylogeny. InterProScan results summarize and visualizae domains and (if accession numbers are provided) phylogeny.

BLAST outputs from the NCBI BLAST web-interface

BLAST is available through NCBI’s website.

You can start your analysis with data from a previous BLAST run. These data are run through BLASTClust to cluster similar sequences among the retrieved homologs. The Phylogenetic Analysis and Domain Architecture options are then applied.

To ensure compatibility with the MolEvolvR Start Analysis tab, follow these guidelines.

Step 1: Enter Accession Numbers/FASTA sequences and choose parameters

First, enter your Accession Number(s) or FASTA sequence(s) into the “Enter Query Sequence” box.

For the database parameter, we support either the non-redundant database (nr) or the reference sequence collection (refseq_proteins). The refseq_proteins dataset is a high quality, non-redundant subset of protein records curated by NCBI staff. Meanwhile, nr is a larger, non-redundant set that includes many more sequences but is not necessarily vetted for quality and accuracy. If you would like to further filter your results based on lineages (e.g., species, genus, family, kingdom), enter the name/taxID in the Organism field and toggle the box to include/exclude those results in your search.

Next, select which algorithm to run. If you don’t know details of your protein, ‘BLASTP’ is a great place to start. If you are interested in identifying remote homologs, we suggest using ‘PSI-BLAST’. If your protein has domains of interest, ‘DELTA-BLAST’ works very well.

Creating a job title for the run is optional and for your personal convenience.

Under the expandable “Algorithm parameters” section, the defaults for max target sequences are typically sufficient. The expect threshold value, or E-value, represents the number of matches by pure random chance, and filters out hits with values greater than the threshold. We suggest 1e-5 (1x10-5 or `0.00001’) for general searches. Double check your parameters across the page, then click the BLAST button.

Summary of NCBI BLAST submission parameters

  1. Accession Number(s) or FASTA sequence(s)
  2. Database
  1. RefSeq. This database contains only NCBI-curated, high quality, non-redundant protein sequences.
  2. NR. This database contains a much larger pool of uncurated, variable quality, non-redundant protein sequences.
Step 2: Downloading BLAST results

Once your BLAST search is complete, at the end of the RID row towards the top of the page, there will be a Download All option with a dropdown menu to download results. Click on the Download All button and select the Hit Table (csv) option. You can directly upload these .csv result files to MolEvolvR. If the first column of the results .csv does not include accession numbers, you will also need to provide the query sequence(s) that you used to run BLAST as a second file (.fa, .faa, or .fasta format).

If you are performing a PSI-BLAST, you will have the option to run additional iterative searches upon each search’s completion. Further iterations will find more remote homologs, so it is recommended you run several iterations before downloading the Hit Table.

BLAST provides information in many formats for your protein homologs, which we encourage you to review. However, MolEvolvR requires the Hit Table (csv) for analysis.

Alternatively, you may upload command line BLAST results with these columns specified:

query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % positives

Check out the BLAST tutorials to learn more about BLAST.


InterProScan outputs from the Iprscan5 web-interface

InterProScan is available through EBI’s website.

If you have already identified your protein’s domains through InterProScan, you can upload the output to MolEvolvR for a customizable visual summary of the information.

To ensure compatibility with the MolEvolvR Start Analysis tab, follow these guidelines.

Step 1: Enter FASTA sequence

Input your protein’s FASTA sequence by copy/pasting into the box or uploading the FASTA file with the Choose file button. If the sequence is valid, InterProScan will display a green check mark in the bottom right corner of the input box. You can use the Advanced options dropdown to select specific databases. When finished, click the `Search button to begin.