Select a row to see which lineages the domain architecture is present in
Load an existing analysis to explore MolEvolvR results right away without inputting anything.
FASTA upload: Full analysis, for 6 bacterial proteins
FASTA upload: Full analysis, for 1 viral protein
BLAST results upload, for 1 eukaryotic protein
InterProScan results upload, for 1 viral protein
Studying proteins through the lens of evolution can reveal conserved features, lineage-specific variants, and their potential functions. MolEvolvR (https://jravilab.org/molevolvr) is a novel web-app enabling researchers to visualize the molecular evolution of their proteins of interest in a phylogenetic context across the tree of life, spanning all superkingdoms. The web-app accepts multiple input formats — protein/domain sequences, homologous proteins, or domain scans — and, using a general-purpose computational workflow, returns detailed homolog data and dynamic graphical summaries (e.g., phylogenetic trees, multiple sequence alignments, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns across lineages). Thus, MolEvolvR is a powerful, easy-to-use web interface for computational protein characterization.
MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*,
Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vincent P Rubinetti,
Shaddai Amolitos, Kellen M Reason, John B Johnson, Janani Ravi. [*Co-primary]
bioRxiv 2022. doi:
https://doi.org/10.1101/2022.02.18.461833
; web-app:
http://jravilab.org/molevolvr
Submit your sequence or preprocessed data here and use your retrieval code (custom URL) to view the results.
An overview of the protein analysis. To view the full results, explore the additional tabs (or click the buttons!). After exploring results and configuring graphics, you may download a copy of your results using the "Download Report" button below.
Download ReportVisualizations and summaries for protein domains.
Visualizations for protein evolution
Summary table of proteins including domain architectures, phylogeny, and homologs, when applicable.
Input data, additional metadata, and preliminary analyses of query protein(s).
The data table provides a summary of the sequences submitted, or "queried", for analysis. The preview shown can be extended by using "Add/remove column(s)" to see info about other taxonomic classes as well as domain architecuture codes from databases other than the default (Pfam).
Uploaded amino acid FASTA sequence(s).
A heatmap of submitted sequences and their respective taxonomic lineages.
Summary and visualizations of protein motifs/subunits (domains) and their configurations within the query protein(s) (domain architectures).
Full set of homologs of query sequences, including their lineage and domain architecture info.
Visualizations and analyses of all query and homologous protein domains, structural or functional subunits, and their architectures.
Select a row to see which lineages the domain architecture is present in
Visualizations of phyletic patterns, sequence similarity, and evolution of related proteins.
Help and FAQ docs for MolEvolvR
This website is free and open to all users, no login required.
This help page shows how to use MolEvolvR to its fullest potential.
Coming Soon: Videos demonstrating what you can do with MolEvolvR, and how to set up custom analyses and navigate the app after loading your results.
Proteins are the functional units of cellular processes. The goal of MolEvolvR is to characterize proteins by their sequence, structure, function, and phylogeny by using sequence similarity, domain architecture, lineages/phyletic spread, and more.
You can explore a sample set of phage shock proteins (PSP) (e.g., lia operon from Bacillus subtilis here), and the full set of PSP proteins here. We created homology, domain architecture, and phylogeny of these proteins (and genomic contexts) to show their prevalence in other organisms and detail how variations of this phage shock stress response system are present across many lineages.
We have applied the approach underlying MolEvolvR to study diverse systems, including:
You can provide a variety of protein inputs, including:
.tsv
format.tsv
formatWith any of these inputs, proteins of interest will be analyzed to identify homologs, determine domain architectures, and delineate phyletic spreads. These analyses provide insights into the biological role(s) of the protein(s) of interest within organisms, as well as trace their evolution.
MolEvolvR can perform 4 types of analyses:
To begin, enter the amino acid FASTA sequence(s) or
accession numbers of your protein(s) of interest into the
Start Analysis
tab. You can also upload a file containing
multiple FASTA sequences (.fa
, .faa
,
.fasta
), or accession numbers (.csv
). Up to
100 protein sequences per job are accepted. For analyses with more than
100 proteins, please contact us.
If you have a pre-existing set of homologous proteins, you can enter/upload the multiprotein FASTA or list of accession numbers. MolEvolvR can also use an MSA in FASTA/Pearson format generated through external programs such as Clustal Omega, ClustalW, Kalign, or MUSCLE.
Advanced options allow you to customize your analysis.
Selecting Phylogenetic Analysis
will analyze a set of
known homologous proteins. Because this type of
analysis already uses homologs, the homology search
option
will be disabled.
Selecting Homology Search
will identify homologs
(related proteins) for each input protein. This pairs well with domain
architecture searches that can be obtained for all homologous hits for
each query.
Selecting Domain Architecture
will generate domain
architecture (including sequence-structure motifs/domains such as Pfam,
Hamap and SignalP, disorder predictions using MobiDBlite, and seconday
structure/cellular localization using Phobius and Coils) for the query
proteins provided. If selected alone, no other analysis will be
performed.
This option allows you to search for proteins with homologous domains to those found within your query proteins. This allows for a broader search of remote homologs that would be missed by a standard whole-protein search. Phylogenetic searches, domain architecture, and characterization are then performed.
For analyses that include a homology search, you can adjust parameters like database (default refseq), maximum hits (default 100), and E-value (default 0.00001).
A homology search requires you to enter/upload protein FASTA sequence(s) or accession number(s).
If given an accession number, MolEvolvR will find its
corresponding FASTA sequence to run through DELTA-BLAST, a variation of
BLASTP. DELTA-BLAST searches pre-constructed
Position-Specific-Scoring-Matrix (PSSM) and the conserved domain
database (CDD) for accelerated lookups. Once the BLAST homology search
completes, _MolEvolvR_ clusters the resulting homolog sequences with BLASTClust and adds additional metadata by lineage
and domain architecture (when selecting the
Domain Architecture
option).
You can start your analysis with a full list of accession numbers or
FASTA files for protein(s) of interest. MolEvolvR gathers
homologs of your input protein(s), and then performs domain architecture
analysis to on all homolog and query sequences. You have the option to
perform only Phylogenetic Analysis
or
Domain Architecture
if you don’t need both.
You can start your analysis from uploaded NCBI BLAST or InterProScan results. Web-BLAST results allow you to determine homolog similarity, the domain architecture and/or phylogeny. InterProScan results summarize and visualizae domains and (if accession numbers are provided) phylogeny.
BLAST is available through NCBI’s website.
You can start your analysis with data from a previous BLAST run.
These data are run through BLASTClust to cluster similar sequences among
the retrieved homologs. The Phylogenetic Analysis
and
Domain Architecture
options are then applied.
To ensure compatibility with the MolEvolvR
Start Analysis
tab, follow these guidelines.
First, enter your Accession Number(s) or FASTA sequence(s) into the “Enter Query Sequence” box.
For the database parameter, we support either the non-redundant
database (nr
) or the reference sequence collection
(refseq_proteins
). The refseq_proteins
dataset
is a high quality, non-redundant subset of protein records curated by
NCBI staff. Meanwhile, nr
is a larger, non-redundant set
that includes many more sequences but is not necessarily vetted for
quality and accuracy. If you would like to further filter your results
based on lineages (e.g., species, genus, family, kingdom), enter the
name/taxID in the Organism field and toggle the box to include/exclude
those results in your search.
Next, select which algorithm to run. If you don’t know details of your protein, ‘BLASTP’ is a great place to start. If you are interested in identifying remote homologs, we suggest using ‘PSI-BLAST’. If your protein has domains of interest, ‘DELTA-BLAST’ works very well.
Creating a job title for the run is optional and for your personal convenience.
Under the expandable “Algorithm parameters” section, the defaults for
max target sequences are typically sufficient. The expect threshold
value, or E-value, represents the number of matches by pure random
chance, and filters out hits with values greater than the threshold. We
suggest 1e-5 (1x10-5
or `0.00001’) for general searches.
Double check your parameters across the page, then click the BLAST
button.
Summary of NCBI BLAST submission parameters
Once your BLAST search is complete, at the end of the RID row towards
the top of the page, there will be a Download All
option
with a dropdown menu to download results. Click on the
Download All
button and select the
Hit Table (csv)
option. You can directly upload these
.csv
result files to MolEvolvR. If the first
column of the results .csv
does not include accession
numbers, you will also need to provide the query sequence(s) that you
used to run BLAST as a second file (.fa
, .faa
,
or .fasta
format).
If you are performing a PSI-BLAST, you will have the option to run additional iterative searches upon each search’s completion. Further iterations will find more remote homologs, so it is recommended you run several iterations before downloading the Hit Table.
BLAST provides information in many formats for your protein homologs,
which we encourage you to review. However, MolEvolvR requires
the Hit Table (csv)
for analysis.
Alternatively, you may upload command line BLAST results with these columns specified:
query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % positives
Check out the BLAST tutorials to learn more about BLAST.
InterProScan is available through EBI’s website.
If you have already identified your protein’s domains through InterProScan, you can upload the output to MolEvolvR for a customizable visual summary of the information.
To ensure compatibility with the MolEvolvR
Start Analysis
tab, follow these guidelines.
Input your protein’s FASTA sequence by copy/pasting into the box or
uploading the FASTA file with the Choose file
button. If
the sequence is valid, InterProScan will display a green check mark in
the bottom right corner of the input box. You can use the
Advanced options
dropdown to select specific databases.
When finished, click the `Search button to begin.
When the search completes (another green check mark will appear under Status), click on your job submission to view the output.
Under the blue Export
dropdown menu, download the
results in the .tsv
format. Upload the .tsv
file to MolEvolvR for visualization and further analysis. If
the first column of the results .tsv
does not include
accession numbers, you will also need to provide the query protein
sequence that you used to run InterProScan as a second file (amino acid
sequence in .fa
, .faa
, or .fasta
format).
Check out the InterProScan tutorials to learn more about their algorithms and search parameters.
After submitting proteins to MolEvolvR, take a break!
Runtime depends on server load and on the complexity of your submission
(Full Analysis
taking the most time), but you can expect
this to take 10 minutes or more.
You will receive a six character alphanumeric analysis code after
submission. We recommend saving this code before you close the app. You
will need to enter it later on the Retrieve Results
tab to
view your results.
Before submitting you can provide an email to receive a link to your
analysis. The app will also save any analyses you’ve submitted on your
current device (laptop, phone, etc.), and will list them under the
Retrieve Results
tab.
The Result Summary
tab provides a high level
overview/snapshot of your analysis results. You can explore your
analysis fully with the detailed results and visualizations under the
other tabs, as follows.
The Data Table
tab shows the processed input data in
tabular form. The default view includes query name, the species and
lineage in which it is found, and Pfam domain architecture, but the
table can be customized with the Add/remove column(s)
button. Columns can be filtered for particular species, lineages,
percentages, etc., and the entire table can be searched with plain text
or regex. The full data table can be downloaded in .csv
format with the Download as csv
button.
All FASTA sequences for the query protein(s) are provided for ease of access.
A heatmap shows the occurrence of query protein(s) by taxonomic presence, which may be useful for multi-FASTA input of homologs.
A customizable domain architecture visualization shows the query
protein(s) grouped by analysis or query. You can modify the domain plot
by selecting the Analysis
box and adding or removing
results to display (e.g., Pfam, Phobius, Coils analyses).
The Homolog Data table lists the best hits from all superkingdoms of
life (queried across all refseq
or nr
genomes). Like in Query Data > Data Table
described
above, tabular details are provided across all homologs, including
genome, species, lineage, and domain architecture information. Many
homology-specific options available in Add/remove column(s)
like percent identity, cluster ID (BLAST parameters). The accession
number for each homolog is linked to its corresponding NCBI protein
page.
A protein’s domain architecture (DA) refers to the order of specific functional regions of a protein. Currently, MolEvolvR uses databases and prediction algorithms integrated with InterProScan to characterize the domain architecture of protein queries and their homologs. We summarize the data with a set of useful visualizations below. Results from Pfam, Phobius, Gene3D, SignalP_Gram_positive, SignalP_Gram_negative, MobiDBlite, Hamap, and Coils are available.
The table provides summary statistics on the domain architecture data across all homologs, with the top (most frequent) domain architectures by query protein (or across all queries) and the frequencies of occurrence and lineages in which they occur. Click each row to view the domain architecture spread across lineages. A popup demonstrates the ‘LineageCount’ by showing the frequencies of occurrence by individual lineage for the selected domain architecture.
A color gradient heatmap across the query protein(s) indicates the number of homologs identified within each lineage per domain architecture.
Rows: Predominant domain architectures. Columns: Key lineages from across the superkingdoms of life.
A network visualization summarizes domain architectures across query protein(s) and their homologs Nodes represent a domain and edges denote domain co-occurrence within a protein. The domains (nodes) that co-occur within a protein/domain architecture are connected (edges), and the size of nodes and thickness of edges are proportional to their relative occurrences across homologs (or query proteins).
Each column of this visualization is organized by the database the
domains were obtained from. The rows represent select query protein(s)
and/or homologs (if a homology search was performed) with the lineage
added to the front of the accession number. You can select specific
proteins with the dropdown box, and choose group rows by analysis or
query. The visualization can also be updated by toggling available
database options under Analysis
, or by adjusting the
Total Cutoff Count
slider.
An UpSet plot is a helpful summary visualization that shows the frequencies of domains and domain architectures across all homologs. It shows distribution of constituent domains underlying all homologs in a histogram (to the left). The combination matrix displays the various combinations of domains present across the domain architectures. The adjoining second histogram (on top) shows the frequency of occurrences of the indicated domain architectures (combinations).
Phylogenetic analysis of proteins provides key insights into their development and evolution. The conservation of certain portions through lineages or across domains of life could indicate the importance of the protein in certain biological processes.
An interactive sunburst plot shows the phyletic spread of the query
protein (selected with the Protein dropdown) across life. Hovering over
each section of the plot displays the lineage. The depth of displayed
taxonomic levels can be adjusted with Number of Levels
to
add more detail to the sunburst plot.
This visualization is constructed from a multiple sequence alignment of representative homologs. Tree leaves are labeled by lineage, species (three-letter abbreviation), and accession numbers.
You can adjust the tree generation in two important ways: based on whether homologs are reduced to representative sequences (e.g., by lineage, species, or domain architecture), and based on the multiple sequence alignment (MSA) algorithm chosen (including Clustal Omega, Clustal W, and Muscle). The size of the tree can be altered selecting the desirable number of sequences. To the right of the tree is a visualization of the multiple sequence alignment, colored by amino acid and showing overall conservation of sequence and structure of the homologs used in tree construction.
You can customize and download a multiple sequence alignment as a
.pdf
file, including a user-specified number of
representative sequences among the homologs.
Data tables are filterable via global or column-specific search boxes
and controls. Filters are applied across the Phylogeny
and
Domain Architecture
tabs (indicated by a small, dismissable
notification) allowing users to fine-tune their analysis.
Columns are searched appropriately based on the data they contain. For example, the AccNum column is text searchable, while PcPositive provides sliders to specify a range of values.
Table-wide search boxes support JavaScript-flavored
regular expressions. This can be used to make advanced searches,
e.g. Staphylococcus\saureus|Klebsiella\spneumoniae
(search
for staphylococcus OR klebsiella.)
This web-app is regularly tested on the following:
We only use standardized and widely-supported HTML, CSS, and JavaScript features, so any other modern, standard-compliant browser such as Opera or Microsoft Edge should also work, even if not explicitly tested.
The following are NOT supported, and may result in unexpected look or behavior:
If you encounter a bug, please let us know!
R, InterProScan, BLAST+, edirect, FastTree, MUSCLE, Phobius, TMHMM, HMMER
NCBI Taxonomy, NCBI GenBank/RefSeq; BLAST RefSeq, NR DB; InterPro
ape, biomartr, cowplot, d3r, DT, gganimate, gggenes, ggraph, ggsci, ggthemes, ggtree, ggvis, gh, gridExtra, heatmap3, heatmaply, htmlwidgets, httr, igraph, knitr, latexpdf, pdftools, phangorn, phylogram, phylotools, phytools, plotly, rentrez, reutils, rmarkdown, seqinr, seqRFLP, shiny, shinydashboard, sunburstR, tidytext, tidytree, tidyverse, tinytex, UpSetR, viridis, visNetwork, wordcloud, wordcloud2
Coming Soon.
We will provide video tutorials covering: how to load your data (accession numbers, FASTA file, web-BLAST results, web-InterProScan results), how to run your analyses, how to load your analysis after it has been processed, how to navigate the web-app, and how to download publication-ready figures and data!
If you have used our web-app to generate any results for your publication or presentations, please cite us as follows:
MolEvolvR: a web-app for characterizing proteins using molecular evolution and phylogeny. Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*, Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vince P Rubinetti, Shaddai Amolitos, Kellen M Reason, John B Johnston, Janani Ravi. [*Co-primary] bioRxiv 2022.02.18.461833; doi: https://doi.org/10.1101/2022.02.18.461833; web-app: http://jravilab.org/molevolvr
Questions? Email us at mailto:janani.ravi@cuanschutz.edu.
We would like to thank our funding sources: Endowed Research Funds from the College of Veterinary Medicine, Michigan State University, NSF-funded BEACON funding support, and the University of Colorado Anschutz start-up funds awarded to JR; NSF-funded REU-ACRES summer scholarship to SZC; NIH NIAID U01AI176414 to JR; NIH NLM T15LM009451 to EPB.
Yes, if you supplied an (optional) email on the submission page, then an email will be sent to confirm when a job is ready.
>OHS91782.1 16S rRNA pseudouridine(516) synthase [Staphylococcus aureus]
MRIDKFLANMGVGTRNEVKQLLKKGLVNVNEQVIKSPKTHIEPENDKITVRGELIEYIENVYIMLNKPKG
YISATEDHHSKTVIDLIPEYQHLNIFPVGRLDKDTEGLLLITNDGDFNHELMSPNKHVSKKYEVISANPI
TEDDIQAFKEGVTLTDGKVKPAILTYIDNQTSHVTIYEGKYHQVKRMFHSIQNEVLHLRRIKIADLELDS
NLDSGEYRLLTENDFDKLNYK
>sp|P01189|COLI_HUMAN Pro-opiomelanocortin OS=Homo sapiens OX=9606 GN=POMC PE=1 SV=2
MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM
FPGNGDEQPLTENPRKYVMGHFRWDRFGRRNSSSSGSSGAGQKREDVSAGEDCGPLPEGG
PEPRSDGAKPGPREGKRSYSMEHFRWGKPVGKKRRPVKVYPNGAEDESAEAFPLEFKREL
TGQRLREGDGPDGPADDGAGAQADLEHSLLVAAEKKDEGPYRMEHFRWGSPPKDKRYGGF
MTSEKSQTPLVTLFKNAIIKNAYKKGE
Custom FASTA header (not recommended)
>SEQUENCE154 UNKNOWN
MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM
FPG
The application uses NCBI or UniProt accessions to get taxonomy info from query proteins. Therefore, it is recommended to include valid protein accession numbers in the header when possible.
No header lines (missing >
header delimiter)
MRIDKFLANMGVGTRNEVKQLLKKGLVNVNEQVIKSPKTHIEPENDKITVRGELIEYIENVYIMLNKPKG
MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM
Duplicate headers/accnums
>GCF_000013425.1
MVPEEKGSITLSKEAAIIFAIAKFKPFKNRIKNNPQKTNPFLKLHENKKS
>GCF_000013425.1
MKQKKSKNIFWVFSILAVVFLVLFSFAVGASNVPMMILTFILLVATFGIGFTTKKKYRENDWL
>protein
MKLTLMKFFVGGFAVLLSYIVSVTLPWKEFGGIFATFPAVFLVSMFITGMQYGDKVAVHVSRGAVFGMTGVLVCILVTWM
MLHMTHMWLISIVVGFLSWFISAVCIFEAVEFIAQKRLEKHSWKAGKSNSK
>protein
MVKRTYQPNKRKHSKVHGFRKRMSTKNGRKVLARRRRKGRKVLSA
Upon submission, a url link to retrieve the results will display. The link provides job progress info and, once finished, the results.
Recommendations:
Key factors of job duration:
Number of sequences submitted
Number of homologs to search for each sequence
(Advanced Options
>Maximum Hits
)
Length & complexity of sequences
The computational methods underlying MolEvolvR have enabled understanding fundamental biological systems and protein evolution.
In this section, companion MolEvolvR jobs for proteins studied in these publications are provided for users to explore.