Select a row to see which lineages the domain architecture is present in
Load an existing analysis to explore MolEvolvR results right away without inputting anything.
FASTA upload: Full analysis, for 6 bacterial proteins
FASTA upload: Full analysis, for 1 viral protein
BLAST results upload, for 1 eukaryotic protein
InterProScan results upload, for 1 viral protein
Studying proteins through the lens of evolution can reveal conserved features, lineage-specific variants, and their potential functions. MolEvolvR (https://jravilab.org/molevolvr) is a novel web-app enabling researchers to visualize the molecular evolution of their proteins of interest in a phylogenetic context across the tree of life, spanning all superkingdoms. The web-app accepts multiple input formats — protein/domain sequences, homologous proteins, or domain scans — and, using a general-purpose computational workflow, returns detailed homolog data and dynamic graphical summaries (e.g., phylogenetic trees, multiple sequence alignments, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns across lineages). Thus, MolEvolvR is a powerful, easy-to-use web interface for computational protein characterization.
MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*,
Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vincent P Rubinetti,
Shaddai Amolitos, Kellen M Reason, John B Johnson, Janani Ravi. [*Co-primary]
bioRxiv 2022. doi:
https://doi.org/10.1101/2022.02.18.461833
; web-app:
http://jravilab.org/molevolvr
Submit your sequence or preprocessed data here and use your retrieval code (custom URL) to view the results.
An overview of the protein analysis. To view the full results, explore the additional tabs (or click the buttons!). After exploring results and configuring graphics, you may compile a report of your results using the "Generate Report" button below. You will be notified when the report is available for download.
Download ReportVisualizations and summaries for protein domains.
Visualizations for protein evolution
Summary table of proteins including domain architectures, phylogeny, and homologs, when applicable.
Input data, additional metadata, and preliminary analyses of query protein(s).
The data table provides a summary of the sequences submitted, or "queried", for analysis. The preview shown can be extended by using "Add/remove column(s)" to see info about other taxonomic classes as well as domain architecuture codes from databases other than the default (Pfam).
Uploaded amino acid FASTA sequence(s).
A heatmap of submitted sequences and their respective taxonomic lineages.
Summary and visualizations of protein motifs/subunits (domains) and their configurations within the query protein(s) (domain architectures).
Full set of homologs of query sequences, including their lineage and domain architecture info.
All proteins are shown by default. Use the input below to select proteins to filter the table.
Visualizations and analyses of all query and homologous protein domains, structural or functional subunits, and their architectures.
Select a row to see which lineages the domain architecture is present in
Visualizations of phyletic patterns, sequence similarity, and evolution of related proteins.
* Analyses will appear only if there is at least one visible row that has values for that analysis
Help and FAQ docs for MolEvolvR
This website is free and open to all users, no login required.
This help page shows how to use MolEvolvR to its fullest potential.
Coming Soon: Videos demonstrating what you can do with MolEvolvR, and how to set up custom analyses and navigate the app after loading your results.
Proteins are the functional units of cellular processes. The goal of MolEvolvR is to characterize proteins by their sequence, structure, function, and phylogeny by using sequence similarity, domain architecture, lineages/phyletic spread, and more.
You can explore a sample set of phage shock proteins (PSP) (e.g., lia operon from Bacillus subtilis here), and the full set of PSP proteins here. We created homology, domain architecture, and phylogeny of these proteins (and genomic contexts) to show their prevalence in other organisms and detail how variations of this phage shock stress response system are present across many lineages.
We have applied the approach underlying MolEvolvR to study diverse systems, including:
You can provide a variety of protein inputs, including:
.csv format.tsv formatWith any of these inputs, proteins of interest will be analyzed to identify homologs, determine domain architectures, and delineate phyletic spreads. These analyses provide insights into the biological role(s) of the protein(s) of interest within organisms, as well as trace their evolution.
MolEvolvR can perform 4 types of analyses:
To begin, enter the amino acid FASTA sequence(s) or
accession numbers of your protein(s) of interest into the
Start Analysis tab. You can also upload a file containing
multiple FASTA sequences (.fa, .faa,
.fasta ), or accession numbers (.csv). Up to
100 protein sequences per job are accepted. For analyses with more than
100 proteins, please contact us.
If you have a pre-existing set of homologous proteins, you can enter/upload the multiprotein FASTA or list of accession numbers. MolEvolvR can also use an MSA in FASTA/Pearson format generated through external programs such as Clustal Omega, ClustalW, Kalign, or MUSCLE.
Advanced options allow you to customize your analysis.
Selecting Phylogenetic Analysis will analyze a set of
known homologous proteins. Because this type of
analysis already uses homologs, the homology search option
will be disabled.
Selecting Homology Search will identify homologs
(related proteins) for each input protein. This pairs well with domain
architecture searches that can be obtained for all homologous hits for
each query.
Selecting Domain Architecture will generate domain
architecture (including sequence-structure motifs/domains such as Pfam,
Hamap and SignalP, disorder predictions using MobiDBlite, and seconday
structure/cellular localization using Phobius and Coils) for the query
proteins provided. If selected alone, no other analysis will be
performed.
This option allows you to search for proteins with homologous domains to those found within your query proteins. This allows for a broader search of remote homologs that would be missed by a standard whole-protein search. Phylogenetic searches, domain architecture, and characterization are then performed.
For analyses that include a homology search, you can adjust parameters like database (default refseq), maximum hits (default 100), and E-value (default 0.00001).
You can filter your homology search results to specific organisms via
the Organism(s) to Include/Exclude dropdowns. Enter either
organism names or taxon IDs and the list will be dynamically filtered to
show only matching organisms, with your query highlighted in each list
item. The full taxonomic classification for each item is shown and is
searchable. Multiple selections are possible.
The first dropdown filters organisms to include; if empty, all organisms/taxa will be eligible for inclusion. The second dropdown filters organisms to exclude; if empty, no organisms/taxa will be excluded. If both dropdowns are used, the “include” filter will be applied first, followed by the “exclude” filter.
The filters are applied to your currently selected BLAST database (e.g., refseq, nr).
In the backend, these options are passed to BLAST as
-taxids <comma-delimited list of IDs> for the
inclusion filter and
-negative_taxids <comma-delimited list of taxon IDs>
for exclusions. See Limiting a Search by taxonomy for NCBI’s
documentation on this feature.
A homology search requires you to enter/upload protein FASTA sequence(s) or accession number(s).
If given an accession number, MolEvolvR will find its
corresponding FASTA sequence to run through DELTA-BLAST, a variation of
BLASTP. DELTA-BLAST searches pre-constructed
Position-Specific-Scoring-Matrix (PSSM) and the conserved domain
database (CDD) for accelerated lookups. Once the BLAST homology search
completes, _MolEvolvR_ clusters the resulting homolog sequences with BLASTClust and adds additional metadata by lineage
and domain architecture (when selecting the
Domain Architecture option).
You can start your analysis with a full list of accession numbers or
FASTA files for protein(s) of interest. MolEvolvR gathers
homologs of your input protein(s), and then performs domain architecture
analysis to on all homolog and query sequences. You have the option to
perform only Phylogenetic Analysis or
Domain Architecture if you don’t need both.
You can start your analysis from uploaded NCBI BLAST or InterProScan results. Web-BLAST results allow you to determine homolog similarity, the domain architecture and/or phylogeny. InterProScan results summarize and visualizae domains and (if accession numbers are provided) phylogeny.
BLAST is available through NCBI’s website.
You can start your analysis with data from a previous BLAST run.
These data are run through BLASTClust to cluster similar sequences among
the retrieved homologs. The Phylogenetic Analysis and
Domain Architecture options are then applied.
To ensure compatibility with the MolEvolvR
Start Analysis tab, follow these guidelines.
First, enter your Accession Number(s) or FASTA sequence(s) into the “Enter Query Sequence” box.
For the database parameter, we support either the non-redundant
database (nr) or the reference sequence collection
(refseq_proteins). The refseq_proteins dataset
is a high quality, non-redundant subset of protein records curated by
NCBI staff. Meanwhile, nr is a larger, non-redundant set
that includes many more sequences but is not necessarily vetted for
quality and accuracy. If you would like to further filter your results
based on lineages (e.g., species, genus, family, kingdom), enter the
name/taxID in the Organism field and toggle the box to include/exclude
those results in your search.
Next, select which algorithm to run. If you don’t know details of your protein, ‘BLASTP’ is a great place to start. If you are interested in identifying remote homologs, we suggest using ‘PSI-BLAST’. If your protein has domains of interest, ‘DELTA-BLAST’ works very well.
Creating a job title for the run is optional and for your personal convenience.
Under the expandable “Algorithm parameters” section, the defaults for
max target sequences are typically sufficient. The expect threshold
value, or E-value, represents the number of matches by pure random
chance, and filters out hits with values greater than the threshold. We
suggest 1e-5 (1x10-5 or `0.00001’) for general searches.
Double check your parameters across the page, then click the BLAST
button.
Summary of NCBI BLAST submission parameters
Once your BLAST search is complete, at the end of the RID row towards
the top of the page, there will be a Download All option
with a dropdown menu to download results. Click on the
Download All button and select the
Hit Table (csv) option. You can directly upload these
.csv result files to MolEvolvR. If the first
column of the results .csv does not include accession
numbers, you will also need to provide the query sequence(s) that you
used to run BLAST as a second file (.fa, .faa,
or .fasta format).
If you are performing a PSI-BLAST, you will have the option to run additional iterative searches upon each search’s completion. Further iterations will find more remote homologs, so it is recommended you run several iterations before downloading the Hit Table.
BLAST provides information in many formats for your protein homologs,
which we encourage you to review. However, MolEvolvR requires
the Hit Table (csv) for analysis.
Alternatively, you may upload command line BLAST results with these columns specified:
query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % positives
Check out the BLAST tutorials to learn more about BLAST.
InterProScan is available through EBI’s website.
If you have already identified your protein’s domains through InterProScan, you can upload the output to MolEvolvR for a customizable visual summary of the information.
To ensure compatibility with the MolEvolvR
Start Analysis tab, follow these guidelines.
Input your protein’s FASTA sequence by copy/pasting into the box or
uploading the FASTA file with the Choose file button. If
the sequence is valid, InterProScan will display a green check mark in
the bottom right corner of the input box. You can use the
Advanced options dropdown to select specific databases.
When finished, click the `Search button to begin.