Please note: this site is undergoing maintenance, and processing functionality has been temporarily disabled. For the SEEKR paper, see here.

Upload user RNA set

Select or upload comparison RNA set

Drag and drop .fa file here or click to upload

Warning: This set contains too many sequences to visualize.
Files must contain < 200 sequences for visualization.

Choose from standard sets

Drag and drop .fa file here or click to upload

Note: This set contains too many sequences to visualize.
Files must contain < 200 sequences for visualization.

SEEKR settings

Select normalization set

Select kmer length

Warning: This kmer length is too large to visualize.
K < 7 for visualization.

Please run SEEKR to populate this tab.

Click on the axis labels to re-order the graph

Click on the axis labels to re-order the graph

Welcome to the webapp interface for the SEEKR (SEquence Evaluation through Kmer Representation) algorithm. SEEKR is a novel algorithm used to quantify similarities and differences in genomic sequences, particularly those of long non-coding RNAs (lncRNAs).

SEEKR can be used to:

  1. Count kmers in a set of transcripts
  2. Make all pairwise similarity comparisons between transcripts in a fasta file of interest
  3. Find new transcripts similar to current transcripts of interest
  4. Compare potentially functionally-homologous transcripts between species
  5. Build communities of similar transcripts
  6. Perform many other transcript similarity calculations

If you use SEEKR, please cite the SEEKR paper :

Kirk, J. M., Kim, S. O., Inoue, K., Smola, M. J., Lee, D. M., Schertzer, M. D., … Calabrese, J. M. (2018). Functional classification of long non-coding RNAs by k -mer content. Nature Genetics, 50(10), 1474–1482. https://doi.org/10.1038/s41588-018-0207-8



To run larger fasta files or kmer sizes, download SEEKR from our GitHub repository .


Input

The SEEKR web portal requires four options be set in order to run:


This file should contain the transcripts the user is interested in studying via SEEKR. The file must be in fasta format. Two sample fasta files are provided.

sample2.fa
sample15.fa

A second fasta file must be declared to compare against the transcripts of the user fasta file. There are three types of options for this file:

  1. User Set – This will compare the User fasta file to itself.
  2. Preloaded GENCODE fasta files – The latest human and mouse lncRNA annotations from GENCODE.
  3. Upload Set – The user can choose to upload a second fasta file for comparison. These lncRNAs will occupy the y-axis in the Pearson’s comparison matrix. A use case for uploading a second fasta file would be if the user is interested in comparing a group of lncRNAs to a set of lncRNAs of known function, such as Xist, Kcnq1ot1, and Airn. Another example would be if the user is interested in comparing lncRNAs in the User set to lncRNAs from another genome.

After calculating the abundance of each kmer in each lncRNA and normalizing for lncRNA length, SEEKR then calculates a z-score for each kmer in each lncRNA by subtracting by the mean, length-normalized abundance of each kmer in the normalization set and dividing by the standard deviation. For small User sets, it can be useful to use a large normalization set, such as all GENCODE mouse or human lncRNAs, to determine how similar the lncRNAs in the User set are to each other relative to the kmer frequency of lncRNAs from a complete genome annotation. In other cases, it may be useful to determine how similar the lncRNAs in the User set are to each other relative to their own distribution of kmer frequencies or those from the Comparison set.

  1. All Human lncRNAs (GENCODE) – The most recent human GENCODE annotations.
  2. All Mouse lncRNAs (GENCODE) – The most recent mouse GENCODE annotations.
  3. User Set – The transcripts provided in the User fasta file.
  4. Comparision Set – The transcripts provided in the Comparision fasta file.

The size of the kmer to count in the fasta files. We recommend k=6 for most applications, though smaller kmers will process exponentially faster.



Output

Press the Submit button and the algorithms output will be visualized in the ‘Results’ tab (files container over 200 RNA sequences and Kmers above 6 will not produce visuals)


The kmer count profiles for lncRNAs in the User Set. Each lncRNA is a row on the y axis, and each kmer is a column on the x axis. The lncRNAs are arranged according to a hierarchical clustering on the kmer contents of each sequence. Clicking on either the row or column labels will re-order the graph based on the values in the row or column clicked. For User sets with >200 lncRNAs, the .csv is available for download (no visualization).

The x axis represents sequences from the User fasta file, and the y-axis represents sequences from the comparison set. Each element in the matrix indicates the Pearson Correlation R value between the two sequences in the given row and column. The sequences are automatically arranged according to the same hierarchical clustering as the Sequence Kmer Profiles visual. Clicking on either the row or column labels will re-order the graph based on the values in the row or column clicked.