Skip to content

Clustering the pairwise similarities

Graph-based clustering of the pairwise TSV file based on the connected components algorithm. The clustering is based on the similarity metric and the cutoff value. The output is a DBRetina clusters file.

Usage: DBRetina cluster [OPTIONS]

  Graph-based clustering of the pairwise TSV file.

Options:
  -p, --pairwise PATH       pairwise TSV file  [required]
  -m, --metric TEXT         select from ['containment', 'ochiai', 'jaccard',
                            'pvalue']  [required]
  --community               clusters as communities
  -c, --cutoff FLOAT RANGE  cluster the supergroups with (similarity > cutoff)
                            [0<=x<=100; required]
  -o, --output-prefix TEXT  output file prefix  [required]
  --help                    Show this message and exit.

Command arguments

-c, --cutoff FLOAT RANGE cluster the supergroups with (similarity > cutoff) [0<=x<=100; required]

The cutoff value for clustering the supergroups. The default value is 0.0, which means that all comparisons are included in the clustering.

-p, --pairwise PATH pairwise TSV file [required]

The original pairwise TSV file generated from DBRetina pairwise command, or DBRetina filter command.

-m, --metric TEXT select from ['containment', 'ochiai', 'jaccard', 'pvalue'] [required]

The similarity metric to apply the cutoff on.

--community clusters as communities

This flag will cluster the supergroups by a community detection algorithm. The default is to cluster the supergroups using the weakly connected components algorithm.

-o, --output-prefix TEXT output file prefix [required]

The user-defined prefix for the output files.


Output files format

{output_prefix}_clusters.tsv

The DBRetina clusters TSV file. First column is the cluster ID, second column is the cluster size, and the third column is PIPE separated cluster members.

{output_prefix}_clusters_histogram.png

A histogram provides of cluster sizes. Each bar corresponds to a size range, with a log-scale y-axis indicating the number of clusters falling within that range. This allows for a quick understanding of how cluster sizes are distributed, identifying common sizes and outliers.

Image title

Example histogram plot

{output_prefix}_clusters_bubbles.png

A Bubble plot uses a grid layout to represent distinct clusters. The bubble size and color gradient both denote the magnitude of each cluster. The bubble plot is useful for visualizing the distribution of cluster sizes and the relative sizes of each cluster.

Image title

Example Bubble plot

Last update: July 9, 2023
Created: July 5, 2023
Authors: mr-eyes