Pairwise

The Pairwise command in DBRetina is designed to perform pairwise comparisons between supergroups based on their shared features. This command takes the index prefix and the number of cores as input parameters.

Usage: DBRetina pairwise [OPTIONS]

  Calculate pairwise similarities.

Options:
  -i, --index-prefix TEXT   Index file prefix  [required]
  -t, --threads INTEGER     number of cores  [default: 1]
  -m, --metric TEXT         select from ['containment', 'jaccard', 'ochiai']
  -c, --cutoff FLOAT RANGE  filter out similarities < cutoff  [default: 0.0;
                            0<=x<=100]
  --pvalue                  calculate Hypergeometric p-value
  --help                    Show this message and exit.

Command arguments

-i, --index-prefix TEXT Index file prefix [required]

This is the user-defined prefix that was used in the indexing step as an output prefix.

-t, --threads INTEGER number of cores [default: 1]

The number of processing cores to be used for parallel computation during the pairwise comparisons.

-m, --metric TEXT select from ['containment', 'jaccard', 'ochiai']

Optional similarity metric to filter out pairwise comparisons below a certain cutoff from exporting.

-c, --cutoff FLOAT RANGE filter out similarities < cutoff [default: 0.0; 0<=x<=100]

The -c argument is used with the -m argument to define the cutoff.

--pvalue calculate Hypergeometric p-value This flag calculates the Hypergeometric p-value for pairwise comparisons based on shared features between supergroups and the total number of features in the database.

Output files

Primary output files

{index_prefix}_DBRetina_pairwise.tsv

A TSV file that provides information about shared features between each pair of supergroups. The TSV columns are defined as follows:

group_1_ID	ID of the first supergroup in a pair
group_2_ID	ID of the second supergroup in a pair
group_1_name	name of the first supergroup in a pair
group_2_name	name of the second supergroup in a pair
shared_features	number of features shared between the two supergroups
containment	The containment metric is the ratio of shared kmers to the smallest set of kmers. This score is calculated as (shared_kmers / minimum_source_kmers) * 100.
ochiai	Ochiai similarity computed as 100 * (shared_kmers / sqrt(source_1_kmers * source_2_kmers))
jaccard	Jaccard similarity percentage. calculated as 100 * (shared_kmers / (source_1_kmers + source_2_kmers - shared_kmers))
odds_ratio	The `odds_ratio` function calculates the odds ratio between two supergroups, quantifying the strength of association between them based on shared features. It returns a `double` representing the odds ratio, or `-1` if the calculation encounters a division by zero.
pvalue	This p-value quantifies the statistical significance of the overlap between the two supergroups, given their sizes and the universe of all features. Calculated from hypergeometric distribution.

{index_prefix}_DBRetina_similarity_metrics_plot_log.png

The clustered bar chart shows the logarithmic frequency distribution of three similarity metrics - containment, ochiai, and jaccard - over different similarity ranges.

{index_prefix}_DBRetina_similarity_metrics_plot_linear.png

Same as above, but the y-axis is displayed on a linear scale.

Example plot (Log)Example plot (Linear)

Advanced Output (For developers)

{index_prefix}_DBRetina_pairwise_stats.json: used to generate the similarity metrics plot. {index_prefix}_DBRetina_pairwise_stats_odds_ratio.txt: odds-ratio metadata for next step of commands.

Last update: July 8, 2023
Created: July 5, 2023

Authors: mr-eyes