Apply set-cover and deduplication algorithms

Algorithm Overview

DBRetina presents a robust set-cover algorithm that functions within a complex system of parameters. This documentation provides a comprehensive overview of the underlying mechanisms at work, guiding you through the processes.

The set cover algorithm executes several complex computations to achieve its end goals. Here's an ordered breakdown of the process:

Community Detection: Using the value specified by --community (default: 30), the algorithm calculates Ochiai similarity and applies a community detection method on the primary pairwise similarities file, producing clusters of groups. This step serves to calculate the Cluster Specificity Index (CSI) for each group.
Group Pleitropy Index (GPI) Calculation: Next, the algorithm computes the GPI for each group.
Modularity Calculation: Using the containment cutoff specified by --modularity (default: 80), the algorithm processes the main pairwise similarities file to compute three key metrics: Fragmentation, Heterogeneity, and Modularity.
Fragmentation is represented as the negative of the number of outbound edges.
Heterogeneity is the positive of the number of inbound edges.
Modularity is the absolute value of the sum of Heterogeneity and Fragmentation. The group with the lowest modularity value has the highest modularity.
Deduplication: The --dedup parameter specifies the Ochiai similarity threshold (default: 100%) for deduplication, creating an undirected graph. The algorithm applies a weakly connected components method, selecting the group with the highest number of edges from each connected component. In case of a tie, the largest group is selected. A default value of 100% is utilized to avoid item loss. Note that the --dedup argumnt is similar to the one in DBRetina dedup.
Set Cover Application: Finally, the set cover algorithm sorts the groups by their modularity, CSI, and size, subsequently subtracting shared items from the universal set. This process continues until the --stop-cov percentage of items are covered. By default, the algorithm stops only when 100% of items are covered, thereby ensuring maximum coverage.

DBRetina's set cover algorithm offers granular control over the balance between complexity and precision in handling large data sets. By adjusting the various parameters, users can tailor the algorithm's performance to meet specific project requirements.

Usage

Usage: DBRetina setcov [OPTIONS]

  Apply set-cover algorithm.

Options:
  -i, --index-prefix TEXT   index file prefix  [required]
  --modularity FLOAT RANGE  containment cutoff for modularity calculation
                            [default: 80; 0<=x<=100]
  --dedup FLOAT RANGE       deduplication similarity cutoff  [default: 100;
                            0<=x<=100]
  --community FLOAT RANGE   community detection similarity cutoff  [default:
                            30; 0<=x<=100]
  --stop-cov FLOAT RANGE    stop when items covered by %  [0<=x<=100]
  -o, --output TEXT         output file prefix  [required]
  --help                    Show this message and exit.

Command arguments

-i, --index-prefix TEXT index file prefix [required]

This is the user-defined prefix that was used in the indexing step.

--modularity FLOAT RANGE containment cutoff for modularity calculation [default: 80; 0<=x<=100]

This parameter specifies the containment cutoff for modularity calculation. The default value is 80%.

--dedup FLOAT RANGE deduplication similarity cutoff [default: 100; 0<=x<=100]

This parameter specifies the Ochiai similarity threshold for deduplication. The default value is 100% which means no items loss and deduplicate only identical groups.

--community FLOAT RANGE community detection similarity cutoff [default: 30; 0<=x<=100]

This parameter specifies the Ochiai similarity threshold for community detection. The default value is 30%.

--stop-cov FLOAT RANGE stop when items covered by % [0<=x<=100]

This parameter specifies the percentage of items to be covered before stopping the set cover algorithm. The default value is 100% which means the algorithm will stop only when all items are covered.

-o, --output TEXT output file prefix [required]

This is the user-defined prefix for the output files.

Output files format

{output_prefix}_item_to_GPI_CSI.tsv

A TSV file with the following columns:

Item name	The item name (i.e. gene, protein)
GPI	Group Pleitropy Index
CSI	Cluster Specificity Index

{output_prefix}_groups_metadata.tsv

A TSV file with the following columns:

Column	Description
group	The group name (node name)
no_of_items	Number of items in the group
average_gpi	Group's average Pleitropy Index
average_CSI	Group's average Cluster Specificity Index
fragmentation	Number of outbound edges from the node (-ve value)
heterogeneity	Number of inbound edges to the node
modularity	absolute(fragmentation + heterogeneity)
status	The status of the group. `set-cov`: removed by the set-cov. `dedup`: removed by deduplication. `remained`: remained group.

{output_prefix}_remaining_groups_metadata.tsv

Same as {output_prefix}_groups_metadata.tsv but only for the remaining groups after set cover and without the status column.

{output_prefix}_removed_groups_metadata.tsv

Same as {output_prefix}_groups_metadata.tsv but only for the removed groups after set cover and without the status column.

{output_prefix}_associations.tsv

A new association files as described in DBretina index section. This file contains the final remaining groups after set cover and deduplication.

{output_prefix}_new.gmt

A new GMT file as described in DBretina index section. This file contains the final remaining groups after set cover and deduplication.

{output_prefix}_original.gmt

The original GMT file as described in DBretina index section. This file contains the original groups before set cover and deduplication.

Last update: July 9, 2023
Created: July 9, 2023

Authors: mr-eyes