Apply set-cover and deduplication algorithms
Algorithm Overview
DBRetina presents a robust set-cover algorithm that functions within a complex system of parameters. This documentation provides a comprehensive overview of the underlying mechanisms at work, guiding you through the processes.
The set cover algorithm executes several complex computations to achieve its end goals. Here's an ordered breakdown of the process:
-
Community Detection: Using the value specified by
--community
(default: 30), the algorithm calculates Ochiai similarity and applies a community detection method on the primary pairwise similarities file, producing clusters of groups. This step serves to calculate the Cluster Specificity Index (CSI) for each group. -
Group Pleitropy Index (GPI) Calculation: Next, the algorithm computes the GPI for each group.
-
Modularity Calculation: Using the containment cutoff specified by
--modularity
(default: 80), the algorithm processes the main pairwise similarities file to compute three key metrics: Fragmentation, Heterogeneity, and Modularity. - Fragmentation is represented as the negative of the number of outbound edges.
- Heterogeneity is the positive of the number of inbound edges.
-
Modularity is the absolute value of the sum of Heterogeneity and Fragmentation. The group with the lowest modularity value has the highest modularity.
-
Deduplication: The
--dedup
parameter specifies the Ochiai similarity threshold (default: 100%) for deduplication, creating an undirected graph. The algorithm applies a weakly connected components method, selecting the group with the highest number of edges from each connected component. In case of a tie, the largest group is selected. A default value of 100% is utilized to avoid item loss. Note that the--dedup
argumnt is similar to the one inDBRetina dedup
. -
Set Cover Application: Finally, the set cover algorithm sorts the groups by their modularity, CSI, and size, subsequently subtracting shared items from the universal set. This process continues until the
--stop-cov
percentage of items are covered. By default, the algorithm stops only when 100% of items are covered, thereby ensuring maximum coverage.
DBRetina's set cover algorithm offers granular control over the balance between complexity and precision in handling large data sets. By adjusting the various parameters, users can tailor the algorithm's performance to meet specific project requirements.
Usage
Usage: DBRetina setcov [OPTIONS]
Apply set-cover algorithm.
Options:
-i, --index-prefix TEXT index file prefix [required]
--modularity FLOAT RANGE containment cutoff for modularity calculation
[default: 80; 0<=x<=100]
--dedup FLOAT RANGE deduplication similarity cutoff [default: 100;
0<=x<=100]
--community FLOAT RANGE community detection similarity cutoff [default:
30; 0<=x<=100]
--stop-cov FLOAT RANGE stop when items covered by % [0<=x<=100]
-o, --output TEXT output file prefix [required]
--help Show this message and exit.
Command arguments
-i, --index-prefix TEXT index file prefix [required]
This is the user-defined prefix that was used in the indexing step.
--modularity FLOAT RANGE containment cutoff for modularity calculation [default: 80; 0<=x<=100]
This parameter specifies the containment cutoff for modularity calculation. The default value is 80%.
--dedup FLOAT RANGE deduplication similarity cutoff [default: 100; 0<=x<=100]
This parameter specifies the Ochiai similarity threshold for deduplication. The default value is 100% which means no items loss and deduplicate only identical groups.
--community FLOAT RANGE community detection similarity cutoff [default: 30; 0<=x<=100]
This parameter specifies the Ochiai similarity threshold for community detection. The default value is 30%.
--stop-cov FLOAT RANGE stop when items covered by % [0<=x<=100]
This parameter specifies the percentage of items to be covered before stopping the set cover algorithm. The default value is 100% which means the algorithm will stop only when all items are covered.
-o, --output TEXT output file prefix [required]
This is the user-defined prefix for the output files.
Output files format
{output_prefix}_item_to_GPI_CSI.tsv
A TSV file with the following columns:
Item name | The item name (i.e. gene, protein) |
GPI | Group Pleitropy Index |
CSI | Cluster Specificity Index |
{output_prefix}_groups_metadata.tsv
A TSV file with the following columns:
Column | Description |
---|---|
group | The group name (node name) |
no_of_items | Number of items in the group |
average_gpi | Group's average Pleitropy Index |
average_CSI | Group's average Cluster Specificity Index |
fragmentation | Number of outbound edges from the node (-ve value) |
heterogeneity | Number of inbound edges to the node |
modularity | absolute(fragmentation + heterogeneity) |
status | The status of the group. `set-cov`: removed by the set-cov. `dedup`: removed by deduplication. `remained`: remained group. |
{output_prefix}_remaining_groups_metadata.tsv
Same as {output_prefix}_groups_metadata.tsv but only for the remaining groups after set cover and without the status
column.
{output_prefix}_removed_groups_metadata.tsv
Same as {output_prefix}_groups_metadata.tsv but only for the removed groups after set cover and without the status
column.
{output_prefix}_associations.tsv
A new association files as described in DBretina index
section. This file contains the final remaining groups after set cover and deduplication.
{output_prefix}_new.gmt
A new GMT file as described in DBretina index
section. This file contains the final remaining groups after set cover and deduplication.
{output_prefix}_original.gmt
The original GMT file as described in DBretina index
section. This file contains the original groups before set cover and deduplication.
Created: July 9, 2023