Indexing (Index the input data files)

The Indexing process in DBRetina primarily focuses on creating an index structure for the input entities and their associated features. This structure is utilized for calculating pairwise distances between input entities using the "pairwise" command, as well as querying one or more features to determine their associated entities with the "query" command.

Usage: DBRetina index [OPTIONS]

  Index the input data files.

Options:
  -a, --asc TEXT     associations file col1: supergroup, col2: single feature. 1st
                     line is header.
  -g, --gmt TEXT     GMT file(s)
  -o, --output TEXT  output file prefix  [required]
  --help             Show this message and exit.

Command arguments

Warning

The text of all input files is automatically converted to lowercase.
All double quotes are removed.

Danger

Pipe character | can't be used in the input data.

-a, --asc TEXT associations file(s) col1: supergroup, col2: single feature. 1st line is header.

The "Association File" is a two-column TSV (tab-separated values) file with an included header. The first column denotes groups, while the second column indicates the features associated with each respective group. Each row signifies a single feature and its corresponding group.

Example of an Association File with features as features:

Disease       feature
Breast Cancer BRCA1
Breast Cancer BRCA2
Lung Cancer   EGFR
Lung Cancer   KRAS

-g, --gmt TEXT GMT file(s).

The "GMT File" is a tab-delimited headerless file that contains feature sets. Each row denotes a single feature set, while the first column indicates the name of the feature set. The second column contains a description of the feature set, while the remaining columns contain the features that belong to the feature set. Click here to read more about GMT format.

-o, --output TEXT output file prefix [required]

The output file prefix is used to name the output files. The output files are explained in detail in the next section.

Example

Multiple GMT filesMultiple association files

DBRetina index -g gmt_file1.gmt -g gmt_file2.gmt -o idx_example

DBRetina index -a asc_file1.tsv -a asc_file2.tsv -o idx_example

Warning

You can't use combination of gmt files and association files, the command accepts only one type of input.

Output files format

The output featurerated by the Index command consists of two JSON files (private and public) and a set of index files. These files are explained in detail below:

Primary output files

{prefix}_raw.json

This JSON file contains supergroups and their related features in plain text. This JSON file is prepared for the user to understand the final input data that is used for indexing.

{prefix}_hashes.json

This is another JSON file that contains supergroups and their related features, but the features are hashed for indexing. This JSON file is used internally by DBRetina for indexing.

Advanced Output (For developers)

{output_prefix}_groupID_to_featureCount.bin: binary file that contains the number of features for each supergroup.

{output_prefix}_groupID_to_featureCount.tsv: tab-delimited file that contains the number of features for each supergroup.

{output_prefix}_color_count.bin: binary file that contains the colors count.

{output_prefix}.phmap: The parallel-hash-map binary index file contains features and their associated colors.

{output_prefix}.namesMap: pipe-delimited file of supergroup ID and its original name with total number of supergroups in header.

{output_prefix}_color_to_sources.bin: binary file that contains the colors and their associated supergroup IDs.

{output_prefix}.extra: text file that contains metadata for inter-command communication.

Last update: July 8, 2023
Created: July 5, 2023

Authors: mr-eyes