Indexing (Index the input data files)
The Indexing process in DBRetina primarily focuses on creating an index structure for the input entities and their associated features. This structure is utilized for calculating pairwise distances between input entities using the "pairwise" command, as well as querying one or more features to determine their associated entities with the "query" command.
Usage: DBRetina index [OPTIONS]
Index the input data files.
Options:
-a, --asc TEXT associations file col1: supergroup, col2: single feature. 1st
line is header.
-g, --gmt TEXT GMT file(s)
-o, --output TEXT output file prefix [required]
--help Show this message and exit.
Command arguments
Warning
- The text of all input files is automatically converted to lowercase.
- All double quotes are removed.
Danger
Pipe character |
can't be used in the input data.
-a, --asc TEXT associations file(s) col1: supergroup, col2: single feature. 1st line is header.
The "Association File" is a two-column TSV (tab-separated values) file with an included header. The first column denotes groups, while the second column indicates the features associated with each respective group. Each row signifies a single feature and its corresponding group.
Example of an Association File with features as features:
-g, --gmt TEXT GMT file(s).
The "GMT File" is a tab-delimited headerless file that contains feature sets. Each row denotes a single feature set, while the first column indicates the name of the feature set. The second column contains a description of the feature set, while the remaining columns contain the features that belong to the feature set. Click here to read more about GMT format.
-o, --output TEXT output file prefix [required]
The output file prefix is used to name the output files. The output files are explained in detail in the next section.
Example
Warning
You can't use combination of gmt files and association files, the command accepts only one type of input.
Output files format
The output featurerated by the Index command consists of two JSON files (private and public) and a set of index files. These files are explained in detail below:
Primary output files
{prefix}_raw.json
This JSON file contains supergroups and their related features in plain text. This JSON file is prepared for the user to understand the final input data that is used for indexing.
{prefix}_hashes.json
This is another JSON file that contains supergroups and their related features, but the features are hashed for indexing. This JSON file is used internally by DBRetina for indexing.
Advanced Output (For developers)
{output_prefix}_groupID_to_featureCount.bin
: binary file that contains the number of features for each supergroup.
{output_prefix}_groupID_to_featureCount.tsv
: tab-delimited file that contains the number of features for each supergroup.
{output_prefix}_color_count.bin
: binary file that contains the colors count.
{output_prefix}.phmap
: The parallel-hash-map binary index file contains features and their associated colors.
{output_prefix}.namesMap
: pipe-delimited file of supergroup ID and its original name with total number of supergroups in header.
{output_prefix}_color_to_sources.bin
: binary file that contains the colors and their associated supergroup IDs.
{output_prefix}.extra
: text file that contains metadata for inter-command communication.
Created: July 5, 2023