workflows/annotation/scgpt_annotation

Description

Cell type annotation workflow using scGPT. The workflow takes a pre-processed h5mu file as query input, and performs

subsetting for HVG
cross-checking of genes with the model vocabulary
binning of gene counts
padding and tokenizing of genes
transformer-based cell type prediction Note that cell-type prediction using scGPT is only possible using a fine-tuned scGPT model.

Type

nextflow_script

License

MIT

Links

Repository

Launch

Query input

Name	Type & Properties	Description
--id	string required	ID of the sample.
--input	file required	Path to the input file.
--modality	string
--input_layer	string	The layer of the input dataset to process if .X is not to be used. Should contain log normalized counts.
--input_var_gene_names	string	The .var field in the input (query) containing gene names; if not provided, the var index will be used.
--input_obs_batch_label	string required	The .obs field in the input (query) dataset containing the batch labels.

Name

Type & Properties

--id

string

required

--input

file

required

--modality

string

--input_layer

string

--input_var_gene_names

string

--input_obs_batch_label

string

required

Model input

Name	Type & Properties	Description
--model	file required	The scGPT model file. Must be a fine-tuned model that contains keys for checkpoints (--finetuned_checkpoints_key) and cell type label mapper(--label_mapper_key).
--model_config	file required	The scGPT model configuration file.
--model_vocab	file required	The scGPT model vocabulary file.
--finetuned_checkpoints_key	string	Key in the model file containing the pre-trained checkpoints.
--label_mapper_key	string	Key in the model file containing the cell type class to label mapper dictionary.

Name

Type & Properties

--model

file

required

--model_config

file

required

--model_vocab

file

required

--finetuned_checkpoints_key

string

--label_mapper_key

string

Outputs

Name	Type & Properties	Description
--output	file required output	Output file path
--output_compression	string	The compression algorithm to use for the output h5mu file.
--output_obs_predictions	string	The name of the adata.obs column to write predicted cell type labels to.
--output_obs_probability	string	The name of the adata.obs column to write predicted cell type labels to.

Name

Type & Properties

--output

file

required

output

--output_compression

string

--output_obs_predictions

string

--output_obs_probability

string

Padding arguments

Name	Type & Properties	Description
--pad_token	string	Token used for padding.
--pad_value	integer	The value of the padding token.

Name

Type & Properties

--pad_token

string

--pad_value

integer

HVG subset arguments

Name	Type & Properties	Description
--n_hvg	integer	Number of highly variable genes to subset for.
--hvg_flavor	string	Method to be used for identifying highly variable genes. Note that the default for this workflow (`cell_ranger`) is not the default method for scanpy hvg detection (`seurat`).

Name

Type & Properties

--n_hvg

integer

--hvg_flavor

string

Tokenization arguments

Name	Type & Properties	Description
--max_seq_len	integer	The maximum sequence length of the tokenized data.

Name

Type & Properties

--max_seq_len

integer

Embedding arguments

Name	Type & Properties	Description
--dsbn	boolean	Apply domain-specific batch normalization
--batch_size	integer	The batch size to be used for embedding inference.

Name

Type & Properties

--dsbn

boolean

--batch_size

integer

Binning arguments

Name	Type & Properties	Description
--n_input_bins	integer	The number of bins to discretize the data into; When no value is provided, data won't be binned.
--seed	integer	Seed for random number generation used for binning. If not set, no seed is used.

Name

Type & Properties

--n_input_bins

integer

--seed

integer