Extending
In this section, we provide instructions on how to add new GRN inference methods, metrics, and datasets to the geneRNIB platform. Installing geneRNIB is a prerequisite for integrating new features.
Add a GRN inference method
Examples of GRN inference methods include GRNBoost2, CellOracle, and SCENIC. The list of integrated GRN inference methods can be found on the geneRNIB platform, src/methods, which are examples of how to integrate new methods for both R and Python.
Each method requires a config.vsh file together with a script.py. Additionally, the method can have extra files to store and organize the code, such as helper, which are stored in the same folder and called by the script.py.
The overlook of config.vsh is as follows. However, refer to the src/methods/ folder for the updated formatting.
__merge__: /src/api/comp_method.yaml # merge with the common method scheme
name: method_name # unique id for your method
namespace: "grn_methods"
info:
label: pretty method name # e.g. "GRNBoost2"
summary: "summary of your method"
description: |
more about your method
documentation_url: link to your method documentation # optional
resources:
- type: python_script # or R_script
path: script.py # your main script (dont change it). or script.R
engines:
- type: docker
image: ghcr.io/openproblems-bio/base_python:1.0.4 # a base docker image
__merge__: /src/api/base_requirements.yaml # merge with the base requirements schema required for the pipeline
setup:
- type: python
packages: [ grnboost2 ] # additional packages required for your method. see different methods for examples as this could get complicated. or, use your image and omit this.
runners: # this is for the nextflow pipeline.
- type: executable
- type: nextflow
directives:
label: [midtime, midmem, midcpu] # expected resources. see scripts/labels_tw.config for their definition
Your script.py should have the following structure:
import sys
import anndata as ad
import numpy as np
... # import necessary libraries
## VIASH START -> this is necessary for the viash to work. It essentially replaces this with the parameters passed to the config.vsh file
par = {
'rna': 'resources/grn_benchmark/rna_op.h5ad', # example rna data
'tf_all': 'resources/grn_benchmark/prior/tf_all.csv', # tf list. you will need this to filter the network to only tf-gene pairs. we only evaluate top 50k TF-gene edges so better to filter it.
'prediction': 'output/prediction.h5ad' # where to save the prediction
}
## VIASH END
# Load some data
rna = ad.read_h5ad(par["rna"])
tf_all = np.loadtxt(par["tf_all"], dtype=str)
# Your method code here
net = pd.DataFrame({"source": ["TF1", "TF2"], "target": ["gene1", "gene2"], "weight": [0.1, 0.2]}) # example network
# Save the inferred network
net['weight'] = net['weight'].astype(str) # Ensure weight is stored as a string
output = AnnData(
X=None,
uns={
"method_id": "method_name",
"dataset_id": "dataset_name",
"prediction": net[["source", "target", "weight"]]
}
)
output.write_h5ad(par['prediction'])
Once you have added your method, you can test it by running the following command. For this, download and place the test datasets in resources_test/grn_benchmark.
aws s3 sync s3://openproblems-data/resources_test/grn/grn_benchmark resources_test/grn_benchmark --no-sign-request
viash test src/methods/your_method/config.vsh # path to the config.vsh file of your method
Once the test is successful, you can submit a pull request to the geneRNIB repository to integrate your method. See additional Viash commands in the Viash documentation to run your method with different parameters.
Updating the leaderboard
After evaluation is complete, aggregate the raw scores into all_scores.csv:
python scripts/benchmark/aggregate_local_scores.py
Then regenerate the leaderboard figure, which re-runs normalization across all methods:
python scripts/benchmark/create_overview_figure.py
The raw benchmark scores are also available for download on the Leaderboard page.
Add a GRN evaluation metric
Similar to method integration, metrics also follow similar file formatting. See folder src/metrics/ for examples. While new metrics could use different evaluation datasets, the current files for evaluation are located in resources/grn_benchmark/evaluation_data. There are three formats of datasets; single cell, (pseudo)bulk, and differential expression (de). The choice of the dataset depends on the evaluation metric.
A few tips:
use read_prediction from from src/utils/util to read the inferred GRNs and do checking the format.
the metric should output a score which has to be a h5ad file. In Python, this can be done:
import pandas as pd
from util import format_save_score
results = pd.DataFrame({
'metric_key_1': [metric_value_1], # submetric 1
'metric_key_2': [metric_value_2], # submetric 2
})
method_id = 'name of GRN method'
dataset_id = 'name of dataset used for GRN inference'
score_file = 'output/score.h5ad'
format_save_score(results, method_id, dataset_id, score_file)
Add a GRN inference and evaluation dataset
Here we explain how to integrate new datasets. All datasets are in h5ad, and the example structure of a inference or evaluation dataset can be found in resources/grn_benchmark/. The inference datasets are in resources/grn_benchmark/inference_data/ and the evaluation datasets are in resources/grn_benchmark/evaluation_data/. Each dataset should have a unique dataset_id, stored in .uns[‘dataset_id’], that will be used to identify it in the platform. In addition, there should be additional information stored in .uns, such as the description, reference, and normalization type. Also, the normalized values should be stores in layers with the name of normalization method, e.g. lognorm.