RidgeReg with expression features
Haiqing Hua
I share news from Chinese website (you can use google translate please also subscribe my YouTube Channel) | Ideologist | Poet | Futurist | Educator | Technologist | Business Analyst | Data Analyst | Realtor |
Background
The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function.
GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC). See the Gene Ontology Overview for more details.
The protein's function is therefore represented by a subset of one or more of the subontologies. These annotations are supported by evidence codes, which can be broadly divided into experimental (e.g., as documented in a paper published by a research team of biologists) and non-experimental. Non-experimental terms are usually inferred by computational means. We recommend you read more about the different types of GO evidence codes.
We will use experimentally determined term-protein assignments as class labels for each protein. That is, if a protein is labeled with a term, it means that this protein has this function validated by experimental evidence. By processing these annotated terms, we can generate a dataset of proteins and their ground truth labels for each term. The absence of a term annotation does not necessarily mean a protein does not have this function, only that this annotation does not exist (yet) in the GO. A protein may be annotated by one or more terms from the same subontology, and by terms from more than one subontology.
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet (2000) 25(1):25-29.
Training Set
For the training set, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (evidence code TAS), or inferred by curator (IC). More information about evidence codes can be found here. We use annotations from the UniProtKB release of 2022-11-17. The participants are not required to use these data and are also welcome to use any other data available to them.
Test Superset
The test superset is a set of protein sequences on which the participants are asked to predict GO terms.
Test Set
The test set is unknown at the beginning of the competition. It will contain protein sequences (and their functions) from the test superset that gained experimental annotations between the submission deadline and the time of evaluation.
File Descriptions
Gene Ontology: The ontology data is in the file go-basic.obo. This structure is the 2023-01-01 release of the GO graph. This file is in OBO format, for which there exist many parsing libraries. For example, the obonet package is available for Python. The nodes in this graph are indexed by the term name, for example the roots of the three onotlogies are:
subontology_roots = {'BPO':'GO:0008150',
'CCO':'GO:0005575',
'MFO':'GO:0003674'}
Training sequences: train_sequences.fasta contains the protein sequences for the training dataset. This files are in FASTA format, a standard format for describing protein sequences. The proteins were all retrieved from the UniProt data set curated at the European Bioinformatics Institute. The header contains the protein's UniProt accession ID and additional information about the protein. Most protein sequences were extracted from the Swiss-Prot database, but a subset of proteins that are not represented in Swiss-Prot were extracted from the TrEMBL database. In both cases, the sequences come from the 2022_05 release from 14-Dec-2022. More information can be found here.
The train_sequences.fasta file will indicate from which database the sequence originate. For example, sp|P9WHI7|RECN_MYCT in the FASTA header indicates the protein with UniProt ID P9WHI7 and gene name RECN_MYCT was taken from Swiss-Prot (sp). Any sequences taken from TrEMBL will have tr in the header instead of sp. Swiss-Prot and TrEMBL are both parts of UniProtKB.
This file contains only sequences for proteins with annotations in the dataset (labeled proteins). To obtain the full set of protein sequences for unlabeled proteins, the Swiss-Prot and TrEMBL databases can be found here.
Labels: train_terms.tsv contains the list of annotated terms (ground truth) for the proteins in train_sequences.fasta. The first column indicates the protein's UniProt accession ID, the second is the GO term ID, and the third indicates in which ontology the term appears.
Taxonomy: train_taxonomy.tsv contains the list of proteins and the species to which they belong, represented by a "taxonomic identifier" (taxon ID) number. The first column is the protein UniProt accession ID and the second is the taxon ID. More information about taxonomies can he found here.
Information accretion: IA.txt contains the information accretion (weights) for each GO term. These weights are used to compute weighted precision and recall, as described in the Evaluation section. The values of this file were computed using the following code repo.
Test sequences: testsuperset.fasta contains protein sequences on which the participants are asked to submit predictions. The header for each sequence in testsuperset.fasta contains the protein's UniProt accession ID and the Taxon ID of the species this protein belongs to. Only a small subset of those sequences will accumulate functional annotations and will constitute the test set. The file testsuperset-taxon-list.tsv is a set of taxon IDs for the proteins in the test superset.
Files
- train_sequences.fasta - amino acid sequences for proteins in training set
- train_terms.tsv - the training set of proteins and corresponding annotated GO terms
- train_taxonomy.tsv - taxon ID for proteins in training set
- go-basic.obo - ontology graph structure
- testsuperset.fasta - amino acid sequences for proteins on which the predictions should be made
- testsuperset-taxon-list.tsv - taxon ID for proteins in test superset (Note: you may need to use encoding="ISO-8859-1" to read this file in pandas)
- IA.txt - Information Accretion for each term. This is used to weight precision and recall (see Evaluation)
- sample_submission.csv - a sample submission file in the correct format
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
!pip install mygene scanpy
from Bio import SeqIO
import pandas as pd
import mygene
import scanpy as sc
import anndata
import h5py
from sklearn.decomposition import PCA
from tqdm import tqdm
mg = mygene.MyGeneInfo()
tax_table = pd.read_table("/kaggle/input/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv", sep="\t", index_col="EntryID")
prot_names = [record.id for record in SeqIO.parse("/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta", "fasta")]
领英推è
Read Human scRNA-Seq and normalize?
sc_hum_data = sc.read_h5ad("/kaggle/input/msci-raw-counts-anndata/train_cite_inputs_raw.ann.h5")
sc_hum_data.var_names = [x.split("_")[0] for x in sc_hum_data.var_names]
sc.pp.normalize_total(sc_hum_data, inplace=True)
sc.pp.log1p(sc_hum_data)
sc.pp.normalize_total(sc_hum_data, inplace=True)
Read Mouse scRNA-Seq?
sc_mus_data = sc.read_text("/kaggle/input/single-cell-rna-seq-nestorova2016-mouse-hspc/nestorowa_corrected_log2_transformed_counts.txt")
Select human and mouse proteins?
hum_prot = tax_table.loc[tax_table["taxonomyID"] == 9606].index.to_list()
mus_prot = tax_table.loc[tax_table["taxonomyID"] == 10090].index.to_list()
Query MyGene to get gene names?
hum_gene_prot_conv = mg.querymany(hum_prot, scopes='uniprot', fields='symbol,ensembl.gene', species='human', returnall=False, as_dataframe=True)
mus_gene_prot_conv = mg.querymany(mus_prot, scopes='uniprot', fields='symbol,ensembl.gene', species='mouse', returnall=False, as_dataframe=True)
Subset scRNA-Seq to CAFA5 genes?
hum_shared_genes = list(set(sc_hum_data.var_names).intersection(set(hum_gene_prot_conv["ensembl.gene"].unique())))
cafa_hum_gene_features = sc_hum_data[:, hum_shared_genes]
mus_shared_genes = list(set(sc_mus_data.var_names).intersection(set(mus_gene_prot_conv["symbol"].unique())))
cafa_mus_gene_features = sc_mus_data[:, mus_shared_genes]
PCA of scRNA-Seq to get new features?
hum_gene_pca = PCA(n_components=100).fit_transform(cafa_hum_gene_features.X.T.todense())
hum_gene_pca = pd.DataFrame(hum_gene_pca.T, columns = cafa_hum_gene_features.var_names.to_list())
new_feature_hum = []
hum_proteins = []
for prot in tqdm(hum_prot):
try:
genes = hum_gene_prot_conv.loc[prot].dropna()["ensembl.gene"]
if type(genes) == str:
genes = [genes]
for gene in genes:
if gene in hum_gene_pca.columns.to_list():
hum_proteins.append(prot)
new_feature_hum.append(hum_gene_pca[gene].to_list())
except:
continue
new_features_hum = pd.DataFrame(new_feature_hum).T
new_features_hum.columns = hum_proteins
new_features_hum.to_csv("/kaggle/working/new_features_hum_genes.csv")
mus_gene_pca = PCA(n_components=100).fit_transform(cafa_mus_gene_features.X.T)
mus_gene_pca = pd.DataFrame(mus_gene_pca.T, columns = cafa_mus_gene_features.var_names.to_list())
new_feature_mus = []
mus_proteins = []
for prot in mus_prot:
try:
genes = mus_gene_prot_conv.loc[prot].dropna()["symbol"]
if type(genes) == str:
genes = [genes]
for gene in genes:
if gene in mus_gene_pca.columns.to_list():
mus_proteins.append(prot)
new_feature_mus.append(mus_gene_pca[gene].to_list())
except:
continue
new_features_mus = pd.DataFrame(new_feature_mus).T
new_features_mus.columns = mus_proteins
new_features_mus.to_csv("/kaggle/working/new_features_mus_genes.csv")
new_feature = pd.concat([new_features_hum.T, new_features_mus.T])
Ridge regression?
Loading train labels
%%time
n_labels_to_consider = 1499
trainTerms = pd.read_csv("/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv",sep="\t")
print(trainTerms.shape)
display(trainTerms.head(2))
vec_freqCount = (trainTerms['term'].value_counts())
print(vec_freqCount )
print()
labels_to_consider = list(vec_freqCount.index[:n_labels_to_consider] )
print('n_labels_to_consider:', len(labels_to_consider), 'First 10:', labels_to_consider[:10] )
Load protein IDs
%%time
fn = '/kaggle/input/t5embeds/train_ids.npy'
vec_train_protein_ids = np.load(fn)
print(vec_train_protein_ids.shape)
vec_train_protein_ids
Prepare Y
train_size = 142246
Y = np.zeros( (train_size ,n_labels_to_consider) )
print(Y.shape)
series_train_protein_ids = pd.Series(vec_train_protein_ids ) #
trainTerms_smaller = trainTerms[ trainTerms['term'].isin( labels_to_consider ) ] # to speed-up the next step
print( trainTerms_smaller.shape)
for i in tqdm(range(Y.shape[1])):
m = trainTerms_smaller['term'] == labels_to_consider[i]
# m.sum()
Y[:,i] = series_train_protein_ids.isin( set(trainTerms_smaller[m]['EntryID'] ) ).astype(float )
Y
Loading precalculated embedings
%%time
# fn = '/kaggle/input/protein-embeddings-1/reduced_embeddings_file.npy'
# fn = '/kaggle/input/protein-embeddings-1/embed_protbert_train_clip_1200_first_70000_prot.csv'
fn = '/kaggle/input/t5embeds/train_embeds.npy'
# fn = '/kaggle/input/t5embeds/test_embeds.npy'
print(fn)
if '.csv' in fn:
df = pd.read_csv(fn, index_col = 0)
X = df.values
elif '.npy' in fn:
X = np.load(fn)
print(X.shape)
X
Load protein IDs
%%time
fn = '/kaggle/input/t5embeds/train_ids.npy'
vec_train_protein_ids = np.load(fn)
print(vec_train_protein_ids.shape)
vec_train_protein_ids
X_ext = []
for ind, prot_id in tqdm(enumerate(vec_train_protein_ids)):
if prot_id in new_feature.index:
X_ext.append(np.r_[X[ind,:], new_feature.loc[prot_id]])
else:
X_ext.append(np.r_[X[ind,:], [np.mean(X[ind,:])] * 100])
from sklearn.model_selection import train_test_split
IX = np.arange(len(X))
IX_train, IX_test, _,_ = train_test_split( IX, IX, train_size=0.1, random_state=42)
print(len(IX_train), len(IX_test), IX_train[:10], IX_test[:10] )
Modeling
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
str_model_id = 'Ridge1'
df_models_stat = pd.DataFrame()
model
%%time
import time
from sklearn.metrics import roc_auc_score
t0 = time.time()
model.fit(X[IX_train,:],Y[IX_train,:])
Y_pred_test = model.predict(X[IX_test,:])
tt = time.time() - t0
print(str_model_id, tt)
l = []
for i in range(Y.shape[1]):
if len(np.unique(Y[IX_test,i]) ) > 1:
s = roc_auc_score(Y[IX_test,i], Y_pred_test[:,i]);
else:
s = 0.5
l.append(s)
if i %10 == 0:
print(i, s)
df_models_stat.loc[str_model_id,'RocAuc Mean Test'] = np.mean(l)
df_models_stat.loc[str_model_id,'Time'] = np.round(tt,1)
df_models_stat.loc[str_model_id,'Test Size'] = len(IX_test)
df_models_stat
Scores statistics across targets
import matplotlib.pyplot as plt
plt.hist(l)
plt.show()
pd.Series(l).describe()
Full dataset model
%%time
model.fit(X,Y)