ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

RidgeReg with expression features

Haiqing Hua

I share news from Chinese website (you can use google translate please also subscribe my YouTube Channel) | Ideologist | Poet | Futurist | Educator | Technologist | Business Analyst | Data Analyst | Realtor |

å‘å¸ƒæ—¥æœŸ: 2024å¹´4æœˆ14æ—¥

Background

The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function.

GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC). See the Gene Ontology Overview for more details.

The protein's function is therefore represented by a subset of one or more of the subontologies. These annotations are supported by evidence codes, which can be broadly divided into experimental (e.g., as documented in a paper published by a research team of biologists) and non-experimental. Non-experimental terms are usually inferred by computational means. We recommend you read more about the different types of GO evidence codes.

We will use experimentally determined term-protein assignments as class labels for each protein. That is, if a protein is labeled with a term, it means that this protein has this function validated by experimental evidence. By processing these annotated terms, we can generate a dataset of proteins and their ground truth labels for each term. The absence of a term annotation does not necessarily mean a protein does not have this function, only that this annotation does not exist (yet) in the GO. A protein may be annotated by one or more terms from the same subontology, and by terms from more than one subontology.

Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet (2000) 25(1):25-29.

Training Set

For the training set, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (evidence code TAS), or inferred by curator (IC). More information about evidence codes can be found here. We use annotations from the UniProtKB release of 2022-11-17. The participants are not required to use these data and are also welcome to use any other data available to them.

Test Superset

The test superset is a set of protein sequences on which the participants are asked to predict GO terms.

Test Set

The test set is unknown at the beginning of the competition. It will contain protein sequences (and their functions) from the test superset that gained experimental annotations between the submission deadline and the time of evaluation.

File Descriptions

Gene Ontology: The ontology data is in the file go-basic.obo. This structure is the 2023-01-01 release of the GO graph. This file is in OBO format, for which there exist many parsing libraries. For example, the obonet package is available for Python. The nodes in this graph are indexed by the term name, for example the roots of the three onotlogies are:

subontology_roots = {'BPO':'GO:0008150',
                     'CCO':'GO:0005575',
                     'MFO':'GO:0003674'}

Training sequences: train_sequences.fasta contains the protein sequences for the training dataset. This files are in FASTA format, a standard format for describing protein sequences. The proteins were all retrieved from the UniProt data set curated at the European Bioinformatics Institute. The header contains the protein's UniProt accession ID and additional information about the protein. Most protein sequences were extracted from the Swiss-Prot database, but a subset of proteins that are not represented in Swiss-Prot were extracted from the TrEMBL database. In both cases, the sequences come from the 2022_05 release from 14-Dec-2022. More information can be found here.

The train_sequences.fasta file will indicate from which database the sequence originate. For example, sp|P9WHI7|RECN_MYCT in the FASTA header indicates the protein with UniProt ID P9WHI7 and gene name RECN_MYCT was taken from Swiss-Prot (sp). Any sequences taken from TrEMBL will have tr in the header instead of sp. Swiss-Prot and TrEMBL are both parts of UniProtKB.

This file contains only sequences for proteins with annotations in the dataset (labeled proteins). To obtain the full set of protein sequences for unlabeled proteins, the Swiss-Prot and TrEMBL databases can be found here.

Labels: train_terms.tsv contains the list of annotated terms (ground truth) for the proteins in train_sequences.fasta. The first column indicates the protein's UniProt accession ID, the second is the GO term ID, and the third indicates in which ontology the term appears.

Taxonomy: train_taxonomy.tsv contains the list of proteins and the species to which they belong, represented by a "taxonomic identifier" (taxon ID) number. The first column is the protein UniProt accession ID and the second is the taxon ID. More information about taxonomies can he found here.

Information accretion: IA.txt contains the information accretion (weights) for each GO term. These weights are used to compute weighted precision and recall, as described in the Evaluation section. The values of this file were computed using the following code repo.

Test sequences: testsuperset.fasta contains protein sequences on which the participants are asked to submit predictions. The header for each sequence in testsuperset.fasta contains the protein's UniProt accession ID and the Taxon ID of the species this protein belongs to. Only a small subset of those sequences will accumulate functional annotations and will constitute the test set. The file testsuperset-taxon-list.tsv is a set of taxon IDs for the proteins in the test superset.

Files

train_sequences.fasta - amino acid sequences for proteins in training set
train_terms.tsv - the training set of proteins and corresponding annotated GO terms
train_taxonomy.tsv - taxon ID for proteins in training set
go-basic.obo - ontology graph structure
testsuperset.fasta - amino acid sequences for proteins on which the predictions should be made
testsuperset-taxon-list.tsv - taxon ID for proteins in test superset (Note: you may need to use encoding="ISO-8859-1" to read this file in pandas)
IA.txt - Information Accretion for each term. This is used to weight precision and recall (see Evaluation)
sample_submission.csv - a sample submission file in the correct format

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

!pip install mygene scanpy

from Bio import SeqIO
import pandas as pd
import mygene
import scanpy as sc
import anndata
import h5py
from sklearn.decomposition import PCA
from tqdm import tqdm


mg = mygene.MyGeneInfo()

tax_table = pd.read_table("/kaggle/input/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv", sep="\t", index_col="EntryID")
    
prot_names = [record.id for record in SeqIO.parse("/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta", "fasta")]

é¢†è‹±æŽ¨è

??SpotSweeper: Spatial Transcriptomics QC,??DNA Methylation Patterns in Human Pancreas , Gene for Sperm Storage in Hens??: RNA-seq and Ribo-seq??

??SpotSweeper: Spatial Transcriptomics QC,??DNAâ€¦

Zifo Bioinformatics 9 ä¸ªæœˆå‰

Latest stories from the Sanger Institute

Wellcome Sanger Institute 7 ä¸ªæœˆå‰

Biweekly Research News Digest

Novogene 1 å¹´å‰

Read Human scRNA-Seq and normalize?

sc_hum_data = sc.read_h5ad("/kaggle/input/msci-raw-counts-anndata/train_cite_inputs_raw.ann.h5")
sc_hum_data.var_names = [x.split("_")[0] for x in sc_hum_data.var_names]

sc.pp.normalize_total(sc_hum_data, inplace=True)
sc.pp.log1p(sc_hum_data)
sc.pp.normalize_total(sc_hum_data, inplace=True)

Read Mouse scRNA-Seq?

sc_mus_data = sc.read_text("/kaggle/input/single-cell-rna-seq-nestorova2016-mouse-hspc/nestorowa_corrected_log2_transformed_counts.txt")

Select human and mouse proteins?

hum_prot = tax_table.loc[tax_table["taxonomyID"] == 9606].index.to_list()
mus_prot = tax_table.loc[tax_table["taxonomyID"] == 10090].index.to_list()

Query MyGene to get gene names?

hum_gene_prot_conv = mg.querymany(hum_prot, scopes='uniprot', fields='symbol,ensembl.gene', species='human', returnall=False, as_dataframe=True)
mus_gene_prot_conv = mg.querymany(mus_prot, scopes='uniprot', fields='symbol,ensembl.gene', species='mouse', returnall=False, as_dataframe=True)

Subset scRNA-Seq to CAFA5 genes?

hum_shared_genes = list(set(sc_hum_data.var_names).intersection(set(hum_gene_prot_conv["ensembl.gene"].unique())))
cafa_hum_gene_features = sc_hum_data[:, hum_shared_genes]

mus_shared_genes = list(set(sc_mus_data.var_names).intersection(set(mus_gene_prot_conv["symbol"].unique())))
cafa_mus_gene_features = sc_mus_data[:, mus_shared_genes]

PCA of scRNA-Seq to get new features?

hum_gene_pca = PCA(n_components=100).fit_transform(cafa_hum_gene_features.X.T.todense())
hum_gene_pca = pd.DataFrame(hum_gene_pca.T, columns = cafa_hum_gene_features.var_names.to_list())
new_feature_hum = []
hum_proteins = []
for prot in tqdm(hum_prot):
    try:
        genes = hum_gene_prot_conv.loc[prot].dropna()["ensembl.gene"]
        if type(genes) == str:
            genes = [genes]
        for gene in genes:
            if gene in hum_gene_pca.columns.to_list():
                hum_proteins.append(prot)
                new_feature_hum.append(hum_gene_pca[gene].to_list())
    except:
        continue
        
new_features_hum = pd.DataFrame(new_feature_hum).T
new_features_hum.columns = hum_proteins
new_features_hum.to_csv("/kaggle/working/new_features_hum_genes.csv")
mus_gene_pca = PCA(n_components=100).fit_transform(cafa_mus_gene_features.X.T)
mus_gene_pca = pd.DataFrame(mus_gene_pca.T, columns = cafa_mus_gene_features.var_names.to_list())
new_feature_mus = []
mus_proteins = []
for prot in mus_prot:
    try:
        genes = mus_gene_prot_conv.loc[prot].dropna()["symbol"]
        if type(genes) == str:
            genes = [genes]
        for gene in genes:
            if gene in mus_gene_pca.columns.to_list():
                mus_proteins.append(prot)
                new_feature_mus.append(mus_gene_pca[gene].to_list())
    except:
        continue
        
new_features_mus = pd.DataFrame(new_feature_mus).T
new_features_mus.columns = mus_proteins
new_features_mus.to_csv("/kaggle/working/new_features_mus_genes.csv")
new_feature = pd.concat([new_features_hum.T, new_features_mus.T])

Ridge regression?

Loading train labels

%%time
n_labels_to_consider = 1499
trainTerms = pd.read_csv("/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv",sep="\t")
print(trainTerms.shape)
display(trainTerms.head(2))
vec_freqCount = (trainTerms['term'].value_counts())
print(vec_freqCount )

print()
labels_to_consider = list(vec_freqCount.index[:n_labels_to_consider] )
print('n_labels_to_consider:', len(labels_to_consider), 'First 10:', labels_to_consider[:10] )

Load protein IDs

%%time
fn = '/kaggle/input/t5embeds/train_ids.npy'
vec_train_protein_ids = np.load(fn)
print(vec_train_protein_ids.shape)
vec_train_protein_ids

Prepare Y

train_size = 142246
Y = np.zeros( (train_size ,n_labels_to_consider) )
print(Y.shape)

series_train_protein_ids = pd.Series(vec_train_protein_ids ) # 

trainTerms_smaller = trainTerms[ trainTerms['term'].isin( labels_to_consider ) ] # to speed-up the next step 
print( trainTerms_smaller.shape)

for i in tqdm(range(Y.shape[1])):
    m = trainTerms_smaller['term'] ==  labels_to_consider[i]
#     m.sum()
    Y[:,i] =  series_train_protein_ids.isin(  set(trainTerms_smaller[m]['EntryID'] ) ).astype(float )
Y

Loading precalculated embedings

%%time

# fn = '/kaggle/input/protein-embeddings-1/reduced_embeddings_file.npy'
# fn = '/kaggle/input/protein-embeddings-1/embed_protbert_train_clip_1200_first_70000_prot.csv'
fn = '/kaggle/input/t5embeds/train_embeds.npy'
# fn = '/kaggle/input/t5embeds/test_embeds.npy'

print(fn)
if '.csv' in fn:
    df = pd.read_csv(fn, index_col = 0)
    X = df.values
elif '.npy' in fn:
    X = np.load(fn)
print(X.shape)
X

Load protein IDs

%%time
fn = '/kaggle/input/t5embeds/train_ids.npy'
vec_train_protein_ids = np.load(fn)
print(vec_train_protein_ids.shape)
vec_train_protein_ids

X_ext = []
for ind, prot_id in tqdm(enumerate(vec_train_protein_ids)):
    if prot_id in new_feature.index:
        X_ext.append(np.r_[X[ind,:], new_feature.loc[prot_id]])
    else:
        X_ext.append(np.r_[X[ind,:], [np.mean(X[ind,:])] * 100])
from sklearn.model_selection import train_test_split
IX = np.arange(len(X))
IX_train, IX_test, _,_ = train_test_split( IX, IX, train_size=0.1, random_state=42)
print(len(IX_train), len(IX_test),  IX_train[:10], IX_test[:10] )

Modeling

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
str_model_id = 'Ridge1'

df_models_stat = pd.DataFrame()
model
%%time 
import time
from sklearn.metrics import roc_auc_score

t0 = time.time()
model.fit(X[IX_train,:],Y[IX_train,:])
Y_pred_test = model.predict(X[IX_test,:])
tt = time.time() - t0
print(str_model_id, tt)
l = []
for i in range(Y.shape[1]):
    if len(np.unique(Y[IX_test,i]) ) > 1:
        s = roc_auc_score(Y[IX_test,i], Y_pred_test[:,i]);
    else:
        s = 0.5
    l.append(s)        
    if i %10 == 0:
        print(i, s)
df_models_stat.loc[str_model_id,'RocAuc Mean Test'] = np.mean(l)
df_models_stat.loc[str_model_id,'Time'] = np.round(tt,1)
df_models_stat.loc[str_model_id,'Test Size'] = len(IX_test)
df_models_stat

Scores statistics across targets

import matplotlib.pyplot as plt
plt.hist(l)
plt.show()
pd.Series(l).describe()

Full dataset model

%%time
model.fit(X,Y)

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Haiqing Huaçš„æ›´å¤šæ–‡ç«

A conversation about Bitcoin and AI

2025å¹´1æœˆ2æ—¥

A conversation about Bitcoin and AI

What is the significance of Bitcoin in the financial field? ChatGPT said: ChatGPT As a decentralized digital currencyâ€¦
A conversation about Bitcoin and AIå…³äºŽæ¯”ç‰¹å¸ä¸ŽAIçš„ä¸€æ¬¡å¯¹è¯

2025å¹´1æœˆ2æ—¥

A conversation about Bitcoin and AIå…³äºŽæ¯”ç‰¹å¸ä¸ŽAIçš„ä¸€æ¬¡å¯¹è¯

æ¯”ç‰¹å¸åœ¨è´¢ç»é¢†åŸŸçš„å˜åœ¨æ„ä¹‰æ˜¯ä»€ä¹ˆï¼Ÿ ChatGPT said: ChatGPT æ¯”ç‰¹å¸ä½œä¸ºä¸€ç§åŽ»ä¸å¿ƒåŒ–çš„æ•°å—è´§å¸ï¼Œåœ¨è´¢ç»é¢†åŸŸå…·æœ‰ä»¥ä¸‹å˜åœ¨æ„ä¹‰å’Œä»·å€¼ï¼š 1. åŽ»ä¸å¿ƒåŒ–çš„è´§å¸ä½“ç³»â€¦
æµ·åº†æœªæ¥æ€æƒ³ä¹¦ã€Šhow they become richã€‹ä½œè€…ï¼šåŽæµ·åº†

2024å¹´9æœˆ16æ—¥

æµ·åº†æœªæ¥æ€æƒ³ä¹¦ã€Šhow they become richã€‹ä½œè€…ï¼šåŽæµ·åº†

Email: haiqinghua@yahoo.com æ ¹æ®ä¸€ä½æ³¨å†Œç†è´¢è§„åˆ’å¸ˆï¼Œä»–ç™½æ‰‹èµ·å®¶çš„ç™¾ä¸‡å¯Œç¿å®¢æˆ·æ‹¥æœ‰10ä¸ªå¸¸è§çš„ç†è´¢ä¹ æƒ¯ï¼Œæ™®é€šäººä¹Ÿå¯ä»¥æ•ˆä»¿ã€‚â€¦
The War of Souls â€” A Stunning Work of Future Thought by Haiqing Hua

2024å¹´9æœˆ15æ—¥

The War of Souls â€” A Stunning Work of Future Thought by Haiqing Hua

The War of Souls is the latest science fiction masterpiece by Haiqing Hua, filled with profound philosophicalâ€¦
Colonizing Mars

2024å¹´9æœˆ10æ—¥

Colonizing Mars

The following is a synopsis of the 50-episode TV series "Colonizing Mars", which shows a story structure full ofâ€¦
Haiqing Hua

2024å¹´9æœˆ7æ—¥

Haiqing Hua

Haiqing Hua is a prominent Chinese-Canadian writer, poet and social commentator currently living in Toronto, Canadaâ€¦
Stories Inspired by Zhuangzi (Chuang-Tzu, 369â€”298 BCE) è§£è¯»åº„åæ·±åˆ»æ€æƒ³ï¼Œç»†åˆ†å¤ç”»ä¼ ç»Ÿæ„å¢ƒ

2024å¹´7æœˆ26æ—¥

Stories Inspired by Zhuangzi (Chuang-Tzu, 369â€”298 BCE) è§£è¯»åº„åæ·±åˆ»æ€æƒ³ï¼Œç»†åˆ†å¤ç”»ä¼ ç»Ÿæ„å¢ƒ

Stories Inspired by Zhuangzi (Chuang-Tzu, 369â€”298 BCE) Prologue In the vast expanse of ancient Chinese philosophy, theâ€¦
The Wise Old Turtle and the Helpful Frog

2024å¹´7æœˆ26æ—¥

The Wise Old Turtle and the Helpful Frog

https://www.amazon.
Haiqing Chinese-English Bilingual Amazon Bookstore

2024å¹´6æœˆ27æ—¥

Haiqing Chinese-English Bilingual Amazon Bookstore

Due to issues with Amazon's publishing software recognizing Chinese characters correctly, errors in recognition mayâ€¦
Summary Reviews of Haiqing Hua's Recent Novels

2024å¹´6æœˆ19æ—¥

Summary Reviews of Haiqing Hua's Recent Novels

AI Socialism --- A Dream Come True In this speculative fiction, Hua envisions a world where AI-powered socialism hasâ€¦

See all articles

RidgeReg with expression features

Haiqing Hua

I share news from Chinese website (you can use google translate please also subscribe my YouTube Channel) | Ideologist | Poet | Futurist | Educator | Technologist | Business Analyst | Data Analyst | Realtor |

Background

Training Set

Test Superset

Test Set

File Descriptions

Files

é¢†è‹±æŽ¨è

Read Human scRNA-Seq and normalize?

Read Mouse scRNA-Seq?

Select human and mouse proteins?

Query MyGene to get gene names?

Subset scRNA-Seq to CAFA5 genes?

PCA of scRNA-Seq to get new features?

Ridge regression?

Haiqing Huaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Celebrating the birthday of Gregor Mendel, the father of genetics and early innovator

Berlin hosted ESHG 2024: The intersection of history and genetics

Exploring the Mysteries of 'Junk' in DNA:

Epigenetics Newsletter â€“ March, 2023

CRISPR at ten

Unveiling the Blueprint of India: The Genome India Project

Innovate and Lead: Breakthrough Insights Revealed in Our Webinars

The Road to the $100 Genome

Dr. Stephen Knox Jones Jr - one of 10 EMBO grant winners

The fundament of life: how molecules assert themselves against each other

Background

Training Set

Test Superset

Test Set

File Descriptions

Files

é¢†è‹±æŽ¨è

Read Human scRNA-Seq and normalize?

Read Mouse scRNA-Seq?

Select human and mouse proteins?

Query MyGene to get gene names?

Subset scRNA-Seq to CAFA5 genes?

PCA of scRNA-Seq to get new features?

Ridge regression?

Haiqing Huaçš„æ›´å¤šæ–‡ç«

A conversation about Bitcoin and AI

A conversation about Bitcoin and AIå…³äºŽæ¯”ç‰¹å¸ä¸ŽAIçš„ä¸€æ¬¡å¯¹è¯

æµ·åº†æœªæ¥æ€æƒ³ä¹¦ã€Šhow they become richã€‹ä½œè€…ï¼šåŽæµ·åº†

The War of Souls â€” A Stunning Work of Future Thought by Haiqing Hua

Colonizing Mars

Haiqing Hua

Stories Inspired by Zhuangzi (Chuang-Tzu, 369â€”298 BCE) è§£è¯»åº„å­æ·±åˆ»æ€æƒ³ï¼Œç»†åˆ†å¤ç”»ä¼ ç»Ÿæ„å¢ƒ

The Wise Old Turtle and the Helpful Frog

Haiqing Chinese-English Bilingual Amazon Bookstore

Summary Reviews of Haiqing Hua's Recent Novels

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Celebrating the birthday of Gregor Mendel, the father of genetics and early innovator

Berlin hosted ESHG 2024: The intersection of history and genetics

Exploring the Mysteries of 'Junk' in DNA:

Epigenetics Newsletter â€“ March, 2023

CRISPR at ten

Unveiling the Blueprint of India: The Genome India Project

Innovate and Lead: Breakthrough Insights Revealed in Our Webinars

The Road to the $100 Genome

Dr. Stephen Knox Jones Jr - one of 10 EMBO grant winners

The fundament of life: how molecules assert themselves against each other

é¢†è‹±æŽ¨è

A conversation about Bitcoin and AIå…³äºŽæ¯”ç‰¹å¸ä¸ŽAIçš„ä¸€æ¬¡å¯¹è¯

æµ·åº†æœªæ¥æ€æƒ³ä¹¦ã€Šhow they become richã€‹ä½œè€…ï¼šåŽæµ·åº†

Stories Inspired by Zhuangzi (Chuang-Tzu, 369â€”298 BCE) è§£è¯»åº„åæ·±åˆ»æ€æƒ³ï¼Œç»†åˆ†å¤ç”»ä¼ ç»Ÿæ„å¢ƒ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†