分子AI预测赛笔记

#AI夏令营 #Datawhale #夏令营

Taks1 跑通baseline

根据task1跑通baseline

注册账号

直接注册或登录百度账号，etc

fork 项目

零基础入门 Ai 数据挖掘竞赛-速通 Baseline - 飞桨AI Studio星河社区

启动项目

选择运行环境，并点击确定，没有特殊要求就默认的基础版就可以了

等待片刻，等待在线项目启动

运行项目代码

点击运行全部Cell

程序运行完生成文件 submit.csv

这个文件就最终提交的文件。

Taks2 赛题深入解析

理解赛题，了解机器学习竞赛通用流程

数据字段理解

Docs

对 Smiles、Assay (DC50/Dmax)、Assay (Protac to Target, IC50)、Assay (Cellular activities, IC5、Article DOI、InChI字段学习分析

预测目标

选手需要预测PROTACs的降解能力，具体来说，就是预测Label字段的值。

根据DC50和Dmax的值来判断降解能力的好坏：如果DC50大于100nM且Dmax小于80%，则Label为0；如果DC50小于等于100nM或Dmax大于等于80%，则Label为1。

零基础入门AI(机器学习)竞赛 - 飞书云文档
https://datawhaler.feishu.cn/wiki/Ue7swBbiJiBhsdk5SupcqfL7nLX

Docs

Task3初步调试参数

学习9群助教【温酒相随】原创，九月助教编辑调整，首发于B站~

https://www.bilibili.com/read/cv35897986/?jump_opus=1

导入库、训练集和测试集

# 1. 导入需要用到的相关库
# 导入 pandas 库，用于数据处理和分析
import pandas as pd
# 导入 numpy 库，用于科学计算和多维数组操作
import numpy as np
# 从 lightgbm 模块中导入 LGBMClassifier 类
from lightgbm import LGBMClassifier


# 2. 读取训练集和测试集
# 使用 read_excel() 函数从文件中读取训练集数据，文件名为 'traindata-new.xlsx'
train = pd.read_excel('./data/train.xlsx')
# 使用 read_excel() 函数从文件中读取测试集数据，文件名为 'testdata-new.xlsx'
test = pd.read_excel('./data/test.xlsx')
train

查看数据类型

data = train.info()

data

部分数据的数据项比较少。可以筛掉减少拟合

# 筛选
train = train.iloc[:,1:]
test = test.iloc[:,1:]
# 行保留 列从第一个下标1开始
# train['lan'].value_counts()# language

查看object类型的列表


# 查看object类型的列表
train.select_dtypes(include = 'object').columns

缺失值查看

# 缺失值查看

temp = train.isnull().sum()

temp[temp > 0]

唯一值个数判断

# 唯一值个数判断
# fea = train.columns
fea = train.columns.tolist()
fea

输出唯一值

# 输出唯一值

for f in fea:

    print(f,train[f].nunique());# nunique() 统计列中的唯一值

筛选

# 定义了一个空列表cols，用于存储在测试数据集中非空值小于10个的列名。
cols = []
for f in test.columns:
    if test[f].notnull().sum() < 10:
        cols.append(f)
cols

# 使用drop方法从训练集和测试集中删除了这些列，以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(cols, axis=1)
test = test.drop(cols, axis=1)
# 使用pd.concat将清洗后的训练集和测试集合并成一个名为data的DataFrame，便于进行统一的特征工程处理
data = pd.concat([train, test], axis=0, ignore_index=True)
newData = data.columns[2:]

将SMILES转换为分子对象列表,并转换为SMILES字符串列表

data['smiles_list'] = data['Smiles'].apply(lambda x:[Chem.MolToSmiles(mol, isomericSmiles=True) for mol in [Chem.MolFromSmiles(x)]])
data['smiles_list'] = data['smiles_list'].map(lambda x: ' '.join(x))

用TfidfVectorizer计算TF-IDF

tfidf = TfidfVectorizer(max_df = 0.9, min_df = 1, sublinear_tf = True)

res = tfidf.fit_transform(data['smiles_list'])

转为dataframe格式

# 将结果转为dataframe格式
tfidf_df = pd.DataFrame(res.toarray())
tfidf_df.columns = [f'smiles_tfidf_{i}' for i in range(tfidf_df.shape[1])]
# 按列合并到data数据
data = pd.concat([data, tfidf_df], axis=1)

自然数编码

# 自然数编码
def label_encode(series):
    unique = list(series.unique())
    return series.map(dict(zip(
        unique, range(series.nunique())
    )))
# 对每个类转换为其编码
for col in cols:
    if data[col].dtype == 'object':
        data[col]  = label_encode(data[col])

构建训练集和测试集

# 提取data中label行不为空的，将其作为train的数据并更新索引
train = data[data.Label.notnull()].reset_index(drop=True)
# 提取data中label行为空的，将其作为teat的数据并更新索引
test = data[data.Label.isnull()].reset_index(drop=True)
# 特征筛选
features = [f for f in train.columns if f not in ['uuid','Label','smiles_list']]
# 构建训练集和测试集
x_train = train[features]
x_test = test[features]
# 训练集标签
y_train = train['Label'].astype(int)

使用采用5折交叉验证（KFold(n_splits=5）

def cv_model(clf, train_x, train_y, test_x, clf_name, seed=2022):

    # 进行5折交叉验证
    kf = KFold(n_splits=5, shuffle=True, random_state=seed)
    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []
    # 每一折数据采用训练索引和验证索引来分割训练集和验证集
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} {}************************************'.format(str(i+1), str(seed)))

        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        # 配置CatBoost分类器的参数
        params = {'learning_rate': 0.05, 'depth': 8, 'l2_leaf_reg': 10, 'bootstrap_type':'Bernoulli','random_seed':seed,
                  'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False, 'task_type':'CPU'}
        # 使用CatBoost分类器训练模型
        model = clf(iterations=20000, **params, eval_metric='AUC')

        model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                  metric_period=100,
                  cat_features=[],
                  use_best_model=True,
                  verbose=1)
        val_pred  = model.predict_proba(val_x)[:,1]
        test_pred = model.predict_proba(test_x)[:,1]

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(f1_score(val_y, np.where(val_pred>0.5, 1, 0)))

        print(cv_scores)

    print("%s_score_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test
    
cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")

这段代码是一个交叉验证模型的函数，用于训练和评估分类器模型。具体来说，它使用了CatBoost分类器，在给定的训练数据集上进行了5折交叉验证，并返回了训练集和测试集的预测结果。

函数中的参数包括：

clf: 分类器模型的类对象，这里是CatBoostClassifier。
train_x, train_y: 训练数据的特征和标签。
test_x: 测试数据的特征。
clf_name: 分类器的名称，用于输出结果。
seed: 随机种子，默认为2022。

函数的主要流程如下：

创建了一个5折交叉验证器(KFold)。
初始化了训练集和测试集的预测结果数组。
在每一折循环中，根据训练索引和验证索引分割训练集和验证集。
配置CatBoost分类器的参数，并使用训练集训练模型。
对验证集和测试集进行预测，并将预测结果加入到结果数组中。
计算并保存每一折验证集的F1分数。
输出每一折的F1分数列表、平均分数和标准差。
返回训练集和测试集的预测结果。

通过调用这个函数，可以得到CatBoost分类器在给定数据集上的交叉验证结果，评估模型的性能以及获取训练集和测试集的预测结果。

输出结果

from datetime import datetime

current_time = datetime.now()  # 获取当前时间
formatted_time = current_time.strftime("%Y-%m-%d %H:%M:%S")  # 格式化时间

# print("当前时间：", current_time)
# print("格式化时间：", formatted_time)
# 5. 保存结果文件到本地
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': pred
    }
).to_csv(formatted_time+ '.csv', index=None)

本地torch部分未用