モデル学習 (CLI v2 + YAML)
コマンド
Azure ML CLI v2 を利用します。予め作成した YAML ファイルを引数にしてコマンドを実行します。
az ml job create --file train-model.yml
コード
YAML
実験の設定を YAML ファイルに記述します。
train-model.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ../src/model
command: >-
python train.py --reg_rate ${{inputs.reg_rate}} --training_data ${{inputs.training_data}}
inputs:
training_data:
type: uri_folder
path: azureml:diabetes-folder@latest
reg_rate: 0.01
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
experiment_name: train-model
description: train model using Logistic Regressions
Python
model
フォルダ配下に 2 つの Python スクリプト train.py
, helper.py
がある想定です。
train.py
import argparse
import mlflow
from helper import get_csvs_df
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# define functions
def main(args):
# TO DO: enable autologging
mlflow.autolog()
# read data
df = get_csvs_df(args.training_data)
# split data
X_train, X_test, y_train, y_test = split_data(df)
# train model
train_model(args.reg_rate, X_train, X_test, y_train, y_test)
# split data
def split_data(df):
X, y = (
df[
[
"Pregnancies",
"PlasmaGlucose",
"DiastolicBloodPressure",
"TricepsThickness",
"SerumInsulin",
"BMI",
"DiabetesPedigree",
"Age",
]
].values,
df["Diabetic"].values,
)
return train_test_split(X, y, test_size=0.30, random_state=0)
def train_model(reg_rate, X_train, X_test, y_train, y_test):
# train model
LogisticRegression(C=1 / reg_rate, solver="liblinear").fit(X_train, y_train)
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--training_data", dest="training_data", type=str)
parser.add_argument("--reg_rate", dest="reg_rate", type=float, default=0.01)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# add space in logs
print("\n\n")
print("*" * 60)
# parse args
args = parse_args()
# run main function
main(args)
# add space in logs
print("*" * 60)
print("\n\n")
helper.py
import glob
import os
import pandas as pd
def get_csvs_df(path):
if not os.path.exists(path):
raise RuntimeError(f"Cannot use non-existent path provided: {path}")
csv_files = glob.glob(f"{path}/*.csv")
if not csv_files:
raise RuntimeError(f"No CSV files found in provided data path: {path}")
return pd.concat((pd.read_csv(f) for f in csv_files), sort=False)