(Kaggle) Intermediate Machine Learning

Notice

소모임 Run The Bridge 오픈 완료

Recent Posts

Recent Comments

Link

250x250

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Run The Bridge

(Kaggle) Intermediate Machine Learning 본문

카테고리 없음

(Kaggle) Intermediate Machine Learning

anfrhrl5555 2020. 12. 30. 00:12

728x90

이번에는 Kaggle Courses에 있는 Intermediate Machine Learning을 공부하고 복습하는 게시글이다.

Kaggle에는 Intro to Machine Learning이라는 앞 단계가 있지만, 생각보다 단순하고 쉬워서 바로 따로 게시글을 작성하지는 않았다.

먼저 사용하는 Dataset은 다음과 같다.

Lessons의 구성도는 Introduction을 포함한 총 7단계로 나누어져 있다.

2. Missing Values

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select target
y = data.Price

# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

Pandas와 sklearn에서 train과 test를 분류하기 위해 각각 모듈을 import 한다.

여기서 내가 모르는 문법이 하나 있었는데, 바로 select_dtypes(exclude=['object'] 이 부분이었다.

melb_predictors에서 data type이 object인 것을 제외하고 선택한다.라는 뜻이었다.

이것을 사용하면 내가 필요한 type만 걸러낼 수 있다는 점이 매우 편리한 코드였다.

ML을 간편하게 사용할 수 있게 하기 위해 함수화를 먼저 해두었다.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

그리고 list comprehension을 통해 X_train.columns에서 행, 열을 비교해서 null값이 존재한 Column을 cols_with_missing에 담아두었다.

# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

- 생각보다 ML을 할 때 list comprehension이나 lambda를 많이 쓰는 듯하다. -

다음과같이 code를 작성해주면 손쉽게 null값이 존재하는 column들을 날려줄 수 있다.

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

기계학습을 할 때 null 값을 날려주는 것도 좋지만, 해당 feature가 어떤 의미를 가지는지 분석한 다음에 해주어야 한다.

잘못하다가는 의미 있는 feature을 날리거나, null 값의 개수가 작은데도 불구하고 전체 열을 날려버릴 수 있다.

null값을 모두 날린 뒤 MAE값을 찍어본다.

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

183550점이 나왔다.

이제부터 lesson을 하나하나 진행할 때 마다 해당 MAE값을 줄여나간다.

3. sklearn에 있는 SimpleImputer 사용하기

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

※ SimpleImputer란?

A. 결측치 값을 대처하기 위함으로, 결측치가 존재하면 평균값, 중앙값, 최빈값, 특정값으로 자동으로 넣어준다.

default == 평균값

sklearn에서 imputer을 import 시켜준다.

my_imputer = SimpleImputer()를 통해 객체를 생성한다. --> 괄호안에 strategy='' 를 쓰면된다.

my_imputer.fit_transform(X_train)을 통해 fit 과 transform을 동시에 한다.

마찬가지로 X_valid도 똑같이 해주지만, 이미 fit되어있기에 transform만 해준다.

SimpleImputer로 fit 및 transform을 하게 되면 dtype이 numpy로 변환되기 때문에 pd.DataFrame으로 다시 형 변환을 해주며, 형 변환 시 Column 값이 없기 때문에 Column 값들을 다시 넣어준다.

변환을 통해 나온 값은 178166점이다.

4. Column을 별도로 생성해서 표기하기

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

다음과 같이 별도의 Columns들이 생성되며 위에서 구한 3개의 Column 들에 대해

null 값을 가지면 True, 아닐 시 False를 표기한다.

점수는 178927점이다.

별도의 Tip으로 Dataset에 shpae을 찍었을 때, 해당 값 이상으로 보고 싶을 때

# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column >= 0])

변수에 담지 않고 굳이 표현하고 싶으면 다음과 같이 쓰면 된다.

대신 숫자는 안 나온다...... ㅠ 분명 나오게 할 수 있을 텐데 아직 실력이 부족하다

X_train.isnull().sum() > 0

5. Catogorical Variables(Lesson.3)

Lesson이 달려져서 똑같은 코드를 사용하는데 쉽게 복붙하겠다.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

여기서 제일 신기했던 점은 list comprehension에서 'nunique() < 10' 이 부분이 신기했다.

nunique()가 무얼 뜻하는지 보고 싶어서 unique랑 비교하면서 해봤는데, 해당 Row에 unique들의 개수를 counting 해주는 것이었다. 즉, Row의 종류(?)라고 봐도 될 듯하다

s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical Variables:")
print(object_cols)

s에 type이 object인 것을 넣고, object_cols 변수에 index값을 담는다.

처음에 list(s[s]가 뜻이가지않아 코드를 하나씩 뜯어보았다.

s에는 True, False값들이 들어가있다.

s.index에는 해당하는 key(?)가 들어있다.

아무리 생각해도 이 코드는 True만을 가져와서 object_cols에 넣는다고 밖에 이해가 가지 않는다.

object_cols = list(s[s].index)

그 후 Categorical Variables인 object들을 전부 날려주고 학습을 시켜본다.

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop Categorical Variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

183350점이 나왔다.

※ Categorical Variables란?

ML을 학습시키기 위해서는 Data들을 모두 수치화 시켜주어야 한다.

object라고 찍힌 값들은 모두 string 형식으로 이루어져 있다.

Type은 u, h, ...

Method는 S, SA, SP, ...

Regioname도 Southem ---, Western ---, ...으로 이루어져 있다.

여기서 Type 부분에선 u = 1, h = 2 , ... 이렇게 정의해 주고

Method도 S=1, SA=2, SP=3

Regioname도 Southem=1, Western=2로 수치화 시켜주어야 한다.

이러한 수치화를 위해 Labelencoding을 사용하는데, 조금 있다가 다룰 예정이다.

6. Label Encoding

Label Encoding은 sklearn.preprocessing안에 내장되어 있다 사용법은 다음과같다.

from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

deep copy를 통해 두 변수에 복사를 한 다음 label_encoder의 객체를 생성한다.

나머지 사용법은 SimpleImputer와 크게 다를 바 없다.

점수는 175062점이 나왔다.

7. One-hot Encoding

※ One-hot Encoding이란?

원-핫 인코딩은 단어 집합의 크기를 벡터의 차원으로 하고, 표현하고 싶은 단어의 인덱스에 1의 값을 부여하고, 다른 인덱스에는 0을 부여하는 단어의 벡터 표현 방식입니다.

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

handle_unknown='ignore'

→ 변환 중에 알 수 없는 Categorical Feature가 존재하면 오류를 발생 무시할지 여부, default == raise

무시로 설정되면 알 수 없는 Categorical Feature가 나와도 모두 0으로 표시된다.

sparse=False

→ 원핫인코딩에서는 array가 필요하므로 희소행렬은 필요없다.

8. Pipeline

Pipeline는 코드를 간소화시키고, 간편화 시켜준다.

아직 이해가 덜 된 부분이라 일단 적어놓고 한 번 더 봐야겠다.

from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

9. Cross-validation(교차검증)

교차 검증에 대해서는 구글이나, 유튜브에 설명해둔 내용이 아주 많다.

한번 찾아보는 것도 추천한다.

모델의 성능을 올리기에 아주 적합하다

새로운 Lesson이기때문에 Dataset을 불러온다.

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                      ('model', RandomForestRegressor(n_estimators=100,
                                                     random_state=0))
                      ])

앞단에서 썼던 Pipeline을 이용하여 전처리, 모델을 생성한다

from sklearn.model_selection import cross_val_score

scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

sklearn에 존재하는 cross_val_score을 불러오고 위와같이 적어준다.

scoring에 'neg_mean_absolute_error'는 음수값으로 나오기 때문에 -1을 곱해준다.

cv = 5은 5번 교차검증 하겠다는 뜻이다.

평균값을 찍어본다.

print(scores.mean())

10. XG-Boost

XG-Boost에 대해서도 공부가 덜 되었기 때문에 코드들만 작성해두어야겠다.

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

모델을 생성하고 바로 fit시킨다.

여기서 XGBOOST는 object 값이든 float, int 값이든 상관없이 학습이 된다는 점이다.

기본값으로 학습시켰을 때

from sklearn.metrics import mean_absolute_error

pred = my_model.predict(X_valid)
print("MEAN Absolute Error: " + str(mean_absolute_error(pred, y_valid)))

생성할 Tree의 개수를 500개로 하였을 때

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

pred = my_model.predict(X_valid)
print("MEAN Absolute Error: " + str(mean_absolute_error(pred, y_valid)))

여러가지 HyperParameter들을 추가하였을 때

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
            

pred = my_model.predict(X_valid)
print("MEAN Absolute Error: " + str(mean_absolute_error(pred, y_valid)))

여기서 early_stopping_rounds란 학습을 시켰을 때, overfitting을 방지시켜준다.

eval_set은 조기 종료를 하였을 때, 어떤 데이터를 보고 평가할 것인지(여기서는 X_valid 와 y_valid)

learning_rate을 0.05로 주었을 때(default = 0.03)

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
            

pred = my_model.predict(X_valid)
print("MEAN Absolute Error: " + str(mean_absolute_error(pred, y_valid)))

learning_rate란 경사하강법에서 최저점이 0이 되는 곳을 찾는데, 그 강도가 0.05만큼 이동하면서 찾는다.

위에 나온 모든 HyperParameter들의 값들을 조절하면서 최적의 값을 찾는 것이 제일 좋다

여기까지 Intermediate Machine Learning에 관한 내용이다.

감사합니다. Thank you!

728x90

저작자표시 비영리 (새창열림)

Comments

Run The Bridge

(Kaggle) Intermediate Machine Learning 본문

(Kaggle) Intermediate Machine Learning

2. Missing Values

3. sklearn에 있는 SimpleImputer 사용하기

4. Column을 별도로 생성해서 표기하기

5. Catogorical Variables(Lesson.3)

6. Label Encoding

7. One-hot Encoding

8. Pipeline

9. Cross-validation(교차검증)

10. XG-Boost

티스토리툴바