아이리스 데이터셋(Iris DataSet)

아이리스 데이터셋(Iris DataSet)은 머신러닝 및 데이터 분석에서 많이 사용되는 유명한 데이터셋 중 하나이다. 이 데이터셋은 1936년에 영국의 통계학자와 생물학자인 Ronald A. Fisher에 의해 소개되었다.

아이리스 데이터셋은 세 종류의 붓꽃(Iris)에 대한 측정값을 포함하고 있습니다. 각 붓꽃의 종류는 다음과 같습니다.

각 붓꽃에 종류

Setosa: 0
Versicolor: 1
Virginica: 2

각 붓꽃에 대해 측정된 특성

꽃받침 길이(Sepal Length)
꽃받침 너비(Sepal Width)
꽃잎 길이(Petal Length)
꽃잎 너비(Petal Width)

feature_names = iris["feature_names"]

사이킷런(Scikit-learn)은 파이썬에서 머신러닝 모델을 쉽게 구축하고 테스트할 수 있는 도구 모음 중 하나로, 아이리스 데이터셋을 포함하고 있다. 이 데이터셋은 load_iris 함수를 사용하여 로드할 수 있다.

아래는 아이리스 데이터셋을 로드하는 간단한 예제 코드입니다.

from sklearn.datasets import load_iris
iris = load_iris()
iris

출력예제

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
       		 ...
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'frame': None,
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 'filename': 'iris.csv',
 'data_module': 'sklearn.datasets.data'}

train_test_split

train_test_split는 사이킷런(scikit-learn) 라이브러리에서 제공되는 함수 중 하나이며 이 함수는 데이터 훈련 세트(Training set)와 테스트 세트(Test set)로 나누는 데 사용됩니다. 데이터를 나누는 이유는 모델을 훈련시키고 이를 평가하기 위해 독립적인 데이터 세트가 필요하기 때문입니다.

feature_names = iris["feature_names"]
feature_names

아이리스 데이터프레임 선언

import pandas as pd

df_iris = pd.DataFrame(data, columns=feature_names)
df_iris.head()

아이리스 데이터프레임 선언

from sklearn.model_selection import train_test_split

# 특성 데이터와 타겟 데이터 준비
X = iris.data
y = iris.target

# train_data : 학습시킬 데이터
# valid_data : 중간 중간 모의고사 데이터
# test_data : 시험 데이터 -> 검증 데이터 ( 날것의 데이터, 머신러닝에 한번도 없었던 데이터 )
# 실무에서

# 데이터를 훈련 세트와 테스트 세트로 나누기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train_test_split(독립변수, 종속변수, 테스트사이즈, 시드값... )
# 시드값: 같은 독립변수, 종속변수, 시드값에 의해서는 동일한 랜덤셔플이 생성 -> 데이터 섞인 정도에 따라 결과가 달라질 수 있기 때문에 데이터의 랜덤을 고정적으로 둘 필요가 있음
# 보통 10만개 기준으로 데이터가 많으면 training을 60%로 사용, 데이터가 적으면 training을 80~90%로 사용

X_train, X_test, y_train, y_test = train_test_split(df_iris.drop("target", 1), df_iris["target"], test_size=0.2, random_state=2023)


# 다차원 배열일 때는 첫글자를 대문자로 작성하고, 일차원 배열일 때는 소문자를 대문자로 작성 -> 개발자들 관례

X_train: 학습시킬 데이터를 의미

X_test: 테스트할 데이터

y_train: 학습시킬 정답 데이터

y_test: 테스트의 모법답

predict() 메소드는 모델의 예측 결과를 반환한다.

shape

머신러닝에서 행렬의 차원을 shape라는 개념으로 표현합니다.

위와 같은 행렬이 있다고 한다면 1차원에 3, 2차원에 2로 (3,2)로 표현한다.

X_train.shape, X_test.shape

결과값

((120, 4), (30, 4))

X_train

X_train을 출력하면 아래 처럼 결과값이 출력된다.

y_train

y_train을 출력하면 아래 값이 출력된다.

# SVC 분류 모델 임포트
from sklearn.svm import SVC
# 정확도 평가 임포트
from sklearn.metrics import accuracy_score

# SVC 객체 생성
svc = SVC()

# 피처와 라벨을 넣고 학습
svc.fit(X_train, y_train)

# 테스트 데이터 결과를 y_pred에 담아줌
y_pred = svc.predict(X_test)

# accuracy_score() 함수를 사용해서 정답률을 반환
# y_test 테스트 실제정답, y_pred는 위에서 예측한 정답
print("정답률", accuracy_score(y_test, y_pred))

테스트가 아닌 실제 값을 입력

# 6.2 2.5 5.0 1.9
y_pred = svc.predict([[6.4, 2.6, 5.1, 1.9]])
y_pred

/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( array([2])

해당 문제의 답이 2로 출력