本文是初次参加Kaggle入门赛Titanic生还预测的一个过程记录,通过这个比赛,主要熟悉了pandas以及sklearn包的使用,对于常见的分类问题有了一定的了解。现将数据分析的过程记录如下。
| 
 | 
 | 
数据分析
| 
 | 
 | 
(891, 12)
(418, 11)
| 
 | 
 | 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
| 
 | 
 | 
ALL: 1309
------------------------------
Cabin       1014
Age          263
Embarked       2
Fare           1
dtype: int64
填充缺失值
| 
 | 
 | 

| 
 | 
 | 
number describe:
          min       max       mean        std   count
Sex     0.00    1.0000   0.355997   0.478997  1309.0
Pclass  1.00    3.0000   2.294882   0.837836  1309.0
SibSp   0.00    8.0000   0.498854   1.041658  1309.0
Parch   0.00    9.0000   0.385027   0.865560  1309.0
Age     0.17   80.0000  29.876751  13.447012  1309.0
Fare    0.00  512.3292  33.281086  51.741500  1309.0
| 
 | 
 | 
object describe:
Cabin          : 187 
 [nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
 'B42' 'C148' 'B45' 'B36' 'A21' 'D34' 'A9' 'C31' 'B61' 'C53' 'D43' 'C130'
 'C132' 'C55 C57' 'C116' 'F' 'A29' 'C6' 'C28' 'C51' 'C97' 'D22' 'B10'
 'E45' 'E52' 'A11' 'B11' 'C80' 'C89' 'F E46' 'B26' 'F E57' 'A18' 'E60'
 'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D40' 'D38' 'C105']
Embarked       : 4 
 ['S' 'C' 'Q' 'None']
Name           : 1307 
 ['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkinen, Miss. Laina' ... 'Saether, Mr. Simon Sivertsen'
 'Ware, Mr. Frederick' 'Peter, Master. Michael J']
Ticket         : 929 
 ['A/5 21171' 'PC 17599' 'STON/O2. 3101282' '113803' '373450' '330877'
 '17463' '349909' '347742' '237736' 'PP 9549' '113783' 'A/5. 2151'
    ...
 'A.5. 3236' 'SOTON/O.Q. 3101262' '359309']
| 
 | 
 | 

| 
 | 
 | 
| 
 | 
 | 



| 
 | 
 | 

| 
 | 
 | 

从上面可以看出Fare的分布比较不均匀,对其通过log尺度变换进行处理。
| 
 | 
 | 

| 
 | 
 | 

| 
 | 
 | 

| 
 | 
 | 

| 
 | 
 | 

| 
 | 
 | 

对Embarked属性进行dummy操作
| 
 | 
 | 
剩下Ticket属性看起来比较复杂,先不做任何提取。
下面再继续看目前的data_all的信息。
| 
 | 
 | 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
Age                1309 non-null float64
Cabin              295 non-null object
Fare               1309 non-null float64
Name               1309 non-null object
Parch              1309 non-null int64
PassengerId        1309 non-null int64
Pclass             1309 non-null int64
Sex                1309 non-null int64
SibSp              1309 non-null int64
Survived           891 non-null float64
Ticket             1309 non-null object
Title              1309 non-null int64
Family_size        1309 non-null int64
Last_Name          1309 non-null object
Family_Survival    1309 non-null float64
FareBin            1309 non-null category
FareBin_Code       1309 non-null int64
AgeBin             1309 non-null category
AgeBin_Code        1309 non-null int64
Embarked_C         1309 non-null uint8
Embarked_None      1309 non-null uint8
Embarked_Q         1309 non-null uint8
Embarked_S         1309 non-null uint8
dtypes: category(2), float64(4), int64(9), object(4), uint8(4)
memory usage: 181.8+ KB
| 
 | 
 | 

Scale以及模型选择
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 

| 
 | 
 | 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    3.0s finished
AdaBoost:
{'algorithm': 'SAMME', 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'random', 'learning_rate': 0.0001, 'n_estimators': 100}
0.8294051627384961
| 
 | 
 | 
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:    8.1s finished
Extra Tree:
{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 500}
0.8462401795735129
| 
 | 
 | 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.5s finished
xgboost:
{'booster': 'gbtree', 'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 500}
0.8518518518518519
| 
 | 
 | 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Logistic Regression:
{'C': 10, 'penalty': 'l1', 'tol': 0.0001}
0.8361391694725028
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.3s finished
| 
 | 
 | 
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    3.1s finished
Random Forest:
{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 6, 'min_samples_split': 5, 'n_estimators': 150}
0.8552188552188552
| 
 | 
 | 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.5s finished
Gradient Boost:
{'learning_rate': 0.01, 'loss': 'deviance', 'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 100, 'n_estimators': 200}
0.8484848484848485
| 
 | 
 | 
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.5s finished
SVC
{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
0.8305274971941639
| 
 | 
 | 
voting: [0.86592179 0.84357542 0.85393258 0.81460674 0.84745763]
| 
 | 
 | 
stack_score: [0.83240223 0.81564246 0.85393258 0.82022472 0.82485876]
| 
 | 
 | 







| 
 | 
 | 

| 
 | 
 | 
(418,)
(418,)
最后提交的voting方法的结果得到了0.81339的分数,还有一定的提升空间,其中超参数调整应该还会带来一定的效果提升。
参考
1、yassineghouzam’s kernel
2、konstantinmasich’s kernel
3、pandas document
4、seaborn tutorial
5、notebook