本文是初次参加Kaggle入门赛Titanic生还预测的一个过程记录,通过这个比赛,主要熟悉了pandas以及sklearn包的使用,对于常见的分类问题有了一定的了解。现将数据分析的过程记录如下。
|
|
数据分析
|
|
(891, 12)
(418, 11)
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
|
|
ALL: 1309
------------------------------
Cabin 1014
Age 263
Embarked 2
Fare 1
dtype: int64
填充缺失值
|
|
|
|
number describe:
min max mean std count
Sex 0.00 1.0000 0.355997 0.478997 1309.0
Pclass 1.00 3.0000 2.294882 0.837836 1309.0
SibSp 0.00 8.0000 0.498854 1.041658 1309.0
Parch 0.00 9.0000 0.385027 0.865560 1309.0
Age 0.17 80.0000 29.876751 13.447012 1309.0
Fare 0.00 512.3292 33.281086 51.741500 1309.0
|
|
object describe:
Cabin : 187
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
'B42' 'C148' 'B45' 'B36' 'A21' 'D34' 'A9' 'C31' 'B61' 'C53' 'D43' 'C130'
'C132' 'C55 C57' 'C116' 'F' 'A29' 'C6' 'C28' 'C51' 'C97' 'D22' 'B10'
'E45' 'E52' 'A11' 'B11' 'C80' 'C89' 'F E46' 'B26' 'F E57' 'A18' 'E60'
'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D40' 'D38' 'C105']
Embarked : 4
['S' 'C' 'Q' 'None']
Name : 1307
['Braund, Mr. Owen Harris'
'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
'Heikkinen, Miss. Laina' ... 'Saether, Mr. Simon Sivertsen'
'Ware, Mr. Frederick' 'Peter, Master. Michael J']
Ticket : 929
['A/5 21171' 'PC 17599' 'STON/O2. 3101282' '113803' '373450' '330877'
'17463' '349909' '347742' '237736' 'PP 9549' '113783' 'A/5. 2151'
...
'A.5. 3236' 'SOTON/O.Q. 3101262' '359309']
|
|
|
|
|
|
|
|
|
|
从上面可以看出Fare的分布比较不均匀,对其通过log尺度变换进行处理。
|
|
|
|
|
|
|
|
|
|
|
|
对Embarked属性进行dummy操作
|
|
剩下Ticket属性看起来比较复杂,先不做任何提取。
下面再继续看目前的data_all的信息。
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
Age 1309 non-null float64
Cabin 295 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null int64
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
Title 1309 non-null int64
Family_size 1309 non-null int64
Last_Name 1309 non-null object
Family_Survival 1309 non-null float64
FareBin 1309 non-null category
FareBin_Code 1309 non-null int64
AgeBin 1309 non-null category
AgeBin_Code 1309 non-null int64
Embarked_C 1309 non-null uint8
Embarked_None 1309 non-null uint8
Embarked_Q 1309 non-null uint8
Embarked_S 1309 non-null uint8
dtypes: category(2), float64(4), int64(9), object(4), uint8(4)
memory usage: 181.8+ KB
|
|
Scale以及模型选择
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 3.0s finished
AdaBoost:
{'algorithm': 'SAMME', 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'random', 'learning_rate': 0.0001, 'n_estimators': 100}
0.8294051627384961
|
|
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 8.1s finished
Extra Tree:
{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 500}
0.8462401795735129
|
|
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 2.5s finished
xgboost:
{'booster': 'gbtree', 'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 500}
0.8518518518518519
|
|
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Logistic Regression:
{'C': 10, 'penalty': 'l1', 'tol': 0.0001}
0.8361391694725028
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 2.3s finished
|
|
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 3.1s finished
Random Forest:
{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 6, 'min_samples_split': 5, 'n_estimators': 150}
0.8552188552188552
|
|
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 2.5s finished
Gradient Boost:
{'learning_rate': 0.01, 'loss': 'deviance', 'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 100, 'n_estimators': 200}
0.8484848484848485
|
|
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 2.5s finished
SVC
{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
0.8305274971941639
|
|
voting: [0.86592179 0.84357542 0.85393258 0.81460674 0.84745763]
|
|
stack_score: [0.83240223 0.81564246 0.85393258 0.82022472 0.82485876]
|
|
|
|
|
|
(418,)
(418,)
最后提交的voting方法的结果得到了0.81339的分数,还有一定的提升空间,其中超参数调整应该还会带来一定的效果提升。
参考
1、yassineghouzam’s kernel
2、konstantinmasich’s kernel
3、pandas document
4、seaborn tutorial
5、notebook