{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 数据编码的十种方式\n",
"\n",
"\n",
"```{admonition} 在线刷题\n",
":class: seealso\n",
"\n",
"检查 or 强化 `Pandas` 数据分析操作?👉在线体验「Pandas进阶修炼300题」\n",
"```\n",
"\n",
"```{note} \n",
"本页面代码可以[在线编辑、执行](../指引/在线执行.md)!\n",
"```\n",
"\n",
"在使用`Python`进行机器学习时,很多算法都需要我们对分类特征进行转换(编码),即根据某一列的值,新增(修改)一列。\n",
"\n",
"这个操作在`pandas`中也有多种解决方案,本文就将介绍十种方法,代码拿走就用,希望你在遇到不同类型的数据时,可以灵活使用。\n",
" \n",
"下面先创建用于示例的数据👇\n"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {
"tags": [
"hide-input",
"thebe-init"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score\n",
"0 Male Python 95\n",
"1 Female Java 85\n",
"2 Male C 75\n",
"3 Male Sql 65\n",
"4 Male Linux 55\n",
"5 Female Python 95\n",
"6 Male Python 75\n",
"7 Male Java 65\n",
"8 Female C 55\n",
"9 Female Php 85"
]
},
"execution_count": 152,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.DataFrame({\n",
" \"Sex\": pd.Series(['Male','Female','Male','Male','Male','Female','Male','Male','Female','Female']),\n",
" \"Course Name\": pd.Series(['Python','Java','C','Sql','Linux','Python','Python','Java','C','Php']),\n",
" \"Score\":[95,85,75,65,55,95,75,65,55,85]})\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数值型数据编码"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1 - 使用自定义函数 + 循环遍历\n",
"\n",
"首先然是最简单,最笨的方法,自己写一个函数来转换数据,并用循环遍历,肯定就是一个`def`加一个`for`"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label\n",
"0 Male Python 95 A\n",
"1 Female Java 85 B\n",
"2 Male C 75 C\n",
"3 Male Sql 65 D\n",
"4 Male Linux 55 E\n",
"5 Female Python 95 A\n",
"6 Male Python 75 C\n",
"7 Male Java 65 D\n",
"8 Female C 55 E\n",
"9 Female Php 85 B"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df.copy()\n",
"\n",
"def myfun(x):\n",
" if x>90:\n",
" return 'A'\n",
" elif x>=80 and x<90:\n",
" return 'B'\n",
" elif x>=70 and x<80:\n",
" return 'C'\n",
" elif x>=60 and x<70:\n",
" return 'D'\n",
" else:\n",
" return 'E'\n",
" \n",
"df1['Score_Label'] = None\n",
"for i in range(len(df1)):\n",
" df1.iloc[i,3] = myfun(df1.iloc[i,2])\n",
" \n",
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这段代码,相信所有人都能看懂,简单好想但比较麻烦\n",
"\n",
"有没有更简单的办法呢?`pandas`当然提供了很多高效的操作的函数,继续往下看。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2 - 使用 map + 自定义函数\n",
"\n",
"现在,可以使用`map`来干掉循环(虽然本质上也是循环)"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label\n",
"0 Male Python 95 A\n",
"1 Female Java 85 B\n",
"2 Male C 75 C\n",
"3 Male Sql 65 D\n",
"4 Male Linux 55 E\n",
"5 Female Python 95 A\n",
"6 Male Python 75 C\n",
"7 Male Java 65 D\n",
"8 Female C 55 E\n",
"9 Female Php 85 B"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.copy()\n",
"\n",
"def mapfun(x):\n",
" if x>90:\n",
" return 'A'\n",
" elif x>=80 and x<90:\n",
" return 'B'\n",
" elif x>=70 and x<80:\n",
" return 'C'\n",
" elif x>=60 and x<70:\n",
" return 'D'\n",
" else:\n",
" return 'E'\n",
"\n",
"df2['Score_Label'] = df2['Score'].map(mapfun)\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3 - 使用 apply + 匿名函数\n",
"\n",
"如果还想简洁代码,可以使用`自定义函数 + apply`来干掉自定义函数(结果和上面是一致的,只不过这么写容易被打。)"
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" A | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" C | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" D | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" E | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" B | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label\n",
"0 Male Python 95 A\n",
"1 Female Java 85 B\n",
"2 Male C 75 C\n",
"3 Male Sql 65 D\n",
"4 Male Linux 55 E\n",
"5 Female Python 95 A\n",
"6 Male Python 75 C\n",
"7 Male Java 65 D\n",
"8 Female C 55 E\n",
"9 Female Php 85 B"
]
},
"execution_count": 184,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = df.copy()\n",
"df3['Score_Label'] = df3['Score'].apply(lambda x: 'A' if x > 90 else (\n",
" 'B' if 90 > x >= 80 else ('C' if 80 > x >= 70 else ('D' if 70 > x >= 60 else 'E'))))\n",
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4 - 使用cut\n",
"\n",
"现在,让我们继续了解更高级的`pandas`函数,依旧是对 `Score` 进行编码,使用`pd.cut`,并指定划分的区间后,可以直接帮你分好组"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" (80, 100] | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" (80, 100] | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" (70, 80] | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" (59, 70] | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" (0, 59] | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" (80, 100] | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" (70, 80] | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" (59, 70] | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" (0, 59] | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" (80, 100] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label\n",
"0 Male Python 95 (80, 100]\n",
"1 Female Java 85 (80, 100]\n",
"2 Male C 75 (70, 80]\n",
"3 Male Sql 65 (59, 70]\n",
"4 Male Linux 55 (0, 59]\n",
"5 Female Python 95 (80, 100]\n",
"6 Male Python 75 (70, 80]\n",
"7 Male Java 65 (59, 70]\n",
"8 Female C 55 (0, 59]\n",
"9 Female Php 85 (80, 100]"
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4 = df.copy()\n",
"bins = [0, 59, 70, 80, 100]\n",
"df4['Score_Label'] = pd.cut(df4['Score'], bins)\n",
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也可以直接使用`labels`参数来修改对应组的名称,是不是方便多了"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
" Score_Label_new | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" (80, 100] | \n",
" perfect | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" (80, 100] | \n",
" perfect | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" (70, 80] | \n",
" good | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" (59, 70] | \n",
" middle | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" (0, 59] | \n",
" low | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" (80, 100] | \n",
" perfect | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" (70, 80] | \n",
" good | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" (59, 70] | \n",
" middle | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" (0, 59] | \n",
" low | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" (80, 100] | \n",
" perfect | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label Score_Label_new\n",
"0 Male Python 95 (80, 100] perfect\n",
"1 Female Java 85 (80, 100] perfect\n",
"2 Male C 75 (70, 80] good\n",
"3 Male Sql 65 (59, 70] middle\n",
"4 Male Linux 55 (0, 59] low\n",
"5 Female Python 95 (80, 100] perfect\n",
"6 Male Python 75 (70, 80] good\n",
"7 Male Java 65 (59, 70] middle\n",
"8 Female C 55 (0, 59] low\n",
"9 Female Php 85 (80, 100] perfect"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4['Score_Label_new'] = pd.cut(df4['Score'], bins, labels=[\n",
" 'low', 'middle', 'good', 'perfect'])\n",
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5 - 使用 sklearn 二值化\n",
"\n",
"既然是和机器学习相关,`sklearn`肯定跑不掉,如果需要新增一列并判定成绩是否及格,就可以使用`Binarizer`函数,代码也是简洁好懂"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Score_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Score_Label\n",
"0 Male Python 95 1\n",
"1 Female Java 85 1\n",
"2 Male C 75 1\n",
"3 Male Sql 65 1\n",
"4 Male Linux 55 0\n",
"5 Female Python 95 1\n",
"6 Male Python 75 1\n",
"7 Male Java 65 1\n",
"8 Female C 55 0\n",
"9 Female Php 85 1"
]
},
"execution_count": 158,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import Binarizer\n",
"\n",
"df5 = df.copy()\n",
"binerize = Binarizer(threshold = 60)\n",
"trans = binerize.fit_transform(np.array(df1['Score']).reshape(-1,1))\n",
"df5['Score_Label'] = trans\n",
"\n",
"df5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 文本型数据编码\n",
"\n",
"下面介绍更常见的,对文本数据进行转换打标签。例如新增一列,将性别男、女分别标记为0、1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6 - 使用 replace\n",
"\n",
"首先介绍`replace`,但要注意的是,上面说过的自定义函数相关方法依旧是可行的"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Sex_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 0 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Sex_Label\n",
"0 Male Python 95 0\n",
"1 Female Java 85 1\n",
"2 Male C 75 0\n",
"3 Male Sql 65 0\n",
"4 Male Linux 55 0\n",
"5 Female Python 95 1\n",
"6 Male Python 75 0\n",
"7 Male Java 65 0\n",
"8 Female C 55 1\n",
"9 Female Php 85 1"
]
},
"execution_count": 160,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6 = df.copy()\n",
"df6['Sex_Label'] = df6['Sex'].replace(['Male','Female'],[0,1])\n",
"df6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上面是对性别操作,因为只有男女,所以可以手动指定0、1,但要是类别很多,也可以使用`pd.value_counts()`来自动指定标签,例如对`Course Name`列分组"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 2 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Course Name_Label\n",
"0 Male Python 95 0\n",
"1 Female Java 85 1\n",
"2 Male C 75 2\n",
"3 Male Sql 65 5\n",
"4 Male Linux 55 3\n",
"5 Female Python 95 0\n",
"6 Male Python 75 0\n",
"7 Male Java 65 1\n",
"8 Female C 55 2\n",
"9 Female Php 85 4"
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6 = df.copy()\n",
"value = df6['Course Name'].value_counts()\n",
"value_map = dict((v, i) for i,v in enumerate(value.index))\n",
"df6['Course Name_Label'] = df6.replace({'Course Name':value_map})['Course Name']\n",
"df6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7 - 使用map\n",
"\n",
"额外强调的是,新增一列,一定要能够想到`map`"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 3 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 5 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 5 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 2 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Course Name_Label\n",
"0 Male Python 95 5\n",
"1 Female Java 85 2\n",
"2 Male C 75 0\n",
"3 Male Sql 65 3\n",
"4 Male Linux 55 1\n",
"5 Female Python 95 5\n",
"6 Male Python 75 5\n",
"7 Male Java 65 2\n",
"8 Female C 55 0\n",
"9 Female Php 85 4"
]
},
"execution_count": 162,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7 = df.copy()\n",
"Map = {elem:index for index,elem in enumerate(set(df[\"Course Name\"]))}\n",
"df7['Course Name_Label'] = df7['Course Name'].map(Map)\n",
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 8 - 使用astype\n",
"\n",
"这个方法应该很多人不知道,这就属于上面提到的知乎问题,能实现的方法太多了"
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 2 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 4 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 4 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Course Name_Label\n",
"0 Male Python 95 4\n",
"1 Female Java 85 1\n",
"2 Male C 75 0\n",
"3 Male Sql 65 5\n",
"4 Male Linux 55 2\n",
"5 Female Python 95 4\n",
"6 Male Python 75 4\n",
"7 Male Java 65 1\n",
"8 Female C 55 0\n",
"9 Female Php 85 3"
]
},
"execution_count": 163,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8 = df.copy()\n",
"value = df8['Course Name'].astype('category')\n",
"df8['Course Name_Label'] = value.cat.codes\n",
"df8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9 - 使用 sklearn\n",
"\n",
"同数值型一样,这种机器学习中的经典操作,`sklearn`一定有办法,使用`LabelEncoder`可以对分类数据进行编码"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Sex_Label | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Sex_Label Course Name_Label\n",
"0 Male Python 95 1 4\n",
"1 Female Java 85 0 1\n",
"2 Male C 75 1 0\n",
"3 Male Sql 65 1 5\n",
"4 Male Linux 55 1 2\n",
"5 Female Python 95 0 4\n",
"6 Male Python 75 1 4\n",
"7 Male Java 65 1 1\n",
"8 Female C 55 0 0\n",
"9 Female Php 85 0 3"
]
},
"execution_count": 164,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"df9 = df.copy()\n",
"le = LabelEncoder()\n",
"le.fit(df9['Sex'])\n",
"df9['Sex_Label'] = le.transform(df9['Sex'])\n",
"le.fit(df9['Course Name'])\n",
"df9['Course Name_Label'] = le.transform(df9['Course Name'])\n",
"df9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一次性转换两列也是可以的"
]
},
{
"cell_type": "code",
"execution_count": 165,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Sex_Label | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 1.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 1.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 0.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 1.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 0.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Sex_Label Course Name_Label\n",
"0 Male Python 95 1.0 4.0\n",
"1 Female Java 85 0.0 1.0\n",
"2 Male C 75 1.0 0.0\n",
"3 Male Sql 65 1.0 5.0\n",
"4 Male Linux 55 1.0 2.0\n",
"5 Female Python 95 0.0 4.0\n",
"6 Male Python 75 1.0 4.0\n",
"7 Male Java 65 1.0 1.0\n",
"8 Female C 55 0.0 0.0\n",
"9 Female Php 85 0.0 3.0"
]
},
"execution_count": 165,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"df9 = df.copy()\n",
"le = OrdinalEncoder()\n",
"le.fit(df9[['Sex','Course Name']])\n",
"df9[['Sex_Label','Course Name_Label']] = le.transform(df9[['Sex','Course Name']])\n",
"\n",
"df9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 10 - 使用factorize\n",
"\n",
"最后,再介绍一个小众但好用的`pandas`方法,我们需要注意到,在上面的方法中,自动生成的`Course Name_Label`列,虽然一个数据对应一个语言,因为避免写自定义函数或者字典,这样可以自动生成,所以大多是无序的。\n",
"\n",
"如果我们希望它是有序的,也就是 `Python` 对应 `0`,`Java`对应`1`,除了自己指定,还有什么优雅的办法?这时可以使用`factorize`,它会根据出现顺序进行编码"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 3 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 4 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 2 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Course Name_Label\n",
"0 Male Python 95 0\n",
"1 Female Java 85 1\n",
"2 Male C 75 2\n",
"3 Male Sql 65 3\n",
"4 Male Linux 55 4\n",
"5 Female Python 95 0\n",
"6 Male Python 75 0\n",
"7 Male Java 65 1\n",
"8 Female C 55 2\n",
"9 Female Php 85 5"
]
},
"execution_count": 166,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df10 = df.copy()\n",
"df10['Course Name_Label'] = pd.factorize(df10['Course Name'])[0]\n",
"df10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"结合匿名函数,我们可以做到对多列进行有序编码转换"
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Course Name | \n",
" Score | \n",
" Sex_Label | \n",
" Course Name_Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Male | \n",
" Python | \n",
" 95 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" Female | \n",
" Java | \n",
" 85 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Male | \n",
" C | \n",
" 75 | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" Male | \n",
" Sql | \n",
" 65 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 4 | \n",
" Male | \n",
" Linux | \n",
" 55 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 5 | \n",
" Female | \n",
" Python | \n",
" 95 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" Male | \n",
" Python | \n",
" 75 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" Male | \n",
" Java | \n",
" 65 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" Female | \n",
" C | \n",
" 55 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 9 | \n",
" Female | \n",
" Php | \n",
" 85 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Course Name Score Sex_Label Course Name_Label\n",
"0 Male Python 95 0 0\n",
"1 Female Java 85 1 1\n",
"2 Male C 75 0 2\n",
"3 Male Sql 65 0 3\n",
"4 Male Linux 55 0 4\n",
"5 Female Python 95 1 0\n",
"6 Male Python 75 0 0\n",
"7 Male Java 65 0 1\n",
"8 Female C 55 1 2\n",
"9 Female Php 85 1 5"
]
},
"execution_count": 168,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df10 = df.copy()\n",
"cat_columns = df10.select_dtypes(['object']).columns\n",
"\n",
"df10[['Sex_Label', 'Course Name_Label']] = df10[cat_columns].apply(\n",
" lambda x: pd.factorize(x)[0])\n",
"df10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} 注意\n",
":class: attention\n",
"\n",
"你无需完全记住所有方法与细节,**只需知道有这么个函数能完成这样操作,需要用时能想到,想到再来查就行**。\n",
"```"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "384px"
},
"toc_section_display": true,
"toc_window_display": true
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}