{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# join - 横向连接\n", "\n", "`DataFrame.join`是一种将两个可能具有不同索引的 `DataFrame` 的列组合成单个 `DataFrame` 的便捷方法。\n", "\n", "```{admonition} 在线刷题\n", ":class: seealso\n", "\n", "检查 or 强化 `Pandas` 数据分析操作?👉在线体验「Pandas进阶修炼300题」\n", "```\n", "\n", "```{note} \n", "本页面代码可以[在线编辑、执行](../指引/在线执行.md)!\n", "```\n", "\n", "\n", "`join` 本质上和 [merge](merge.ipynb) 类似,或者说是 `merge` 的特殊情况,所以也是一种 `SQL` 风格的合并方法。\n", "\n", "但和 `merge` 不一样的地方在于,`join` **只能按照行索引去合并数据,因此我翻译成 「横向连接」**。\n", "\n", "有关 `join` 的基本用法可以通过下图大致了解,至于 `how` 参数中的 `inner、outer` 可以参考 [merge](merge.ipynb) 中的图解。\n", "\n", "\n", "```{figure} https://pic.liuzaoqi.com/picgo/202112242051282.png\n", ":width: 100%\n", ":align: center\n", "```\n", "\n", "需要注意的是 `df2` 与 `df3` 的连接,如果有重复的列名,需要指定合并后左右的列名后缀,否则会报错。\n", "\n", "下面是官方文档中的案例,你可以修改相关代码来验证自己的想法!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 默认方法\n", "\n", "组合 left 和 right,并按照 left 的索引进行对齐" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "hide-input", "thebe-init" ] }, "outputs": [], "source": [ "import pandas as pd\n", "left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n", " 'B': ['B0', 'B1', 'B2']},\n", " index=['K0', 'K1', 'K2'])\n", "\n", "\n", "right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n", " 'D': ['D0', 'D2', 'D3']},\n", " index=['K0', 'K2', 'K3'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://pandas.pydata.org/pandas-docs/version/0.20/_images/merging_join.png)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
\n", "
" ], "text/plain": [ " A B C D\n", "K0 A0 B0 C0 D0\n", "K1 A1 B1 NaN NaN\n", "K2 A2 B2 C2 D2" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 取并集\n", "\n", "按下图所示进行组合\n", "\n", "![](https://pandas.pydata.org/pandas-docs/version/0.20/_images/merging_join_outer.png)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
K3NaNNaNC3D3
\n", "
" ], "text/plain": [ " A B C D\n", "K0 A0 B0 C0 D0\n", "K1 A1 B1 NaN NaN\n", "K2 A2 B2 C2 D2\n", "K3 NaN NaN C3 D3" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left.join(right, how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 取交集\n", "\n", "按下图所示进行组合\n", "\n", "![](https://pandas.pydata.org/pandas-docs/version/0.20/_images/merging_join_inner.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "left.join(right, how='inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} 思考\n", ":class: hint\n", "\n", "用 `merge` 如何实现?\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "pd.merge(left, right, left_index=True, right_index=True, how='inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 按索引组合\n", "\n", "重新产生数据并按下图所示进行连接(根据 `key`)\n", "\n", "![](https://pandas.pydata.org/pandas-docs/version/0.20/_images/merging_join_key_columns.png)\n" ] }, { "cell_type": "code", "execution_count": 248, "metadata": {}, "outputs": [], "source": [ "left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'key': ['K0', 'K1', 'K0', 'K1']})\n", "\n", "\n", "right = pd.DataFrame({'C': ['C0', 'C1'],\n", " 'D': ['D0', 'D1']},\n", " index=['K0', 'K1'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "left.join(right, on='key')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 多索引组合\n", "\n", "按下图所示进行连接(根据 `key1` 和 `key2`)\n", "\n", "![](https://pandas.pydata.org/pandas-docs/version/0.20/_images/merging_join_multikeys.png)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'key1': ['K0', 'K0', 'K1', 'K2'],\n", " 'key2': ['K0', 'K1', 'K0', 'K1']})\n", "\n", "\n", "index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),\n", " ('K2', 'K0'), ('K2', 'K1')])\n", "\n", "\n", "right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']},\n", " index=index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "left.join(right, on=['key1', 'key2'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }