pandas的级联操作
阅读原文时间:2023年07月08日阅读:1
  • pd.concat, pd.append

    import pandas as pd
    from pandas import DataFrame
    import numpy as np

pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
ignore_index=False
  • 匹配级联

    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
    'group':['Accounting','Product','Marketing'],
    'hire_date':[1998,2017,2018]})
    df1

employee

group

hire_date

0

Bobs

Accounting

1998

1

Linda

Product

2017

2

Bill

Marketing

2018

pd.concat((df1,df1),axis=0)

employee

group

hire_date

0

Bobs

Accounting

1998

1

Linda

Product

2017

2

Bill

Marketing

2018

0

Bobs

Accounting

1998

1

Linda

Product

2017

2

Bill

Marketing

2018

  • 不匹配级联

    • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
    • 有2种连接方式:
      • 外连接:补NaN(默认模式)
      • 内连接:只连接匹配的项

    df2 = df1.copy()
    df2.columns = ['employee','groupps','hire_date']
    df2

employee

groupps

hire_date

0

Bobs

Accounting

1998

1

Linda

Product

2017

2

Bill

Marketing

2018

pd.concat((df1,df2),axis=0)

# 按列级联,发现不同列就加在表中,用NaN补全空的字段

employee

group

groupps

hire_date

0

Bobs

Accounting

NaN

1998

1

Linda

Product

NaN

2017

2

Bill

Marketing

NaN

2018

0

Bobs

NaN

Accounting

1998

1

Linda

NaN

Product

2017

2

Bill

NaN

Marketing

2018

  • join:

    • inner:只对可以匹配的项进行级联
    • outer:可以级联所有的项

    pd.concat((df1,df2),axis=0,join='inner')

employee

hire_date

0

Bobs

1998

1

Linda

2017

2

Bill

2018

0

Bobs

1998

1

Linda

2017

2

Bill

2018

  • append函数的使用: append只可以进行纵向的级联

    employee

    group

    groupps

    hire_date

    0

    Bobs

    Accounting

    NaN

    1998

    1

    Linda

    Product

    NaN

    2017

    2

    Bill

    Marketing

    NaN

    2018

    0

    Bobs

    NaN

    Accounting

    1998

    1

    Linda

    NaN

    Product

    2017

    2

    Bill

    NaN

    Marketing

    2018

合并操作

  • merge与concat的区别在于,merge需要依据某一共同列来进行合并

  • 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

  • 注意每一列元素的顺序不要求一致

一对一合并

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df1

employee

group

0

Bob

Accounting

1

Jake

Engineering

2

Lisa

Engineering

df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })
df2

employee

hire_date

0

Lisa

2004

1

Bob

2008

2

Jake

2012

pd.merge(df1,df2,on='employee')

employee

group

hire_date

0

Bob

Accounting

2008

1

Jake

Engineering

2012

2

Lisa

Engineering

2004

一对多合并

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
df

employee

group

hire_date

0

Lisa

Accounting

2004

1

Jake

Engineering

2016

df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })
df4

group

supervisor

0

Accounting

Carly

1

Engineering

Guido

2

Engineering

Steve

pd.merge(df3,df4)

# 会默认以两个表的共同字段 group 进行关联,将两张表中的全部数据进行合并

employee

group

hire_date

supervisor

0

Lisa

Accounting

2004

Carly

1

Jake

Engineering

2016

Guido

2

Jake

Engineering

2016

Steve

多对多合并

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
df1

employee

group

0

Bob

Accounting

1

Jake

Engineering

2

Lisa

Engineering

df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })
df5

group

supervisor

0

Engineering

Carly

1

Engineering

Guido

2

HR

Steve

pd.merge(df1,df5,how='outer')

employee

group

supervisor

0

Bob

Accounting

NaN

1

Jake

Engineering

Carly

2

Jake

Engineering

Guido

3

Lisa

Engineering

Carly

4

Lisa

Engineering

Guido

5

NaN

HR

Steve

key的规范化

  • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

    df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
    'group':['Accounting','Finance','Marketing']})
    df1

employee

group

0

Jack

Accounting

1

Summer

Finance

2

Steve

Marketing

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})
df2

employee

group

hire_date

0

Jack

Accounting

2003

1

Bob

sell

2009

2

Jake

ceo

2012

pd.merge(df1,df2,on='group')

# 指定按照 group 这一列来进行分组,会将两个表中相同的字段名都加在合并后的表中,两个字段会改变
# 也可以通过这个 suffixes=('_x', '_y') 参数进行一个修改

employee_x

group

employee_y

hire_date

0

Jack

Accounting

Jack

2003

  • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列作为连接的列

    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
    'group':['Accounting','Product','Marketing'],
    'hire_date':[1998,2017,2018]})
    df1

employee

group

hire_date

0

Bobs

Accounting

1998

1

Linda

Product

2017

2

Bill

Marketing

2018

df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})
df5

hire_dates

name

0

1998

Lisa

1

2016

Bobs

2

2007

Bill

pd.merge(df1,df5,left_on='employee',right_on='name',how='outer')

# 分别指定df1表中的employee字段和df5表中的name字段进行一个比对连接,

employee

group

hire_date

hire_dates

name

0

Bobs

Accounting

1998.0

2016.0

Bobs

1

Linda

Product

2017.0

NaN

NaN

2

Bill

Marketing

2018.0

2007.0

Bill

3

NaN

NaN

NaN

1998.0

Lisa

内合并与外合并:out取并集 inner取交集