apply函数
pandas提供了apply函数方便的处理Series与DataFrame;apply函数支持逐一处理数据集中的每个元素都会执行一次目标函数,把返回值存入结果集中。:
# series.apply()
ary = np.array(['80公斤','83公斤','78公斤','74公斤','84公斤'])
s = pd.Series(ary)
def func(x):
return x[:2]
s.apply(func)
# dataframe.apply()
def func(x):
x[pd.isna(x)] = x.mean()
return x
ratings.apply(func, axis=1)
Pandas有两种排序方式,它们分别是按标签与按实际值排序。
import numpy as np
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack','Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)
使用sort_index()
方法,通过传递axis
参数和排序顺序,可以对DataFrame
进行排序。 默认情况下,按照升序对行标签进行排序。
# 按照行标进行排序
sorted_df=unsorted_df.sort_index()
print (sorted_df)
# 控制排序顺序
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)
# 按照列标签进行排序
sorted_df=unsorted_df.sort_index(axis=1)
print (sorted_df)
像索引排序一样,sort_values()
是按值排序的方法。它接受一个by
参数,它将使用要与其排序值的DataFrame
的列名称。
sorted_df = unsorted_df.sort_values(by='Age')
print (sorted_df)
# 先按Age进行升序排序,然后按Rating降序排序
sorted_df = unsorted_df.sort_values(by=['Age', 'Rating'], ascending=[True, False])
print (sorted_df)
concat函数是在pandas的方法,可以根据不同的轴合并数据集。
r = pd.concat(datas, axis=0, join='outer', ignore_index=False,
keys=['x', 'y', 'z'])
纵向合并:
横向合并:
panda具有全功能、高性能的内存连接操作,与SQL之类的关系数据库非常相似。与其他开源实现相比,这些方法的性能要好得多(在某些情况下要好一个数量级以上)
pandas提供了merge函数实现高效的内存链接操作:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False)
参数名称
说明
left
接收DataFrame或Series。表示要添加的新数据。无默认。
right
接收DataFrame或Series。表示要添加的新数据。无默认。。
how
接收inner,outer,left,right。表示数据的连接方式。默认为inner。
on
接收string或sequence。表示外键字段名。默认为None。
left_on
接收string或sequence。关联操作时左表中的关联字段名。
right_on
接收string或sequence。关联操作时右表中的关联字段名。
left_index
接收boolean。表示是否将left参数接收数据的index作为连接主键。默认为False。
right_index
接收boolean。表示是否将right参数接收数据的index作为连接主键。默认为False。
sort
接收boolean。表示是否根据连接键对合并后的数据进行排序。默认为False。
suffixes
接收接收tuple。表示用于追加到left和right参数接收数据重叠列名的尾缀默认为(’_x’, ‘_y’)。
合并两个DataFrame:
import pandas as pd
left = pd.DataFrame({
'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2],
'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'],
'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22],
'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
{'class_id':[1,2,3,5],
'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
# 合并两个DataFrame
data = pd.merge(left,right)
print(data)
其他合并方法同数据库相同:
合并方法
SQL等效
描述
left
LEFT OUTER JOIN
使用左侧对象的键
right
RIGHT OUTER JOIN
使用右侧对象的键
outer
FULL OUTER JOIN
使用键的联合
inner
INNER JOIN
使用键的交集
实验:
# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='right')
print(rs)
# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='outer')
print(rs)
# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='inner')
print(rs)
pandas提供了功能类似于数据库中group by语句的用于拆分数据组的方法pd.groupby();该方法提供的是分组聚合步骤中的拆分功能,能根据索引或字段对数据进行分组(Split) 进而针对得到的多组数据执行聚合操作(Apply),最终合并为最终结果(Combine)。
groupby方法的参数及其说明:
DataFrame.groupby(by=None, axis=0, as_index=True, sort=True)
参数名称
说明
by
接收list,string,mapping或generator。用于确定进行分组的依据。无默认。
axis
接收int。表示操作的轴向,默认对行进行操作。默认为0。
as_index
接收boolearn。表示聚合后的聚合标签是否以DataFrame索引形式输出。默认为True。
sort
接收boolearn。表示是否对分组依据分组标签进行排序。默认为True。
用groupby方法分组后的结果并不能直接查看,而是被存在内存中,输出的是内存地址。实际上分组后的数据对象(Groupby对象)类似Series与DataFrame,是pandas提供的一种对象。
Groupby对象的常用方法:
方法
说明
groupObject.get_group(‘A’)
返回A组的详细数据
groupObject.size()
返回每一组的频数
grouped = data.groupby(by=['class_id', 'gender'])
grouped.get_group((1, 'M'))
grouped = data.groupby(by=['class_id', 'gender'])
grouped.get_group((1, 'M'))
聚合函数为每个组返回聚合值。当创建了分组(groupby)对象,就可以对每个分组的其他字段数据执行求和、求标准差等操作。
使用聚合函数agg进行组内计算:
grouped = data.groupby(by='class_id')
grouped.agg({'score':np.mean})
对于某个字段希望只做求均值操作,而对另一个字段则希望只做求和操作,可以使用字典的方式,将两个字段名分别作为key:
grouped.agg({'age':np.max, 'score':np.mean})
还可以这样:
result = grouped.agg(
{'age':np.max, 'score':[np.mean, np.max]})
pandas支持的聚合函数有:
方法名称
说明
count
计算分组的数目,包括缺失值。
head
返回每组的前n个值。
max
返回每组最大值。
mean
返回每组的均值。
median
返回每组的中位数。
cumcount
对每个分组中组员的进行标记,0至n-1。
size
返回每组的大小。
min
返回每组最小值。
std
返回每组的标准差。
sum
返回每组的和。
透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。它根据一个或多个键对数据进行分组聚合,并根据每个分组进行数据汇总。
# 以class_id与gender做分组汇总数据,默认聚合统计所有列
print(data.pivot_table(index=['class_id', 'gender']))
# 以class_id与gender做分组汇总数据,聚合统计score列
print(data.pivot_table(index=['class_id', 'gender'], values=['score']))
# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age']))
# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计,添加行、列小计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True))
# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计,添加行、列小计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True, aggfunc='max'))
交叉表(cross-tabulation, 简称crosstab)是一种用于计算分组频率的特殊透视表:
# 按照class_id分组,针对不同的gender,统计数量
print(pd.crosstab(data.class_id, data.gender, margins=True))
https://download.csdn.net/download/yegeli/12562286
在我的资源中下载:StudentsPerformance.csv
字段
说明
gender
性别
race/ethnicity
种族
parental level of education
父母教育水平
lunch
午餐
test preparation course
是否通过预科考试
math score
数学得分
reading score
阅读得分
writing score
写作得分
import numpy as np
import pandas as pd
data = pd.read_csv('StudentsPerformance.csv')
data['total score'] = data.sum(axis=1)
# 参数number,object意为:统计数字列与字符串列
data.describe(include=['number', 'object'])
gender
race/ethnicity
parental level of education
lunch
test preparation course
math score
reading score
writing score
total score
count
1000
1000
1000
1000
1000
1000.00000
1000.000000
1000.000000
1000.000000
unique
2
5
6
2
2
NaN
NaN
NaN
NaN
top
female
group C
some college
standard
none
NaN
NaN
NaN
NaN
freq
518
319
226
645
642
NaN
NaN
NaN
NaN
mean
NaN
NaN
NaN
NaN
NaN
66.08900
69.169000
68.054000
203.312000
std
NaN
NaN
NaN
NaN
NaN
15.16308
14.600192
15.195657
42.771978
min
NaN
NaN
NaN
NaN
NaN
0.00000
17.000000
10.000000
27.000000
25%
NaN
NaN
NaN
NaN
NaN
57.00000
59.000000
57.750000
175.000000
50%
NaN
NaN
NaN
NaN
NaN
66.00000
70.000000
69.000000
205.000000
75%
NaN
NaN
NaN
NaN
NaN
77.00000
79.000000
79.000000
233.000000
max
NaN
NaN
NaN
NaN
NaN
100.00000
100.000000
100.000000
300.000000
# 分析性别对学习成绩的影响(按性别分组)
data.pivot_table(index='gender')
math score
reading score
total score
writing score
gender
female
63.633205
72.608108
208.708494
72.467181
male
68.728216
65.473029
197.512448
63.311203
总体来说,女生的成绩普遍比较好,但是男生更善于数学。
# 分析种族对学习成绩的影响
data.pivot_table(index='race/ethnicity')
math score
reading score
total score
writing score
race/ethnicity
group A
61.629213
64.674157
188.977528
62.674157
group B
63.452632
67.352632
196.405263
65.600000
group C
64.463950
69.103448
201.394984
67.827586
group D
67.362595
70.030534
207.538168
70.145038
group E
73.821429
73.028571
218.257143
71.407143
种族划分(优秀-及格): E - D - C - B - A
# 分析父母教育水平对学习成绩的影响
r = data.pivot_table(index='parental level of education')
r.sort_values(by='total score', ascending=False)
math score
reading score
total score
writing score
parental level of education
master's degree
69.745763
75.372881
220.796610
75.677966
bachelor's degree
69.389831
73.000000
215.771186
73.381356
associate's degree
67.882883
70.927928
208.707207
69.896396
some college
67.128319
69.460177
205.429204
68.840708
some high school
63.497207
66.938547
195.324022
64.888268
high school
62.137755
64.704082
189.290816
62.448980
父母受教育水平越高,学习成绩越好。
# 分析中午饭学习成绩的影响
r = data.pivot_table(index='lunch')
r.sort_values(by='total score', ascending=False)
math score
reading score
total score
writing score
lunch
standard
70.034109
71.654264
212.511628
70.823256
free/reduced
58.921127
64.653521
186.597183
63.022535
# 分析中午饭学习成绩的影响
r = data.pivot_table(index='test preparation course')
r.sort_values(by='total score', ascending=False)
math score
reading score
total score
writing score
test preparation course
completed
69.695531
73.893855
218.008380
74.418994
none
64.077882
66.534268
195.116822
64.504673
r = data.pivot_table(index=['gender','test preparation course'])
r
math score
reading score
total score
writing score
gender
test preparation course
female
completed
67.195652
77.375000
223.364130
78.793478
none
61.670659
69.982036
200.634731
68.982036
male
completed
72.339080
70.212644
212.344828
69.793103
none
66.688312
62.795455
189.133117
59.649351
r = data.sort_values(by='total score', ascending=False)
top100 = r.head(100)
tail100 = r.tail(100)
r1 = pd.DataFrame({'top100':top100['gender'].value_counts(),
'tail100':tail100['gender'].value_counts()})
r1
tail100
top100
female
38
66
male
62
34
data['parental level of education'].value_counts()
some college 226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64
r2 = pd.DataFrame({'top100':top100['parental level of education'].value_counts(),
'tail100':tail100['parental level of education'].value_counts()})
r2
tail100
top100
associate's degree
17
29
bachelor's degree
8
20
high school
32
6
master's degree
1
15
some college
14
21
some high school
28
9
import numpy as np
import pandas as pd
# apply() 应用于Series
ary = np.array(['80公斤','83公斤','78公斤','74公斤','84公斤'])
s = pd.Series(ary)
s
def func(item):
return float(item[:-2])
s = s.apply(func)
s
0 80.0
1 83.0
2 78.0
3 74.0
4 84.0
dtype: float64
# apply() 应用于 DataFrame
ratings = pd.read_json('../../data/ratings.json')
def func(item):
item[item.isna()] = item.mean()
return item
ratings.apply(func, axis=0)
<ipython-input-7-ebdfbe0e051f>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
item[item.isna()] = item.mean()
John Carson
Michelle Peterson
William Reynolds
Jillian Hobart
Melissa Jones
Alex Roberts
Michael Henry
Inception
2.5
3.0
2.50
3.5
3
3.0
3.166667
Pulp Fiction
3.5
3.5
3.00
3.5
4
4.0
4.500000
Anger Management
3.0
1.5
3.25
3.0
2
3.7
3.166667
Fracture
3.5
5.0
3.50
4.0
3
5.0
4.000000
Serendipity
2.5
3.5
3.25
2.5
2
3.5
1.000000
Jerry Maguire
3.0
3.0
4.00
4.5
3
3.0
3.166667
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack','Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)
unsorted_df
Name
Age
Rating
0
Tom
25
4.23
1
James
26
3.24
2
Ricky
25
3.98
3
Vin
23
2.56
4
Steve
30
3.20
5
Minsu
29
4.60
6
Jack
23
3.80
7
Lee
34
3.78
8
David
40
2.98
9
Gasper
30
4.80
10
Betina
51
4.10
11
Andres
46
3.65
unsorted_df.sort_index(axis=1, ascending=True)
Age
Name
Rating
0
25
Tom
4.23
1
26
James
3.24
2
25
Ricky
3.98
3
23
Vin
2.56
4
30
Steve
3.20
5
29
Minsu
4.60
6
23
Jack
3.80
7
34
Lee
3.78
8
40
David
2.98
9
30
Gasper
4.80
10
51
Betina
4.10
11
46
Andres
3.65
unsorted_df.sort_index(ascending=False)
Name
Age
Rating
11
Andres
46
3.65
10
Betina
51
4.10
9
Gasper
30
4.80
8
David
40
2.98
7
Lee
34
3.78
6
Jack
23
3.80
5
Minsu
29
4.60
4
Steve
30
3.20
3
Vin
23
2.56
2
Ricky
25
3.98
1
James
26
3.24
0
Tom
25
4.23
unsorted_df.sort_values(
by=['Age', 'Rating'], ascending=[True, False])
Name
Age
Rating
6
Jack
23
3.80
3
Vin
23
2.56
0
Tom
25
4.23
2
Ricky
25
3.98
1
James
26
3.24
5
Minsu
29
4.60
9
Gasper
30
4.80
4
Steve
30
3.20
7
Lee
34
3.78
8
David
40
2.98
11
Andres
46
3.65
10
Betina
51
4.10
left = pd.DataFrame({
'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2],
'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'],
'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22],
'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
{'class_id':[1,2,3,5],
'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
left
student_id
student_name
class_id
gender
age
score
0
1
Alex
1
M
20
98
1
2
Amy
1
M
21
74
2
3
Allen
1
F
22
67
3
4
Alice
2
F
20
38
4
5
Ayoung
2
M
21
65
5
6
Billy
2
M
22
29
6
7
Brian
3
F
23
32
7
8
Bran
3
F
20
34
8
9
Bryce
3
M
21
85
9
10
Betty
4
M
22
64
10
11
Emma
1
F
20
52
11
12
Marry
1
F
21
38
12
13
Allen
1
M
22
26
13
14
Jean
2
M
23
89
14
15
Rose
2
F
20
68
15
16
David
2
F
21
46
16
17
Tom
3
M
22
32
17
18
Jack
3
M
20
78
18
19
Daniel
3
F
21
79
19
20
Andrew
2
F
22
87
right
class_id
class_name
0
1
ClassA
1
2
ClassB
2
3
ClassC
3
5
ClassE
r = pd.merge(left, right, how='inner')
r
student_id
student_name
class_id
gender
age
score
class_name
0
1
Alex
1
M
20
98
ClassA
1
2
Amy
1
M
21
74
ClassA
2
3
Allen
1
F
22
67
ClassA
3
11
Emma
1
F
20
52
ClassA
4
12
Marry
1
F
21
38
ClassA
5
13
Allen
1
M
22
26
ClassA
6
4
Alice
2
F
20
38
ClassB
7
5
Ayoung
2
M
21
65
ClassB
8
6
Billy
2
M
22
29
ClassB
9
14
Jean
2
M
23
89
ClassB
10
15
Rose
2
F
20
68
ClassB
11
16
David
2
F
21
46
ClassB
12
20
Andrew
2
F
22
87
ClassB
13
7
Brian
3
F
23
32
ClassC
14
8
Bran
3
F
20
34
ClassC
15
9
Bryce
3
M
21
85
ClassC
16
17
Tom
3
M
22
32
ClassC
17
18
Jack
3
M
20
78
ClassC
18
19
Daniel
3
F
21
79
ClassC
grouped = r.groupby(by=['class_id', 'gender'])
# 返回每个分组的频数
grouped.size()
class_id gender
1 F 3
M 3
2 F 4
M 3
3 F 3
M 3
dtype: int64
# 获取某一个分组细节
grouped.get_group((2, 'F'))
student_id
student_name
class_id
gender
age
score
class_name
6
4
Alice
2
F
20
38
ClassB
10
15
Rose
2
F
20
68
ClassB
11
16
David
2
F
21
46
ClassB
12
20
Andrew
2
F
22
87
ClassB
# 针对每小组执行聚合操作 agg()
grouped.agg({'score':np.mean})
r = grouped.agg({'score':np.mean, 'age':[np.max, np.min]})
r
score
age
mean
amax
amin
class_id
gender
1
F
52.333333
22
20
M
66.000000
22
20
2
F
59.750000
22
20
M
61.000000
23
21
3
F
48.333333
23
20
M
65.000000
22
20
data = pd.merge(left, right)
# 透视表
data.pivot_table(index='class_id')
age
score
student_id
class_id
1
21.000000
59.166667
7.000000
2
21.285714
60.285714
11.428571
3
21.166667
56.666667
13.000000
data
student_id
student_name
class_id
gender
age
score
class_name
0
1
Alex
1
M
20
98
ClassA
1
2
Amy
1
M
21
74
ClassA
2
3
Allen
1
F
22
67
ClassA
3
11
Emma
1
F
20
52
ClassA
4
12
Marry
1
F
21
38
ClassA
5
13
Allen
1
M
22
26
ClassA
6
4
Alice
2
F
20
38
ClassB
7
5
Ayoung
2
M
21
65
ClassB
8
6
Billy
2
M
22
29
ClassB
9
14
Jean
2
M
23
89
ClassB
10
15
Rose
2
F
20
68
ClassB
11
16
David
2
F
21
46
ClassB
12
20
Andrew
2
F
22
87
ClassB
13
7
Brian
3
F
23
32
ClassC
14
8
Bran
3
F
20
34
ClassC
15
9
Bryce
3
M
21
85
ClassC
16
17
Tom
3
M
22
32
ClassC
17
18
Jack
3
M
20
78
ClassC
18
19
Daniel
3
F
21
79
ClassC
# 结果集中只希望看到score列的均值
data.pivot_table(index='class_id', values='score')
score
class_id
1
59.166667
2
60.285714
3
56.666667
# 依据class_id 与 gender 同时做分组
data.pivot_table(index=['class_id','gender'], values='score')
score
class_id
gender
1
F
52.333333
M
66.000000
2
F
59.750000
M
61.000000
3
F
48.333333
M
65.000000
data.pivot_table(index=['class_id','gender'],
columns='age', values='score')
age
20
21
22
23
class_id
gender
1
F
52.0
38.0
67.0
NaN
M
98.0
74.0
26.0
NaN
2
F
53.0
46.0
87.0
NaN
M
NaN
65.0
29.0
89.0
3
F
34.0
79.0
NaN
32.0
M
78.0
85.0
32.0
NaN
data.pivot_table(index=['class_id','gender'],
columns='age', values='score', aggfunc=np.max)
age
20
21
22
23
class_id
gender
1
F
52.0
38.0
67.0
NaN
M
98.0
74.0
26.0
NaN
2
F
68.0
46.0
87.0
NaN
M
NaN
65.0
29.0
89.0
3
F
34.0
79.0
NaN
32.0
M
78.0
85.0
32.0
NaN
r = data.pivot_table(index=['class_id','gender'],
columns='age', values='score', aggfunc=np.max,
margins=True)
r
age
20
21
22
23
All
class_id
gender
1
F
52.0
38.0
67.0
NaN
67
M
98.0
74.0
26.0
NaN
98
2
F
68.0
46.0
87.0
NaN
87
M
NaN
65.0
29.0
89.0
89
3
F
34.0
79.0
NaN
32.0
79
M
78.0
85.0
32.0
NaN
85
All
98.0
85.0
87.0
89.0
98
手机扫一扫
移动阅读更方便
你可能感兴趣的文章