pandas中有两个主要的数据结构,一个是Series,另一个是DataFrame。通过这两类数据,可以下载数据、可视化数据、和分析数据。
Pandas安装:pip install pandas
import numpy as np
import pandas as pd
a = np.array([1,5,3,4,10,0,9])
b = pd.Series([1,5,3,4,10,0,9])
print(a)
print(b)
[ 1 5 3 4 10 0 9]
0 1
1 5
2 3
3 4
4 10
5 0
6 9
dtype: int64
Series就如同列表一样,具有一系列数据,类似一维数组的对象。每个数据对应一个索引值。比如这样一个列表:[9, 3, 8],如果跟索引值写在一起。
Series有两个属性:values和index有些时候,需要把他竖过来表示,Series就是“竖起来”的array
import pandas as pd
b = pd.Series([1,5,3,4,10,0,9])
print (b.values)
print (b.index)
print (type(b.values))
[ 1 5 3 4 10 0 9]
RangeIndex(start=0, stop=7, step=1)
<class 'numpy.ndarray'>
import pandas as pd
s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
print (s)
张三 21
李四 19
王五 20
赵六 50
dtype: int64
s['赵六']
50
通过list构建Series
由数据和索引组成
获取数据和索引
ser_obj.index, ser_obj.values
预览数据
ser_obj.head(n)
import pandas as pd
countries = ['中国','美国','日本','德国']
countries_s = pd.Series(countries)
print (countries_s)
0 中国
1 美国
2 日本
3 德国
dtype: object
import pandas as pd
country_dicts = {'CH': '中国', 'US': '美国', 'AU': '澳大利亚'}
country_dict_s = pd.Series(country_dicts)
country_dict_s.index.name = 'Code'
country_dict_s.name = 'Country'
print(country_dict_s)
print(country_dict_s.values)
print(country_dict_s.index)
Code
CH 中国
US 美国
AU 澳大利亚
Name: Country, dtype: object
['中国' '美国' '澳大利亚']
Index(['CH', 'US', 'AU'], dtype='object', name='Code')
注:把 key 当索引号了
列表的索引只能是从 0 开始的整数,Series 数据类型在默认情况下,其索引也是如此。不过,区别于列表的是,Series 可以自定义索引
import pandas as pd
data = [1,2,3,4,5]
ind = ['a','b','c','d','e']
s = pd.Series (data, index = ind )
print (s)
a 1
b 2
c 3
d 4
e 5
dtype: int64
import pandas as pd
s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
s1 = s.to_dict ()
print (s1)
{'张三': 21, '李四': 19, '王五': 20, '赵六': 50}
Series 向量化操作(思维)在数据分析和人工智能领域是一个很重要,要把标量转换成向量(数组)
import numpy as np
import pandas as pd
s = range(11)
s1 = pd.Series(s)
total = np.sum(s1)
print('total = ',total)
total = 55
Series 类似于一维数组,DataFrame 是一种二维的数据结构,类似于电子表格。同时具有 行索引(index) 和 列索引(label)。可以看作是由 Series 构成的字典
每一列都是一个Series。多个列对应行,也有一个行索引,DataFrame列优先,每列数据可以是不同的类型,因为有了标号,所以好提取。
通过Series构建DataFrame
通过dict构建DataFrame
通过列索引获取列数据(Series类型)
增加列数据,类似dict添加key-value
删除列
import pandas as pd
country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
country3 = pd.Series({'Name': '澳大利亚','Language': 'English (AU)', 'Area':'7.692M km2','Happiness Rank': 9})
df = pd.DataFrame([country1, country2, country3], index=['CH', 'US', 'AU'])
print(df)
Name Language Area Happiness Rank
CH 中国 Chinese 9.597M km2 79
US 美国 English (US) 9.834M km2 14
AU 澳大利亚 English (AU) 7.692M km2 9
import pandas as pd
country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
df = pd.DataFrame([country1, country2], index=['CH', 'US'])
df['Location'] = '地球'
print(df)
Name Language Area Happiness Rank Location
CH 中国 Chinese 9.597M km2 79 地球
US 美国 English (US) 9.834M km2 14 地球
import pandas as pd
dt = {0: [9, 8, 7, 6], 1: [3, 2, 1, 0]}
a = pd.DataFrame(dt)
print (a)
0 1
0 9 3
1 8 2
2 7 1
3 6 0
import pandas as pd
df1 =pd.DataFrame ([[1,2,3],[4,5,6]],index = ['A','B'],columns = ['C1','C2','C3'])
print (df1)
C1 C2 C3
A 1 2 3
B 4 5 6
df1.T
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
A
B
C1
1
4
C2
2
5
C3
3
6
df1.shape
(2, 3)
df1.size
6
df1.head(1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
C1
C2
C3
A
1
2
3
df1.tail(1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
C1
C2
C3
B
4
5
6
df1.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
C1
C2
C3
count
2.00000
2.00000
2.00000
mean
2.50000
3.50000
4.50000
std
2.12132
2.12132
2.12132
min
1.00000
2.00000
3.00000
25%
1.75000
2.75000
3.75000
50%
2.50000
3.50000
4.50000
75%
3.25000
4.25000
5.25000
max
4.00000
5.00000
6.00000
df1.loc['B']
C1 4
C2 5
C3 6
Name: B, dtype: int64
df1.loc['B'].loc['C2']
5
df1.loc['B', 'C1']
4
df1.iloc[1, 2]
6
import pandas as pd
data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2014,2015,2016,2017,2018],'Points':[4,25,6,2,3]}
# 指定行索引
df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
print (df)
name year Points
Day1 Joe 2014 4
Day2 Cat 2015 25
Day3 Mike 2016 6
Day4 Kim 2017 2
Day5 Amy 2018 3
# 可以选择列
print(df['Points'])
Day1 4
Day2 25
Day3 6
Day4 2
Day5 3
Name: Points, dtype: int64
unique 是一个用来列举 pandas 列中不同取值的方法(函数)
import pandas as pd
data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2012,2012,2013,2018,2018],'Points':[4,25,6,2,3]}
df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
print (df)
name year Points
Day1 Joe 2012 4
Day2 Cat 2012 25
Day3 Mike 2013 6
Day4 Kim 2018 2
Day5 Amy 2018 3
首先,通过 DataFram 传入 索引 的方式获取这一列的数据
然后,在这一列上 调用 unique 方法就会得到不同的取值!
df['year']
Day1 2012
Day2 2012
Day3 2013
Day4 2018
Day5 2018
Name: year, dtype: int64
df['year'].unique()
array([2012, 2013, 2018], dtype=int64)
groupby 是 pandas中最为常用和有效的分组函数,有 sum()、count()、mean() 等统计函数
df = DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
key1 key2 data1 data2
0 a one 1.600927 -0.876908
1 a two 0.159591 0.288545
2 b one 0.919900 -0.982536
3 b two 1.158895 1.787031
4 a one 0.116526 0.795206
grouped = df.groupby(df['key1'])
print(grouped.mean())
data1 data2
key1
a 0.625681 0.068948
b 1.039398 0.402248
合并是指基于某一列将来自不同的DataFrame的列合并起来
举例:假设有两个 DataFrame :
(1)一个是包含学生的 ID、姓名
(2)第二个包含学生ID、数学、python语言、计算思维三门课的成绩
要求:创建一个新的 DataFrame,包含学生 ID、姓名以及三门课的成绩
df2 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
'Math':[91, 88, 75, 68],
'Python':[81, 82, 87, 76],
'Computational thinking':[94, 81, 85, 86]})
print(df2)
Key Math Python Computational thinking
0 2015308 91 81 94
1 2016312 88 82 81
2 2017301 75 87 85
3 2017303 68 76 86
df3 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
'Name':['张三', '李四', '王五', '赵六']})
print(df3)
Key Name
0 2015308 张三
1 2016312 李四
2 2017301 王五
3 2017303 赵六
dfnew = pd.merge(df1, df2, on='Key')
处理缺失数据
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
Key
Math
Python
Computational thinking
0
2015308
91
81
94
1
2016312
88
82
81
2
2017301
75
87
85
3
2017303
68
76
86
df2.drop([0, 3])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
Key
Math
Python
Computational thinking
1
2016312
88
82
81
2
2017301
75
87
85
# axis指轴,0是行, 1是列,缺省值是0
df2.drop('Math', axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
Key
Python
Computational thinking
0
2015308
81
94
1
2016312
82
81
2
2017301
87
85
3
2017303
76
86
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
obj3 = pd.isnull(obj2)
import math
math.isnan(obj2['California'])
True
obj2
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
obj2['California'] == None
False
x = obj2['California']
obj2['California'] != x
True
obj3['California']
True
import pandas as pd
d = {
'1': 'Alice',
'2': 'Bob',
'3': 'Rita',
'4': 'Molly',
'5': 'Ryan'
}
S = pd.Series(d)
S.iloc[0:3]
1 Alice
2 Bob
3 Rita
dtype: object
from pandas import DataFrame
score = {'gre_score':[337, 324, 316, 322, 314], 'toefl_score':[118, 107, 104, 110, 103]}
score_df = DataFrame(score, index = [1, 2, 3, 4, 5])
print(score_df)
gre_score toefl_score
1 337 118
2 324 107
3 316 104
4 322 110
5 314 103
score_df.where(score_df['toefl_score'] > 105).dropna()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
1
337.0
118.0
2
324.0
107.0
4
322.0
110.0
score_df[score_df['toefl_score'] > 105]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
1
337
118
2
324
107
4
322
110
score_df.where(score_df['toefl_score'] > 105)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
1
337.0
118.0
2
324.0
107.0
3
NaN
NaN
4
322.0
110.0
5
NaN
NaN
Python dict
Pandas Series object
2D ndarray
city_dict = {'one':[0, 4, 8, 12], 'two':[1, 5, 9, 13], 'three':[2, 6, 10, 14], 'four':[3, 7, 11, 15]}
city_df = DataFrame(city_dict, index=['Ohio', 'Colorado', 'Utah', 'New York'])
print(city_df)
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
print(city_df.drop('two', axis=1))
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
print(city_df.drop(['Utah', 'Colorado']))
one two three four
Ohio 0 1 2 3
New York 12 13 14 15
import pandas as pd
s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})
print(s1)
print(s2)
1 Alice
2 Jack
3 Molly
dtype: object
Alice 1
Jack 2
Molly 3
dtype: int64
s2.iloc[1]
2
s1.loc[1]
'Alice'
s2[1]
2
s2.loc[1]
We can use s.iteritems() on a pd.Series object s to iterate on it
If s and s1 are two pd.Series objects, we cann't use s.append(s1) to directly append s1 to the existing series s.
If s is a pd.Series object, then we can use s.loc[label] to get all data where the index is equal to label.
loc and iloc ate two usefil and commonly used Pandas methods.
s = pd.Series([1, 2, 3])
s
0 1
1 2
2 3
dtype: int64
s1 = pd.Series([4, 5, 6])
s1
0 4
1 5
2 6
dtype: int64
s.append(s1)
s
0 1
1 2
2 3
dtype: int64
print(score_df)
gre_score toefl_score
1 337 118
2 324 107
3 316 104
4 322 110
5 314 103
score_df[(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
2
324
107
4
322
110
score_df[(score_df['toefl_score'].isin(range(106, 115)))]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
2
324
107
4
322
110
(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)
1 False
2 True
3 False
4 True
5 False
Name: toefl_score, dtype: bool
score_df[score_df['toefl_score'].gt(105) & score_df['toefl_score'].lt(115)]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }
gre_score
toefl_score
2
324
107
4
322
110
stu_dict = {'Name':['Alice', 'Jack'], 'Age':[20, 22], 'Gender':['F', 'M']}
stu_df = DataFrame(stu_dict, index=['Mathematics', 'Sociology'])
print(stu_df)
Name Age Gender
Mathematics Alice 20 F
Sociology Jack 22 M
stu_df.loc['Mathematics']
Name Alice
Age 20
Gender F
Name: Mathematics, dtype: object
手机扫一扫
移动阅读更方便
你可能感兴趣的文章