在数据聚合与分组中,主要包括:
根据一个或多个键(函数、数组、或dataframe的列名)拆分pandas对象
计算分组后数据的统计值,包括:计数,平均值,标准差,自定义函数
对dataframe的列应用各种各样的函数
实现组内转换或其他运算,规整化,线性回归,排名,选取子集
透视表,交叉表
分组分析
groupby
就是将pandas的数据对象进行,拆分---应用---合并 的数据处理过程
如下面的:
df = pd.DataFrame({ 'key1':['a','a','b','b','a'], 'key2':['one','two','two','one','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)})df返回 key1 key2 data1 data20 a one 0.127349 1.4778321 a two -0.430755 0.3972692 b two -0.911272 0.6699603 b one 0.370609 0.4694594 a one 0.968631 0.885551当我们想要根据key1进行分组,并且计算data1列的平均值则用到groupby进行拆分grouped = df['data1'].groupby(df['key1'])grouped.mean()key1a 0.221742b -0.270332
在这里,数据根据分组键进行了聚合,产生了新的Series,而且key1是唯一的索引值
当我们一次传入多个数组时:mean = df['data1'].groupby([df['key1'],df['key2']]).mean()meankey1 key2a one 0.547990 two -0.430755b one 0.370609 two -0.911272此时得到的Series具有一个层次化索引mean.unstack()key2 one twokey1 a 0.547990 -0.430755b 0.370609 -0.911272
分组键可以是任何长度适当的数组,也可以直接是列名(字符串,数字,)
另外,groupby的size用法,能够返回一个含有分组大小的series
对分组进行迭代
group by 对象的迭代可以有单个键进行迭代,也可以有多重键迭代:
for year , group in df.groupby('key1'): print (year) print (group)a key1 key2 data1 data20 a one -0.047866 0.3236671 a two 0.400731 0.5213594 a one 0.192612 -0.233259b key1 key2 data1 data22 b two -0.516440 -0.0308723 b one 1.204337 -1.718465
for (a1,a2,), group in df.groupby(['key1','key2']): print(a1,a2) print(group)a one key1 key2 data1 data20 a one -0.047866 0.3236674 a one 0.192612 -0.233259a two key1 key2 data1 data21 a two 0.400731 0.521359b one key1 key2 data1 data23 b one 1.204337 -1.718465b two key1 key2 data1 data22 b two -0.51644 -0.030872
此时,我们可以对这些迭代产生的数据片段进行操作,如整合成字典:
pieces = dict(list(df.groupby('key1')))pieces['a'] key1 key2 data1 data20 a one -0.047866 0.3236671 a two 0.400731 0.5213594 a one 0.192612 -0.233259
group = df.groupby(df.dtypes,axis = 1) #对列进行操作dict(list(group)){dtype('float64'): data1 data2 0 -0.047866 0.323667 1 0.400731 0.521359 2 -0.516440 -0.030872 3 1.204337 -1.718465 4 0.192612 -0.233259, dtype('O'): key1 key2 0 a one 1 a two 2 b two 3 b one 4 a one}
选取列
通过dataframe产生的groupby对象,用单个字符串或字符串数组为列名进行索引,就可以实现选取列的目的
df.groupby(['key1','key2'])[['data2']].mean() #只计算data2列的mean,并以dataframe形式显示 data2 key1 key2 #返回的是已分组的dataframe,或seriesa one 0.045204 two 0.521359b one -1.718465 two -0.030872
通过字典,series 进行分组
通过字典:
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index = ['A','B','C','D','E'])people.ix[2:4,['b','c']] = np.nan #添加NA值people a b c d eA 1.136587 -1.014446 0.516930 -1.032332 -0.638470B 0.971402 -2.668267 0.741624 1.000780 0.207016C -0.816918 NaN NaN 0.693395 0.432147D 0.056326 NaN NaN 0.143980 0.103061E -0.986889 0.881119 -0.034649 1.266522 0.240111mapping = { 'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}#建立字典联系by_columns = people.groupby(mapping, axis = 1) #将字典传给groupbyby_columns.sum() blue redA -0.515401 -0.516329B 1.742404 -1.489849C 0.693395 -0.384771D 0.143980 0.159386E 1.231873 0.134341
通过Series:
map_series = pd.Series(mapping) map_seriesa redb redc blued bluee redf orangedtype: objectpeople.groupby(map_series,axis = 1).count() blue redA 2 3B 2 3C 1 2D 1 2E 2 3
通过函数进行分组
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index = ['Joe','Tony','Steve','Jim','Travis'])people.ix[2:4,['b','c']] = np.nan #添加NA值people.groupby(len).sum() #根据人名长度来对数据分组a b c d e3 2.551209 -0.095830 -0.388745 -1.382552 -0.6785474 -0.644126 1.371943 -0.316472 0.121421 -1.2082015 -0.557207 0.000000 0.000000 -0.102257 0.4384656 1.228693 0.056033 -0.472999 1.077108 -0.772639
将数组,列表,字典,函数 综合运用
key_list = ['one','one','two','three','two']people.groupby([len,key_list]).max() a b c d e3 one 2.189451 -0.095830 -0.388745 -1.208221 -0.826994 three 0.361758 NaN NaN -0.174331 0.1484474 one -0.644126 1.371943 -0.316472 0.121421 -1.2082015 two -0.557207 NaN NaN -0.102257 0.4384656 two 1.228693 0.056033 -0.472999 1.077108 -0.772639
层次化索引分组
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'], [1,3,5,1,3]], names = ['cty','tenor'])hier_df = pd.DataFrame(np.random.randn(4,5),columns = columns)hier_dfcty US JPtenor 1 3 5 1 30 -0.248862 -1.474661 -0.483425 0.217972 0.9570901 0.707428 -1.150710 0.101633 1.821193 2.0271302 0.752401 0.850991 1.068320 0.327417 -0.3318833 -0.032896 0.819291 -0.270272 1.750449 0.295412hier_df.groupby(level = 'cty',axis = 1).count() #层次化索引聚合数据通过level传入级别编号或名称cty JP US0 2 31 2 32 2 33 2 3
数据聚合
聚合是指从数组产生标量值的数据转换过程,常见聚合运算包括:
count---计数,非NA值
sum---求和
mean--非NA值平均数
median---非NA值的算术中位数
std.var---分母为n-1的标准差和方差
min,max---非NA值最小,最大值
prod---非NA值的积
first,last---第一个和最后一个非NA值
quantile(0.9)---样本90%分位数,series方法
当然我们 也可以自定义聚合函数,如:
df key1 key2 data1 data20 a one -0.047866 0.3236671 a two 0.400731 0.5213592 b two -0.516440 -0.0308723 b one 1.204337 -1.7184654 a one 0.192612 -0.233259def peak_to_peak(arr): #自定义聚合函数 return arr.max()-arr.min()grouped.agg(peak_to_peak) #传入groupby的agg即可 data1 data2key1 a 0.448598 0.754618b 1.720777 1.687592
面向列的多函数应用
tips = pd.read_csv('C:/Users/Administrator/Desktop/IPYTHON/tips.csv')tips['tip_pct'] = tips['tip']/tips['total_bill']tips[:5]#返回 total_bill tip sex smoker day time size tip_pct0 16.99 1.01 Female No Sun Dinner 2 0.0594471 10.34 1.66 Male No Sun Dinner 3 0.1605422 21.01 3.50 Male No Sun Dinner 3 0.1665873 23.68 3.31 Male No Sun Dinner 2 0.1397804 24.59 3.61 Female No Sun Dinner 4 0.146808grouped = tips.groupby(['sex','smoker']) #根据sex,smoker 进行分组grouped_pct = grouped['tip_pct'] grouped_pct.agg('mean') #可以将函数名以字符串形式传入#返回sex smokerFemale No 0.156921 Yes 0.182150Male No 0.160669 Yes 0.152771grouped_pct.agg(['mean','std',peak_to_peak]) #传入一组函数时,直接以列表形式传入#返回 mean std peak_to_peaksex smoker Female No 0.156921 0.036421 0.195876 Yes 0.182150 0.071595 0.360233Male No 0.160669 0.041849 0.220186 Yes 0.152771 0.090588 0.674707grouped_pct.agg([('foo','mean'),('bar',np.std)]) #自定义每一列的name,可以通过二元元祖形式完成#返回 foo barsex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595Male No 0.160669 0.041849 Yes 0.152771 0.090588func = ['count','mean','max']result = grouped['tip_pct','total_bill'].agg(func)#返回 tip_pct total_bill count mean max count mean maxsex smoker Female No 54 0.156921 0.252672 54 18.105185 35.83 Yes 33 0.182150 0.416667 33 17.977879 44.30Male No 97 0.160669 0.291990 97 19.791237 48.33 Yes 60 0.152771 0.710345 60 22.284500 50.81grouped.agg({ 'tip':np.max,'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典#返回 tip sizesex smoker Female No 5.2 140 Yes 6.5 74Male No 9.0 263 Yes 10.0 150grouped.agg({ 'tip_pct':['min','max','mean','std'],'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典#返回 tip_pct size min max mean std sumsex smoker Female No 0.056797 0.252672 0.156921 0.036421 140 Yes 0.056433 0.416667 0.182150 0.071595 74Male No 0.071804 0.291990 0.160669 0.041849 263 Yes 0.035638 0.710345 0.152771 0.090588 150
分组级运算和转换
transform 会将一个函数应用到各个分组,并且将结果放在合适的位置:
以之前的people数据为例:key = ['one','two','one','two','one']people.groupby(key).mean()a b c d eone -0.142316 -0.065082 2.255068 1.211343 0.817666two 1.493163 -0.169286 0.671765 1.290332 0.093891当使用transform以后people.groupby(key).transform(np.mean)a b c d eJoe -0.142316 -0.065082 2.255068 1.211343 0.817666Tony 1.493163 -0.169286 0.671765 1.290332 0.093891Steve -0.142316 -0.065082 2.255068 1.211343 0.817666Jim 1.493163 -0.169286 0.671765 1.290332 0.093891Travis -0.142316 -0.065082 2.255068 1.211343 0.817666
当我们想要从各组中减去平均值,可以创建一个距平化函数,然后传给transform:
def demean(arr): return(arr)- arr.mean()demeaned = people.groupby(key).transform(demean)demeaned a b c d eJoe 2.102734 -0.111083 0.242585 -0.334759 -0.008227Tony 0.339766 0.000000 0.000000 0.630333 -0.157031Steve -1.583725 NaN NaN 0.503897 -0.087889Jim -0.339766 NaN NaN -0.630333 0.157031Travis -0.519009 0.111083 -0.242585 -0.169138 0.096116
apply:会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后尝试将各片段组合到一起
以小费数据为例:
def top(df,n = 5,column = 'tip_pct'): #定义 在指定列找出最大值,然后把这个值所在的行选出来 的函数 return df.sort_index(by = column)[-n:] top(tips,n = 6) #选出最高的6个tip_pct total_bill tip sex smoker day time size tip_pct109 14.31 4.00 Female Yes Sat Dinner 2 0.279525183 23.17 6.50 Male Yes Sun Dinner 4 0.280535232 11.61 3.39 Male No Sat Dinner 2 0.29199067 3.07 1.00 Female Yes Sat Dinner 1 0.325733178 9.60 4.00 Female Yes Sun Dinner 2 0.416667172 7.25 5.15 Male Yes Sun Dinner 2 0.710345tips.groupby('smoker').apply(top) #对smoker分组并调用apply total_bill tip sex smoker day time size tip_pctsmoker No 88 24.71 5.85 Male No Thur Lunch 2 0.236746185 20.69 5.00 Male No Sun Dinner 5 0.24166351 10.29 2.60 Female No Sun Dinner 2 0.252672149 7.51 2.00 Male No Thur Lunch 2 0.266312232 11.61 3.39 Male No Sat Dinner 2 0.291990Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525183 23.17 6.50 Male Yes Sun Dinner 4 0.28053567 3.07 1.00 Female Yes Sat Dinner 1 0.325733178 9.60 4.00 Female Yes Sun Dinner 2 0.416667172 7.25 5.15 Male Yes Sun Dinner 2 0.710345tips.groupby(['smoker','day']).apply(top,n = 1,column = 'total_bill') total_bill tip sex smoker day time size tip_pctsmoker day No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982
在传入groupby时 指定 group_keys = False,可以选择关闭分组键的效果。
分位数和桶分析:将pandas中的分位数和groupby结合起来,可以实现对数据集的桶分析,分位数分析:
frame = pd.DataFrame({ 'data1':np.random.randn(1000), 'data2':np.random.randn(1000)})factor = pd.cut(frame.data1,4) #4是指将数据分成4个区间def get_stats(group): #定义函数对data2进行统计计算 return{ 'min':group.min(),'max':group.max(),'count':group.count(), 'mean':group.mean()}grouped = frame.data2.groupby(factor) #这里用factor,可以保证data2的相关统计计算,是基于data1的数据划分区间grouped.apply(get_stats).unstack()返回 count max mean mindata1 (-3.196, -1.594] 55.0 2.567283 -0.322128 -2.235570(-1.594, 0.00267] 418.0 2.505320 -0.052012 -3.109679(0.00267, 1.599] 467.0 3.187789 0.069884 -3.000375(1.599, 3.195] 60.0 1.873327 0.022410 -1.934497grouping = pd.qcut(frame.data1,10,labels = False) #qcut能够利用样本分位数得到大小相同的区间,labels = False 即可grouped = frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()返回 count max mean mindata1 0 100.0 2.567283 -0.184858 -2.2355701 100.0 2.235197 -0.025099 -2.2511062 100.0 2.011793 -0.096504 -2.1847173 100.0 1.953811 -0.120498 -2.9115694 100.0 3.124611 0.156681 -3.1096795 100.0 3.067191 0.075880 -2.8440216 100.0 2.328596 0.082201 -2.0688337 100.0 3.187789 0.105087 -2.3107128 100.0 2.624968 -0.047236 -3.0003759 100.0 2.735516 -0.000431 -2.182246
示例:用特定于分组的值填充缺失值
states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']group_key = ['East'] * 4 + ['West'] *4data = pd.Series(np.random.randn(8),index = states)data[['Vermont','Nevada','Idaho']] = np.nandatadata.groupby(group_key).mean() #求出平均值fill_mean = lambda g: g.fillna(g.mean()) #用分组平均值填充NA值data.groupby(group_key).apply(fill_mean)fill_values = { 'East':0.5,'West':-1} #预定义填充值,利用name属性填充fill_func = lambda g: g.fillna(fill_values[g.name])data.groupby(group_key).apply(fill_func)
透视表和交叉表