博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
《利用python进行数据分析》学习笔记--数据聚合与分组(groupby)
阅读量:5862 次
发布时间:2019-06-19

本文共 14618 字,大约阅读时间需要 48 分钟。

在数据聚合与分组中,主要包括:

根据一个或多个键(函数、数组、或dataframe的列名)拆分pandas对象

计算分组后数据的统计值,包括:计数,平均值,标准差,自定义函数

对dataframe的列应用各种各样的函数

实现组内转换或其他运算,规整化,线性回归,排名,选取子集

透视表,交叉表

分组分析

 

groupby 

就是将pandas的数据对象进行,拆分---应用---合并   的数据处理过程

如下面的:

df = pd.DataFrame({
'key1':['a','a','b','b','a'], 'key2':['one','two','two','one','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)})df返回 key1 key2 data1 data20 a one 0.127349 1.4778321 a two -0.430755 0.3972692 b two -0.911272 0.6699603 b one 0.370609 0.4694594 a one 0.968631 0.885551当我们想要根据key1进行分组,并且计算data1列的平均值则用到groupby进行拆分grouped = df['data1'].groupby(df['key1'])grouped.mean()key1a 0.221742b -0.270332

在这里,数据根据分组键进行了聚合,产生了新的Series,而且key1是唯一的索引值

当我们一次传入多个数组时:mean = df['data1'].groupby([df['key1'],df['key2']]).mean()meankey1  key2a     one     0.547990      two    -0.430755b     one     0.370609      two    -0.911272此时得到的Series具有一个层次化索引mean.unstack()key2    one    twokey1        a    0.547990    -0.430755b    0.370609    -0.911272

分组键可以是任何长度适当的数组,也可以直接是列名(字符串,数字,)

另外,groupby的size用法,能够返回一个含有分组大小的series

对分组进行迭代

group by 对象的迭代可以有单个键进行迭代,也可以有多重键迭代:

for year , group in df.groupby('key1'):    print (year)     print (group)a  key1 key2     data1     data20    a  one -0.047866  0.3236671    a  two  0.400731  0.5213594    a  one  0.192612 -0.233259b  key1 key2     data1     data22    b  two -0.516440 -0.0308723    b  one  1.204337 -1.718465
for (a1,a2,), group in df.groupby(['key1','key2']):    print(a1,a2)    print(group)a one  key1 key2     data1     data20    a  one -0.047866  0.3236674    a  one  0.192612 -0.233259a two  key1 key2     data1     data21    a  two  0.400731  0.521359b one  key1 key2     data1     data23    b  one  1.204337 -1.718465b two  key1 key2    data1     data22    b  two -0.51644 -0.030872

此时,我们可以对这些迭代产生的数据片段进行操作,如整合成字典:

pieces = dict(list(df.groupby('key1')))pieces['a']    key1    key2    data1    data20    a    one    -0.047866    0.3236671    a    two    0.400731    0.5213594    a    one    0.192612    -0.233259
group = df.groupby(df.dtypes,axis = 1)  #对列进行操作dict(list(group)){dtype('float64'):       data1     data2 0 -0.047866  0.323667 1  0.400731  0.521359 2 -0.516440 -0.030872 3  1.204337 -1.718465 4  0.192612 -0.233259, dtype('O'):   key1 key2 0    a  one 1    a  two 2    b  two 3    b  one 4    a  one}

选取列

通过dataframe产生的groupby对象,用单个字符串或字符串数组为列名进行索引,就可以实现选取列的目的

df.groupby(['key1','key2'])[['data2']].mean()   #只计算data2列的mean,并以dataframe形式显示        data2    key1    key2                                     #返回的是已分组的dataframe,或seriesa       one    0.045204        two    0.521359b       one    -1.718465        two    -0.030872

通过字典,series 进行分组

通过字典

people = pd.DataFrame(np.random.randn(5,5),                     columns = ['a','b','c','d','e'],                     index = ['A','B','C','D','E'])people.ix[2:4,['b','c']] = np.nan  #添加NA值people        a            b            c           d           eA    1.136587    -1.014446    0.516930    -1.032332    -0.638470B    0.971402    -2.668267    0.741624    1.000780    0.207016C    -0.816918    NaN           NaN      0.693395    0.432147D    0.056326    NaN             NaN    0.143980    0.103061E    -0.986889    0.881119    -0.034649    1.266522    0.240111mapping = {
'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}#建立字典联系by_columns = people.groupby(mapping, axis = 1) #将字典传给groupbyby_columns.sum() blue redA -0.515401 -0.516329B 1.742404 -1.489849C 0.693395 -0.384771D 0.143980 0.159386E 1.231873 0.134341

通过Series:

map_series = pd.Series(mapping) map_seriesa       redb       redc      blued      bluee       redf    orangedtype: objectpeople.groupby(map_series,axis = 1).count()    blue  redA    2    3B    2    3C    1    2D    1    2E    2    3

通过函数进行分组

people = pd.DataFrame(np.random.randn(5,5),                     columns = ['a','b','c','d','e'],                     index = ['Joe','Tony','Steve','Jim','Travis'])people.ix[2:4,['b','c']] = np.nan  #添加NA值people.groupby(len).sum()  #根据人名长度来对数据分组a    b    c    d    e3    2.551209    -0.095830    -0.388745    -1.382552    -0.6785474    -0.644126    1.371943    -0.316472    0.121421    -1.2082015    -0.557207    0.000000    0.000000    -0.102257    0.4384656    1.228693    0.056033    -0.472999    1.077108    -0.772639

将数组,列表,字典,函数  综合运用

key_list = ['one','one','two','three','two']people.groupby([len,key_list]).max()                a              b           c              d     e3    one    2.189451    -0.095830    -0.388745    -1.208221    -0.826994     three    0.361758    NaN    NaN    -0.174331    0.1484474    one    -0.644126    1.371943    -0.316472    0.121421    -1.2082015    two    -0.557207    NaN    NaN    -0.102257    0.4384656    two    1.228693    0.056033    -0.472999    1.077108    -0.772639

层次化索引分组

 

columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],                                    [1,3,5,1,3]], names = ['cty','tenor'])hier_df = pd.DataFrame(np.random.randn(4,5),columns = columns)hier_dfcty    US                                     JPtenor     1           3           5            1          30    -0.248862    -1.474661    -0.483425    0.217972    0.9570901    0.707428    -1.150710    0.101633    1.821193    2.0271302    0.752401    0.850991    1.068320    0.327417    -0.3318833    -0.032896    0.819291    -0.270272    1.750449    0.295412hier_df.groupby(level = 'cty',axis = 1).count()      #层次化索引聚合数据通过level传入级别编号或名称cty  JP    US0    2    31    2    32    2    33    2    3

数据聚合

聚合是指从数组产生标量值的数据转换过程,常见聚合运算包括:

count---计数,非NA值

sum---求和

mean--非NA值平均数

median---非NA值的算术中位数

std.var---分母为n-1的标准差和方差

min,max---非NA值最小,最大值

prod---非NA值的积

first,last---第一个和最后一个非NA值

quantile(0.9)---样本90%分位数,series方法

      当然我们 也可以自定义聚合函数,如:

df    key1    key2    data1    data20    a    one    -0.047866    0.3236671    a    two    0.400731    0.5213592    b    two    -0.516440    -0.0308723    b    one    1.204337    -1.7184654    a    one    0.192612    -0.233259def peak_to_peak(arr):            #自定义聚合函数    return arr.max()-arr.min()grouped.agg(peak_to_peak)         #传入groupby的agg即可        data1    data2key1        a    0.448598    0.754618b    1.720777    1.687592

面向列的多函数应用

tips = pd.read_csv('C:/Users/Administrator/Desktop/IPYTHON/tips.csv')tips['tip_pct'] = tips['tip']/tips['total_bill']tips[:5]#返回    total_bill    tip    sex    smoker    day    time    size    tip_pct0    16.99    1.01    Female    No    Sun    Dinner    2    0.0594471    10.34    1.66    Male    No    Sun    Dinner    3    0.1605422    21.01    3.50    Male    No    Sun    Dinner    3    0.1665873    23.68    3.31    Male    No    Sun    Dinner    2    0.1397804    24.59    3.61    Female    No    Sun    Dinner    4    0.146808grouped = tips.groupby(['sex','smoker'])  #根据sex,smoker 进行分组grouped_pct = grouped['tip_pct']        grouped_pct.agg('mean')                #可以将函数名以字符串形式传入#返回sex     smokerFemale  No        0.156921        Yes       0.182150Male    No        0.160669        Yes       0.152771grouped_pct.agg(['mean','std',peak_to_peak])   #传入一组函数时,直接以列表形式传入#返回                 mean    std    peak_to_peaksex    smoker            Female    No    0.156921    0.036421    0.195876         Yes    0.182150    0.071595    0.360233Male      No    0.160669    0.041849    0.220186         Yes    0.152771    0.090588    0.674707grouped_pct.agg([('foo','mean'),('bar',np.std)])   #自定义每一列的name,可以通过二元元祖形式完成#返回                       foo    barsex    smoker        Female    No    0.156921    0.036421         Yes    0.182150    0.071595Male     No    0.160669    0.041849        Yes    0.152771    0.090588func = ['count','mean','max']result = grouped['tip_pct','total_bill'].agg(func)#返回    tip_pct    total_bill                count    mean    max    count    mean    maxsex    smoker                        Female    No    54    0.156921    0.252672    54    18.105185    35.83         Yes    33    0.182150    0.416667    33    17.977879    44.30Male      No    97    0.160669    0.291990    97    19.791237    48.33         Yes    60    0.152771    0.710345    60    22.284500    50.81grouped.agg({
'tip':np.max,'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典#返回 tip sizesex smoker Female No 5.2 140 Yes 6.5 74Male No 9.0 263 Yes 10.0 150grouped.agg({
'tip_pct':['min','max','mean','std'],'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典#返回 tip_pct size min max mean std sumsex smoker Female No 0.056797 0.252672 0.156921 0.036421 140 Yes 0.056433 0.416667 0.182150 0.071595 74Male No 0.071804 0.291990 0.160669 0.041849 263 Yes 0.035638 0.710345 0.152771 0.090588 150

分组级运算和转换

transform 会将一个函数应用到各个分组,并且将结果放在合适的位置:

以之前的people数据为例:key = ['one','two','one','two','one']people.groupby(key).mean()a    b    c    d    eone    -0.142316    -0.065082    2.255068    1.211343    0.817666two    1.493163    -0.169286    0.671765    1.290332    0.093891当使用transform以后people.groupby(key).transform(np.mean)a    b    c    d    eJoe    -0.142316    -0.065082    2.255068    1.211343    0.817666Tony    1.493163    -0.169286    0.671765    1.290332    0.093891Steve    -0.142316    -0.065082    2.255068    1.211343    0.817666Jim    1.493163    -0.169286    0.671765    1.290332    0.093891Travis    -0.142316    -0.065082    2.255068    1.211343    0.817666

当我们想要从各组中减去平均值,可以创建一个距平化函数,然后传给transform:

def demean(arr):    return(arr)- arr.mean()demeaned = people.groupby(key).transform(demean)demeaned           a            b           c           d           eJoe    2.102734    -0.111083    0.242585    -0.334759    -0.008227Tony    0.339766    0.000000    0.000000    0.630333    -0.157031Steve    -1.583725    NaN        NaN        0.503897    -0.087889Jim    -0.339766      NaN        NaN        -0.630333    0.157031Travis  -0.519009    0.111083    -0.242585    -0.169138    0.096116

apply:会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后尝试将各片段组合到一起

以小费数据为例:

def top(df,n = 5,column = 'tip_pct'):  #定义  在指定列找出最大值,然后把这个值所在的行选出来   的函数    return df.sort_index(by = column)[-n:]   top(tips,n = 6)    #选出最高的6个tip_pct    total_bill    tip    sex    smoker    day    time    size    tip_pct109    14.31    4.00    Female    Yes    Sat    Dinner    2    0.279525183    23.17    6.50    Male    Yes    Sun    Dinner    4    0.280535232    11.61    3.39    Male    No    Sat    Dinner    2    0.29199067    3.07    1.00    Female    Yes    Sat    Dinner    1    0.325733178    9.60    4.00    Female    Yes    Sun    Dinner    2    0.416667172    7.25    5.15    Male    Yes    Sun    Dinner    2    0.710345tips.groupby('smoker').apply(top) #对smoker分组并调用apply    total_bill    tip    sex    smoker    day    time    size    tip_pctsmoker                                    No    88    24.71    5.85    Male    No    Thur    Lunch    2    0.236746185    20.69    5.00    Male    No    Sun    Dinner    5    0.24166351    10.29    2.60    Female    No    Sun    Dinner    2    0.252672149    7.51    2.00    Male    No    Thur    Lunch    2    0.266312232    11.61    3.39    Male    No    Sat    Dinner    2    0.291990Yes    109    14.31    4.00    Female    Yes    Sat    Dinner    2    0.279525183    23.17    6.50    Male    Yes    Sun    Dinner    4    0.28053567    3.07    1.00    Female    Yes    Sat    Dinner    1    0.325733178    9.60    4.00    Female    Yes    Sun    Dinner    2    0.416667172    7.25    5.15    Male    Yes    Sun    Dinner    2    0.710345tips.groupby(['smoker','day']).apply(top,n = 1,column = 'total_bill')    total_bill    tip    sex    smoker    day    time    size    tip_pctsmoker    day                                    No    Fri    94    22.75    3.25    Female    No    Fri    Dinner    2    0.142857Sat    212    48.33    9.00    Male    No    Sat    Dinner    4    0.186220Sun    156    48.17    5.00    Male    No    Sun    Dinner    6    0.103799Thur    142    41.19    5.00    Male    No    Thur    Lunch    5    0.121389Yes    Fri    95    40.17    4.73    Male    Yes    Fri    Dinner    4    0.117750Sat    170    50.81    10.00    Male    Yes    Sat    Dinner    3    0.196812Sun    182    45.35    3.50    Male    Yes    Sun    Dinner    3    0.077178Thur    197    43.11    5.00    Female    Yes    Thur    Lunch    4    0.115982

在传入groupby时 指定 group_keys = False,可以选择关闭分组键的效果。

分位数和桶分析:将pandas中的分位数和groupby结合起来,可以实现对数据集的桶分析,分位数分析:

frame = pd.DataFrame({
'data1':np.random.randn(1000), 'data2':np.random.randn(1000)})factor = pd.cut(frame.data1,4) #4是指将数据分成4个区间def get_stats(group): #定义函数对data2进行统计计算 return{
'min':group.min(),'max':group.max(),'count':group.count(), 'mean':group.mean()}grouped = frame.data2.groupby(factor) #这里用factor,可以保证data2的相关统计计算,是基于data1的数据划分区间grouped.apply(get_stats).unstack()返回 count max mean mindata1 (-3.196, -1.594] 55.0 2.567283 -0.322128 -2.235570(-1.594, 0.00267] 418.0 2.505320 -0.052012 -3.109679(0.00267, 1.599] 467.0 3.187789 0.069884 -3.000375(1.599, 3.195] 60.0 1.873327 0.022410 -1.934497grouping = pd.qcut(frame.data1,10,labels = False) #qcut能够利用样本分位数得到大小相同的区间,labels = False 即可grouped = frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()返回 count max mean mindata1 0 100.0 2.567283 -0.184858 -2.2355701 100.0 2.235197 -0.025099 -2.2511062 100.0 2.011793 -0.096504 -2.1847173 100.0 1.953811 -0.120498 -2.9115694 100.0 3.124611 0.156681 -3.1096795 100.0 3.067191 0.075880 -2.8440216 100.0 2.328596 0.082201 -2.0688337 100.0 3.187789 0.105087 -2.3107128 100.0 2.624968 -0.047236 -3.0003759 100.0 2.735516 -0.000431 -2.182246

示例:用特定于分组的值填充缺失值

states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']group_key = ['East'] * 4 + ['West'] *4data = pd.Series(np.random.randn(8),index = states)data[['Vermont','Nevada','Idaho']] = np.nandatadata.groupby(group_key).mean() #求出平均值fill_mean = lambda g: g.fillna(g.mean()) #用分组平均值填充NA值data.groupby(group_key).apply(fill_mean)fill_values = {
'East':0.5,'West':-1} #预定义填充值,利用name属性填充fill_func = lambda g: g.fillna(fill_values[g.name])data.groupby(group_key).apply(fill_func)

透视表和交叉表

 

转载于:https://www.cnblogs.com/jx1111jx/p/9623391.html

你可能感兴趣的文章
hdu 4547(LCA)
查看>>
编写你自己的单点登录(SSO)服务
查看>>
Atitit.软件控件and仪表盘(23)--多媒体子系统--视频输出切换控制cvbs av s-video Ypbpr pal ntsc...
查看>>
《SQL Server企业级平台管理实践》读书笔记——SQL Server中收缩数据库不好用的原因...
查看>>
cs-JsonHelper
查看>>
仿糯米弹框效果demo
查看>>
char与byte的差别
查看>>
[ES6] 08. Destructuring Assignment -- 1
查看>>
[再寄小读者之数学篇](2014-11-26 正定矩阵乘积的特征值)
查看>>
Permutation Sequence
查看>>
C盘空间不足
查看>>
【LeetCode】101. Symmetric Tree (2 solutions)
查看>>
java属性文件读取,属性修改
查看>>
vim 显示tab键
查看>>
【转】蓝牙协议分析(1)_基本概念
查看>>
基于Linux的视频传输系统(上大学时參加的一个大赛的论文)
查看>>
检测到有潜在危险的Request.Form值
查看>>
jQuery Mobile方向感应事件
查看>>
linux 进程(一)---基本概念
查看>>
jsp 相关特性
查看>>