python分组函数用法 python 分组函数( 二 )


对于多重键的情况 , 元组的第一个元素将会是由键值组成的元组:
1234567891011121314151617
for (k1, k2), group in df.groupby(['key1', 'key2']):...print k1, k2...print group...a onedata1data2 key1 key20 -0.4106730.519378aone4 -1.017495 -0.530459aonea twodata1data2 key1 key21 -2.1207930.199074atwob onedata1data2 key1 key220.642216 -0.143671boneb twodata1data2 key1 key230.975133 -0.592994btwo
当然,你可以对这些数据片段做任何操作 。有一个你可能会觉得有用的运算:将这些数据片段做成一个字典:
1234567891011121314
pieces = dict(list(df.groupby('key1'))) pieces['b']data1data2 key1 key220.642216 -0.143671bone30.975133 -0.592994btwo df.groupby('key1')pandas.core.groupby.DataFrameGroupBy object at 0x0413AE30 list(df.groupby('key1'))[('a',data1data2 key1 key20 -0.4106730.519378aone1 -2.1207930.199074atwo4 -1.017495 -0.530459aone), ('b',data1data2 key1 key220.642216 -0.143671bone30.975133 -0.592994btwo)]
groupby默认是在axis=0上进行分组的,通过设置也可以在其他任何轴上进行分组 。那上面例子中的df来说,我们可以根据dtype对列进行分组:
12345678910111213141516171819
df.dtypesdata1float64data2float64key1objectkey2objectdtype: object grouped = df.groupby(df.dtypes, axis=1) dict(list(grouped)){dtype('O'):key1 key20aone1atwo2bone3btwo4aone, dtype('float64'):data1data20 -0.4106730.5193781 -2.1207930.19907420.642216 -0.14367130.975133 -0.5929944 -1.017495 -0.530459}
1234567891011121314
groupedpandas.core.groupby.DataFrameGroupBy object at 0x041288F0 list(grouped)[(dtype('float64'),data1data20 -0.4106730.5193781 -2.1207930.19907420.642216 -0.14367130.975133 -0.5929944 -1.017495 -0.530459), (dtype('O'),key1 key20aone1atwo2bone3btwo4aone)]
5、选取一个或一组列
对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的,即:
123456
df.groupby('key1')['data1']pandas.core.groupby.SeriesGroupBy object at 0x06615FD0 df.groupby('key1')['data2']pandas.core.groupby.SeriesGroupBy object at 0x06615CB0 df.groupby('key1')[['data2']]pandas.core.groupby.DataFrameGroupBy object at 0x06615F10
和以下代码是等效的:
123456
df['data1'].groupby([df['key1']])pandas.core.groupby.SeriesGroupBy object at 0x06615FD0 df[['data2']].groupby([df['key1']])pandas.core.groupby.DataFrameGroupBy object at 0x06615F10 df['data2'].groupby([df['key1']])pandas.core.groupby.SeriesGroupBy object at 0x06615E30
尤其对于大数据集,很可能只需要对部分列进行聚合 。例如,在前面那个数据集中 , 如果只需计算data2列的平均值并以DataFrame形式得到结果 , 代码如下:
1234567891011121314
df.groupby(['key1', 'key2'])[['data2']].mean()data2key1 key2aone-0.005540two0.199074bone-0.143671two-0.592994 df.groupby(['key1', 'key2'])['data2'].mean()key1key2aone-0.005540two0.199074bone-0.143671two-0.592994Name: data2, dtype: float64
这种索引操作所返回的对象是一个已分组的DataFrame(如果传入的是列表或数组)或已分组的Series(如果传入的是标量形式的单个列明):
12345678910
s_grouped = df.groupby(['key1', 'key2'])['data2'] s_groupedpandas.core.groupby.SeriesGroupBy object at 0x06615B10 s_grouped.mean()key1key2aone-0.005540two0.199074bone-0.143671two-0.592994Name: data2, dtype: float64
6、通过字典或Series进行分组
除数组以外,分组信息还可以其他形式存在,来看一个DataFrame示例:
123456789101112
people = pd.DataFrame(np.random.randn(5, 5),...columns=['a', 'b', 'c', 'd', 'e'],...index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']... ) peopleabcdeJoe0.306336 -0.1394310.210028 -1.489001 -0.172998Steve0.9983350.4942290.337624 -1.222726 -0.402655Wes1.4153290.450839 -1.0521990.7317210.317225Jim0.5505513.2013690.6697130.7257510.577687Travis -2.013278 -2.0103040.117713 -0.545000 -1.228323 people.ix[2:3, ['b', 'c']] = np.nan

推荐阅读