December 1, 2011

The aggregate function in Python

One thing I love the most about R is the aggregate function, which (in a nutshell) summarizes a data frame according to one or more columns, and perform some operations on the aggregated values. If, for example, you have a data frame with a treatment and a response, and want to calculate the mean and standard deviation of the response by level of treatment, then aggregate is usually the way to go.

I wanted to do a quick plot with errorbars using PyX, and I though that having a python version of aggregate would be really nice. Here is the complete code to do so

import numpy as np
import scipy as sp

def MSD(vec):
return [np.mean(vec),np.std(vec)]
def aggregate(df,by=0,to=1,func=np.sum):
Dat = []
ColBy = df.T[by]
ColTo = df.T[to]
UniqueBy = np.sort(np.unique(ColBy))
for ub in UniqueBy:
uTo = ColTo[ColBy==ub]
Out = func(uTo)
Dat.append(np.concatenate(([ub],Out)))
return Dat

g.writePDFfile("errorbar")