Python Aggregation on time-series

Nofal Daud Source

I have a dataframe df like this

project_ID country   prj_start  prj_end  revenue   profit
 2131      USA       201603     201703   100000     30000
 5124      UK        201502     201606   1500       1000 
 1245      UK        201010     201710   1800       1000

I want to find the number of active projects per month and country and sum their revenue and profits. The output would look like this

Month   country   active_projects   revenue profit
201603  USA         15            500000  100000
201603  UK          20            150000  100000
201604  Germany     30            1000000 500000

My first programming language is C++, so I tend to do things using loops. I almost succeeded at a solution where I created the month slots like this.

#making a monthlist dataframe with count column to hold no. of active projects
monthlist = pd.DataFrame(columns= ["months","count"])

#making a new dataframe to insert the results into
newdf = pd.DataFrame(columns=["month", "country","active_prj_count","rev","gp"])
#making the month slots, not concerned with future values
monthlist['months']=pd.date_range(start = min(df['prj_start']), end =datetime.date.today(), freq='M').map(lambda x: 100*x.year + x.month)
monthlist['count']=0

#traversing through the original dataframe and monthlist to insert a new row into newdf 

#everytime the project start is less than and prj end is greater than the month slot
i=0
for y in range(len(df)):
    for x in range(len(monthlist)):
        if(df.loc[y,'prj_start']<=monthlist.loc[x,'months'] & df.loc[y,'prj_end']>=monthlist.loc[x,'months']):
            monthlist.loc[x,'count']=monthlist.loc[x,'count']+1
            newdf.loc[i] = [monthlist.loc[x,'months'],df.loc[y,'country']
                                 ,monthlist.loc[x,'count'],df.loc[y,'revenue'],df.loc[y,'profit']]
            i=i+1

this solution works, but I have to admit it is not very smart and computationally efficient. takes a while to process. Anyone with ideas to improve the code by probably using pandas or numpy functions?

pythonpython-3.xpython-2.7pandasnumpy

Answers

answered 3 months ago Stev #1

Ok, what about something like this (up to you how you calculate monthly profit, just an example):

d={'projectid':[2131,5124,1245],'country':['USA', 'UK', 'UK'],'pr_start':['2016-03','2015-02','2010-10'],'pr_end':['2017-03','2016-06','2017-10'], 'total_revenue':[100000, 1500, 1800], 'total_profit':[30000, 1000, 1000]}
df = pd.DataFrame(data=d)

df['pr_end'] = pd.to_datetime(df['pr_end'])
df['pr_start'] = pd.to_datetime(df['pr_start'])
df['project_length'] = df['pr_end'].dt.to_period('M') - df['pr_start'].dt.to_period('M')
df['monthly_revenue'] = df['total_revenue'] / df['project_length']
df['monthly_profit'] = df['total_profit'] / df['project_length']

for (idx, row) in df.iterrows():
    if row.project_length > 1:
        df.loc[idx, 'pr_end'] = df.loc[idx, 'pr_start'] + pd.DateOffset(months=1)
        for i in range(1, row.project_length):
            df2 = pd.DataFrame([row])
            df2['pr_start'] = row.pr_start + pd.DateOffset(months=i)
            df2['pr_end'] = row.pr_start + pd.DateOffset(months=i+1)
            df = df.append(df2)

df = df.sort_values(by='pr_start').sort_index(kind = 'mergesort')

print(df.groupby(['pr_start','country']).agg({'projectid':'count', 'monthly_revenue': 'sum', 'monthly_profit': 'sum'}).rename(columns={'projectid':'Active Projects'}))

answered 3 months ago Mabel Villalba #2

You could apply functions to each row and extract the dates where each project is present and then aggregate by month and country.

>>> df 

   project_ID country  prj_start  prj_end  revenue  profit
0        2131     USA     201603   201703   100000   30000
1        5124      UK     201502   201606     1500    1000
2        1245      UK     201010   201710     1800    1000  

Let's add some more samples to have different countries per month:

>>>  df_new = pd.DataFrame([
                [1111, 'Germany',201603, 201703,1000, 4000],
                [4111, 'Germany',201603, 201703,4000, 6000],
                [3112, 'Germany',201010, 201703,4000, 6000],
                [2112, 'Germany',201603, 201703,4000, 6000],
                [2116, 'Germany',201502, 201710,4000, 6000]],
                columns=df.columns)

>>> df_new

   project_ID  country  prj_start  prj_end  revenue  profit
0        1111  Germany     201603   201703     1000    4000
1        4111  Germany     201603   201703     4000    6000
2        3112  Germany     201010   201703     4000    6000
3        2112  Germany     201603   201703     4000    6000
4        2116  Germany     201502   201710     4000    6000

>>> df_ = pd.concat([df,df_new],axis=0,ignore_index=True)

   project_ID  country  prj_start  prj_end  revenue  profit
0        2131      USA     201603   201703   100000   30000
1        5124       UK     201502   201606     1500    1000
2        1245       UK     201010   201710     1800    1000
3        1111  Germany     201603   201703     1000    4000
4        4111  Germany     201603   201703     4000    6000
5        3112  Germany     201010   201703     4000    6000
6        2112  Germany     201603   201703     4000    6000
7        2116  Germany     201502   201710     4000    6000

Transform prj_start and prj_end to datetime and indicate the format format="%Y%m" to parse:

>>> df_[['prj_start','prj_end']] =  df_[['prj_start','prj_end']].apply(pd.to_datetime, format="%Y%m")

>>> df_ 

   project_ID  country  prj_start    prj_end  revenue  profit
0        2131      USA 2016-03-01 2017-03-01   100000   30000
1        5124       UK 2015-02-01 2016-06-01     1500    1000
2        1245       UK 2010-10-01 2017-10-01     1800    1000
3        1111  Germany 2016-03-01 2017-03-01     1000    4000
4        4111  Germany 2016-03-01 2017-03-01     4000    6000
5        3112  Germany 2010-10-01 2017-03-01     4000    6000
6        2112  Germany 2016-03-01 2017-03-01     4000    6000
7        2116  Germany 2015-02-01 2017-10-01     4000    6000

Now let's define a function to transform the rows and apply it:

def transform_row(row):
    date_index = pd.date_range(row['prj_start'].min(),
                               row['prj_end'].max(), freq='MS') 

    row_out = pd.DataFrame(np.repeat(row.values, 
                                     len(date_index.values),axis=0), 
                           index=date_index, columns=row.columns)
    row_out.index.name = 'date'
    return row_out.reset_index()

df_transformed = pd.concat([transform_row(row.to_frame().T) 
                            for i,row in df_.iterrows()],axis=0)

And then, finally apply pivot_table to aggregate the values by country and date:

df1 = pd.pivot_table(df_transformed, 
                     index=['date','country'],
                     values=['revenue','profit'],
                     aggfunc=np.sum,fill_value=0)

df2 = pd.pivot_table(df_transformed,
                     index=['date','country'],
                     values=['project_ID'],
                     aggfunc=len,fill_value=0)

Finally, concatenate the datafame to obtain the data by month:

pd.concat([df1,df2],axis=1)

                    profit  revenue  project_ID
date       country                             
2010-10-01 Germany    6000     4000           1
           UK         1000     1800           1
2010-11-01 Germany    6000     4000           1
           UK         1000     1800           1
2010-12-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-01-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-02-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-03-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-04-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-05-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-06-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-07-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-08-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-09-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-10-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-11-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-12-01 Germany    6000     4000           1
           UK         1000     1800           1
...                    ...      ...         ...
2016-10-01 USA       30000   100000           1
2016-11-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2016-12-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-01-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-02-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-03-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-04-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-05-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-06-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-07-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-08-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-09-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-10-01 Germany    6000     4000           1
           UK         1000     1800           1

comments powered by Disqus