In our last post we walked through the lag function so that for our defined group we have a column that contains a ‘lagged’ value. We can think about this operating much like LAG in SQL. But now what? We have an extra column that we wouldn’t have needed if we were using Excel. Well, we aren’t trying to mirror Excel, we are trying to mimic the behavior. So, the next step would be to apply a function to the dataframe using pandas .apply(). But, we can’t do that without a helper function defining what we need to do.
I’m assuming if you are reading my blog, you understand how to define a function in python, so our walkthrough there will be very, very brief.
Let’s say that in the dataframe we used before, we want to see a month-over-month change in subscription amount. This would let us track revenue growth as well as customer level growth.
To do that we can use a function like the below. One thing to note here is that you MUST `return` the result of the function or the apply will not work the way we want it to.
def growth( last_month, this_month):
diff = this_month - last_month
return diff
Now we can get pretty fancy in our function and the apply will still work. But, for the purpose of illustrating we will keep it simple.
Well if we have this great new metric we want to track (and maybe pivot on it later!) we should store it in the dataframe. So what does that mean? You guessed it: a new series! We’ll name our new column ‘monthly_growth’.
df[‘monthly_growth’] = pd.Series()
And now we have this:
Index | Activity Date | Customer Name | Subscription Amount | subscription_lag | monthly_growth |
2 | 2020-02 | Risk & Data | 75 | N/A | N/A |
3 | 2020-03 | Risk & Data | 125 | 75 | N/A |
1 | 2020-02 | XYZ, Co. | 100 | N/A | N/A |
The next part utilizes a lambda function in order to apply the helper and populate the new series all in one line.
df[‘monthly_growth’] = df.apply(lambda x:
growth(x[‘subscription_lag’],x[‘monthly_growth’]))
Resulting in this!
Index | Activity Date | Customer Name | Subscription Amount | subscription_lag | monthly_growth |
2 | 2020-02 | Risk & Data | 75 | N/A | N/A |
3 | 2020-03 | Risk & Data | 125 | 75 | 50 |
1 | 2020-02 | XYZ, Co. | 100 | N/A | N/A |
Great! We have the change in subscription amount month over month in that column. Setting things up this way also allows us to see people who decreased their subscription amount, ie negative growth. So there you have it. The offset functionality in Excel can be mimicked in pandas making it much easier to work within a dataframe, especially for large values.
Happy Modelling!