I am working on a dataset containing data from taxi rides in the city of Chicago. The data contains information per trip such as Taxi ID, Timestamp, Fare etc. among others. A sample of the data is shown below from
df.head() with the timestamps in pandas datetime:
Taxi ID Pickup Community Area Trip Total Trip End Timestamp 2016-04-25 18:00:00 bfec2c659fcbed3b508dc6caaf6f7ef39e17801e8ca0cc... 8.0 16.00 2016-11-16 15:45:00 f45c4bfa9d6c445eb03ab69093f7ec1e9cdecef83cb2ce... 32.0 9.75 2016-01-15 17:45:00 ecfb6f2cdce5d4c4e80218f58070ae719060ee47e648f4... 32.0 14.75 2016-01-20 21:30:00 7d8179131ea9952793af4cda8635e94b56c2b92d3c376c... 32.0 7.25 2016-02-11 09:15:00 d5c4fbae1c0c510364404a90fd477b19f7f03408ce40ff... 35.0 18.75
Assuming I have data for one week, I want to achieve this:
This can be extended to hours in a week, weeks in a year etc.
As of now, I am able to use
pivot_table to, say, obtain the median revenue per hour of the day over the whole timeframe (but this is not for every hour in the week), or
resample to obtain the median fare over time (but this is not 'hourly revenue' of the taxis, this is the median fare per trip). Neither of those would achieve exactly what I want, how should I implement this?
I was able to achieve this for my data by first combining
df.grouby with the
Grouper feature in pandas. Thereby, the data is first grouped by the ID, and for each ID, it's resampled per timeframe (hour for example), and eventually the hourly fares for that particular ID can be summed up, all through this command.
dfGrouped = (df.groupby(['Taxi ID', pd.Grouper(freq='H', key='Trip End Timestamp')]) ['Trip Total'] .sum() .unstack(fill_value=0))
Because of the final unstack operation,
dfGrouped now contains a 2D array of Taxi ID vs Hour. The median over all taxi IDs (rows) was simply obtained by: