== background == Below is input File format(*.txt):
| userID | month | date | hour | totalTW | totalQs | result |
|---|---|---|---|---|---|---|
| 21535110 | 05 | 01 | 02 | 3 | 2 | 1 |
| 21535110 | 05 | 01 | 03 | 3 | 2 | 1 |
| 21535110 | 05 | 01 | 06 | 1 | 0 | 0 |
| 21535110 | 05 | 02 | 02 | 1 | 0 | 0 |
| 21535110 | 05 | 03 | 05 | 3 | 2 | 0 |
| 21535112 | 05 | 01 | 05 | 1 | 1 | 1 |
totally there are 28,000,000 lines in the file, and I have 6 this kind of files.
==object==
write script to process the input data, to:
for each user, sum up the data (totalTW, totalQS, result) within same month, same day of the week, same hour.
lets say:
there are lines like this:
| userID | month | date | hour | totalTW | totalQs | result |
|---|---|---|---|---|---|---|
| 21535110 | 05 | 01 | 02 | 3 | 2 | 1 |
| 21535110 | 05 | 08 | 02 | 2 | 1 | 0 |
then this 2 data points should sum to
| userID | month | day | hour | totalTW | totalQs | result |
|---|---|---|---|---|---|---|
| 21535110 | 05 | Tue | 02 | 5 | 3 | 1 |
== Problem ==
the week.py script I added in this gist is working, the problem is, it seems too slow.
I used lab server to run it for ~20 hours and it is currently processing at 2,300,000
Is there any way to optimize this script?