Much like David Lynch's surreal narratives, machine learning is deep, complex and not easily understandable upon first thought. And as his characters follow a labyrinthine path through his stories, so did I through the world of ML.
Our task was ‘simple’; predict the global sales for a video game given EU sales, console, release date etc. Diving straight in I immediately began working away at the data in Kaggle. However, problems began to arise. Our dataset contained over 1500 individual developers across 15000 different games. Although only a beginner I recognised that using this would only create noise when it came to generalising our model. I dumped this as well as most other qualitative variables.
Although I had not tested whether linear regression would be a suitable model for our data I still plowed on ahead because my hours of googling yielded no better alternative; that I understand. I’d have liked to explore whether or not the assumptions for LR were met in our data, but my time spent cleaning the data was filled with many roadblocks and ostensibly meaningless tangents. While our predictors had little multicollinearity, linearity only really existed between EU sales and global sales. I believe using linear regression handicapped me and prevented me from improving on the benchmark any more than a couple of per cent.
I attempted to use Ridge regression, a method used for highly correlated independent variables — had I known that fact before writing this medium article then I’d have saved myself 30 minutes of flailing in Kaggle's notebook. Lasso regression was also used, but seeing as this was just another form of linear regression my hopeless attempts at using whatever regression models sklearn had to offer again was unfruitful.
My last-ditch attempt at improving my score was the use of a neural network package in sklearn. I have less than a surface level understanding of NN’s and do not know how they can be applied in a regression problem. I used some code from the documentation which performed worse than the benchmark. At this point, I had spent so much time debugging and researching alternative models that I could not possibly improve my score before the deadline.
Elaborate on that
Unlike David Lynch I will elaborate on my ‘story’ and what lessons are to be learnt. Firstly, I am lucky enough to be in a cohort of intelligent individuals who have more experience than me in this field. They were kind and showed us their reports, in which they used random forest regression. This model, along with some insightful optimisations, gave scores orders of magnitude better than mine. Hyperparameter tuning and KFold validation are ideas that I’ve begun to learn. Now, I will be prepared to tackle any similar problems in the future.
Rome wasn’t built in a day but my bodged solution was. Problem statements like ours require real thought and preparation and promising a solution in this time frame would be disingenuous. I’m looking forward to tackling challenges like this in the future but know now that good solutions take time and diligence.
Also, don’t clap this article if you’re reading this on your telephone. Get real.