Putting an addiction to good use
People who know me can tell you about my addictive behaviour when it comes to watching British panel shows. I can go on and on about those. Whether it is the content, cast or production process, I can talk about this for ages. I’ve just recently discovered Shooting Stars with Bob Mortimer, Vic Reeves and ULRIKA-KA-KA-KA! Warning: Shooting Stars is a very “special” show, so prepare for some pythonesque stuff!
Let’s have a look on how I got into panel shows and why I can’t stop watching them. It all started with QI. After binge watching this show, I stumbled upon a beautiful subreddit called /r/panelshow. There, I found a lot of good recommendations, based on the people I knew from QI. I moved on to 8 Out Of 10 Cats, Never Mind The Buzzcocks and Would I Lie to You?. Which got me thinking: When I can select panel shows, based on their cast, why not build a recommendation system, which does exactly the same?
The dove from above or: The main idea
The main goal is to predict similar panel shows to a given one, based on the people appearing in them. For this we need a measurement of what is “similar” (or “dissimilar”). I used a vector space approach, where each show gets the cast as features. The features are the names of the people appearing in the show. Each show then has the number of appearances of every particular comedian/actor/pornstar in that show as a feature characteristic.
|Show||Jimmy Carr||Stephen Fry||Jeremy Clarkson||…|
|8 out of 10 cats||127||0||0||…|
|Have I got news for you||0||2||3||…|
If we now read each line from the table, we get a well known representation of the show as a point in a very high dimensional vector space. Imaging the shows as points in a coordinate system, only this time you don’t just have two dimensions, but 2646. And where there are points, you will be able to calculate the distances between them. Get the points with the lowest distance to your favourite show and ERANU: You’ve just received your watch list for the next week.
Of course you could measure the distance between the points/shows in the full 2646-dimensional space, but I will apply a technique called Principal Component Analysis (PCA), which is capable of transforming a high dimensional space into a 2-dimensional one. This makes it easier to access the points in a 2-dimensional coordinate system. PCA uses eigenvalues and covariances in the data to determine the positioning of the points in the low dimensional space.
Data to whom data is due.
Most good ideas fail due to missing data. But good news everyone! There is a website called British Comedy Guide, which stores cast information for many British panel shows. Now the bad news: You have to crawl HTML and parse it. With different styles of data representation for different types of panellists. And sometimes missing information. ARGH! Nevertheless BGG is consistent for all major panel shows. Let’s have a look at my crawler, shall we?
You need to set up some libraries and set a provider (here: BCG).
First I downloaded the complete list of panel shows and stored them with their corresponding URL for further processing.
Next we need to define how one row in the DataFrame should look like and how we want to fill it.
This function returns a DataFrame with one row (equal to one show) each time it is called. We just invoke the function on each row and URL from the list of shows and let pandas build a sparse DataFrame from the results.
Finally, save everything.
We asked our studio audience…
With the data at hand, we finally can look for some recommendations. We are going to use the PCA implementation from skicit-learn.
First, we need to normalise the data. I have chosen to divide each row by its maximum value. Other
brands normalisations are available.
The data has now 69 rows and 2646 columns. Let the PCA reduce the data to two dimensions. I have used a randomized PCA. After a quick survey of other possibilities this gave the best results. Finally, build a Nearest Neighbour tree to quickly query the dataset.
The moment of truth: Let’s take an example. I have chosen 8 Out Of 10 Cats as my starting point. In my dataset the show has index
The result is quite good:
|67||8 Out Of 10 Cats|
|68||8 Out Of 10 Cats Does Countdown|
|9||The Big Fat Quiz Of The Year|
|64||Would I Lie To You?|
|58||Was It Something I Said?|
Beneath the closest neighbours, you will find the spin-off 8 Out Of 10 Cats Does Countdown, which has a quite similar cast, and another show hosted by Jimmy Carr (The Big Fat Quiz Of The Year). I know, some of you guys are thinking: Isn’t it counterproductive to count the appearance of the host for every show? First of all, no, because the host is part of the panel in my model. I find it very intuitive to assume that, because if you want to watch a show, you have to more or less like the host. He will be there every time you watch the program. So it’s reasonable to give him/her a high weight. Secondly, here is another example for QI (index 44):
|20||Duck Quacks Don't Echo|
|37||The Marriage Ref|
|57||Wall Of Fame|
|43||Play To The Whistle|
|65||You Have Been Watching|
You can see Duck Quacks Don’t Echo, which has similar concept to QI. The other panel shows might be further off, but as I said before, it is just about the cast. It’s not about the content of the shows.
One last look on the complete data.
If you you want to try it yourself, you can find the data and the notebooks on github. Caution: I cannot guarantee for the data’s completeness or correctness, nor does it cure cancer.
And the final scores are…
You have just read about the possibility of making a recommendation system for panel shows just by looking at the cast and the cast similarities. This solution is far from perfect, but it gives a hint for further development. A lot more can be done.
- One could complete the data set from other sources.
- Change the weights for several panellists to represent personal favourites
- Lower the impact of the host
- Try other methods like clustering
The least thing I can hope for is somebody to actually pick up watching (new) panel shows. And therefore: Back to Sandi Toksvig for everything Scandinavian.