by Milan Janosov, PhD candidate at the CEU Center for Network Science.
The new season of Game of Thrones is almost upon us and fans are excited about what it may bring. I am probably not alone in wondering which of my favourite characters are going to meet their ends, and which will live on to the next season. So I decided to come up with a ranking for the characters based on how likely it is they will die. Game of Thrones is a complex world in which social position and true friends seem to be quite important, so I quantified each character’s social interaction patterns using the tools of network science. I then predict their fate using machine learning methods.
Creating the network of Westeros
As a data source I used the show’s subtitles, collected in dialogue format on a fan website . Unfortunately most of the episodes from season two and three are missing but the remaining four seasons, including almost 600 scenes, are available in a consistent format.
First I constructed the aggregated network of the realm’s social system. In this network each node represents a character of the story, and the weight of the link between each pair of characters symbolizes the strength of their social interaction. I considered scenes to be the elementary units of the social interaction (an average episode contains about twenty of them). This means that everyone who appeared once (twice) together in the same scene has a tie with strength of one (two), and within a scene everyone is connected with everyone. In other words, scenes are complete graphs, or cliques, increasing the tie strength between all pairs of people present by one. By calculating these scene-level complete networks and then aggregating them, we arrive to the global social network of Westeros [link to the full image], which has almost 400 nodes and more than 3000 edges.
In the network visualization all the members of the great houses are marked with different colors (e.g. blue – Starks, red – Lannisters, yellow – Martells), while the rest of the people are in gray. The size of the nodes is proportional to the number of contacts each person has and the names of the most popular characters are added as labels. The less interesting nodes with very low degree are filtered out. We can see a separated community around Jon Snow, indicating that the folks around the Wall have only a few contacts with the rest of the realm. Tyrion has a separate role: he connects Daenerys Targaryen to the center of the network, including King’s Landing, where we can see two large communities. These are the Starks and the Lannisters and their zones of influence and interaction, like the bonds between the Stark and Tully families and the conflict between the Lannisters and the Martells, forming a dense web at the heart of the story.
Let’s turn to the math. We can calculate various measures of how important nodes are. We associate these measurements to the characters to illustrate their importance in this social ecosystem. Some of these measures are i) the node degree - the number of contacts a person has; ii) the weighted degree - the sum of edge weights at a certain node; iii) the clustering - how often pairs of contacts of the node are in contact themselves ; and iv) the betweenness centrality, which says how much of a bridge a node is in terms of information flow by measuring how often it lies on the shortest path between other pairs of nodes. Besides getting a better idea of who is important and who is not, we can also learn from the data which characters died in the first six seasons. Our goal then is to relate network position to survival: does one predict the other? In other words, we want to train an algorithm to figure out which network measures predict whether a character has died.
We have a set of 94 characters interesting enough to care about. All of them are described by seven different network-based features which proxy for different dimensions of their social importance. We also know which of the characters have already died (61 of them). Based on this knowledge we can form an educated guess of who is going to die in the near future in the following way: which of those people still alive have similar features to those who have already passed away? This problem resembles the well-known churn problem, which can be solved with various classification-based algorithms. In this analysis we use a Support Vector Machine (SVM), which happened to be the most accurate. It has an easy-to-use implementation in Python in case you’d like to try this at home .
The machine learning algorithm takes all the features into account and makes predictions on the possible value of the target variable. For this the sample data is split into test and training sets randomly and multiple times, the prediction is made on all the random splits, and the final result is evaluated. With this cross-validation strategy the SVM classifier predicted the correct class (dead or alive) in 72.3% of the cases, which given the size and nature of the data is a fair result. To illustrate the accuracy, the model says that eight characters shouldn’t have died but in the story they did - the model couldn’t foresee their death. Such characters are e.g. Margaery Tyrell - death of queens seems to be less likely than that of kings, and Janos Slynt who was exiled from Kings Landing to the Wall, where his powerful friends couldn't save him, even though the model suggested so.
It should be mentioned that including other types of features (e.g. gender, being a member of a noble house, sentiment analysis of the speeches, etc.), having a more complete dataset, comparing the TV show to the book, etc., could increase the accuracy of the predictions. This model also neglects discrepancies like Jon Snow dying and then being reborn, and Benjen Stark being somewhere in between.
Results – Spoiler alert
Using the SVM model we get to the answers – the probabilities of each living well-known character passing away. As network measures are often very correlated, we can’t pick out one or two that are highly predictive on their own, but seemingly characters with high betweenness, low clustering and high degree are less likely to be killed. In any case the strength of the machine learning approach is exactly finding hidden relationships among the large number of features. I used five-fold cross-validation during the prediction, and repeated this a hundred times to get an estimation on the statistical value and error of the probabilities. Finally, here is the list of characters ranked in increasing order of survival according to the final prediction model:
The list tells us many interesting things. First, Daenerys seems to be quite likely to die, overlapping with a number of speculations, while Tyrion and Jon Snow seem to be relatively safe. Second, both the ever-popular Arya Stark and the less friendly Hound, already so close to death many times before, are both in dangerous positions. Surprisingly, Cersei, currently sitting on the Iron Throne, and Baelish who is doing his best to get there, seem to be in a much better position. It seems Jorah Mormont will find the cure for his greyscale disease, and despite all he has been through, Theon Greyjoy will probably survive. Sadly, the same cannot be said about the Arryn family.
Read about Milán's research in other media: