How To Beat Kaggle (the Easy Way)

6 minute read

A few nights ago, I found myself tinkering with the Titanic data set on Kaggle and couldn’t help but notice the number of people with a perfect score – many of whom have a single entry.

So, I thought to myself: “Clearly, they must be cheating. But how do you cheat efficiently?”

A Mathematician’s Perspective

The Kaggle competition ‘Titanic: Machine Learning from Disaster’ (and in fact any classification competition on Kaggle) can be modelled as follows:

Kaggle asks you to find a secret -dimensional vector (with for the Titanic data set) where each number is either (the passenger with ID didn’t survive) or (the passenger with ID did survive). So all we really have to do is to guess – no need for fancy machine learning techniques!

There are many possible values for and guessing all of them would be practically impossible. Fortunately, there’s one more ingredient to Kaggle that we can exploit: If we submit a guess to Kaggle, it will return a score – the number of correct guesses. I.e. the number of s such that .

A Naive Approach

This allows for a naive solution in (at most) many guesses: First guess resulting in a score of correct entries and then, for each guess – the binary vector with only one in position . Let be the resulting score. If , then the th entry of must be a . Otherwise it is a .

While this is certainly possible to pull off in practice 1, it does not satisfy my urge for efficiency.

Can we do better?

Indeed! But making significant progress will require some work.

Insert Graph Theory

Definition

Let be an undirected graph. Let be vertices. The distance of and in (in symbols ) is the length of the minimal path of edges in that connects and . If no such path exists, we let . 2

distance of two nodes in a graph
A shortest path of length 2 between u and v

Definition

Let be an undirected graph and let be a set of vertices. resolves if every vertex is uniquely determined by its vector of distances to members of . To put it in mathematical terms: resolves if the function

\[ d_R \colon V \to [0, \infty]^k, v \mapsto (d_G(v,r_1), d_G(v,r_2), \ldots, d_G(v,r_k)) \]

is injective.

Note that trivially resolves as every is uniquely determined by with since in this case we have .

Consider the following example:

a resolving set of size 4 for H(3)
A resolving set of size 3 for H(3)

Here, the vertex is uniquely identified by having distance to , distance to and distance to . In other words, is the unique vertex with distance vector to the highlighted resolving set.

I encourage the reader to check that is indeed a resolving set for to get a better feel for this rather abstract concept before moving on.

Definition

Let be an undirected graph. The metric dimension of is the minimal size of some that resolves .

Returning to our example : We’ve already found a resolving set of size and a bit of computation confirms that there is no resolving set of size . Therefore the metric dimension of is .

It is now time to meet the hero of our advanced guessing strategy.

Definition

The -dimensional Hamming graph is the undirected graph whose vertices are all -dimensional binary vectors such that iff they differ in exactly one position, i.e.

\[ E = \{ \{ \vec{v}, \vec{u} \} \mid \{i \mid v_i \neq u_i \} \text{ has size 1 } \}. \]

If you look back to the diagram of , you will see that this is indeed the -dimensional Hamming graph. Its vertices are binary vectors of length and they are connected via a single edge if and only if they differ in exactly one coordinate.

A Graph Theoretic Guessing Strategy

Here is our graph theoretic guessing strategy: Let be a small resolving set for . Submit each as a guess to Kaggle which sends back its score . If for some , we’ve achieved a perfect score and are done with this competition. Otherwise, since is a resolving set, there is a unique vertex in such that

\[ d_{H(n)}(\vec{g},\vec{r}) = n - s(\vec{r}) \]

for all .

However, by the construction of , we have (where is Kaggle’s secret solution vector) for all .

Since is the unique vector with this property, we must have that is the desired solution. This strategy therefore guarantees a perfect score in at most many submissions!

Now, because has a metric dimension of , we will be able to crack the Titanic dataset in 3 submissions and furthermore, if we commit to submitting all but the final guess at once, this is known to be an optimal guessing strategy.

Confessions

While this graph theoretic approach to beat Kaggle, from a purely mathematical point of view, is very pleasing, it doesn’t seem that practical after all. For instance, while the metric dimension of is known asymptotically, it’s not clear to me how to find small resolving sets for in practice. Furthermore, even if you were given a small resolving set , calculating from requires some serious computational power. Granted, it doesn’t increase the number of guesses required, the metric I’ve chosen to optimize for, but it still results in so much computational overhead that the naive approach will win out.

What’s worse: We are not taking full advantage of Kaggle’s feedback to our guesses. When guessing the entries of our resolving set one by one, we don’t use the information gained to guide our future guesses. Instead we could just as well have submitted them all in one go, collect the scores and then compute the correct solution. If we add this further restriction, the solution presented here is optimal. However, without this artificial restriction, I suspect that a better performing, general solution should be possible and I’d very much like to find one.

In practice, if you want to do better than this graph theoretic guessing strategy, I’d suggest cooking up a reasonably well-performing model. You then adapt the naive approach by prioritizing those bits that your model is least certain about. This, if I had to guess, is what people actually do to obtain perfect scores. Finally, to get a perfect score with a single entry, just calculate the solution on a different account and, once you’ve obtained it, submit it on your main account.

From Kaggle’s point of view, at least in the case of the Titanic dataset, this attack vector is pretty much impossible to defend against. For larger datasets, on the other hand, there certainly is a lot they can do to prevent successful guessing strategies. That, however, is a story for another day…

Sources

  1. Kaggle’s daily submission limit poses only the tiniest of hurdles to any programmer… 

  2. is the usual graph metric 

  3. The exact number depends on the precise metric dimension of which I do not yet know at the time of writing this post. All I know for certain is that it is