Jekyll2021-06-19T15:02:26+00:00https://www.stefanmesken.info/feed.xmlStefan MeskenData ScienceStefan MeskenMachine Learning Golf2019-07-01T00:00:00+00:002019-07-01T00:00:00+00:00https://www.stefanmesken.info/data-science/machine-learning-golf<p>For the last 5 years or so, I’ve been lurking on the <a href="https://codegolf.stackexchange.com/">Code Golf Stackexchange</a>. Code Golf is the art of solving a given problem, under specified rules, with as little written code as possible. And if you have never encountered ‘golfed code’, I highly recommend taking a look. Not sure about other people but to my nerdy brain, reading the ingenious approaches these incredibly talented and passionate coders come up with, is pure entertainment gold!</p>
<p>For various reasons, I’ve never gotten into the hobby myself – until last week. While walking my dog, I thought about <a href="https://www.newscientist.com/article/2205779-creating-an-ai-can-be-five-times-worse-for-the-planet-than-a-car/">this article on the carbon footprint of machine learning</a> and it occurred to me that the neural networks I’ve designed and trained are super wasteful in terms of their parameter efficiency – basically the number of free variables during training to adapt them to a given task. So, I thought to myself, what would happen if I added parameter efficiency as an additional metric?</p>
<h1 id="machine-learning-golf">Machine Learning Golf</h1>
<p>Here’s the elevator pitch for ML Golf: The challenger provides</p>
<ul>
<li>a public training set,</li>
<li>a performance goal that needs to be met on the (unmodified!) training set,</li>
<li>[additional, optional performance goals for bragging rights],</li>
<li>a set of rules that all entries must satisfy and</li>
<li>a scoring system to judge the entries by.</li>
</ul>
<p>For logistics and visibility reasons, I suggest posting any such challenge on https://codegolf.stackexchange.com/ under the tag ‘machine-learning’, once that tag exists. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p>Challengers then design and train their models according to the stated rules. Once their model meets the mandatory performance goal, they may submit their model (in a suitable form) and all code necessary to verify their results. As per usual in Code Golf, peers will then vote/comment on their and other submissions, suggest improvements and challengers may update their entries at any point and are free to enter multiple models.</p>
<h1 id="whats-the-point">What’s the point?</h1>
<p>The initial motivation of ML Golf was to find out how to design machine learning models that are much more parameter efficient. However, I don’t think that this is the most convincing or even relevant argument for you to consider participating.</p>
<p>Instead, the motivation for ML Golf is pretty much the same as for any other programming competition: My initial trial runs of ML Golfs proved to be highly entertaining, got me thinking about aspects of machine learning I previously hadn’t considered, encourage out-of-the-box thinking and, should the community decide to participate in them, spark the spirited exchange with like-minded people in a casual and safe environment.</p>
<h1 id="the-first-challenge">The First Challenge</h1>
<p>To kick things off, I’ll propose the first, very simple challenge.</p>
<p>In the language and framework of your choice, design and train a neural network that, given \((x_1, x_2)\) calculates their product \(x_1 \cdot x_2\) for all integers \(x_1, x_2\) between (and including) \(-10\) and \(10\).</p>
<h2 id="performance-goal">Performance Goal</h2>
<p>To qualify, your model may not deviate by more than \(0.5\) from the correct result on any of those entries.</p>
<h2 id="bonus-goal">Bonus Goal</h2>
<p>Achieve a maximal deviation (measured by the Euclidean norm) of \(0.5\) for the multiplication of all complex numbers with integer components \(c_1 = (a_1, b_1) = a_1 + b_1 \cdot i, c_2 = (a_2, b_2) = a_2 + b_2 \cdot i\) of Euclidean norm \(\le 10\).</p>
<h2 id="rules">Rules</h2>
<p>Your model</p>
<ul>
<li>must be a ‘traditional’ neural network (a node’s value is calculated as a weighted linear combination of some of the nodes in a previous layer followed by an activation function),</li>
<li>may only use activation functions listed as such <a href="https://keras.io/activations/">in Keras</a>,</li>
<li>must take \((x_1, x_2)\) either as a tuple/vector/list/… of integers or floats as its only input,</li>
<li>return the answer \(\hat{y}\) as an integer, float (or a suitable container, e.g. a vector or list, that contains this answer).</li>
</ul>
<h2 id="scoring">Scoring</h2>
<p>The neural network with the <em>smallest number of weights</em> (including bias weights) wins.</p>
<p>My best attempt so far uses 43 weights and achieves a maximal deviation of \(0.04786\) as witnessed by \(8 \cdot -8 \mapsto -63.521423\). I’m sure there’s plenty of room for improvements and am looking forward to your entries!</p>
<p>If you want to participate or check out my submission, hop over to <a href="https://codegolf.stackexchange.com/questions/187562/machine-learning-golf-multiplication">this challenge on Code Golf Stackexchange</a>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>This, of course, doesn’t mean that ML Golf is limited to this platform – feel free to post your challenges and submissions wherever you want. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Stefan MeskenFor the last 5 years or so, I’ve been lurking on the Code Golf Stackexchange. Code Golf is the art of solving a given problem, under specified rules, with as little written code as possible. And if you have never encountered ‘golfed code’, I highly recommend taking a look. Not sure about other people but to my nerdy brain, reading the ingenious approaches these incredibly talented and passionate coders come up with, is pure entertainment gold!The Dangers of Performance Metrics2019-06-21T00:00:00+00:002019-06-21T00:00:00+00:00https://www.stefanmesken.info/data-science/metrics<p>Full disclosure: I’m not a big fan of one-dimensional metrics when evaluating <a href="https://en.wikipedia.org/wiki/Binary_classification">(binary) classifiers</a>
(e.g. a neural network that is supposed to distinguish pictures of cats and dogs). In my opinion, a <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a> is, in almost all instances, preferable for data scientists. However, that doesn’t mean there aren’t good reasons to use one-dimensional performance metrics:</p>
<ul>
<li>They are (deceptively) simple,</li>
<li>Easy to integrate in cross validation and similar automated processes,</li>
<li>Can be an excellent communication tool,</li>
<li>…</li>
</ul>
<p>My gripe with performance metrics is founded in the observation that they tend to be <a href="https://en.wiktionary.org/wiki/footgu://en.wiktionary.org/wiki/footgun">footguns</a> in practice – outside of the academical environment they’ve been born it:</p>
<ul>
<li>Their actual expressive power often isn’t clear,</li>
<li>It’s remarkably common for people to chase metrics that don’t align with their goals,</li>
<li>The terminology can be misleading <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and</li>
<li>Choosing a metric is often less a result of careful consideration than a symptom of <a href="https://en.wikipedia.org/wiki/Information_bias_(psychology)">information bias</a> or high <a href="https://en.wikipedia.org/wiki/Uncertainty_avoidance">uncertainty avoidance</a>.</li>
</ul>
<p>Note that none of these points has to do with performance metrics as mathematical concepts – they are instances of human misconduct. Therefore, instead of continuing my crusade against the status quo, I’d like to take a few minutes to discuss some common metrics and how/when you should pay and, more importantly, when you should not pay attention to them.</p>
<h1 id="black-or-true-black">Black or True Black?</h1>
<p>Before we can hope to understand common performance metrics we need to talk about True Positives and True Negatives.</p>
<figure style="align: center">
<img src="/images/CS8k.png" style="max-width: 400px;" alt="CatSniffer8000" />
<figcaption>CS8k at work</figcaption>
</figure>
<p>Suppose we have a binary classifier, call it ‘CatSniffer8000’ (CS8k), that is supposed to tell us whether a given picture shows a cat or not. We have a test dataset of 1000 pictures, exactly 400 of which show cats, that we ask CS8k to classify. And we end up with the following confusion matrix:</p>
<figure style="align: center">
<img src="/images/diagrams/CS8k_confusion_matrix.png" style="max-width: 400px;" alt="Confusion Matrix" />
<figcaption>CS8k's confusion matrix</figcaption>
</figure>
<p>There are 400 cat pictures in our training set. These are the <em>Positives</em> (P). And, likewise, there are 600 pictures that don’t show cats – the <em>Negatives</em> (N).</p>
<p>Out of the 400 pictures that showed cats (our Positives), CS8k correctly labelled 283 as cats. These are the <em>True Positives</em> (TP). In other words: True Positives are those instances that are labelled as positive both in our training set and by our classifier.</p>
<p>Similarly, out of 600 pictures in our training set that didn’t show cats, CS8k correctly labelled 527 as ‘non cats’. These are the <em>True Negatives</em> (TN).</p>
<p>Furthermore, 73 non-cat pictures got incorrectly labelled as cat picture. This amounts to 73 <em>False Positives</em> (FP).</p>
<p>Finally, 117 cat pictures in our dataset received the label ‘no cat’ by CS8k – resulting in 117 <em>False Negatives</em> (FN).</p>
<p>Note that always</p>
<p>#Negatives = #True Negatives + #False Positives and</p>
<p>#Positives = #True Positives + #False Negatives.</p>
<p>This is a good sanity check when filling out a confusion matrix (or interpreting an unlabelled one).</p>
<h1 id="accuracy-recall-precision-and-f1-score">Accuracy, Recall, Precision and F1-Score</h1>
<p>Now that we know about True/False Positives/Negatives, we can use them to define common performance metrics.</p>
<h2 id="accuracy">Accuracy</h2>
<p>The one pretty much everyone will have seen is <em>Accuracy</em> (ACC). It’s defined as</p>
<p>\[
\mathrm{ACC} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{P} + \mathrm{N}}
\]</p>
<p>but what does it mean? Well, it’s the ratio of all instances that have been labelled correctly by our classifier. As such is serves as a good default performance metric for many tasks – especially if the test set is balanced (i.e. \(\mathrm{P}\) and \(\mathrm{N}\) are of similar size) and consists of groups of roughly equal importance to your task.</p>
<p>The accuracy of CS8k is \(\frac{283+527}{400 + 600} = 0.81 = 81 \%\).</p>
<p>Let’s consider an example in which accuracy is a terrible metric: Suppose you decide to build a model that tries to detect breast cancer in women from their X-ray images. Your dataset consists of 100.000 randomly selected images and, as such, only contains 85 Positives. By always guessing ‘not breast cancer’, your classifier will achieve an accuracy of \(\frac{0 + 99.015}{100000} = 99.015\%\). And despite this impressive performance metric, you hopefully agree with me that this is an absolutely terrible model.</p>
<h2 id="recall">Recall</h2>
<p>To mathematically capture the awfulness of our breast cancer classifier, we should look at its <em>Recall</em> or <em>True Positive Rate</em> (TPR). It’s defined as</p>
<p>\[
\mathrm{TPR} = \frac{\mathrm{TP}}{P}
\]</p>
<p>and hence turns out to be a whopping \(\frac{0}{85} = 0\%\) – truly awful!</p>
<p>Recall measures how many of the Positives got detected as such <em>and only that</em>. While it can be a highly relevant piece of information, it is not suitable as your single performance metric. By blindly guessing ‘Positive’, your model will always have a Recall of \(100\%\).</p>
<h2 id="precision">Precision</h2>
<p>Contrary to Recall, <em>Precision</em> or <em>Positive Predictive Value</em> (PPV) measures the ratio of Positives among all those instances that our classifier labelled as positive.</p>
<p>\[
\mathrm{PPV} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}
\]</p>
<p>Since \(\mathrm{FP} = 0\) for our breast cancer classifier, it has perfect precision – demonstrating that precision alone also isn’t suitable as a performance metric.</p>
<h2 id="f1-score">F1 Score</h2>
<p>This is where the F1-Score (F1) comes into play. It’s the harmonic mean of Recall and Precision which makes is much more useful as a performance metric than either of its building blocks. The precise mathematical definition is</p>
<p>\[
\mathrm{F1} = \frac{2}{\frac{1}{\mathrm{TPR}} + \frac{1}{\mathrm{PPV}}} = \frac{2 \cdot \mathrm{TP}}{2 \cdot \mathrm{TP} + \mathrm{FP} + \mathrm{FN}}.
\]</p>
<p>In order to achieve a high F1-Score, your model must achieve both high Recall and Precision <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. Not only that, because of the usage of the harmonic mean, a terrible score in either metric will destroy the F1-Score. In particular, our breast cancer classifier, despite having perfect precision, has an F1-Score of \(\frac{2 \cdot 0}{ 2 \cdot 0 + 0 + 85} = 0\).</p>
<p>If your task demands a balance of Recall and Precision (which typically require trade-offs), the F1-Score is a decent choice. However, it’s not a silver bullet for your performance metric needs either:</p>
<p>Consider the following scenario: We keep everything as before but flip the labels in our breast cancer example. So the breast cancer instances are now our Negatives and the non-breast cancer instances are now our Positives. Having changed nothing about the quality of our model, this now results in an F1-Score of \(\frac{2 \cdot 99.015}{2 \cdot 99.0815 + 85 + 0 } \sim 0.9996\) – a nearly perfect score.</p>
<p>There are other issues with these and every other one-dimensional performance metric but I think I’ve already demanded too much of your attention for now. So let’s reserve further criticism for another time…</p>
<h1 id="recommended-default">Recommended Default</h1>
<p>Let me stress again that metrics are neither good or bad. If handled responsibly, they’re powerful tools for data scientists and disregarding their value would be a foolish move. However, as I’ve tried to argue in this post, I do think that metrics should guide your decision making only after careful evaluation of their appropriateness for the task at hand. I’m not kidding when I’m saying that, in practice, metrics tend to be footguns. It’s incredibly easy to misinterpret what a given score/value means in your particular case and how it should (or shouldn’t) affect your decisions.</p>
<p>In fact, I’d go as far as to say that the default for data science practitioners should be to <em>not consider any one-dimensional metric</em> and instead analyse confusion matrices. They are, without a doubt, less convenient but that actually turns out to be their strength: Being forced to carefully thing through a confusion matrix will save your from making many mistakes that looking at a single, poorly chosen performance metric alone would have nudged you to commit.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Metrics tend to have names that we attach some intuitive meaning to (like ‘accuracy’) that doesn’t always align with the technical definition. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Not that, although the F1-Score takes only values between 0 and 1, people don’t often view the F1-Score as a percentage. And there are good reasons for that, that I won’t get into here. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Stefan MeskenFull disclosure: I’m not a big fan of one-dimensional metrics when evaluating (binary) classifiers (e.g. a neural network that is supposed to distinguish pictures of cats and dogs). In my opinion, a confusion matrix is, in almost all instances, preferable for data scientists. However, that doesn’t mean there aren’t good reasons to use one-dimensional performance metrics:An Hommage to Feynman’s Technique2019-06-17T00:00:00+00:002019-06-17T00:00:00+00:00https://www.stefanmesken.info/machine%20learning/Feynman-technique<h1 id="fragile-knowledge">Fragile Knowledge</h1>
<blockquote>
<p>I don’t know what’s the matter with people: they don’t learn by
understanding; they learn by some other way - by rote, or
something. Their knowledge is so fragile!</p>
<p>– Richard P. Feynman</p>
</blockquote>
<p>One key to successfully mastering a new subject, I believe, is to
connect it to something that you are already deeply familiar
with. This idea is as old as human learning and a common quality of
great lecturers is that they fully embrace its logical conclusion:
When introducing an audience to any novel concept, start out by
conveying the main ideas in analogies and metaphors rather than
drowning them in complexities and details.</p>
<p>And while people tend to agree that this is a good quality in a
teacher, they often deprive themselves of this luxury when learning on
their own. Admittedly, explaining a new idea well to yourself requires
a substantial amount of additional effort. However, I am convinced
that this overhead is not only economical but of fundamental
importance if you ever plan to master a new set of skills (as opposed to
gaining the mostly worthless knowledge of sparsely connected facts).</p>
<h1 id="case-study-understanding-neural-networks">Case Study: Understanding Neural Networks</h1>
<p>So, if you are still with me, I’d like to demonstrate how this might
look like when first learning about neural networks. Basically I’ll
try to reconstruct the inner monologue that went through my head when
I just first learned about them.</p>
<p>“Neural networks are basically <a href="https://en.wikipedia.org/wiki/Directed_graph">directed
graphs</a> with
metadata. Nodes represent neurons, edges represent the data-flow
between those neurons and the metadata comes in two flavours:</p>
<ol>
<li>
<p>The weight attached to every edge determines how much of an
influence its source neuron has in the overall network and</p>
</li>
<li>
<p>An activation function, attached to each neuron, that determines the
‘shape’ of this neuron’s discharge pattern. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</li>
</ol>
<p>So far, so good. But what does that mean? How can a neural network
process information?”</p>
<p>Stepping out of character for a second: It’s here that you can make
life very difficult for yourself. Both leaving this question
unanswered and jumping into some complex example
(e.g. <a href="https://arxiv.org/abs/1512.03385">ResNet</a>) are, in my opinion,
terrible mistakes. You want to build a simple example, ideally relying
only on yourself, for two reasons:</p>
<ol>
<li>First and most importantly: This is a litmus test. If you are not
able to build a basic example of the concept you’ve just learned
about, you haven’t understood the basics yet and need to revisit
them. By forcing yourself to sit down and actually spelling out a very
simple example in detail, you’re making sure that your previous gained
confidence is justified and that you’re not just fooling yourself.
<blockquote>
<p>The first principle is that you must not fool yourself – and you are
the easiest person to fool.</p>
<p>– Richard P. Feynman</p>
</blockquote>
</li>
<li>
<p>If you want to master any subject, it is my firm opinion that you
must carry around a large set of examples in your head. These
examples will not only guide your intuition but they will also
serve as a more and more elaborate testing ground for new ideas. In
the future, if someone tells you about some new fancy concept (that
someone might be you), you can refer to an appropriate example in
your head and see how it would influence that. If the results don’t
make sense, something’s wrong! Either you’ve misunderstood what the
other party is telling you or there is a mistake. Assuming you want this
conversation to remain meaningful, you need to fix that! This is
how I, and how I believe most of my colleagues, are able to build
an appropriate intuition about complex concepts that often run
counter to any experience you might have had in your daily life.</p>
<p>Building a simple example from scratch is your first step
toward assembling your internal database.</p>
</li>
</ol>
<p>Okay, let’s return to the inner monologue:</p>
<p>“Is there anything I already know about that could easily be realized
as a neural network? Actually, yes: Linear functions are of the form
\(f(x) = a \cdot x + b\) and they are represented by the following
neural network: <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<figure style="align: center">
<img src="/images/diagrams/nn_linear_function.png" style="max-width: 400px;" alt="a neural network that represents a linear function" />
</figure>
<p>Great, but how about logical connections? If artificial neural
networks are supposed to model actual neural networks, they certainly
must be able to capture logic gates, right? Right!</p>
<figure style="align: center">
<img src="/images/diagrams/nn_not.png" style="max-width: 400px;" alt="a neural network that represents logical negation" />
</figure>
<p>Here \(\chi = \chi_{[0, \infty)}\) is the activation function of
\(a_1\) and all you need to know is that \(\chi(z) = 0\) for all \(z
< 0\) and \(\chi(z) = 1\) for all \(z \ge 0\). So, if we set \(x_1 =
0\), we get that \(a_1 = \chi(w_0 \cdot x_0 + w_1 \cdot x_1) =
\chi(0 \cdot 1 + (-1) \cdot 0) = \chi(0) = 0\). On the other hand, if we let
\(x_1 = 1\), we get that \(a_1 = \chi(0 \cdot 1 + (-1) \cdot 1) = \chi(-1) =
0\). So this neural network does indeed represent \(\mathrm{NOT}(x_1)\)!”</p>
<p>At this point, you might want to cook up a few more examples and in an
earlier tweet I demonstrated how to represent other basic logic gates
(AND, OR, NOR, XOR, XNOR) as neural networks as well. I’d encourage
you to try it yourself first and then <a href="https://twitter.com/mesken_stefan/status/1138694458470539264">hop over to
Twitter</a>
and compare your results with mine. <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>Once you’ve done that, you may notice what I’ve noticed: Not only can
neural networks represent logic gates. <em>Neural networks are nothing
more than logic gates</em>, with a few minor adjustments:</p>
<ol>
<li>
<p>Inputs are allowed to take on any real number as value, not just \(0\) and \(1\),</p>
</li>
<li>
<p>We add a restriction that a node’s value is computed by a weighted,
linear combination of the values to its left followed by an
activation function and</p>
</li>
<li>
<p>We add an activation functions to our nodes. <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>”</p>
</li>
</ol>
<p>It may not seem that way, but this little detour is incredibly
beneficial to your overall learning strategy. <strong>Ideas and memories die
in a vacuum</strong>. If you want them to become lasting and meaningful, you
need to them connect to as many previously established ideas as
possible. Don’t believe me? Then tell me: What did you have for dinner
last Wednesday? Unless you are able to connect last Wednesday’s dinner
to something more meaningful (maybe you happened to be on a first
date), most of us will struggle mightily with this very simple
question. And if I ask you about your lunch a year ago, there’s
basically no chance you’ll be able to remember. On the other hand, if
I ask you what you did last Christmas (or even better: On your wedding
day if you happen to be married), the task becomes much easier. You
either connect ideas and memories or you will lose them –
quickly. Not only that: Sparsely connected ideas (what Feynman calls
‘fragile knowledge’) are basically worthless as they’re not readily
available to be used in similar, but slightly different, settings.</p>
<h1 id="what-you-should-take-away-from-this">What you should take away from this</h1>
<p>If you make a habit of <em>always</em> explaining new ideas to yourself until
you’ve completely broken them down to related concepts, that you’ve
already mastered, you’ll never have to learn or memorize anything
truly novel again. And, over time, you’ll gain firm, deeply connected,
foundational knowledge about seemingly unrelated concepts that will
not only help you to learn other concepts faster and more efficient
but, ultimately, also allow you to solve problems in surprising,
efficient and truly ingenious ways. Being able to do that is, in a
nutshell, the mark of a true expert.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I think it’s helpful to point out that this is one of the ways
that artificial neural networks differ from biological
ones. Biological neurons either fire or they don’t – so, in a
way, the only activation function they’re allowed to have is the
<a href="https://en.wikipedia.org/wiki/Indicator_function">characteristic
function</a>
\(\chi_{(-\infty, a]}\) for some number \(a\). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This follows Andrew Ng’s handy convention that any neuron/weight with
a \(0\) as subscript represents a bias component – independent of the input. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>In case you don’t want to use Twitter, you can find my solution <a href="http://localhost:4000/images/diagrams/nn_logic_gates.png">here</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>This point, in theory, is incredibly powerful. In fact, it’s too
powerful. Since neural networks are intended to approximate some
function \(f\) to begin with, we could always achieve this by
taking \(f\) as our activation function. In practice, however,
\(f\) is unknown, we only allow activation functions from a very
restrictive, simple set of functions and their specific choice, in
many cases, turns out to be surprisingly irrelevant. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Stefan MeskenFragile KnowledgeTensorFlow Without Tears2019-06-11T00:00:00+00:002019-06-11T00:00:00+00:00https://www.stefanmesken.info/machine%20learning/tensorflow-without-tears<p>Life is pretty crazy right now. Finishing a PhD, changing from set theory to applied data science, forming connections to other data scientists, learning as much about machine learning (both theory and practice) as I possibly can, finding a new apartment and preparing to move, dealing with German bureaucracy… This doesn’t leave a lot of spare time. Truth be told, it doesn’t leave any.</p>
<p>I’m not complaining – the last couple of months have been immensely enjoyable (well, except for the bureaucracy part). But being pressed for time also made me decide to postpone my planned blog series on the importance of probability distributions in machine learning as I’m currently lacking the spare mental capacity to do this rather broad, open-ended topic justice.</p>
<h1 id="tensorflow-2">TensorFlow 2</h1>
<p>Instead, to finally return the ball of machine learning conversation to my readers, I’d like to share my excitement about <a href="https://www.tensorflow.org/beta/">TensorFlow 2</a>. The beta of TF2 has been announced 4 days ago and while I didn’t take part in the alpha at all, I’ve set aside a bit of time over the weekend to play around with the newly released beta. And it’s been incredibly exciting!</p>
<p>See, what often sets me apart in a group of people is that I really like to get down to the nuts and bolts of anything I lay my hands on. As a young child I’ve disassembled pretty much every electronic device in our household, I’ve worked on <a href="https://en.wikipedia.org/wiki/Trabant">charmingly simple cars</a> by the time I graduated from kindergarten, I’ve spent most of my childhood in the workshop of my grandfather. This hands-on, detail-focused, borderline obsessive personality trait never left me.</p>
<p>And it’s precisely what made me fall in love with TF1 when I first discovered it. TF1, at least to me, isn’t really a machine learning framework at heart. Admittedly, it has a bunch of utilities built into it that are useful in machine learning applications, but what TF1 does at its core is to provide a framework for the manipulation of dataflow graphs. This philosophy makes TF1 very versatile, performant and transparent. But it also sets up a learning curve that, especially to someone without a background in mathematics or computer science, looks more like a learning wall. And like so many other learning walls (vi anyone?), it naturally turns away a lot of people. Especially those people that I’ve benefited the most from during my entire life: People who, unlike me, don’t care much about bolts and nuts but who naturally focus much more on the greater picture, who provide a sense of style, a desire for polish that benefits the entire community.</p>
<p><a href="https://keras.io/">Keras</a> has done a lot to attract a more diverse group of users to TensorFlow 1. To quote their <a href="https://keras.io/why-use-keras/">FAQ</a>:</p>
<blockquote>
<p><strong>Keras prioritizes developer experience</strong></p>
<ul>
<li>Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error.</li>
<li>This makes Keras easy to learn and easy to use. As a Keras user, you are more productive, allowing you to try more ideas than your competition, faster – which in turn helps you win machine learning competitions.</li>
<li>This ease of use does not come at the cost of reduced flexibility: because Keras integrates with lower-level deep learning languages (in particular TensorFlow), it enables you to implement anything you could have built in the base language. In particular, as tf.keras, the Keras API integrates seamlessly with your TensorFlow workflows.</li>
</ul>
</blockquote>
<p>This amazing body of work has enabled TF2 to pull off what so rarely seems attainable: By embracing Keras’ philosophy, TF2 became a piece of software that is both user-friendly and still provides all the functionality, all the performance and all the transparency any user could ask for. This almost never happens but when it does, it makes me really, really happy.</p>
<h1 id="hello-im-tensorflow">Hello, I’m TensorFlow.</h1>
<p>In an effort to invite as many people as possible to share my excitement about TF2 and join the discussion, I’d like to end this post with a very simple regression task completed in TensorFlow 2. We aim to approximate the function \(\sin \colon [0, 2 \pi] \to [-1, 1], x \mapsto \sin (x)\) with a neural network. You can find the <a href="https://colab.research.google.com/drive/1b1Lil2bfpT--axFKgd2z-2LeIkCv0z5Y">interactive Jupyter notebook over at Google Colaboratory</a> or just continue reading for a slighly more verbose version.</p>
<p>The first part of any machine learning project is to load our dependencies.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">tf</span><span class="o">-</span><span class="n">nightly</span><span class="o">-</span><span class="mf">2.0</span><span class="o">-</span><span class="n">preview</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</code></pre></div></div>
<p>Next, we load our data. In this case, to keep things as simple as possible, we will simply create the sine function on the interval \([0, 2 \pi]\) and approximate it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">pi</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">pi</span> <span class="o">/</span> <span class="mi">10000</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>When possible, it’s always a good idea to look at a visual representation of your data. This serves as a baseline sanity check to ensure there is no immediately obvious problem.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<figure style="align: center">
<img src="/images/diagrams/sine.png" style="max-width: 400px;" alt="graph of the sine function" />
</figure>
<p>That looks okay, so let’s continue by building our neural network. The integration of Keras makes this very simple:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Build a neural network with 2 hidden layers, each of size 128
</span><span class="n">model</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">Sequential</span><span class="p">([</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'linear'</span><span class="p">)</span>
<span class="p">])</span>
</code></pre></div></div>
<p>Compiling the model is similarly trivial.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s">'mean_squared_error'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</code></pre></div></div>
<p>Finally, let’s fit the neural network to our training data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<blockquote>
<p>Train on 10000 samples</p>
<p>Epoch 1/5
10000/10000 [==============================] - 1s 72us/sample - loss: 0.1698</p>
<p>Epoch 2/5
10000/10000 [==============================] - 1s 61us/sample - loss: 0.0904</p>
<p>Epoch 3/5
10000/10000 [==============================] - 1s 58us/sample - loss: 0.0415</p>
<p>Epoch 4/5
10000/10000 [==============================] - 1s 53us/sample - loss: 0.0109</p>
<p>Epoch 5/5
10000/10000 [==============================] - 1s 52us/sample - loss: 0.0023</p>
<p><tensorflow.python.keras.callbacks.History at 0x7fa30305c5f8></p>
</blockquote>
<p>That’s it! We’ve created and successfully trained a neural network to approximate the sine function on \([0, 2 \pi]\) and the training output seems encouraging. All that’s left to do is to visualize how well we’ve done:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pred</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">pred</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'model prediction'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<figure style="align: center">
<img src="/images/diagrams/since_prediction.png" style="max-width: 400px;" alt="prediction of the sine function" />
</figure>
<p>Not bad, not bad at all! The interval \([5,2 \pi]\) needs some work (and I encourage you to think about why our model performed so much worse on that section) but other than that, I’d say that our model does know how to create a sine wave.</p>
<p>I’ve decided to give a few hints as to why our model performs worse on the tail end of our data set: Consider the following animation of the learning process</p>
<figure style="align: center">
<img src="/images/diagrams/adam_animation.gif" style="max-width: 400px;" alt="graph of the sine function" />
</figure>
<p>and keep in mind that we are using the ‘Adam’ optimizer.</p>Stefan MeskenLife is pretty crazy right now. Finishing a PhD, changing from set theory to applied data science, forming connections to other data scientists, learning as much about machine learning (both theory and practice) as I possibly can, finding a new apartment and preparing to move, dealing with German bureaucracy… This doesn’t leave a lot of spare time. Truth be told, it doesn’t leave any.How To Beat Kaggle (the Easy Way)2019-05-25T00:00:00+00:002019-05-25T00:00:00+00:00https://www.stefanmesken.info/machine%20learning/how-to-beat-kaggle-(the-easy-way)<p>A few nights ago, I found myself tinkering with the <a href="https://www.kaggle.com/c/titanic">Titanic data set on Kaggle</a> and couldn’t help but notice the number of people with a <a href="https://www.kaggle.com/c/titanic/leaderboard">perfect score</a> – many of whom have a single entry.</p>
<p>So, I thought to myself: “Clearly, they must be cheating. But how do you cheat efficiently?”</p>
<h2 id="a-mathematicians-perspective">A Mathematician’s Perspective</h2>
<p>The Kaggle competition ‘Titanic: Machine Learning from Disaster’ (and in fact any classification competition on Kaggle) can be modelled as follows:</p>
<p>Kaggle asks you to find a secret \(n\)-dimensional vector \(\vec{k} = (k_1, k_2, \ldots, k_n)\) (with \(n=418\) for the Titanic data set) where each number \(k_i\) is either \(0\) (the passenger with ID \(i\) didn’t survive) or \(1\) (the passenger with ID \(i\) did survive). So all we really have to do is to guess \(\vec{k}\) – no need for fancy machine learning techniques!</p>
<p>There are \(2^n\) many possible values for \(\vec{k}\) and guessing all of them would be practically impossible. Fortunately, there’s one more ingredient to Kaggle that we can exploit: If we submit a guess \(\vec{g} = (g_1, \ldots, g_n)\) to Kaggle, it will return a score – the number of correct guesses. I.e. the number of \(i\)s such that \(g_i = k_i\).</p>
<h2 id="a-naive-approach">A Naive Approach</h2>
<p>This allows for a naive solution in (at most) \(n+1\) many guesses: First guess \(\vec{g}_0 = (0,0, \ldots, 0)\) resulting in a score of correct entries \(s_0\) and then, for each \(i = 1, \ldots, n\) guess \(\vec{g}_i\) – the binary vector with only one \(1\) in position \(i\). Let \(s_i\) be the resulting score. If \(s_i > s_0\), then the \(i\)th entry of \(\vec{k}\) must be a \(1\). Otherwise it is a \(0\).</p>
<p>While this is certainly possible to pull off in practice <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, it <em>does not</em> satisfy my urge for efficiency.</p>
<p>Can we do better?</p>
<p>Indeed! But making significant progress will require some work.</p>
<h2 id="insert-graph-theory">Insert Graph Theory</h2>
<h3 id="definition">Definition</h3>
<p>Let \(G = (V;E)\) be an undirected <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">graph</a>. Let \(u,v \in V\) be vertices. The <em>distance \(d_G(v,u)\) of \(v\) and \(u\)</em> in \(G\) (in symbols \(d_G(u,v)\)) is the length of the minimal path of edges in \(G\) that connects \(v\) and \(u\). If no such path exists, we let \(d_G(v,u) := \infty\). <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<figure>
<img src="/images/diagrams/distance.png" alt="distance of two nodes in a graph" />
<figcaption>A shortest path of length 2 between u and v</figcaption>
</figure>
<h3 id="definition-1">Definition</h3>
<p>Let \(G = (V;E)\) be an undirected graph and let \(R \subseteq V\) be a set of vertices. <em>\(R = \{r_1, \ldots, r_k \}\) resolves \(G\)</em> if every vertex \(v \in V\) is uniquely determined by its vector of distances to members of \(R\). To put it in mathematical terms: \(R\) resolves \(G\) if the function</p>
<p>\[
d_R \colon V \to [0, \infty]^k, v \mapsto (d_G(v,r_1), d_G(v,r_2), \ldots, d_G(v,r_k))
\]</p>
<p>is <a href="https://en.wikipedia.org/wiki/Injective_function">injective</a>.</p>
<p>Note that \(R = V = \{ v_1, \ldots, v_n \}\) trivially resolves \(G\) as every \(v \in V\) is uniquely determined by \(i \in \{1, \ldots, n \}\) with \(d_G(v,v_i) = 0\) since in this case we have \(v = v_i\).</p>
<p>Consider the following example:</p>
<figure>
<img src="/images/diagrams/resolving_set.png" alt="a resolving set of size 4 for H(3)" />
<figcaption>A resolving set of size 3 for H(3)</figcaption>
</figure>
<p>Here, the vertex \((0,1,1)\) is uniquely identified by having distance \(2\) to \(r_1 = (0,0,0)\), distance \(3\) to \(r_2 = (1,0,0)\) and distance \(2\) to \(r_3 = (1,1,0)\). In other words, \((0,1,1)\) is the unique vertex with distance vector \(d_R(0,1,1) = (2,3,2)\) to the highlighted resolving set.</p>
<p>I encourage the reader to check that \(R = \{ (0,0,0), (1,0,0), (1,1,0) \}\) is indeed a resolving set for \(H(3)\) to get a better feel for this rather abstract concept before moving on.</p>
<h3 id="definition-2">Definition</h3>
<p>Let \(G = (V;E)\) be an undirected graph. The <em><a href="https://en.wikipedia.org/wiki/Metric_dimension_(graph_theory)">metric dimension</a> of \(G\)</em> is the minimal size of some \(R \subseteq V\) that resolves \(G\).</p>
<p>Returning to our example \(H(3)\): We’ve already found a resolving set of size \(3\) and a bit of computation confirms that there is no resolving set of size \(2\). Therefore the metric dimension of \(H(3)\) is \(3\).</p>
<p>It is now time to meet the hero of our advanced guessing strategy.</p>
<h3 id="definition-3">Definition</h3>
<p>The <em>\(n\)-dimensional <a href="https://en.wikipedia.org/wiki/Hamming_graph">Hamming graph</a></em> is the undirected graph \(H(n) = (V;E)\) whose vertices are all \(n\)-dimensional binary vectors \(\vec{v} = (v_1, \ldots, v_n)\) such that \(\{\vec{v},\vec{u}\} \in E\) iff they differ in exactly one position, i.e.</p>
<p>\[
E = \{ \{ \vec{v}, \vec{u} \} \mid \{i \mid v_i \neq u_i \} \text{ has size 1 } \}.
\]</p>
<p>If you look back to the diagram of \(H(3)\), you will see that this is indeed the \(3\)-dimensional Hamming graph. Its vertices are binary vectors of length \(3\) and they are connected via a single edge if and only if they differ in exactly one coordinate.</p>
<h2 id="a-graph-theoretic-guessing-strategy">A Graph Theoretic Guessing Strategy</h2>
<p>Here is our graph theoretic guessing strategy: Let \(R\) be a small resolving set for \(H(n)\). Submit each \(\vec{r} \in R\) as a guess to Kaggle which sends back its score \(s(\vec{r})\). If \(s(\vec{r}) = n\) for some \(\vec{r}\), we’ve achieved a perfect score and are done with this competition. Otherwise, since \(R\) is a resolving set, there is a unique vertex \(\vec{g}\) in \(H(n)\) such that</p>
<p>\[
d_{H(n)}(\vec{g},\vec{r}) = n - s(\vec{r})
\]</p>
<p>for all \(\vec{r} \in R\).</p>
<p>However, by the construction of \(H(n)\), we have \(d_{H(n)} (\vec{k},\vec{r}) = n - s(\vec{r})\) (where \(\vec{k}\) is Kaggle’s secret solution vector) for all \(\vec{r} \in R\).</p>
<p>Since \(\vec{g}\) is the <em>unique</em> vector with this property, we must have that \(\vec{g} = \vec{k}\) is the desired solution. This strategy therefore guarantees a perfect score in at most \(\mathrm{size(R)} + 1\) many submissions!</p>
<p>Now, because \(H(n)\) has a metric dimension of \(\frac{(2+ o(1))n}{\log_2(n)}\), we will be able to crack the Titanic dataset in \(\sim 97\) <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> submissions and furthermore, if we commit to submitting all but the final guess at once, this is known to be an optimal guessing strategy.</p>
<h2 id="confessions">Confessions</h2>
<p>While this graph theoretic approach to beat Kaggle, from a purely mathematical point of view, is very pleasing, it doesn’t seem that practical after all. For instance, while the metric dimension of \(H(n)\) is known asymptotically, <a href="https://mathoverflow.net/questions/332434/explicit-small-resolving-sets-for-hamming-graphs">it’s not clear to me how to find small resolving sets for \(H(n)\) in practice</a>. Furthermore, even if you were given a small resolving set \(R\), calculating \(\vec{k}\) from \(\{ d_{H(n)}(\vec{k}, \vec{r}) \mid \vec{r} \in R \}\) requires some serious computational power. Granted, it doesn’t increase the number of guesses required, the metric I’ve chosen to optimize for, but it still results in so much computational overhead that the naive approach will win out.</p>
<p>What’s worse: We are not taking full advantage of Kaggle’s feedback to our guesses. When guessing the entries of our resolving set one by one, we don’t use the information gained to guide our future guesses. Instead we could just as well have submitted them all in one go, collect the scores and then compute the correct solution. If we add this further restriction, the solution presented here is optimal. However, without this artificial restriction, I suspect that a better performing, general solution should be possible and I’d very much like to find one.</p>
<p>In practice, if you want to do better than this graph theoretic guessing strategy, I’d suggest cooking up a reasonably well-performing model. You then adapt the naive approach by prioritizing those bits that your model is least certain about. This, if I had to guess, is what people actually do to obtain perfect scores. Finally, to get a perfect score with a single entry, just calculate the solution on a different account and, once you’ve obtained it, submit it on your main account.</p>
<p>From Kaggle’s point of view, at least in the case of the Titanic dataset, this attack vector is pretty much impossible to defend against. For larger datasets, on the other hand, there certainly is a lot they can do to prevent successful guessing strategies. That, however, is a story for another day…</p>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://mathoverflow.net/questions/58600">Math Overflow. Guessing a subset of \(\{1, \ldots, N \}\)</a></li>
<li><a href="https://epubs.siam.org/doi/pdf/10.1137/1.9781611975482.74">Jian, Polyanskii. How to guess an \(n\)-digit number</a></li>
<li><a href="https://arxiv.org/pdf/1712.02723.pdf">Jian, Polyanksii. On the metric dimension of Cartesian powers of a graph</a></li>
<li><a href="https://arxiv.org/pdf/math/0507527.pdf">Caceres, Puertas. ON THE METRIC DIMENSION OFCARTESIAN PRODUCTS OF GRAPHS</a></li>
<li><a href="https://arxiv.org/pdf/math/0507527.pdf">Caceres, Hernando, Mora, Pelayo, Puertas, Seara and Wood. ON THE METRIC DIMENSION OF CARTESIAN PRODUCTS OF GRAPHS</a></li>
<li>Bernt Lindström. On a combinatory detection problem. I (Magyar Tud. Akad.Mat. Kutato Int. Közl., 9:195–207, 1964)</li>
<li><a href="https://mathoverflow.net/questions/332434/explicit-small-resolving-sets-for-hamming-graphs">Math Overflow. Explicit, small resolving sets for Hamming graphs</a></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Kaggle’s daily submission limit poses only the tiniest of hurdles to any programmer… <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>\(d_G\) is the usual <a href="https://en.wikipedia.org/wiki/Distance_(graph_theory)">graph metric</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>The exact number depends on the precise metric dimension of \(H(418)\) which I do not yet know at the time of writing this post. All I know for certain is that it is \(\le 418\). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Stefan MeskenA few nights ago, I found myself tinkering with the Titanic data set on Kaggle and couldn’t help but notice the number of people with a perfect score – many of whom have a single entry.