Pyzzle Time: Most Likely Sum of Dice

Introduction

Recently, my significant other approached me with a probability puzzle: Roll a die repeatedly until the sum over all rolls exceeds twelve. What sum is the most probable to occur? Below, I’ll answer the puzzle by intuition first and then verify my answer via simulation in Python.

The Intuitive Answer

Classical probability theory defines the probability of an event as the total number of its favorable outcomes divided by the number of all possible results. The most likely sum to occur should hence be the sum that can arise in most ways. If the sum over all dice rolls has to exceed twelve, then there are only six possible outcomes of this “experiment”: Anything between thirteen and eighteen is possible. Now, how should we count the number of ways each sum of dice can arise?

Since the number of dice rolled is not known beforehand, we cannot take comfort in exact equations. Regardless, we can reason our way through. To that end, notice that there is only one way to end the experiment at 18: You roll three sixes in a row. In contrast, there are at least two ways to complete the dice rolls at 17. First, the dice sum to twelve, and you get a five on your last roll. Alternatively, your dice sum to eleven and the next roll results in a six. Already, a pattern emerges: Sums closer to twelve seem to occur in more ways.

  • s18 = {(12, 6)}
  • s17 = {(12, 5); (11, 6)}
  • s16 = {(12, 4); (11, 5); (10, 6)}
  • s15 = {(12, 3); (11, 4); (10, 5); (9, 6)}
  • s14 = {(12, 2); (11, 3); (10, 4); (9, 5); (8, 6)}
  • s13 = {(12, 1); (11, 2); (10, 3); (9, 4); (8, 5); (7, 6)}

The list above shows the set of all possible outcomes and the ways to achieve them on the last roll. Thirteen should be the most probable outcome. However, we cannot yet put a probability on it. Each prior sum is itself a compound outcome: Twelve, eleven, etc., may arise in numerous ways. Hence, let’s simulate the entire process in Python and verify my intuition.

Simulation in Python

The listing below shows a simple Python program that simulates the entire process. The function throw_die() returns a random, uniformly distributed integer between 1 and 6. The function run_experiment() does the actual dice rolling and returns the required sum. Finally, main() runs the experiment 10,0000 times and prints the result to screen. When done, the program output supports my intuition.

from random import randint
from collections import Counter 

def throw_die(sides=6):
    return randint(1, sides)

def run_experiment(cut=12):
    sum_dice = 0 
    while sum_dice <= cut: 
        sum_dice += throw_die() 
    return sum_dice 

def main(): 
    NTRIALS = 100000
    results = [run_experiment() 
               for _ in range(NTRIALS)] 
    print(Counter(results))

Three variations of this program are immediately apparent. First, you could alter the number of sides of the die (six by default). Second, you could change the cut-off point (twelve by default). Finally, you could increase the number of trials. The latter increases the precision of the result. The former two options are more interesting. Playing around with them, first, proves the result independent of the size of the die, and, second, shows that the most probable outcome of this experiment is always the cut-off value + 1.

Conclusion

This post has shown how to combine intuition with simulation in Python to answer a probability puzzle. Let’s see what puzzle makes it into my inbox next.

Refactor a FizzBuzz Implementation

Introduction

Today I read an interesting post on R-Bloggers. In the post, Method Matters outlines a solution to a typical data scientist interview problem: FizzBuzz.

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The problem probes your knowledge of basic programming concepts. More specifically, it asks: “Are you able to identify and deal with multiple cases?” Method Matters introduces two simple solutions in R and Python. The first solution uses a for-loop. It iterates over a set of if-else-statements. I take no issue with this approach even though more elegant implementations are possible. The second solution uses a function to do the heavy lifting, which strikes me as more interesting, more powerful, and more questionable given its implementation by Method Matters.

99 problems but for() ain’t one

Method Matters defines a function which takes a numeric input to pass through the set of if-else-statements defined in the first solution. It exemplifies how working code can be recycled in more complex settings. Take a moment to study the original code, I’ll detail my reservations below.

# define the function
fizz_buzz <- function(number_sequence_f){
  if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
    print('FizzBuzz')
  }
  else if(number_sequence_f%%3 == 0) {
    print('Fizz')
  }
  else if (number_sequence_f%%5 == 0){
    print('Buzz')
  }
  else {
    print(number_sequence_f)
  }

}

# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)

So what’s not to like about it? Two things bug me in particular.

  1. The function does not abstract from the particular instance of the problem. Note how values 3, 5, Fizz, Buzz, and FizzBuzz have been hard-coded into the function’s body. Given that the function recycles a for-loop this is not surprising. However, functions (should) encapsulate algorithms which solve entire classes of problems.
  2. More importantly, the function is not vectorized. In R, control statements such as if() or while() expect logical vectors of length one. Whatever condition you pass to them must evaluate to a single Boolean statement.1 Since fizz_buzz() consists entirely of such statements, it expects you to pass each sequence element individually. The concluding call to sapply() does just that. I don’t mean to offend, but sapply(seq(from = 1, to = 100, by = 1), fizz_buzz) is for() in disguise.

fizz_buzz() refactored

The refactored solution shown below generalizes from FizzBuzz, and it is vectorized. The function expects two arguments: (1) a numeric vector such as an integer sequence from 1 to 100, and (2) a “dictionary” of named values to evaluate, e.g., c("Fizz" = 3, "Buzz" = 5). The first argument is checked for multiples of each entry in the dictionary. Those TRUE/FALSE evaluations are saved in a logical matrix with rows equal to length(x) and columns equal to length(dictionary). Next, each column of that matrix is used to logically index the argument <x> and its selected elements are replaced by the corresponding name from the dictionary. Finally, the function returns a vector of strings.

replace_sequence_elements <- function(x, dictionary){
    # The function searches <x> for multiples of each element
    # in <dictionary>. Multiples are replaced by the name of
    # the corresponding dictionary element. If some element
    # of <x> is a multiple of all elements in dictionary,
    # then it will be replaced by concatenated dictionary
    # names.
    # x ... numeric vector
    # dictionary ... named, numeric vector
    # The function returns a vector of type character.
    stopifnot("names" %in% names(attributes(dictionary)))

    out <- as.character(x)
    K <- length(dictionary)
    tests <- matrix(
        FALSE, nrow = length(x), ncol = length(dictionary)
    )

    for(k in seq(K)) {  
        tests[, k] <- x %% dictionary[k] == 0
        out[tests[, k]] <- names(dictionary[k])
    }
    out[rowSums(tests) == K] <- paste(
        names(dictionary), collapse = ""
    )

    return(out)
}

Let’s see if this works:

replace_sequence_elements(seq(15), c("Fizz" = 3, "Buzz" = 5))
##  [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
##  [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
## [13] "13"       "14"       "FizzBuzz"

Looks about right.

Conclusion

This post responds to and refactors a publicly available solution of a common data scientist interview problem: FizzBuzz. The refactored solution generalizes from the original and can now take any number of cases to evaluate. Moreover, it is vectorized. That said, several pain points remain. For instance, a for-loop is still doing the heavy lifting. As the number of test cases increases this may cause performance issues, at least in R. Also, the refactored solution makes specific assumptions about the structure of its arguments. There must be no missing elements, and test cases must be passed in the form of a named vector. A different approach would be needed to avoid or at least reduce the number of such implicit assumptions. Two instructive examples may be found in the comments to Method Matters’ original post.