Pyzzle Time: Most Likely Sum of Dice

Introduction

Recently, my significant other approached me with a probability puzzle: Roll a die repeatedly until the sum over all rolls exceeds twelve. What sum is the most probable to occur? Below, I’ll answer the puzzle by intuition first and then verify my answer via simulation in Python.

The Intuitive Answer

Classical probability theory defines the probability of an event as the total number of its favorable outcomes divided by the number of all possible results. The most likely sum to occur should hence be the sum that can arise in most ways. If the sum over all dice rolls has to exceed twelve, then there are only six possible outcomes of this “experiment”: Anything between thirteen and eighteen is possible. Now, how should we count the number of ways each sum of dice can arise?

Since the number of dice rolled is not known beforehand, we cannot take comfort in exact equations. Regardless, we can reason our way through. To that end, notice that there is only one way to end the experiment at 18: You roll three sixes in a row. In contrast, there are at least two ways to complete the dice rolls at 17. First, the dice sum to twelve, and you get a five on your last roll. Alternatively, your dice sum to eleven and the next roll results in a six. Already, a pattern emerges: Sums closer to twelve seem to occur in more ways.

  • s18 = {(12, 6)}
  • s17 = {(12, 5); (11, 6)}
  • s16 = {(12, 4); (11, 5); (10, 6)}
  • s15 = {(12, 3); (11, 4); (10, 5); (9, 6)}
  • s14 = {(12, 2); (11, 3); (10, 4); (9, 5); (8, 6)}
  • s13 = {(12, 1); (11, 2); (10, 3); (9, 4); (8, 5); (7, 6)}

The list above shows the set of all possible outcomes and the ways to achieve them on the last roll. Thirteen should be the most probable outcome. However, we cannot yet put a probability on it. Each prior sum is itself a compound outcome: Twelve, eleven, etc., may arise in numerous ways. Hence, let’s simulate the entire process in Python and verify my intuition.

Simulation in Python

The listing below shows a simple Python program that simulates the entire process. The function throw_die() returns a random, uniformly distributed integer between 1 and 6. The function run_experiment() does the actual dice rolling and returns the required sum. Finally, main() runs the experiment 10,0000 times and prints the result to screen. When done, the program output supports my intuition.

from random import randint
from collections import Counter 

def throw_die(sides=6):
    return randint(1, sides)

def run_experiment(cut=12):
    sum_dice = 0 
    while sum_dice <= cut: 
        sum_dice += throw_die() 
    return sum_dice 

def main(): 
    NTRIALS = 100000
    results = [run_experiment() 
               for _ in range(NTRIALS)] 
    print(Counter(results))

Three variations of this program are immediately apparent. First, you could alter the number of sides of the die (six by default). Second, you could change the cut-off point (twelve by default). Finally, you could increase the number of trials. The latter increases the precision of the result. The former two options are more interesting. Playing around with them, first, proves the result independent of the size of the die, and, second, shows that the most probable outcome of this experiment is always the cut-off value + 1.

Conclusion

This post has shown how to combine intuition with simulation in Python to answer a probability puzzle. Let’s see what puzzle makes it into my inbox next.

Refactor a FizzBuzz Implementation

Introduction

Today I read an interesting post on R-Bloggers. In the post, Method Matters outlines a solution to a typical data scientist interview problem: FizzBuzz.

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The problem probes your knowledge of basic programming concepts. More specifically, it asks: “Are you able to identify and deal with multiple cases?” Method Matters introduces two simple solutions in R and Python. The first solution uses a for-loop. It iterates over a set of if-else-statements. I take no issue with this approach even though more elegant implementations are possible. The second solution uses a function to do the heavy lifting, which strikes me as more interesting, more powerful, and more questionable given its implementation by Method Matters.

99 problems but for() ain’t one

Method Matters defines a function which takes a numeric input to pass through the set of if-else-statements defined in the first solution. It exemplifies how working code can be recycled in more complex settings. Take a moment to study the original code, I’ll detail my reservations below.

# define the function
fizz_buzz <- function(number_sequence_f){
  if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
    print('FizzBuzz')
  }
  else if(number_sequence_f%%3 == 0) {
    print('Fizz')
  }
  else if (number_sequence_f%%5 == 0){
    print('Buzz')
  }
  else {
    print(number_sequence_f)
  }

}

# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)

So what’s not to like about it? Two things bug me in particular.

  1. The function does not abstract from the particular instance of the problem. Note how values 3, 5, Fizz, Buzz, and FizzBuzz have been hard-coded into the function’s body. Given that the function recycles a for-loop this is not surprising. However, functions (should) encapsulate algorithms which solve entire classes of problems.
  2. More importantly, the function is not vectorized. In R, control statements such as if() or while() expect logical vectors of length one. Whatever condition you pass to them must evaluate to a single Boolean statement.1 Since fizz_buzz() consists entirely of such statements, it expects you to pass each sequence element individually. The concluding call to sapply() does just that. I don’t mean to offend, but sapply(seq(from = 1, to = 100, by = 1), fizz_buzz) is for() in disguise.

fizz_buzz() refactored

The refactored solution shown below generalizes from FizzBuzz, and it is vectorized. The function expects two arguments: (1) a numeric vector such as an integer sequence from 1 to 100, and (2) a “dictionary” of named values to evaluate, e.g., c("Fizz" = 3, "Buzz" = 5). The first argument is checked for multiples of each entry in the dictionary. Those TRUE/FALSE evaluations are saved in a logical matrix with rows equal to length(x) and columns equal to length(dictionary). Next, each column of that matrix is used to logically index the argument <x> and its selected elements are replaced by the corresponding name from the dictionary. Finally, the function returns a vector of strings.

replace_sequence_elements <- function(x, dictionary){
    # The function searches <x> for multiples of each element
    # in <dictionary>. Multiples are replaced by the name of
    # the corresponding dictionary element. If some element
    # of <x> is a multiple of all elements in dictionary,
    # then it will be replaced by concatenated dictionary
    # names.
    # x ... numeric vector
    # dictionary ... named, numeric vector
    # The function returns a vector of type character.
    stopifnot("names" %in% names(attributes(dictionary)))

    out <- as.character(x)
    K <- length(dictionary)
    tests <- matrix(
        FALSE, nrow = length(x), ncol = length(dictionary)
    )

    for(k in seq(K)) {  
        tests[, k] <- x %% dictionary[k] == 0
        out[tests[, k]] <- names(dictionary[k])
    }
    out[rowSums(tests) == K] <- paste(
        names(dictionary), collapse = ""
    )

    return(out)
}

Let’s see if this works:

replace_sequence_elements(seq(15), c("Fizz" = 3, "Buzz" = 5))
##  [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
##  [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
## [13] "13"       "14"       "FizzBuzz"

Looks about right.

Conclusion

This post responds to and refactors a publicly available solution of a common data scientist interview problem: FizzBuzz. The refactored solution generalizes from the original and can now take any number of cases to evaluate. Moreover, it is vectorized. That said, several pain points remain. For instance, a for-loop is still doing the heavy lifting. As the number of test cases increases this may cause performance issues, at least in R. Also, the refactored solution makes specific assumptions about the structure of its arguments. There must be no missing elements, and test cases must be passed in the form of a named vector. A different approach would be needed to avoid or at least reduce the number of such implicit assumptions. Two instructive examples may be found in the comments to Method Matters’ original post.

How do you make fair admissions decisions?

Introduction

It’s that time of the academic year again: The semester has just started and every class is immediately overwhelmed by student demand. On top of it, coordination among the teaching staff could have been better. Classes cluster on Mondays and Wednesdays, limiting our students’ flexibility when building their schedule. Put some strategic behavior on the part of your students into the mix and you’ll easily end up with fifty or more applications for a twenty-five to thirty students class. It’s a mess. So, how do you decide who gets to participate without stepping onto everyone’s toes? Alternatively, how do you democratize access to scarce resources?

Hard criteria are few and far between

Students who endure hardships1 are usually admitted, but their numbers are small. Also, and at the risk of kicking in an open door, what counts as a hardship? Some students must take my class because their study regulations demand two seminars from the same area to conclude their electives module. They have no alternatives as long as my class is the only one offered in that module. Does TINA count as a hardship?

What about good old “first come, first serve”? In fact, several students argued for their admission based on the fact that they registered on five past twelve a.m. right when admissions started. Fun fact: Our campus management software does not tell me who signed up first. More importantly, technology can be tricky. An occasional glitch in your internet connection may easily move you from first to the last position on the list. In my opinion, “First come, first serve” is neither fair – So, you stayed up late? – nor robust.

What if students were to rank their preferences? At Potsdam University, students may express priorities when they sign up for class. In principle, lecturers should admit priority students first and distribute any remaining seats among everyone else. Again, there is a serious drawback: How do you select within each group? For instance, 49 students registered for my class – every single one of them prioritized it. So, I am back to square one.

Are fair procedures the answer?

Apparently, you can’t just look at students to make the call. Maybe there are procedures which lead to fair results? Cues, auctions, and lotteries are on the table.

Some colleagues rely on cueing. They first admit students late in their studies. Any remaining seats are then filled with earlier semesters. Justifications of this method boil down to urgency: Older students must transition into the labor market as soon as possible. But why should an eighth-semester student feel more pressure than a seventh-semester student? If you are unable to justify this distinction, then you won’t be able to justify any other. Moreover, I believe cueing perpetuates the problem. If all students have to wait their turn, then all of them will eventually be eighth at some point.

Auctions would be preferable. After all, students who are most motivated to participate in my class should place the highest bids. In the ideal world, I would have the honors to teach a group of self-selected geeks. That is, if – and, unfortunately, only if – I could elicit sincere offers. I could require students to bid from their own resources and -rightly – face the wrath of the university’s ethics committee. Alternatively, I provide them with benefits to burn and – rightly – face my wife’s wrath. In short, without institutional backing auctions may seem ideal but they are not a realistic option.

That leaves casting lots. Lotteries have been around for ages. In ancient Greece, for instance, many public officials were chosen by lot based on a firm belief in the equality of all citizens. Lots make decisions without consideration of person or intent. In that regard, they are uniquely fair. On the downside, students are not all equal. After all, hardships are all about significant inequalities. Lotteries discard such circumstantial information.

Moving forward

This is the upshot so far: Fair admissions cannot be based on a single criterion or procedure. Rather, some combination of criteria and procedures is required to avoid discrimination against any group of students. Here is how I tried to solve the problem.

  1. I decided to accept 35 students. That is 10 more than originally planned and puts me at 140 percent course load.
  2. To fill those 35 seats I set up a lottery. It assigns a higher probability to older students. After all, I do subscribe to the argument that older students experience enormous social and economic pressures. However, given my ignorance about their course of studies and given the mounting pressure from younger cohorts, age does not justify deterministic admission.2
  3. Students who claimed exceptional circumstances, e.g., small children, were additionally admitted.3

In the end, 39 students were admitted to class. That’s quite a lot, but still manageable based on my past experience. Attrition will do its work and eventually reduce class size to about 30.

Conclusion

Every faculty member fights the admissions battle on her own, no solution is ideal. By combining different admissions criteria and processes I strove to democratize admissions as much as reasonable. The result is unavoidably imperfect. However, in light of my own limited ressources, I believe it constitutes a compromise that will be endurable for everyone.

Street Fighting R: Combine Small Multiples with Highlighting

Introduction

This post demonstrates how small multiples can be used to highlight different parts of a distribution. In principle, ggplot2 offers many, easy to use options that partial out different groups in your data. The aesthetics color, size, and shape come to mind. Moreover, the purpose of small multiples is to show the same (bivariate) association across different groups in your data. Notwithstanding, either approach has drawbacks. When mapping categories to an aesthetic like color, all groups remain on the same canvas. The result may be wanting, especially when you work with big datasets. Small multiples, in contrast, draw out each group but deemphasize the grand picture. Wouldn’t it be nice to find some middle ground?

Motivation

We are going to work with the diamonds data which is available from the ggplot2 package. The goal is to highlight each cut in a scatter plot of price against carat without falling into either of the extremes mentioned above. Here is what the data look like:

rm(list = ls())
library("tidyverse")
data(diamonds)
select(diamonds, price, carat, cut)
## # A tibble: 53,940 x 3
##    price carat cut      
##    <int> <dbl> <ord>    
##  1   326 0.23  Ideal    
##  2   326 0.21  Premium  
##  3   327 0.23  Good     
##  4   334 0.290 Premium  
##  5   335 0.31  Good     
##  6   336 0.24  Very Good
##  7   336 0.24  Very Good
##  8   337 0.26  Very Good
##  9   337 0.22  Fair     
## 10   338 0.23  Very Good
## # … with 53,930 more rows

Recipe

As always, smart layering is the answer. We are going to plot the diamonds data twice using different colors: Once for all diamonds in the data, and once for each cut. The code also includes minor finishing touches (opacity and color).

alpha_lvl <- .4
ggplot(data = diamonds, aes(x = carat, y = price)) +
    geom_point(
        data = select(diamonds, -cut),
        # Dropping <cut> plots all our data. 
        colour = "#3288bd", alpha = alpha_lvl
    ) +
    geom_point(colour = "#d53e4f", alpha = alpha_lvl) +
    scale_y_log10() +
    facet_wrap(vars(cut))

plot of chunk recipe

Conclusion

This post demonstrates how small multiples can highlight different segments within a distribution without losing sight of its overall shape. The key is smart layering. Plot the data twice: Once ignore and once highlight your facets.

Street Fighting R: (Re-)Order Facets in ggplot2

Introduction

This post will show you how to order small multiples in ggplot2 by arbitrary criteria. It is straightforward to generate small multiples in ggplot2: Add either facet_grid() or facet_wrap() to your code and set its first argument to (preferably) a set of factors which defines faceting groups. The order of the resulting small multiples will follow the order of the provided factor levels. In other words, if you want to sort small multiples by arbitrary criteria, the you will have to reorder the underlying factor levels.

Motivation

Assume you have data, measuring the impact of socioeconomic status on student success. Students are nested in school districts. You want to: (1) Plot the relationship for each school district; (2) Order districts by the magnitude of the correlation (such that you might hypothesize possible similiarities between districts). Here is what your toy data look like:

ses success district_id
1196 -1.6179474 0.0188662 l
572 0.0252540 -0.3111760 f
138 -0.4647769 -0.5836298 b
883 1.5759108 1.7807955 i
1168 0.2507581 -0.4603069 l
610 1.9244564 0.8464252 g
1141 -0.8316925 1.4239423 l
1074 0.8881493 0.1626624 k
988 -1.4506733 -1.2485325 j
1148 -0.1606711 0.2567397 l

Recipe

The variable district_id identifies each school district and, consequently, the small multiples you are after. The order of its levels should match the magnitude of the within-district correlation between variables ses and success. We need to (1) compute said correlation, (2) reorder district levels by its value, and (3) plot the data.

# Step (1): Calculate the within district correlation
r_district <- vector("numeric", length(unique(ses_data[["district_id"]])))
names(r_district) <-  unique(ses_data[["district_id"]])
for(i in names(r_district)){
    filter <- which(ses_data$district == i)
    r_district[i] <- cor(ses_data[filter, "ses"], ses_data[filter, "success"])
}

# Step (2): Reorder the factor levels
levels(ses_data$district_id) # before
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
ses_data <- within(ses_data,
    district_id <- factor(district_id, names(r_district)[order(r_district)])
)
levels(ses_data$district_id) # after
##  [1] "h" "a" "f" "k" "d" "e" "i" "b" "l" "g" "j" "c"
# Step (3): Plot the data
library("ggplot2")
ggplot(data = ses_data, aes(x = ses, y = success)) +
    geom_point() + geom_smooth(method = "lm") +
    facet_wrap(vars(district_id))

plot of chunk simpleplot

Conclusion

In this post you have seen how small multiples in ggplot2 can be sorted by arbitrary criteria. Although the toy example is limited to the case of facet_wrap() and a single grouping factor, the approach generalizes. No matter what wrapping function or how many facets you define: The order of small multiple plots follows the order of the underlying factor levels.

Toy Data Creation

# Generate toy data with multilevel structure
rm(list = ls())

# Set population level parameters
n_districts <- 12 # number of school districts
n_students <- 100 # number of students per district
rho <- .4 # mean effect of spending

# Draw district level correlation r
r <- rnorm(n_districts, psych::fisherz(rho), sd = 1.5)
r <- psych::fisherz2r(r)

# Draw district level data
ses_data <- data.frame()
for(i in r){
    Sigma <- matrix(c(1, i, i, 1), 2, 2)
    ses_data <- rbind.data.frame(ses_data, MASS::mvrnorm(n_students, c(0, 0), Sigma))
}
rm(i, Sigma)
names(ses_data) <- c("ses", "success")
ses_data[, "district_id"] <- factor(
    rep(letters[seq(n_districts)], each = n_students)
)
## END

Office Hours during Summer 2018

I will be out of office from July 30 to August 10, and from September 10 to 14. Therefore, there will be no office hours on July, 30, August 6, and September 10.