Skip to content

Instantly share code, notes, and snippets.

@codeslinger
Forked from O-I/weighted_random_sampling.md
Created September 8, 2017 18:58
Show Gist options
  • Select an option

  • Save codeslinger/72cfc9e0136de8519509936a6d568089 to your computer and use it in GitHub Desktop.

Select an option

Save codeslinger/72cfc9e0136de8519509936a6d568089 to your computer and use it in GitHub Desktop.

Revisions

  1. @O-I O-I revised this gist May 22, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion weighted_random_sampling.md
    Original file line number Diff line number Diff line change
    @@ -22,7 +22,7 @@ There isn't any built-in method for rolling our die several times (i.e., samplin
    5.times.map { die.sample }
    # => ["⚄", "⚂", "⚃", "⚁", "⚂"]

    10.times.map { |_, arr| arr << die.sample }
    10.times.map { die.sample }
    # => ["⚂", "⚀", "⚄", "⚁", "⚃", "⚁", "⚁", "⚁", "⚄", "⚀"]
    ```

  2. @O-I O-I revised this gist May 22, 2015. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions weighted_random_sampling.md
    Original file line number Diff line number Diff line change
    @@ -16,17 +16,17 @@ die.sample
    # => "⚁"
    ```

    There isn't any built-in method for rolling our die several times (i.e., sampling with replacement), but one way we could do it is to use `Enumerable#each_with_object`:
    There isn't any built-in method for rolling our die several times (i.e., sampling with replacement), but we can use `Enumerable#map` for this:

    ```ruby
    5.times.each_with_object([]) { |_, arr| arr << die.sample }
    5.times.map { die.sample }
    # => ["⚄", "⚂", "⚃", "⚁", "⚂"]

    10.times.each_with_object([]) { |_, arr| arr << die.sample }
    10.times.map { |_, arr| arr << die.sample }
    # => ["⚂", "⚀", "⚄", "⚁", "⚃", "⚁", "⚁", "⚁", "⚄", "⚀"]
    ```

    We can also use `each_with_object` to observe the frequency distribution of rolling our die a large number of times:
    We can also use `Enumerable#each_with_object` to observe the frequency distribution of rolling our die a large number of times:

    ```ruby
    360.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
    @@ -85,10 +85,10 @@ Weighted random sampling with replacement can be achieved like this:
    ```ruby
    wrs = -> (freq) { freq.max_by { |_, weight| rand ** (1.0 / weight) }.first }

    5.times.each_with_object([]) { |_, arr| arr << wrs[loaded_die] }
    5.times.map { wrs[loaded_die] }
    # => ["⚅", "⚅", "⚁", "⚁", "⚁"]

    10.times.each_with_object([]) { |_, arr| arr << wrs[loaded_die] }
    10.times.map { wrs[loaded_die] }
    # => ["⚀", "⚄", "⚀", "⚄", "⚅", "⚅", "⚁", "⚅", "⚁", "⚅"]
    ```

  3. @O-I O-I created this gist Apr 3, 2015.
    111 changes: 111 additions & 0 deletions weighted_random_sampling.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,111 @@
    One of the many reasons I love working with Ruby is it has a rich vocabulary that allows you to accomplish your goals with a minimal amount of code. If there isn't a method that does exactly what you want, it's usually possible to build an elegant solution yourself.

    Let's take the example of simulating the rolling of a die.

    We can represent a die as an array of its faces.

    ```ruby
    die = [*?⚀..?⚅]
    # => ["⚀", "⚁", "⚂", "⚃", "⚄", "⚅"]
    ```

    We can simulate a single roll of the die using `Array#sample`.

    ```ruby
    die.sample
    # => "⚁"
    ```

    There isn't any built-in method for rolling our die several times (i.e., sampling with replacement), but one way we could do it is to use `Enumerable#each_with_object`:

    ```ruby
    5.times.each_with_object([]) { |_, arr| arr << die.sample }
    # => ["⚄", "⚂", "⚃", "⚁", "⚂"]

    10.times.each_with_object([]) { |_, arr| arr << die.sample }
    # => ["⚂", "⚀", "⚄", "⚁", "⚃", "⚁", "⚁", "⚁", "⚄", "⚀"]
    ```

    We can also use `each_with_object` to observe the frequency distribution of rolling our die a large number of times:

    ```ruby
    360.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
    # => {"⚃"=>55, "⚄"=>56, "⚂"=>67, "⚁"=>64, "⚅"=>65, "⚀"=>53}

    3600.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
    # => {"⚂"=>612, "⚃"=>600, "⚅"=>571, "⚄"=>593, "⚁"=>652, "⚀"=>572}

    freq = 36_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
    # => {"⚂"=>6061, "⚀"=>6018, "⚄"=>6087, "⚃"=>5940, "⚅"=>5891, "⚁"=>6003}

    freq.values.map { |roll_count| roll_count / 36_000.0 }
    # => [0.1683611111111111, 0.16716666666666666, 0.16908333333333334, 0.165, 0.1636388888888889, 0.16675]
    ```

    Here we can see that as the number of rolls tends towards infinity, the likelihood of rolling any die face is equal. This is an example of uniform random sampling, and Ruby's `Array#sample` makes it very easy to simulate.

    But what if we want to simulate rolling a loaded die? Given an array of associated probabilities for each die face, is there an elegant way to calculate weighted random samples in Ruby?

    My inclination would be to construct the cumulative distribution function from the probabilities and work from there. But there is an even simpler way using `Enumerable#max_by` and a remarkable result due to Efraimidis and Spirakis.

    Suppose we want to simulate our loaded die with the probabilities below:

    ```ruby
    ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]

    ps.reduce(:+)
    # => 1.0
    ```

    Note that our weights here sum to 1. In the event we have raw weights, we can normalize them like so:

    ```ruby
    weights = [5, 5, 10, 10, 20, 50]

    ps = weights.map { |w| (Float w) / weights.reduce(:+) }
    # => [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]
    ```

    Now we can represent our loaded die as a hash where each face is a key whose value is the probability of rolling it.

    ```ruby
    loaded_die = die.zip(ps).to_h
    # => {"⚀"=>0.05, "⚁"=>0.05, "⚂"=>0.1, "⚃"=>0.1, "⚄"=>0.2, "⚅"=>0.5}
    ```

    A single roll of the loaded die can be simulated like this:

    ```ruby
    loaded_die.max_by { |_, weight| rand ** (1.0 / weight) }.first
    # => "⚅"
    ```

    Weighted random sampling with replacement can be achieved like this:

    ```ruby
    wrs = -> (freq) { freq.max_by { |_, weight| rand ** (1.0 / weight) }.first }

    5.times.each_with_object([]) { |_, arr| arr << wrs[loaded_die] }
    # => ["⚅", "⚅", "⚁", "⚁", "⚁"]

    10.times.each_with_object([]) { |_, arr| arr << wrs[loaded_die] }
    # => ["⚀", "⚄", "⚀", "⚄", "⚅", "⚅", "⚁", "⚅", "⚁", "⚅"]
    ```

    And to confirm that our loaded die behaves like we expect, we can inspect the frequency distribution like we did above:

    ```ruby
    freq = 100_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[wrs[loaded_die]] += 1 }
    # => {"⚅"=>49980, "⚁"=>5045, "⚄"=>20171, "⚂"=>9840, "⚀"=>5031, "⚃"=>9933}

    freq.map { |face, roll_count| [face, roll_count / 100_000.0] }.to_h
    # => {"⚅"=>0.4998, "⚁"=>0.05045, "⚄"=>0.20171, "⚂"=>0.0984, "⚀"=>0.05031, "⚃"=>0.09933}
    ```

    If you found this anywhere near as intriguing as I did, I'd encourage you to peruse the links below. The paper that introduces the algorithm used is a svelte 4 pages and worth a read.

    Further reading:

    1. [Ruby-Doc](http://ruby-doc.org/core-2.2.1/Enumerable.html#method-i-max_by) for `Enumerable#max_by` — specifically the `wsample` example
    2. [Weighted Random Sampling](http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf) by Efraimidis and Spirakis (2005) which introduces the algorithm
    3. [New features for Array#sample, Array#choice](https://bugs.ruby-lang.org/issues/4247#change-25166) which mentions the intention of adding weighted random sampling to `Array#sample` and reintroducing `Array#choice` for sampling with replacement.