Skip to content

Instantly share code, notes, and snippets.

@dosentmatter
Last active August 22, 2020 09:12
Show Gist options
  • Save dosentmatter/7dd5b3612eb5243f78fe57cbb59e2c15 to your computer and use it in GitHub Desktop.
Save dosentmatter/7dd5b3612eb5243f78fe57cbb59e2c15 to your computer and use it in GitHub Desktop.

Revisions

  1. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 39 additions and 25 deletions.
    64 changes: 39 additions & 25 deletions analysis.txt
    Original file line number Diff line number Diff line change
    @@ -143,8 +143,8 @@ E[k_i] = p * (1 - p**i)
    E[k_i] = p - p**(i + 1)

    E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
    E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1))
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
    E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1 + 1))
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

    ---

    @@ -168,7 +168,7 @@ So for n=0, E[k] <= E[C]
    Let's start induction from n=1,
    When n=1,
    E[C] = (1 - p) * 1 = 1 - p
    E[k] = (p - p**0) = 0
    E[k] = (p - p**1) = 0
    So for n=1, E[k] <= E[C]

    Inductive step:
    @@ -177,15 +177,15 @@ We have to prove P(x + 1) is true.

    P(x + 1) means
    E[k_x+1 elements] < E[C_x+1 elements]
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * (x + 1)
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * x + (1 - p)
    (p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * (x + 1)
    (p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * x + (1 - p)

    The first chunk of each side is just P(x), I will put them in square brackets:

    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) <= [ (1 - p) * x ] + (1 - p)
    [ (p - p**1) + (p - p**2) + ... + (p - p**x) ] + (p - p**(x + 1)) <= [ (1 - p) * x ] + (1 - p)
    The part in the square brackets
    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] <= [ (1 - p) * x ]
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) <= (1 - p) * x
    [ (p - p**1) + (p - p**2) + ... + (p - p**x) ] <= [ (1 - p) * x ]
    (p - p**1) + (p - p**2) + ... + (p - p**x) <= (1 - p) * x
    E[k_x elements] <= E[C_x elements]
    is just P(x), which we have assumed to be true.

    @@ -195,25 +195,25 @@ If a <= b is true,
    a + x <= b + y is also true if x <= y

    In our case,
    a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) = left term inside square brackets
    b = E[C_x elements] = (1 - p) * x = right term inside square brackets
    x = (p - p**x) = left term outside square brackets
    a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**x) = left terms inside square brackets
    b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
    x = (p - p**(x + 1)) = left term outside square brackets
    y = (1 - p) = right term outside square brackets

    This leaves us with having to prove the following in order to prove P(x + 1).
    Note that this is just the expected value of the next term added when adding
    one more element to the array.
    (p - p**x) <= (1 - p)
    p - p**x <= 1 - p
    (p - p**(x + 1)) <= (1 - p)
    p - p**(x + 1) <= 1 - p

    2*p <= 1 + p**x
    2*p <= 1 + p**(x + 1)

    Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.
    Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**(x + 1) is always greater than 1.
    This means that the original x <= y inequality is true and P(x + 1) is true.

    Thus, P(n) is true for n being any integer >= 0.
    In other words:
    "If 0 <= p <= 0.5 an n is an integer >= 0, then E[k] <= E[C]"
    "If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
    This means that for 0 <= p <= 0.5, Solution 3 is always better.

    ---
    @@ -226,26 +226,40 @@ Assumptions:

    Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

    We will special case p=1,
    We will special case p=1.

    E[C] = (1 - p) * n
    = (1 - 1) * n
    = (0) * n
    = 0

    E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
    = (1 - 1**1) + (1 - 1**2) + ... (1 - 1**n)
    = (1 - 1) + (1 - 1) + ... (1 - 1)
    = (0) + (0) + ... (0)
    = 0

    When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
    but E[k] >= E[C] still holds true.


    Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
    For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
    and show that E[k] grows faster than E[C] for large n.

    Let's start with E[C]
    E[C] = (1 - p) * n
    Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
    As n grows large, we keep adding more (1 - p) < 0.5 terms.
    As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.

    How about E[k]?
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

    For 0.5 < p < 1
    Keep in mind 0.5 < p < 1.
    For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
    the exponent is small, so the subtraction is large.
    However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
    since p**i ~= 0, where i is large.
    since p**i ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases p**i.

    When n is large, there will be many terms that will be approximated as just p.
    As n grows large, we keep adding more approximate 0.5 < p < 1 terms.
    @@ -260,7 +274,7 @@ This means the proposition holds true for large n, since E[k] grows faster than
    Note that for low n, it's possible for the opposite to be true. E[C] might be
    greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

    This means for 0.5 < p <= 1, Solution 2 is always better.
    This means for large n and 0.5 < p <= 1, Solution 2 is always better.

    ---

    @@ -337,7 +351,7 @@ n = 250 elements to sample

    For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
    Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
    So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large.
    Set d depending on how accurate your estimate should be.
    So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
    Set d depending on how accurate you want your estimate to be

    If n gets too large, you can also set a 10% array size limit to the sample size n.
    If n gets too large, you can also set a 10% array size limit to the sample size n.
  2. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -132,7 +132,9 @@ P(having at least one zero) = (1 - P(having all non-zeros))
    Since the following two are independent events, they can be factored:
    = P(non-zero at index i AND having at least one zero before index i)
    = P(non-zero at index i) * P(having at least one zero before index i)
    Since i is indexed starting at 0, we don't need to do (i - 1):
    Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
    = P(non-zero) * P(having at least one zero in i elements)
    = P(non-zero) * (1 - P(having i non-zeros))
    = p * (1 - p**i)

    Since each element can contribute at most 1 incorrect non-zero,
  3. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions analysis.txt
    Original file line number Diff line number Diff line change
    @@ -130,6 +130,7 @@ P(having at least one zero) = (1 - P(having all non-zeros))
    P(incorrect non-zero at index i)
    = P(non-zero at index i with at least one zero before it)
    Since the following two are independent events, they can be factored:
    = P(non-zero at index i AND having at least one zero before index i)
    = P(non-zero at index i) * P(having at least one zero before index i)
    Since i is indexed starting at 0, we don't need to do (i - 1):
    = p * (1 - p**i)
  4. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -41,7 +41,7 @@ knowledge will come in handy later.

    Cost of Solution 2 for the general case:
    For the first loop, we have:
    Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves)
    Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
    since k of the elements need to be moved:
    Cost(first loop) = Cost(n iterations) + Cost(k moves)
    Cost(first loop) = n + k
  5. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -7,7 +7,7 @@ For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
    For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

    For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
    For Solution 3, we can say that swap is a 3 cost operation including the temporary variable.
    For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.

    ---

    @@ -17,6 +17,8 @@ Assumptions:
    2. Reading is 0 cost
    3. Comparisons are 0 cost
    4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
    5. Writing is 1 cost
    6. Swapping is 3 costs because it consists of 3 writes.

    Let n be the number of elements in the array
    Let C be the number of zeros.
  6. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@ I don't think Solution 2 is always better than Solution 3

    First of all, it doesn't matter unless n is large, since both are O(n).

    For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array:
    For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
    For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
    For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

  7. dosentmatter revised this gist Aug 22, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -16,7 +16,7 @@ Assumptions:
    1. Each loop iteration costs 1
    2. Reading is 0 cost
    3. Comparisons are 0 cost
    4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it anyway
    4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway

    Let n be the number of elements in the array
    Let C be the number of zeros.
  8. dosentmatter revised this gist Dec 2, 2019. 1 changed file with 58 additions and 3 deletions.
    61 changes: 58 additions & 3 deletions analysis.txt
    Original file line number Diff line number Diff line change
    @@ -278,6 +278,61 @@ It's not worth determining C or k because that would take O(n).
    You can determine the approximation of p by sampling a few values and averaging
    by how many non-zeros you see.

    How many values should you sample? That is a question for another day,
    but I think there is a theorem on how many samples you need for a desired margin of error.
    It might be the Central Limit Theorem. Not sure yet.
    How many values should you sample? We can use the Chebyshev theorem:
    See Chapter 8.1 of
    https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
    and 5a of
    https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf

    P(|X - u| >= E) <= V(X) / E**2

    Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
    There is a p chance for the coin to flip heads (non-zero element)
    and a (1 - p) chance for tails (zero element)

    S_n = X_0 + X_1 + ... X_n-1

    P(|S_n / n - p| >= E) <= V(S_n / n) / E**2
    P(|S_n / n - p| >= E) <= Var(S_n) / (n**2 * E**2)
    P(|S_n / n - p| >= E) <= n * Var(X_i) / (n**2 * E**2)
    P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)

    Var(X_i) = E[X_i**2] - E[X]**2
    = [ (1 - p) * 0**2 + p * 1**2 ] - [ (1 - p) * 0 + p * 1 ] ** 2
    = [ p * 1**2 ] - [ p * 1 ] ** 2
    = [ p ] - [ p ] ** 2
    = p - p**2
    = p * (1 - p)

    P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)
    P(|S_n / n - p| >= E) <= p * (1 - p) / (n * E**2)

    We set d = p * (1 - p) / (n * E ** 2)

    n * E ** 2 = p * (1 - p) / d
    n = p * (1 - p) / (d * E**2)

    We can use this formula to determine a good value for n, the sample size.

    E is the error threshold of estimated probability vs actual p:
    |S_n / n - p| >= E

    d is the probability of the estimated probability breaching the error threshold, E
    P(|S_n / n - p| >= E) <= d

    Formula:
    n = p * (1 - p) / (d * E**2)
    To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
    n = 0.5 * (0.5) / (d * E**2)
    n = 1 / (4 * d * E**2)

    we can set E = 0.1 and d = 0.1
    n = 1 / (4 * 0.1 * 0.1**2)
    n = 250 elements to sample

    For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
    Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
    So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large.
    Set d depending on how accurate your estimate should be.

    If n gets too large, you can also set a 10% array size limit to the sample size n.
  9. dosentmatter revised this gist Dec 2, 2019. 1 changed file with 23 additions and 9 deletions.
    32 changes: 23 additions & 9 deletions analysis.txt
    Original file line number Diff line number Diff line change
    @@ -172,25 +172,39 @@ We have to prove P(x + 1) is true.

    P(x + 1) means
    E[k_x+1 elements] < E[C_x+1 elements]
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * (x + 1)
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * x + (1 - p)
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * (x + 1)
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * x + (1 - p)

    The first chunk of each side is just P(x), I will put them in square brackets:

    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) < [ (1 - p) * x ] + (1 - p)
    Since we are assuming P(x) is true, the terms in the brackets can be removed
    from both sides without affecting the inequality, since the term on the left
    side of the inequality is less than the term on the right side according to P(x)
    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) <= [ (1 - p) * x ] + (1 - p)
    The part in the square brackets
    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] <= [ (1 - p) * x ]
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) <= (1 - p) * x
    E[k_x elements] <= E[C_x elements]
    is just P(x), which we have assumed to be true.

    How can we use P(x) to prove P(x + 1)?
    We can use this fact:
    If a <= b is true,
    a + x <= b + y is also true if x <= y

    In our case,
    a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) = left term inside square brackets
    b = E[C_x elements] = (1 - p) * x = right term inside square brackets
    x = (p - p**x) = left term outside square brackets
    y = (1 - p) = right term outside square brackets

    This leaves us with having to prove the following in order to prove P(x + 1).
    Note that this is just the expected value of the next term added when adding
    one more element to the array.
    (p - p**x) < (1 - p)
    p - p**x < 1 - p
    (p - p**x) <= (1 - p)
    p - p**x <= 1 - p

    2*p < 1 + p**x
    2*p <= 1 + p**x

    Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.
    This means that the original x <= y inequality is true and P(x + 1) is true.

    Thus, P(n) is true for n being any integer >= 0.
    In other words:
  10. dosentmatter revised this gist Nov 27, 2019. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -264,4 +264,6 @@ It's not worth determining C or k because that would take O(n).
    You can determine the approximation of p by sampling a few values and averaging
    by how many non-zeros you see.

    How many values should you sample? That is a question for another day.
    How many values should you sample? That is a question for another day,
    but I think there is a theorem on how many samples you need for a desired margin of error.
    It might be the Central Limit Theorem. Not sure yet.
  11. dosentmatter revised this gist Nov 27, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion analysis.txt
    Original file line number Diff line number Diff line change
    @@ -85,7 +85,7 @@ be worth it to do an O(n) count, so let's do some probability.
    So now let's do some probability and calculate what we expect k or C to be
    given the following input:
    1. We generate an array of length n.
    2. Each element is generated independently with a probability P to be a non-zero, else zero.
    2. Each element is generated independently with a probability p to be a non-zero, else zero.

    What is the expected value of C and k? That is E[C] and E[k]?
    If E[k] <= E[C], probability would suggest Solution 3 would be better
  12. dosentmatter created this gist Nov 27, 2019.
    267 changes: 267 additions & 0 deletions analysis.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,267 @@
    I don't think Solution 2 is always better than Solution 3

    First of all, it doesn't matter unless n is large, since both are O(n).

    For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array:
    For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
    For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

    For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
    For Solution 3, we can say that swap is a 3 cost operation including the temporary variable.

    ---

    Let's calculate the cost for the general case:
    Assumptions:
    1. Each loop iteration costs 1
    2. Reading is 0 cost
    3. Comparisons are 0 cost
    4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it anyway

    Let n be the number of elements in the array
    Let C be the number of zeros.

    Let k be the number of non-zeros in the incorrect location.

    To explain k:
    [1, 2, 3] - k = 0 because no zeros
    [0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
    [0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
    [0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively

    Notice that another way to think about k is "the number of non-zeros with at
    least one zero before it". This is because if there is at least one zero
    before a non-zero, it is out of place. More than one zero before a non-zero
    does not change anything. That will only put more zeros at the end. This
    knowledge will come in handy later.

    ---

    Cost of Solution 2 for the general case:
    For the first loop, we have:
    Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves)
    since k of the elements need to be moved:
    Cost(first loop) = Cost(n iterations) + Cost(k moves)
    Cost(first loop) = n + k

    For the second loop (memset), we have:
    Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
    Cost(second loop) = Cost(C iterations) + Cost(C writes)
    Cost(second loop) = C + C
    Cost(second loop) = 2*C

    So in total,
    Cost(Solution 2) = Cost(first loop) + Cost(second loop)
    Cost(Solution 2) = n + k + 2*C


    Cost of Solution 3 for the general case:
    Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
    Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
    Since a swap costs 3,
    Cost(Solution 3) = n + 3*k

    ---

    So let's see when Solution 3 is better than or equal to Solution 2:
    Cost(Solution 2) >= Cost(Solution 3)
    n + k + 2*C >= n + 3*k
    2*C >= 2*k
    C >= k
    k <= C

    So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
    k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.

    It would take O(n) to determine k or C.

    For C, you O(n) loop to count number of zeros.
    For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
    Even though you know the optimal Solution to use after counting, it might not
    be worth it to do an O(n) count, so let's do some probability.

    ---

    So now let's do some probability and calculate what we expect k or C to be
    given the following input:
    1. We generate an array of length n.
    2. Each element is generated independently with a probability P to be a non-zero, else zero.

    What is the expected value of C and k? That is E[C] and E[k]?
    If E[k] <= E[C], probability would suggest Solution 3 would be better

    ---

    E[C]:
    Let C_i be a random variable on whether the element at index i is a zero.
    Because of Linearity of Expectation,
    E[C] = E[C_0] + E[C_1] + ... E[C_n-1]

    There is a (1 - p) chance to generate a zero.
    Since each element can contribute at most 1 zero with probability (1 - p).
    E[C_i] = (1 - p) * 1

    E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
    E[C] = (1 - p) * n

    ---

    E[k]:
    This one is a little trickier

    Again, we can use Linearity of Expecation,
    Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
    Mentioned above, I said another way to think about this is,
    "a non-zero with at least one zero before it"

    E[k] = E[k_0] + E[k_1] + ... E[k_n-1]

    There is a p chance to generate a non-zero
    Each element can contribute at most 1 incorrect non-zero, but what is the
    probability of an incorrect non-zero?
    The probability of an incorrect non-zero, is the probability of generating a
    non-zero with at least one zero before it.

    Let P mean "the probability of"
    P(having at least one zero) = (1 - P(having all non-zeros))

    P(incorrect non-zero at index i)
    = P(non-zero at index i with at least one zero before it)
    Since the following two are independent events, they can be factored:
    = P(non-zero at index i) * P(having at least one zero before index i)
    Since i is indexed starting at 0, we don't need to do (i - 1):
    = p * (1 - p**i)

    Since each element can contribute at most 1 incorrect non-zero,
    E[k_i] = (p * (1 - p**i)) * 1
    E[k_i] = p * (1 - p**i)
    E[k_i] = p - p**(i + 1)

    E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
    E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1))
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))

    ---

    I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:

    p is a probability so it is in the range [0, 1]

    Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"

    Lets prove by induction on n:
    Proof:
    Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]

    Base Case:
    Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
    When n=0,
    E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
    E[k] = 0, because there are no incorrect non-zeros in an empty array
    So for n=0, E[k] <= E[C]

    Let's start induction from n=1,
    When n=1,
    E[C] = (1 - p) * 1 = 1 - p
    E[k] = (p - p**0) = 0
    So for n=1, E[k] <= E[C]

    Inductive step:
    Assume that P(x) is true for some arbitrary x >= 1
    We have to prove P(x + 1) is true.

    P(x + 1) means
    E[k_x+1 elements] < E[C_x+1 elements]
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * (x + 1)
    (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * x + (1 - p)

    The first chunk of each side is just P(x), I will put them in square brackets:

    [ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) < [ (1 - p) * x ] + (1 - p)
    Since we are assuming P(x) is true, the terms in the brackets can be removed
    from both sides without affecting the inequality, since the term on the left
    side of the inequality is less than the term on the right side according to P(x)

    This leaves us with having to prove the following in order to prove P(x + 1).
    Note that this is just the expected value of the next term added when adding
    one more element to the array.
    (p - p**x) < (1 - p)
    p - p**x < 1 - p

    2*p < 1 + p**x

    Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.

    Thus, P(n) is true for n being any integer >= 0.
    In other words:
    "If 0 <= p <= 0.5 an n is an integer >= 0, then E[k] <= E[C]"
    This means that for 0 <= p <= 0.5, Solution 3 is always better.

    ---

    How about 0.5 < p <= 1?

    Assumptions:
    1. n is very large. This is an okay assumption since we only care about
    optimizing Solution 2/3 when n is large.

    Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

    We will special case p=1,
    When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
    but E[k] >= E[C] still holds true.

    For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
    and show that E[k] grows faster than E[C] for large n.

    Let's start with E[C]
    E[C] = (1 - p) * n
    Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
    As n grows large, we keep adding more (1 - p) < 0.5 terms.

    How about E[k]?
    E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))

    For 0.5 < p < 1
    For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
    the exponent is small, so the subtraction is large.
    However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
    since p**i ~= 0, where i is large.

    When n is large, there will be many terms that will be approximated as just p.
    As n grows large, we keep adding more approximate 0.5 < p < 1 terms.

    Compare the statements above about E[C] and E[k].
    For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
    while E[k] keeps adding 0.5 < p < 1 terms

    The added E[k] terms are larger than the E[C] terms.

    This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
    Note that for low n, it's possible for the opposite to be true. E[C] might be
    greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

    This means for 0.5 < p <= 1, Solution 2 is always better.

    ---

    So what have we learned?

    Use Solution 3 for 0 <= p <= 0.5
    Use Solution 2 for 0.5 < p <= 1 (for large n)

    ---

    How do I apply this to the general case?

    In the general case, there are a lot of unknowns. If there are some knowns, you
    can use the above findings to determine the optimal algorithm.

    For the general case, you are given a large n-size array. You don't know the p
    value used to generate the array. You don't know the value of C or k.

    It's not worth determining C or k because that would take O(n).
    You can determine the approximation of p by sampling a few values and averaging
    by how many non-zeros you see.

    How many values should you sample? That is a question for another day.