dosentmatter · August 22, 2020 09:12 · Aug 22, 2020 · Aug 22, 2020 · Aug 22, 2020 · Aug 22, 2020
diff --git a/analysis.txt b/analysis.txt
@@ -143,8 +143,8 @@ E[k_i] = p * (1 - p**i)
 E[k_i] = p - p**(i + 1)
 
 E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
-E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1))
-E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
+E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1 + 1))
+E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
 
 ---
 
@@ -168,7 +168,7 @@ So for n=0, E[k] <= E[C]
 Let's start induction from n=1,
 When n=1,
 E[C] = (1 - p) * 1 = 1 - p
-E[k] = (p - p**0) = 0
+E[k] = (p - p**1) = 0
 So for n=1, E[k] <= E[C]
 
 Inductive step:
@@ -177,15 +177,15 @@ We have to prove P(x + 1) is true.
 
 P(x + 1) means
 E[k_x+1 elements] < E[C_x+1 elements]
-(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * (x + 1)
-(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * x + (1 - p)
+(p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * (x + 1)
+(p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * x + (1 - p)
 
 The first chunk of each side is just P(x), I will put them in square brackets:
 
-[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) <= [ (1 - p) * x ] + (1 - p)
+[ (p - p**1) + (p - p**2) + ... + (p - p**x) ] + (p - p**(x + 1)) <= [ (1 - p) * x ] + (1 - p)
 The part in the square brackets
-[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] <= [ (1 - p) * x ]
-(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) <= (1 - p) * x
+[ (p - p**1) + (p - p**2) + ... + (p - p**x) ] <= [ (1 - p) * x ]
+(p - p**1) + (p - p**2) + ... + (p - p**x) <= (1 - p) * x
 E[k_x elements] <= E[C_x elements]
 is just P(x), which we have assumed to be true.
 
@@ -195,25 +195,25 @@ If a <= b is true,
 a + x <= b + y is also true if x <= y
 
 In our case,
-a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) = left term inside square brackets
-b = E[C_x elements] = (1 - p) * x = right term inside square brackets
-x = (p - p**x) = left term outside square brackets
+a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**x) = left terms inside square brackets
+b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
+x = (p - p**(x + 1)) = left term outside square brackets
 y = (1 - p) = right term outside square brackets
 
 This leaves us with having to prove the following in order to prove P(x + 1).
 Note that this is just the expected value of the next term added when adding
 one more element to the array.
-(p - p**x) <= (1 - p)
-p - p**x <= 1 - p
+(p - p**(x + 1)) <= (1 - p)
+p - p**(x + 1) <= 1 - p
 
-2*p <= 1 + p**x
+2*p <= 1 + p**(x + 1)
 
-Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.
+Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**(x + 1) is always greater than 1.
 This means that the original x <= y inequality is true and P(x + 1) is true.
 
 Thus, P(n) is true for n being any integer >= 0.
 In other words:
-"If 0 <= p <= 0.5 an n is an integer >= 0, then E[k] <= E[C]"
+"If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
 This means that for 0 <= p <= 0.5, Solution 3 is always better.
 
 ---
@@ -226,26 +226,40 @@ Assumptions:
 
 Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"
 
-We will special case p=1,
+We will special case p=1.
+
+E[C] = (1 - p) * n
+= (1 - 1) * n
+= (0) * n
+= 0
+
+E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
+= (1 - 1**1) + (1 - 1**2) + ... (1 - 1**n)
+= (1 - 1) + (1 - 1) + ... (1 - 1)
+= (0) + (0) + ... (0)
+= 0
+
 When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
 but E[k] >= E[C] still holds true.
 
+
+Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
 For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
 and show that E[k] grows faster than E[C] for large n.
 
 Let's start with E[C]
 E[C] = (1 - p) * n
 Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
-As n grows large, we keep adding more (1 - p) < 0.5 terms.
+As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.
 
 How about E[k]?
-E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
+E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
 
-For 0.5 < p < 1
+Keep in mind 0.5 < p < 1.
 For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
 the exponent is small, so the subtraction is large.
 However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
-since p**i ~= 0, where i is large.
+since p**i ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases p**i.
 
 When n is large, there will be many terms that will be approximated as just p.
 As n grows large, we keep adding more approximate 0.5 < p < 1 terms.
@@ -260,7 +274,7 @@ This means the proposition holds true for large n, since E[k] grows faster than
 Note that for low n, it's possible for the opposite to be true. E[C] might be
 greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.
 
-This means for 0.5 < p <= 1, Solution 2 is always better.
+This means for large n and 0.5 < p <= 1, Solution 2 is always better.
 
 ---
 
@@ -337,7 +351,7 @@ n = 250 elements to sample
 
 For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
 Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
-So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large.
-Set d depending on how accurate your estimate should be.
+So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
+Set d depending on how accurate you want your estimate to be
 
-If n gets too large, you can also set a 10% array size limit to the sample size n.
+If n gets too large, you can also set a 10% array size limit to the sample size n.
diff --git a/analysis.txt b/analysis.txt
@@ -132,7 +132,9 @@ P(having at least one zero) = (1 - P(having all non-zeros))
 Since the following two are independent events, they can be factored:
 = P(non-zero at index i AND having at least one zero before index i)
 = P(non-zero at index i) * P(having at least one zero before index i)
-Since i is indexed starting at 0, we don't need to do (i - 1):
+Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
+= P(non-zero) * P(having at least one zero in i elements)
+= P(non-zero) * (1 - P(having i non-zeros))
 = p * (1 - p**i)
 
 Since each element can contribute at most 1 incorrect non-zero,

diff --git a/analysis.txt b/analysis.txt
@@ -130,6 +130,7 @@ P(having at least one zero) = (1 - P(having all non-zeros))
   P(incorrect non-zero at index i)
 = P(non-zero at index i with at least one zero before it)
 Since the following two are independent events, they can be factored:
+= P(non-zero at index i AND having at least one zero before index i)
 = P(non-zero at index i) * P(having at least one zero before index i)
 Since i is indexed starting at 0, we don't need to do (i - 1):
 = p * (1 - p**i)

diff --git a/analysis.txt b/analysis.txt
@@ -41,7 +41,7 @@ knowledge will come in handy later.
 
 Cost of Solution 2 for the general case:
 For the first loop, we have:
-Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves)
+Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
 since k of the elements need to be moved:
 Cost(first loop) = Cost(n iterations) + Cost(k moves)
 Cost(first loop) = n + k

diff --git a/analysis.txt b/analysis.txt
@@ -7,7 +7,7 @@ For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
 For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`
 
 For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
-For Solution 3, we can say that swap is a 3 cost operation including the temporary variable.
+For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.
 
 ---
 
@@ -17,6 +17,8 @@ Assumptions:
   2. Reading is 0 cost
   3. Comparisons are 0 cost
   4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
+  5. Writing is 1 cost
+  6. Swapping is 3 costs because it consists of 3 writes.
 
 Let n be the number of elements in the array
 Let C be the number of zeros.

diff --git a/analysis.txt b/analysis.txt
@@ -2,7 +2,7 @@ I don't think Solution 2 is always better than Solution 3
 
 First of all, it doesn't matter unless n is large, since both are O(n).
 
-For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array:
+For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
 For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
 For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`
 

diff --git a/analysis.txt b/analysis.txt
@@ -16,7 +16,7 @@ Assumptions:
   1. Each loop iteration costs 1
   2. Reading is 0 cost
   3. Comparisons are 0 cost
-  4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it anyway
+  4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
 
 Let n be the number of elements in the array
 Let C be the number of zeros.

diff --git a/analysis.txt b/analysis.txt
@@ -278,6 +278,61 @@ It's not worth determining C or k because that would take O(n).
 You can determine the approximation of p by sampling a few values and averaging
 by how many non-zeros you see.
 
-How many values should you sample? That is a question for another day,
-but I think there is a theorem on how many samples you need for a desired margin of error.
-It might be the Central Limit Theorem. Not sure yet.
+How many values should you sample? We can use the Chebyshev theorem:
+See Chapter 8.1 of
+https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
+and 5a of
+https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf
+
+P(|X - u| >= E) <= V(X) / E**2
+
+Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
+There is a p chance for the coin to flip heads (non-zero element)
+and a (1 - p) chance for tails (zero element)
+
+S_n = X_0 + X_1 + ... X_n-1
+
+P(|S_n / n - p| >= E) <= V(S_n / n) / E**2
+P(|S_n / n - p| >= E) <= Var(S_n) / (n**2 * E**2)
+P(|S_n / n - p| >= E) <= n * Var(X_i) / (n**2 * E**2)
+P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)
+
+Var(X_i) = E[X_i**2] - E[X]**2
+         = [ (1 - p) * 0**2 + p * 1**2 ] - [ (1 - p) * 0 + p * 1 ] ** 2
+         = [ p * 1**2 ] - [ p * 1 ] ** 2
+         = [ p ] - [ p ] ** 2
+         = p - p**2
+         = p * (1 - p)
+
+P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)
+P(|S_n / n - p| >= E) <= p * (1 - p) / (n * E**2)
+
+We set d = p * (1 - p) / (n * E ** 2)
+
+n * E ** 2 = p * (1 - p) / d
+n = p * (1 - p) / (d * E**2)
+
+We can use this formula to determine a good value for n, the sample size.
+
+E is the error threshold of estimated probability vs actual p:
+|S_n / n - p| >= E
+
+d is the probability of the estimated probability breaching the error threshold, E
+P(|S_n / n - p| >= E) <= d
+
+Formula:
+n = p * (1 - p) / (d * E**2)
+To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
+n = 0.5 * (0.5) / (d * E**2)
+n = 1 / (4 * d * E**2)
+
+we can set E = 0.1 and d = 0.1
+n = 1 / (4 * 0.1 * 0.1**2)
+n = 250 elements to sample
+
+For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
+Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
+So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large.
+Set d depending on how accurate your estimate should be.
+
+If n gets too large, you can also set a 10% array size limit to the sample size n.
diff --git a/analysis.txt b/analysis.txt
@@ -172,25 +172,39 @@ We have to prove P(x + 1) is true.
 
 P(x + 1) means
 E[k_x+1 elements] < E[C_x+1 elements]
-(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * (x + 1)
-(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * x + (1 - p)
+(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * (x + 1)
+(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) <= (1 - p) * x + (1 - p)
 
 The first chunk of each side is just P(x), I will put them in square brackets:
 
-[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) < [ (1 - p) * x ] + (1 - p)
-Since we are assuming P(x) is true, the terms in the brackets can be removed
-from both sides without affecting the inequality, since the term on the left
-side of the inequality is less than the term on the right side according to P(x)
+[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) <= [ (1 - p) * x ] + (1 - p)
+The part in the square brackets
+[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] <= [ (1 - p) * x ]
+(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) <= (1 - p) * x
+E[k_x elements] <= E[C_x elements]
+is just P(x), which we have assumed to be true.
+
+How can we use P(x) to prove P(x + 1)?
+We can use this fact:
+If a <= b is true,
+a + x <= b + y is also true if x <= y
+
+In our case,
+a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) = left term inside square brackets
+b = E[C_x elements] = (1 - p) * x = right term inside square brackets
+x = (p - p**x) = left term outside square brackets
+y = (1 - p) = right term outside square brackets
 
 This leaves us with having to prove the following in order to prove P(x + 1).
 Note that this is just the expected value of the next term added when adding
 one more element to the array.
-(p - p**x) < (1 - p)
-p - p**x < 1 - p
+(p - p**x) <= (1 - p)
+p - p**x <= 1 - p
 
-2*p < 1 + p**x
+2*p <= 1 + p**x
 
 Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.
+This means that the original x <= y inequality is true and P(x + 1) is true.
 
 Thus, P(n) is true for n being any integer >= 0.
 In other words:

diff --git a/analysis.txt b/analysis.txt
@@ -264,4 +264,6 @@ It's not worth determining C or k because that would take O(n).
 You can determine the approximation of p by sampling a few values and averaging
 by how many non-zeros you see.
 
-How many values should you sample? That is a question for another day.
+How many values should you sample? That is a question for another day,
+but I think there is a theorem on how many samples you need for a desired margin of error.
+It might be the Central Limit Theorem. Not sure yet.
diff --git a/analysis.txt b/analysis.txt
@@ -85,7 +85,7 @@ be worth it to do an O(n) count, so let's do some probability.
 So now let's do some probability and calculate what we expect k or C to be
 given the following input:
 1. We generate an array of length n.
-2. Each element is generated independently with a probability P to be a non-zero, else zero.
+2. Each element is generated independently with a probability p to be a non-zero, else zero.
 
 What is the expected value of C and k? That is E[C] and E[k]?
 If E[k] <= E[C], probability would suggest Solution 3 would be better

diff --git a/analysis.txt b/analysis.txt
@@ -0,0 +1,267 @@
+I don't think Solution 2 is always better than Solution 3
+
+First of all, it doesn't matter unless n is large, since both are O(n).
+
+For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array:
+For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
+For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`
+
+For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
+For Solution 3, we can say that swap is a 3 cost operation including the temporary variable.
+
+---
+
+Let's calculate the cost for the general case:
+Assumptions:
+  1. Each loop iteration costs 1
+  2. Reading is 0 cost
+  3. Comparisons are 0 cost
+  4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it anyway
+
+Let n be the number of elements in the array
+Let C be the number of zeros.
+
+Let k be the number of non-zeros in the incorrect location.
+
+To explain k:
+[1, 2, 3] - k = 0 because no zeros
+[0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
+[0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
+[0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
+
+Notice that another way to think about k is "the number of non-zeros with at
+least one zero before it".  This is because if there is at least one zero
+before a non-zero, it is out of place. More than one zero before a non-zero
+does not change anything. That will only put more zeros at the end. This
+knowledge will come in handy later.
+
+---
+
+Cost of Solution 2 for the general case:
+For the first loop, we have:
+Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves)
+since k of the elements need to be moved:
+Cost(first loop) = Cost(n iterations) + Cost(k moves)
+Cost(first loop) = n + k
+
+For the second loop (memset), we have:
+Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
+Cost(second loop) = Cost(C iterations) + Cost(C writes)
+Cost(second loop) = C + C
+Cost(second loop) = 2*C
+
+So in total,
+Cost(Solution 2) = Cost(first loop) + Cost(second loop)
+Cost(Solution 2) = n + k + 2*C
+
+
+Cost of Solution 3 for the general case:
+Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
+Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
+Since a swap costs 3,
+Cost(Solution 3) = n + 3*k
+
+---
+
+So let's see when Solution 3 is better than or equal to Solution 2:
+Cost(Solution 2) >= Cost(Solution 3)
+n + k + 2*C >= n + 3*k
+2*C >= 2*k
+C >= k
+k <= C
+
+So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
+k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.
+
+It would take O(n) to determine k or C.
+
+For C, you O(n) loop to count number of zeros.
+For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
+Even though you know the optimal Solution to use after counting, it might not
+be worth it to do an O(n) count, so let's do some probability.
+
+---
+
+So now let's do some probability and calculate what we expect k or C to be
+given the following input:
+1. We generate an array of length n.
+2. Each element is generated independently with a probability P to be a non-zero, else zero.
+
+What is the expected value of C and k? That is E[C] and E[k]?
+If E[k] <= E[C], probability would suggest Solution 3 would be better
+
+---
+
+E[C]:
+Let C_i be a random variable on whether the element at index i is a zero.
+Because of Linearity of Expectation,
+E[C] = E[C_0] + E[C_1] + ... E[C_n-1]
+
+There is a (1 - p) chance to generate a zero.
+Since each element can contribute at most 1 zero with probability (1 - p).
+E[C_i] = (1 - p) * 1
+
+E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
+E[C] = (1 - p) * n
+
+---
+
+E[k]:
+This one is a little trickier
+
+Again, we can use Linearity of Expecation,
+Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
+Mentioned above, I said another way to think about this is,
+"a non-zero with at least one zero before it"
+
+E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
+
+There is a p chance to generate a non-zero
+Each element can contribute at most 1 incorrect non-zero, but what is the
+probability of an incorrect non-zero?
+The probability of an incorrect non-zero, is the probability of generating a
+non-zero with at least one zero before it.
+
+Let P mean "the probability of"
+P(having at least one zero) = (1 - P(having all non-zeros))
+
+  P(incorrect non-zero at index i)
+= P(non-zero at index i with at least one zero before it)
+Since the following two are independent events, they can be factored:
+= P(non-zero at index i) * P(having at least one zero before index i)
+Since i is indexed starting at 0, we don't need to do (i - 1):
+= p * (1 - p**i)
+
+Since each element can contribute at most 1 incorrect non-zero,
+E[k_i] = (p * (1 - p**i)) * 1
+E[k_i] = p * (1 - p**i)
+E[k_i] = p - p**(i + 1)
+
+E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
+E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1))
+E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
+
+---
+
+I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:
+
+p is a probability so it is in the range [0, 1]
+
+Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"
+
+Lets prove by induction on n:
+Proof:
+Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]
+
+Base Case:
+Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
+When n=0,
+E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
+E[k] = 0, because there are no incorrect non-zeros in an empty array
+So for n=0, E[k] <= E[C]
+
+Let's start induction from n=1,
+When n=1,
+E[C] = (1 - p) * 1 = 1 - p
+E[k] = (p - p**0) = 0
+So for n=1, E[k] <= E[C]
+
+Inductive step:
+Assume that P(x) is true for some arbitrary x >= 1
+We have to prove P(x + 1) is true.
+
+P(x + 1) means
+E[k_x+1 elements] < E[C_x+1 elements]
+(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * (x + 1)
+(p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) + (p - p**x) < (1 - p) * x + (1 - p)
+
+The first chunk of each side is just P(x), I will put them in square brackets:
+
+[ (p - p**1) + (p - p**2) + ... + (p - p**(x - 1)) ] + (p - p**x) < [ (1 - p) * x ] + (1 - p)
+Since we are assuming P(x) is true, the terms in the brackets can be removed
+from both sides without affecting the inequality, since the term on the left
+side of the inequality is less than the term on the right side according to P(x)
+
+This leaves us with having to prove the following in order to prove P(x + 1).
+Note that this is just the expected value of the next term added when adding
+one more element to the array.
+(p - p**x) < (1 - p)
+p - p**x < 1 - p
+
+2*p < 1 + p**x
+
+Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**x is always greater than 1.
+
+Thus, P(n) is true for n being any integer >= 0.
+In other words:
+"If 0 <= p <= 0.5 an n is an integer >= 0, then E[k] <= E[C]"
+This means that for 0 <= p <= 0.5, Solution 3 is always better.
+
+---
+
+How about 0.5 < p <= 1?
+
+Assumptions:
+  1. n is very large. This is an okay assumption since we only care about
+     optimizing Solution 2/3 when n is large.
+
+Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"
+
+We will special case p=1,
+When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
+but E[k] >= E[C] still holds true.
+
+For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
+and show that E[k] grows faster than E[C] for large n.
+
+Let's start with E[C]
+E[C] = (1 - p) * n
+Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
+As n grows large, we keep adding more (1 - p) < 0.5 terms.
+
+How about E[k]?
+E[k] = (p - p**1) + (p - p**2) + ... (p - p**(n - 1))
+
+For 0.5 < p < 1
+For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
+the exponent is small, so the subtraction is large.
+However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
+since p**i ~= 0, where i is large.
+
+When n is large, there will be many terms that will be approximated as just p.
+As n grows large, we keep adding more approximate 0.5 < p < 1 terms.
+
+Compare the statements above about E[C] and E[k].
+For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
+while E[k] keeps adding 0.5 < p < 1 terms
+
+The added E[k] terms are larger than the E[C] terms.
+
+This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
+Note that for low n, it's possible for the opposite to be true. E[C] might be
+greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.
+
+This means for 0.5 < p <= 1, Solution 2 is always better.
+
+---
+
+So what have we learned?
+
+Use Solution 3 for 0 <= p <= 0.5
+Use Solution 2 for 0.5 < p <= 1 (for large n)
+
+---
+
+How do I apply this to the general case?
+
+In the general case, there are a lot of unknowns. If there are some knowns, you
+can use the above findings to determine the optimal algorithm.
+
+For the general case, you are given a large n-size array. You don't know the p
+value used to generate the array. You don't know the value of C or k.
+
+It's not worth determining C or k because that would take O(n).
+You can determine the approximation of p by sampling a few values and averaging
+by how many non-zeros you see.
+
+How many values should you sample? That is a question for another day.