The
"Algorithms and Data Structure Problems" Pittsburgh group decided to study algorithms using the
book Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein (CLRS) [Edition 3].
Even though CLRS is a standard text book at many universities, it may not be appropriate given our audience. Below is my recommendation for an approach that may be appropriate for the group.
Problem Statement
We begin by explicitly stating what appears to be a simple problem: Find two integers from a list of integers that sum to a specific
value. For example, given [1, 2, 3, 4, 5, 6], which two integers sum to
11? Mathematically, this can be stated as follows
-
Given a set N = {x1, x2, ..., xn } and a constant c, find xi and xj such that
Additional requirements are as follows
-
Allow for duplication of integers. In other words, [1, 2, 2, 4, 5, 6, 7] is valid.
-
Only have to find 1 permutation of 2 integers that sum to c. Don't have to find all the permutations.
-
Can't assume that the integers are sorted.
- System does not have any memory of
what has been previously processed. For example, it cannot remember whether the last numbers processed were positive or negative or zero.
The implications of the additional requirements will be addressed
throughout this blog post. Also, to keep things simple, we are going to
state that no concurrent processing is allowed.
Note the writing style used in the problem statement. It's formal enough
that people can learn something concrete but approachable enough that
people won't skip it for fear of feeling dumb. That is why, initially, no
Big O notation is used.
Obviously, the goal is not to teach a trivial addition algorithm. The goal is
to help people start thinking about how to find and evaluate algorithms.
The addition algorithm is a nice example for learning how to approach
algorithms in general. In the particular case, the progression is as
follows
-
Brute Force: Start with nested for loops.
-
Look for Patterns: Notice that "2 + 4" as well as "4 + 2" equal 6. Therefore can reduce the computations by half.
-
Sort and Perform Binary Search
-
Loosen Requirements: Re-use of integers allows the use of hashing.
Concepts are demonstrated via code snippets. The code snippets are executed using Python 3.6.1.
Simplest Case
The simplest case assumes that the integers are sorted and the quantity
of integers is small enough that brute force is acceptable. Code
Snippet 1 satisfies the simplest case. Please note that there is nothing
wrong with brute force as long as it satisfies the requirements.
Sometimes people forget this.
If the integer list is unsorted, a simple solution is to just sort it. The consequences of sorting are addressed later in this blog. Perhaps
sorted data is unnecessary? This too will be addressed later.
Code Snippet 1
def sum_in(int_list, desired_sum):
for i, int_a in enumerate(int_list):
for j, int_b in enumerate(int_list):
if i != j:
sum_of_ints = int_a + int_b
print("int_a = {}, int_b = {}, sum_of_ints = {}".format(int_a, int_b, sum_of_ints))
if desired_sum == sum_of_ints:
return True
# test 1: Desired sum exists
print("-- Start test 1")
SOME_INTEGER_LIST = [1, 2, 3, 4, 5, 6]
DESIRED_SUM_OF_INTS = 11
if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True:
print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS))
else:
print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS))
# test 2: Desired sum does not exist
print("-- Start test 2")
SOME_INTEGER_LIST = [2, 4, 8]
DESIRED_SUM_OF_INTS = 7
if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True:
print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS))
else:
print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS))
Output of Code Snippet
-- Start test 1
int_a = 1, int_b = 2, sum_of_ints = 3
int_a = 1, int_b = 3, sum_of_ints = 4
int_a = 1, int_b = 4, sum_of_ints = 5
int_a = 1, int_b = 5, sum_of_ints = 6
int_a = 1, int_b = 6, sum_of_ints = 7
int_a = 2, int_b = 1, sum_of_ints = 3
int_a = 2, int_b = 3, sum_of_ints = 5
int_a = 2, int_b = 4, sum_of_ints = 6
int_a = 2, int_b = 5, sum_of_ints = 7
int_a = 2, int_b = 6, sum_of_ints = 8
...
int_a = 5, int_b = 1, sum_of_ints = 6
int_a = 5, int_b = 2, sum_of_ints = 7
int_a = 5, int_b = 3, sum_of_ints = 8
int_a = 5, int_b = 4, sum_of_ints = 9
int_a = 5, int_b = 6, sum_of_ints = 11
DESIRED_SUM_OF_INTS = 11 was found
-- Start test 2
int_a = 2, int_b = 4, sum_of_ints = 6
int_a = 2, int_b = 8, sum_of_ints = 10
int_a = 4, int_b = 2, sum_of_ints = 6
int_a = 4, int_b = 8, sum_of_ints = 12
int_a = 8, int_b = 2, sum_of_ints = 10
int_a = 8, int_b = 4, sum_of_ints = 12
DESIRED_SUM_OF_INTS = 7 was not found
You will notice that "2 + 4" as well as "4 + 2" equal 6.
int_a = 2, int_b = 4, sum_of_ints = 6
...
int_a = 4, int_b = 2, sum_of_ints = 6
...
The above corresponds to the
commutative property of addition. If we create a matrix of the possible permutations of additions for [1, 2, 3, 4, 5, 6], a symmetric matrix is created.
Notice that the diagonal of the matrix has N/A. This corresponds to the fact that i != j. Another way of stating this is that we cannot re-use integers.
Before looking at the Big O notation, let's just count all the possible
permutations of addition. Here we use the term
permutation as strictly defined in mathematics. For a discussion of the difference
between
permutations and
combinations, refer to
Combinations and Permutations at MathIsFun.Com. The example above corresponds to permutations without repetition, which means that we should use the equation below
Here n is the number of integers in the list, and r = 2 because we are
taking 2 numbers at a time. Below is a table that evaluates the equation above. It shows how quickly the number of required additions increases
as the number of integers in the list increases.
Number of Integers Number of
in the List Additions
------------------ ---------
2 2
3 6
4 ` 12
5 20
6 30
10 90
100 9,900
150 22,350
500 249,500
1,000 999,000
Now, let's look at the Big O notation. The nested for loops result in O(n
2). Clearly, this approach is not scalable.
Reduce Computations by Taking Advantage of Symmetry Arising from Commutative Property of Addition
Code Snippet 2: Reduce computations by taking advantage of symmetry
def sum_in(int_list, desired_sum):
for i in range(1, len(int_list)):
int_a = int_list[i]
desired_int = desired_sum - int_a
for j in range(0, i): # Cutting computations in half
int_b = int_list[j]
print("i = {}, j = {}".format(i, j))
print("int_a = {}, int_b = {}, Int Sum = {}".format(int_a, int_b, int_a+int_b))
if (int_b == desired_int):
return True
return False
We could try to short circuit the computations being performed in this code snippet. For example, if the desired sum is positive, both
integers are positive, and the integer sum is greater than the desired
sum, there is no reason to continue to perform computations. Below is a
counter example. The key is that the integers can be
negative, zero or positive and the system does not have any memory of
what has been processed previously. Notice that here we are using one of
the earlier-mentioned additional requirements.
Counter Example
def sum_in( [-6, -2, 0, 5, 6, 8], 2 )
i = 1, j = 0
int_a = -2, int_b = -6, Int Sum = -8
i = 2, j = 0
int_a = 0, int_b = -6, Int Sum = -6
i = 2, j = 1
int_a = 0, int_b = -2, Int Sum = -2
i = 3, j = 0
int_a = 5, int_b = -6, Int Sum = -1
i = 3, j = 1
int_a = 5, int_b = -2, Int Sum = 3
i = 3, j = 2
int_a = 5, int_b = 0, Int Sum = 5
i = 4, j = 0
int_a = 6, int_b = -6, Int Sum = 0
i = 4, j = 1
int_a = 6, int_b = -2, Int Sum = 4
i = 4, j = 2
int_a = 6, int_b = 0, Int Sum = 6
i = 4, j = 3
int_a = 6, int_b = 5, Int Sum = 11
For the above, notice the following
I) Both integers are positive
II) The integer sum of 11 is greater than the desired sum of 2
i = 5, j = 0
int_a = 8, int_b = -6, Int Sum = 2
For the above, notice that the integer sum of 2 is equal to the desired sum of 2
Even though the number of computations is cut in half by iterating
over half the matrix, the approach is still not scalable. This is
just a repetition that O(cn
2) == O(n
2) for any constant c.
Alternative Approach: Binary Search for the Desired Integer
Because the above approaches are not scalable, we need to step back and
try another approach. In this other approach, we will use the a priori
knowledge that the integers are sorted. Since we know what the desired
integer is, we can do a binary search for it over the ordered integers.
For background information about binary search, refer to the
book "Problem Solving with Algorithms and Data Structures using Python" by Brad Miller and David Ranum.
Luckily, Python has the
bisect module which can be used to search sorted lists. Code Snippet 3 uses the bisect_left method of the bisect_module.
Code Snippet 3 (Source:
Docs.Python.Org - bisect module - Searching Sorted Lists)
def index(a, x):
'Locate the leftmost value exactly equal to x'
i = bisect_left(a, x)
if i != len(a) and a[i] == x:
return i
raise ValueError
Since we are dealing with open source, we can go look at the
code for bisect_left and see that it is your standard binary search.
Code Snippet 4: bisect_left
def bisect_left(a, x, lo=0, hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e < x, and all e in
a[i:] have e >= x. So if x already appears in the list, a.insert(x) will
insert just before the leftmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x: lo = mid+1
else: hi = mid
return lo
Putting it all together, results in the code snippet below.
Code Snippet 5
from bisect import bisect_left
def index(some_list, some_value):
'Locate the leftmost value exactly equal to some_value'
i = bisect_left(some_list, some_value)
if i != len(some_list) and some_list[i] == some_value:
return i
raise ValueError
def sum_in(int_list, desired_sum):
int_list.sort()
for i, int_a in enumerate(int_list):
desired_int = desired_sum - int_a
try:
j = index(int_list, desired_int)
if (i == j):
continue
else:
return True
except ValueError:
continue
return False
By the way, some consider the use of exceptions for flow control a bad practice. For a typical article on this topic, refer to
"Using Exception Handling for Control Flow (in Python)" by Scott Lobdell.
Now, let's talk about the number of computations involved in Code
Snippet 5. It is commonly known that a binary search is O(log n). Because we have to do the binary search once for each integer, we have to multiply by
n. This results in a performance of O(n [log n] ). For the rationale of
why the base of the log does not matter, refer to
Wikipedia: Sorting algorithm.
Significant Performance Improvement: Computations Reduced from O(n2) to O(n [log n])
Notice that in going from Code Snippet 2 to Code Snippet 5, performance went from O(n
2)
to O(n [log n]). This is a significant performance improvement.
To
visually confirm this, let's examine the graph below. Notice how the
purple line {O(n [log n]} is relatively linearly but the dark blue line
{O(n
2} just shoots straight up.
Why Not Just Use Python Built-in Functionality?
Those familiar with Python will be tempted to simplify the above code by using the
index() method of list objects, which results in the code snippet below.
Code Snippet 6: Use list.index
def sum_in(int_list, desired_sum):
for i, int_a in enumerate(int_list):
desired_int = desired_sum - int_a
try:
j = int_list.index(desired_int)
if (i == j):
continue
else:
return True
except ValueError:
continue
return False
The use of
list.index is equivalent to invoking the
in operator. Note that
in calls the
__contains__ method. This is an O(n) operation (
StackOverFlow.Com,
Complexiity at Wiki.Python.Org). Since we have to execute O(n) for each iteration of the
for statement, the overall performance is now back to O(n
2). Yucky.
One would be tempted to simplify the above code snippet by directly using the
in
operator. Unfortunately, we cannot do this, because we not only have to
know that the integer is present, but that its index differs from the current integer being processed (i != j).
Since the Python
set collection
has a O(1) performance, why not use it? Unfortunately, in Python the set
collection quietly ignores repeated integers. Recall that in the
problem statement, [1, 2, 2, 4, 5, 6, 7] was allowable. Later in the
blog, we will address this sublety.
Complicate Problem by Having Integer List Not Sorted
The complication of an unsorted integer list is easily addressed
simply by just sorting. It is commonly known that no comparison sorts can perform better than O(n [log n]) in the average or worst case. For
background information on this, refer to
Wikipedia: Sorting algorithm. Consequently, the computations required to determine if two integers in an unsorted integer list sum to a particular value is
-
n [log n] cost for sorting the unordered integer list
-
n [log n] cost for a binary search for the other integer
The total computation complexity is O(2 * n [log n] ) == O(n [log n] ).
Remove Restriction: Integers can Now be Re-Used (i = j)
We can now finally use the
in.
Code Snippet 7: Use in operator
def sum_in(int_list, desired_sum):
for i, int_a in enumerate(int_list):
desired_int = desired_sum - int_a
if desired_int in int_list:
return True
else:
continue
return False
Remember, we are now back to the O(n
2) performance. Yucky.
Re-Use of Integers Permits the Application of Set
Because the re-use of integers is allowed, we can use Python's built in
set and get a O(1) look-up performance. However, don't forget that you initially have to create the
set which involves a cost of O(n). Luckily this is just a one-time setup fee. Please note that creating a
set involves
hashing, which, depending on the nature of the data, can be problematic.
Also, by using
set, we are
implicitly removing the additional requirement that the system has no memory
because we are storing hash values. This is an example of why people
must be aware of all ramifications associated with data structures
and their associated algorithms.
Code Snippet 8: Use set
def sum_in(int_list, desired_sum):
int_set = set(int_list)
for i, int_a in enumerate(int_list):
desired_int = desired_sum - int_a
if desired_int in int_set:
return True
else:
continue
return False
The above code snippet is the verbose version of the code snippet found at the
StackOverFlow
question "How to write an algorithm to check if the sum of any two
numbers in an array/list matches a given number?"
Code Snippet 9: From StackOverFlow
def sum_in(int_list, desired_sum):
set_of_input = set(int_list) # O(n)
return any( ( desired_sum - n) in set_of_input for n in set_of_input ) # O(n)
At this point, it is important to remember that we removed a restriction
and integers can now be re-used (i=j). This means that when
"sum_in([-2, 1, 5, 6, 8], 2)" returns True, it is a correct answer
because "1 + 1 == 2." This may or may not be acceptable depending on the
situation.
Also, at this point, we need to address the distinction between
"repeated integers not allowed" and "integers can be re-used." Repeated
integers are simply not allowed because Python's
set
quietly ignores them. "Integers can be re-used" refers to
the fact that "sum_in([-2, 1, 5, 6, 8], 2)" returning True is a correct
answer because "1 + 1 == 2."
Code snippets 8 and 9 require that the entire set (hash values) is computed ahead of time. Why not create the hash value as we process the
data? This way, we might find the desired integers before we have created all
the hash values. The other advantage of creating the hash values as
we process the data is that it allows for repeated integers ([1, 2, 2, 4, 5, 6, 7]) but not the re-use of integers.
.
Code Snippet 10: Generate hash values as process data
def sum_in(int_list, desired_sum):
indexMap = {}
for i in range( len( int_list ) ):
if ( desired_sum - int_list[i] ) in indexMap:
return True
else:
indexMap[ int_list[i] ] = i
return False
Note that code snippet 10 returns False for "sum_in([-2, 1, 5, 6, 8], 2)". In other words, integers cannot be re-used (i != j).
Above material similar to exercise 2.3-7 of CLRS
Some people will recognize that the material above is similar to exercise 2.3-7 of CLRS. Below is the actual exercise. We have deviated from it for pedagogical reasons.
Alternative Solution
Zhao HG on GitBook
solves the problem in O(n log n) time using the code snippet below. Notice
its simple elegance of just walking two pointers. However, it is complex
because have 3 tests, and for each test, there is a different action.
Code Snippet 11
def two_sum(a, x):
l, r = 0, len(a)-1
while l < r:
if a[l] + a[r] == x:
return True
elif a[l] + a[r] < x:
l += 1
else:
r -= 1
return False
Notice how we have examined the algorithm from 3 different dimensions.
- The first dimension was time. How long does it take to perform the
computations?
- The second dimension was required storage. Please note
that storage is used in the generic sense. It does not just refer to
memory. For further details, please refer to the Wikipedia article "Memory Hierarchy."
- The third dimension is algorithm/software complexity. You can use the radon package
to quantify the complexity. Please note that complexity is a critical
consideration when deploying production code. For an example, refer to "Why Netflix Never Implemented The Algorithm That Won The Netflix $1 Million Challenge by Mike Masnick".
Unit Tests
Some people will have an opinion that such trivial code snippets do not
require unit tests. I strongly disagree. Unit tests are a good way to
make sure that you haven't done anything silly. For the purpose of this
blog, I have created rather extensive unit tests to prevent a lot of
back and forth when people propose adding short circuit logic to reduce
the number of computations performed. The conversations will probably
start and end with, "Did you run the proposed short circuit logic against
the unit tests?"
As for the unit tests themselves, we begin by realizing that each
integer can be negative, zero or positive and that the desired sum can
also be negative, zero or positive. This leads to 27 possible test
cases. For an explicit list of all 27 tests, click
here.
Luckly, most of them are not applicable because they can't happen
mathematically or they can't happen because the data is sorted.
In turn, each test case has sub-tests. One set of tests finds the desired
sum while another set of tests does not. Within, those sub-tests,
position matters: left edge, somewhere in the middle and right edge. To
see the tests themselves, click
here.
When all is said and done, we end up with 104 tests. This may seem
excessive, but it beats the alternative of having countless conversations
about why the computations for the nested for loops was not short
circuited.
Also, if you want to create tests to estimate run time, one could use the
timeit module. Another option is to use the
pytest-benchmark.
Subtlety: Repeated Integers Versus Re-Use of Integers
If repeated integers are allowed, you could have a list of integers like [3, 5, 1, 5, 12]. So, if repeated integers are allowed and wanted the sum to be 10, you could use the two 5's to come up with 10.
However, if repeated integers are not allowed, the previous list would become [3, 5, 1, 12]. In this case, if wanted the sum to be 10, there would be no two integers that sum to 10.
Now comes the caveat. If have [3, 5, 1, 12] and integers can be re-used, could re-use 5 a second time to have two integers that sum to 10.
Summary
This blog demonstrates two things.
- First it is critical to know the runtimes of built-in algorithms/tools of a language.
- Secondly, the material demonstrates a common process. You start out with what appears a simple problem. You try a brute force approach but quickly realize that it won't scale. You then look for patterns like symmetry to reduce the number of required computations. If you are lucky, this will be good enough. If not, an alternative approach will have to be tried.
This blog also demonstrates the key traits and skills people need
to choose an algorithm. You need some idea of the consequences of doing
things like adding a loop.
Next you have to be creative. What if
we subtract the integer that we're currently examining from the desired sum and
then use the resulting integer to do a search? Understanding
the Big O notation is a good skill and helps you understand consequences,
but you also need to understand the principles behind it.
Lastly, know
your data. Questions like "Is it sorted or unsorted?" or "Is it symmetric?"
will help us in the algorithm selection process.
Presentations
T
o download the presentation for the PGH Data Science Lightning Talks on 7/5/2017, click here.
To download the presentation for the PGHPython: Study Algorithms by Playing Pick 2 on 7/26/2017, click here. For a video recording of the presentation, click here (YouTube).