Tuesday, January 16, 2018

How do I get started in data science?

A recurring question that I have run into is: How do I get started in data science?

Unfortunately, "data science" has devolved into a marketing phrase. So, I will provide a definition which will be applicable for this blog post.

Definition: Data science is the application of statistics and mathematical optimization (operations research) to real world data to make probabilistic predictions and / or minimize/maximize some business attribute.

Notice how the definition combines the following items
  1. Statistics
    1. Mathematical Optimization / Operations Research
      1. Real world data
        1. Making probabilistic predictions
          1. Minimize / maximize some business attribute
          Also, notice how the above list emphasizes the fact that knowledge of mathematics is required. The good news is that people can elect the depth to which they delve into the associated mathematics.

          One option which won't work is to take the position that mathematics is irrelevant and that all that is required is to just learn the API calls. I have personally seen several people fail who have adopted this position. The classic example is that someone tries to use linear regression and they have a lot of outliers and then wonder why things aren't working.

          The above material provides the context for "data science." Next, let's talk about how to execute on getting started in data science.

          If you mathematical background is weak, recommend the following
          1. Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith
          2. Cartoon Guide to Calculus by Larry Gonick
          3. Linear Algebra For Dummies by Mary Jane Sterling
          4. Manga Guide to Linear Algebra by Shin Takahashi, ...
          5. If the above books are of interest to you, a full list can be found in my blog post titled "Gentle Introduction to Various Math Stuff."
          Now, we can finally get to the first thing that a data scientist must know: linear regression. To systematically study linear regression, recommend the book "Regression Analysis with Python" by Luca Massaron, Alberto Boschetti [PacktPub.ComCode Download / ErrataSafariBooksOnline.ComAmazonO'Reilly]. Personally, I think that it is a good book because it processes data sets using Python to create a linear regression model. It is not just a bunch of math (NJBM). Below is paragraph taken from the book.
          1. We provide some practical examples in Python throughout the book and do not leave explanations about the various regression models at a purely theoretical level. Instead, we will explore together some example datasets, and systematically illustrate to you the commands necessary to achieve a working regression model, interpret its structure, and deploy a predicting application.

          Monday, December 4, 2017

          Article: Has Deep Learning Made Traditional Machine Learning Irrelevant? by William Vorhies

          To go to the article, click here.

          The article makes statements like
          1. ... CNNs and RNNs are very difficult to train and sometimes fail to train at all ...
            1. ... If you are building a CNN or RNN from scratch you are talking weeks or even months of development time ...
              1. ... CNNs and RNNs require extremely large amounts of labeled data on which to train which many companies find difficult or too costly to acquire ...
                1. ... two distinct data science markets that have evolved ... ‘Big Web User’ ... But upwards of 80% of the application of data science today is still in the prediction of consumer behavior ...
                  1. ... need to see these competitions [Kaggle] like Formula One racing ...
                  Long winded way of saying approach neural networks with caution.

                  Friday, September 29, 2017

                  Biking and Machine Learning

                  1. Articles
                    1.  20 Most Bike-Friendly Cities in the World, From Malmö to Montreal by Mikael Colville-Andersen - 06.14.17
                    2. Citibike Business Opportunity: Advertising by Josh Yoon - October 15, 2017
                    3. CMU student creates cool maps of Pittsburgh bike-share stats by Ryan Deto - Dec 9, 2015
                    4. Data from Public Bicycle Hire Systems by Mark Padgham - October 17, 2017
                    5. Here’s Proof That Commuter Bikes Don’t Have to Suck by Michael Calore - 08.08.17 
                    6. How bike-sharing conquered the world at Economist - 9/5/2017
                    7. Is it faster to take a bike or taxi in NYC? by David Smith - October 18, 2017
                    8. New Yorkers, municipal bikes, and the weather by David Smith - January 29, 2016
                    9. NYC Citi Bike Visualization - A Hint of Future Transportation by Summer Sun - August 26, 2017
                    10. Photo of the Week: A Dizzying View of a Bicycle Graveyard in China by Laura Mallonee - 6.30.17
                    11. Reservoir Balancing Constraint with Applications to Bike-Sharing by Joris Kinable
                      1. Integration of AI and OR Techniques in Constraint Programming - 13th International Conference, CPAIOR 2016
                    12. Tale of Twenty-Two Million Citi Bikes: Analyzing the NYC Bike Share System by Todd Schneider - January 13, 2016
                    13. Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate by Claire Vignon Keser - April 26, 2017
                    14. When are Citi Bikes Faster than Taxis in New York City? by Todd Schneider - September 26, 2017
                    15. Why Investors Are Betting That Bike Sharing Is the Next Uber by Erin Griffith - 10.16.17
                    16. With Hundreds Of Millions Of Dollars Burned, The Dockless Bike Sharing Market Is Imploding by Evgeny Tchebotarev - 12/16/2017
                  2. GitHub
                    1. DeFusco, Albert (AlbertDeFusco)
                      1. healthyride
                    2. DeMarco, Steven (stevedem)
                      1. hands-on-python
                    3. Whittaker, Tim (timsetsfire)
                      1. pgh-bike-share
                  3. Kaggle: Bike Sharing Demand
                  4. Slack Channels
                    1. PGH Data Science
                      1. #general
                      2. #pghbikedata
                  5. Web Pages
                    1. bikepgh.org
                    2. healthyridepgh.com
                    3. meetup.com - PGH Data Science
                      1. Hands-On with Python
                    4. pghbikeshare.org

                  Tuesday, July 11, 2017

                  Mapping the Machine Learning Time Feature to a Circle to Compute Distance

                  One of the standard questions in machine learning is to compute the distance between two time intervals. If did a standard time interval difference between 1:00 am and 12:59 PM, there would be a lot of minutes between the two in terms of elapsed time. However, from a human activity perspective, there isn't much of a difference between the two. They are done really early in the morning.

                  One approach to solving the problem is to map the times 00:00 to 23:59 to a circle and then compute the chord length between the two mappings. For a discussion on how to compute chord length, please refer to the Wikipedia article "Circular Segment."

                  Below is a picture of a circle with the cord length denoted by c


                  The general equation to compute the length of the chord is given by the equation below

                  For simplicity will pick R to be 1. The above equation now simplifies to the following

                  The above concepts and math was implemented in the following code using Python 3.6.

                  import math
                  def create_hours_minutes_in_a_day():
                      number_of_hours_in_day = 24
                      number_of_minutes_in_an_hour = 60
                      number_of_minutes_in_a_day = 0
                      hours_minutes_in_day = []
                      for hour in range(0, number_of_hours_in_day):
                          if hour < 10:
                              hour_to_print = '0' + str(hour)
                              hour_to_print = str(hour)
                          for minute in range(0, number_of_minutes_in_an_hour):
                              if minute < 10:
                                  minute_to_print = '0' + str(minute)
                                  minute_to_print = str(minute)
                              number_of_minutes_in_a_day += 1
                      return hours_minutes_in_day, number_of_minutes_in_a_day
                  def map_hour_minute_in_day_to_circle(hour_colon_minute_in_day, number_of_minutes):
                      number_of_degrees_in_a_circle = 360
                      increment_size_in_degrees = number_of_degrees_in_a_circle / number_of_minutes
                      current_angle_in_degrees = 0
                      map_hour_minute_to_circle = dict()
                      for x in hour_colon_minute_in_day:
                          map_hour_minute_to_circle[x] = current_angle_in_degrees
                          current_angle_in_degrees += increment_size_in_degrees
                      return map_hour_minute_to_circle
                  def compute_arc_length_between_time_interval(mapping_of_hour_minute_to_angle, time_a, time_b):
                      # Since we want the relative distance betwee two time intervals, 
                      # can set the circle to a radius = 1
                      angle_a = mapping_of_hour_minute_to_angle[time_a]
                      angle_b = mapping_of_hour_minute_to_angle[time_b]
                      chord_length = 2 * math.sin ( math.radians(angle_b - angle_a) / 2 ) 
                      print("Time A: ", time_a, "Angle A: ", angle_a)
                      print("Time b: ", time_b, "Angle A: ", angle_b)
                      print("Chord: ", chord_length)
                  hour_colon_minute_in_day, number_of_minutes = create_hours_minutes_in_a_day()
                  map_hour_minute_to_angle = map_hour_minute_in_day_to_circle(hour_colon_minute_in_day, number_of_minutes)
                  compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '06:00')
                  compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '12:00')
                  compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '18:00')
                  compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '23:59')

                  The output from the above code is below

                  Time A:  00:00 Angle A:  0
                  Time b:  06:00 Angle A:  90.0
                  Chord:  1.4142135623730951
                  Time A:  00:00 Angle A:  0
                  Time b:  12:00 Angle A:  180.0
                  Chord:  2.0
                  Time A:  00:00 Angle A:  0
                  Time b:  18:00 Angle A:  270.0
                  Chord:  1.4142135623730951
                  Time A:  00:00 Angle A:  0
                  Time b:  23:59 Angle A:  359.75
                  Chord:  0.0043633196686735124

                  Wednesday, June 14, 2017

                  Friday, March 24, 2017

                  Study Algorithms by Playing Pick 2: Find the Sum of Any 2 Integers from a List that are Equal to a Specific Value

                  The "Algorithms and Data Structure Problems" Pittsburgh group decided to study algorithms using the book Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein (CLRS) [Edition 3].

                  Even though CLRS is a standard text book at many universities, it may not be appropriate given our audience. Below is my recommendation for an approach that may be appropriate for the group.

                  Problem Statement

                  We begin by explicitly stating what appears to be a simple problem: Find two integers from a list of integers that sum to a specific value. For example, given [1, 2, 3, 4, 5, 6], which two integers sum to 11? Mathematically, this can be stated as follows
                  1. Given a set N = {x1, x2, ..., xn } and a constant c, find xi and xj such that
                    1. xi + xj = c
                    2. i != j
                  Additional requirements are as follows
                  1. Allow for duplication of integers. In other words, [1, 2, 2, 4, 5, 6, 7] is valid.
                  2. Only have to find 1 permutation of 2 integers that sum to c. Don't have to find all the permutations.
                  3. Can't assume that the integers are sorted.
                  4. System does not have any memory of what has been previously processed. For example, it cannot remember whether the last numbers processed were positive or negative or zero.
                  The implications of the additional requirements will be addressed throughout this blog post. Also, to keep things simple, we are going to state that no concurrent processing is allowed.

                  Note the writing style used in the problem statement. It's formal enough that people can learn something concrete but approachable enough that people won't skip it for fear of feeling dumb. That is why, initially, no Big O notation is used.

                  Obviously, the goal is not to teach a trivial addition algorithm. The goal is to help people start thinking about how to find and evaluate algorithms. The addition algorithm is a nice example for learning how to approach algorithms in general. In the particular case, the progression is as follows
                  1. Brute Force: Start with nested for loops.
                  2. Look for Patterns: Notice that "2 + 4" as well as "4 + 2" equal 6. Therefore can reduce the computations by half.
                  3. Sort and Perform Binary Search
                  4. Loosen Requirements: Re-use of integers allows the use of hashing.
                  Concepts are demonstrated via code snippets. The code snippets are executed using Python 3.6.1.

                  Simplest Case

                  The simplest case assumes that the integers are sorted and the quantity of integers is small enough that brute force is acceptable. Code Snippet 1 satisfies the simplest case. Please note that there is nothing wrong with brute force as long as it satisfies the requirements. Sometimes people forget this.

                  If the integer list is unsorted, a simple solution is to just sort it. The consequences of sorting are addressed later in this blog. Perhaps sorted data is unnecessary? This too will be addressed later.

                  Code Snippet 1
                  def sum_in(int_list, desired_sum):
                      for i, int_a in enumerate(int_list):
                          for j, int_b in enumerate(int_list):
                              if i != j:
                                  sum_of_ints = int_a + int_b
                                  print("int_a = {}, int_b = {}, sum_of_ints = {}".format(int_a, int_b, sum_of_ints))
                                  if desired_sum == sum_of_ints:
                                      return True
                  # test 1: Desired sum exists
                  print("-- Start test 1")
                  SOME_INTEGER_LIST = [1, 2, 3, 4, 5, 6]
                  DESIRED_SUM_OF_INTS = 11
                  if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True:
                      print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS))
                      print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS))
                  # test 2: Desired sum does not exist
                  print("-- Start test 2")
                  SOME_INTEGER_LIST = [2, 4, 8]
                  DESIRED_SUM_OF_INTS = 7
                  if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True:
                      print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS))
                      print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS))
                  Output of Code Snippet
                  -- Start test 1
                  int_a = 1, int_b = 2, sum_of_ints = 3
                  int_a = 1, int_b = 3, sum_of_ints = 4
                  int_a = 1, int_b = 4, sum_of_ints = 5
                  int_a = 1, int_b = 5, sum_of_ints = 6
                  int_a = 1, int_b = 6, sum_of_ints = 7
                  int_a = 2, int_b = 1, sum_of_ints = 3
                  int_a = 2, int_b = 3, sum_of_ints = 5
                  int_a = 2, int_b = 4, sum_of_ints = 6
                  int_a = 2, int_b = 5, sum_of_ints = 7
                  int_a = 2, int_b = 6, sum_of_ints = 8
                  int_a = 5, int_b = 1, sum_of_ints = 6
                  int_a = 5, int_b = 2, sum_of_ints = 7
                  int_a = 5, int_b = 3, sum_of_ints = 8
                  int_a = 5, int_b = 4, sum_of_ints = 9
                  int_a = 5, int_b = 6, sum_of_ints = 11
                  DESIRED_SUM_OF_INTS = 11 was found
                  -- Start test 2
                  int_a = 2, int_b = 4, sum_of_ints = 6
                  int_a = 2, int_b = 8, sum_of_ints = 10
                  int_a = 4, int_b = 2, sum_of_ints = 6
                  int_a = 4, int_b = 8, sum_of_ints = 12
                  int_a = 8, int_b = 2, sum_of_ints = 10
                  int_a = 8, int_b = 4, sum_of_ints = 12
                  DESIRED_SUM_OF_INTS = 7 was not found
                  You will notice that "2 + 4" as well as "4 + 2" equal 6.
                  int_a = 2, int_b = 4, sum_of_ints = 6
                  int_a = 4, int_b = 2, sum_of_ints = 6
                  The above corresponds to the commutative property of addition. If we create a matrix of the possible permutations of additions for [1, 2, 3, 4, 5, 6], a symmetric matrix is created.

                  Notice that the diagonal of the matrix has N/A. This corresponds to the fact that i != j. Another way of stating this is that we cannot re-use integers.

                  Before looking at the Big O notation, let's just count all the possible permutations of addition. Here we use the term permutation as strictly defined in mathematics. For a discussion of the difference between permutations and combinations, refer to Combinations and Permutations at MathIsFun.Com. The example above corresponds to permutations without repetition, which means that we should use the equation below
                  Here n is the number of integers in the list, and r = 2 because we are taking 2 numbers at a time. Below is a table that evaluates the equation above. It shows how quickly the number of required additions increases as the number of integers in the list increases.
                  Number of Integers  Number of
                  in the List         Additions
                  ------------------  ---------
                      2                     2
                      3                     6               
                      4   `                12
                      5                    20
                      6                    30
                     10                    90
                    100                 9,900
                    150                22,350
                    500               249,500
                  1,000               999,000
                  Now, let's look at the Big O notation. The nested for loops result in O(n2). Clearly, this approach is not scalable.

                  Reduce Computations by Taking Advantage of Symmetry Arising from Commutative Property of Addition

                  Code Snippet 2: Reduce computations by taking advantage of symmetry
                  def sum_in(int_list, desired_sum):
                      for i in range(1, len(int_list)):
                          int_a = int_list[i]
                          desired_int = desired_sum - int_a
                          for j in range(0, i):    # Cutting computations in half
                              int_b = int_list[j]
                              print("i = {}, j = {}".format(i, j))
                              print("int_a = {}, int_b = {}, Int Sum = {}".format(int_a, int_b, int_a+int_b))
                              if (int_b == desired_int):
                                  return True
                      return False
                  We could try to short circuit the computations being performed in this code snippet. For example, if the desired sum is positive, both integers are positive, and the integer sum is greater than the desired sum, there is no reason to continue to perform computations. Below is a counter example. The key is that the integers can be negative, zero or positive and the system does not have any memory of what has been processed previously. Notice that here we are using one of the earlier-mentioned additional requirements.

                  Counter Example
                  def sum_in( [-6, -2, 0, 5, 6, 8], 2 )
                  i = 1, j = 0
                  int_a = -2, int_b = -6, Int Sum = -8
                  i = 2, j = 0
                  int_a = 0, int_b = -6, Int Sum = -6
                  i = 2, j = 1
                  int_a = 0, int_b = -2, Int Sum = -2
                  i = 3, j = 0
                  int_a = 5, int_b = -6, Int Sum = -1
                  i = 3, j = 1
                  int_a = 5, int_b = -2, Int Sum = 3
                  i = 3, j = 2
                  int_a = 5, int_b = 0, Int Sum = 5
                  i = 4, j = 0
                  int_a = 6, int_b = -6, Int Sum = 0
                  i = 4, j = 1
                  int_a = 6, int_b = -2, Int Sum = 4
                  i = 4, j = 2
                  int_a = 6, int_b = 0, Int Sum = 6
                  i = 4, j = 3
                  int_a = 6, int_b = 5, Int Sum = 11
                  For the above, notice the following
                     I) Both integers are positive
                     II) The integer sum of 11 is greater than the desired sum of 2
                  i = 5, j = 0
                  int_a = 8, int_b = -6, Int Sum = 2
                  For the above, notice that the integer sum of 2 is equal to the desired sum of 2
                  Even though the number of computations is cut in half by iterating over half the matrix, the approach is still not scalable. This is just a repetition that O(cn2) == O(n2) for any constant c.

                  Alternative Approach: Binary Search for the Desired Integer

                  Because the above approaches are not scalable, we need to step back and try another approach. In this other approach, we will use the a priori knowledge that the integers are sorted. Since we know what the desired integer is, we can do a binary search for it over the ordered integers. For background information about binary search, refer to the book "Problem Solving with Algorithms and Data Structures using Python" by Brad Miller and David Ranum.

                  Luckily, Python has the bisect module which can be used to search sorted lists. Code Snippet 3 uses the bisect_left method of the bisect_module.

                  Code Snippet 3 (Source: Docs.Python.Org - bisect module - Searching Sorted Lists)
                  def index(a, x):
                      'Locate the leftmost value exactly equal to x'
                      i = bisect_left(a, x)
                      if i != len(a) and a[i] == x:
                         return i
                      raise ValueError
                  Since we are dealing with open source, we can go look at the code for bisect_left and see that it is your standard binary search.

                  Code Snippet 4: bisect_left
                  def bisect_left(a, x, lo=0, hi=None):
                      """Return the index where to insert item x in list a, assuming a is sorted.
                      The return value i is such that all e in a[:i] have e < x, and all e in
                      a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
                      insert just before the leftmost x already there.
                      Optional args lo (default 0) and hi (default len(a)) bound the
                      slice of a to be searched.
                      if lo < 0:
                          raise ValueError('lo must be non-negative')
                      if hi is None:
                          hi = len(a)
                      while lo < hi:
                          mid = (lo+hi)//2
                          if a[mid] < x: lo = mid+1
                          else: hi = mid
                      return lo
                  Putting it all together, results in the code snippet below.

                  Code Snippet 5
                  from bisect import bisect_left
                  def index(some_list, some_value):
                      'Locate the leftmost value exactly equal to some_value'
                      i = bisect_left(some_list, some_value)
                      if i != len(some_list) and some_list[i] == some_value:
                          return i
                      raise ValueError
                  def sum_in(int_list, desired_sum):
                      for i, int_a in enumerate(int_list):
                          desired_int = desired_sum - int_a
                              j = index(int_list, desired_int)
                              if (i == j):
                                  return True
                          except ValueError:
                      return False
                  By the way, some consider the use of exceptions for flow control a bad practice. For a typical article on this topic, refer to "Using Exception Handling for Control Flow (in Python)" by Scott Lobdell.

                  Now, let's talk about the number of computations involved in Code Snippet 5. It is commonly known that a binary search is O(log n). Because we have to do the binary search once for each integer, we have to multiply by n. This results in a performance of O(n [log n] ). For the rationale of why the base of the log does not matter, refer to Wikipedia: Sorting algorithm.

                  Significant Performance Improvement: Computations Reduced from O(n2) to O(n [log n])

                  Notice that in going from Code Snippet 2 to Code Snippet 5, performance went from O(n2) to O(n [log n]). This is a significant performance improvement.

                  To visually confirm this, let's examine the graph below. Notice how the purple line {O(n [log n]} is relatively linearly but the dark blue line {O(n2} just shoots straight up.

                  Why Not Just Use Python Built-in Functionality?

                  Those familiar with Python will be tempted to simplify the above code by using the index() method of list objects, which results in the code snippet below.

                  Code Snippet 6: Use list.index
                  def sum_in(int_list, desired_sum):
                      for i, int_a in enumerate(int_list):
                          desired_int = desired_sum - int_a
                              j = int_list.index(desired_int)
                              if (i == j):
                                  return True
                          except ValueError:
                      return False
                  The use of list.index is equivalent to invoking the in operator. Note that in calls the __contains__ method. This is an O(n) operation (StackOverFlow.Com, Complexiity at Wiki.Python.Org). Since we have to execute O(n) for each iteration of the for statement, the overall performance is now back to O(n2). Yucky.

                  One would be tempted to simplify the above code snippet by directly using the in operator. Unfortunately, we cannot do this, because we not only have to know that the integer is present, but that its index differs from the current integer being processed (i != j).

                  Since the Python set collection has a O(1) performance, why not use it? Unfortunately, in Python the set collection quietly ignores repeated integers. Recall that in the problem statement, [1, 2, 2, 4, 5, 6, 7] was allowable. Later in the blog, we will address this sublety.

                  Complicate Problem by Having Integer List Not Sorted

                  The complication of an unsorted integer list is easily addressed simply by just sorting. It is commonly known that no comparison sorts can perform better than O(n [log n]) in the average or worst case. For background information on this, refer to Wikipedia: Sorting algorithm. Consequently, the computations required to determine if two integers in an unsorted integer list sum to a particular value is
                  1. n [log n] cost for sorting the unordered integer list
                  2. n [log n] cost for a binary search for the other integer
                  The total computation complexity is O(2 * n [log n] ) == O(n [log n] ).

                  Remove Restriction: Integers can Now be Re-Used (i = j)

                  We can now finally use the in.

                  Code Snippet 7: Use in operator
                  def sum_in(int_list, desired_sum):
                      for i, int_a in enumerate(int_list):
                          desired_int = desired_sum - int_a
                          if desired_int in int_list:
                              return True
                      return False
                  Remember, we are now back to the O(n2) performance. Yucky.

                  Re-Use of Integers Permits the Application of Set

                  Because the re-use of integers is allowed, we can use Python's built in set and get a O(1) look-up performance. However, don't forget that you initially have to create the set which involves a cost of O(n). Luckily this is just a one-time setup fee. Please note that creating a set involves hashing, which, depending on the nature of the data, can be problematic.

                  Also, by using set, we are implicitly removing the additional requirement that the system has no memory because we are storing hash values. This is an example of why people must be aware of all ramifications associated with data structures and their associated algorithms.

                  Code Snippet 8: Use set
                  def sum_in(int_list, desired_sum):
                      int_set = set(int_list)
                      for i, int_a in enumerate(int_list):
                          desired_int = desired_sum - int_a
                          if desired_int in int_set:
                              return True
                      return False
                  The above code snippet is the verbose version of the code snippet found at the StackOverFlow question "How to write an algorithm to check if the sum of any two numbers in an array/list matches a given number?"

                  Code Snippet 9: From StackOverFlow
                  def sum_in(int_list, desired_sum):
                      set_of_input = set(int_list) # O(n)
                      return any( ( desired_sum - n) in set_of_input for n in set_of_input ) # O(n)

                  At this point, it is important to remember that we removed a restriction and integers can now be re-used (i=j). This means that when "sum_in([-2, 1, 5, 6, 8], 2)" returns True, it is a correct answer because "1 + 1 == 2." This may or may not be acceptable depending on the situation.

                  Also, at this point, we need to address the distinction between "repeated integers not allowed" and "integers can be re-used." Repeated integers are simply not allowed because Python's set quietly ignores them. "Integers can be re-used" refers to the fact that "sum_in([-2, 1, 5, 6, 8], 2)" returning True is a correct answer because "1 + 1 == 2."

                  Code snippets 8 and 9 require that the entire set (hash values) is computed ahead of time. Why not create the hash value as we process the data? This way, we might find the desired integers before we have created all the hash values. The other advantage of creating the hash values as we process the data is that it allows for repeated integers ([1, 2, 2, 4, 5, 6, 7]) but not the re-use of integers.
                  Code Snippet 10: Generate hash values as process data
                  def sum_in(int_list, desired_sum):
                      indexMap = {}
                      for i in range( len( int_list ) ):
                          if ( desired_sum - int_list[i] ) in indexMap:
                              return True
                              indexMap[ int_list[i] ] = i
                      return False
                  Note that code snippet 10 returns False for "sum_in([-2, 1, 5, 6, 8], 2)". In other words, integers cannot be re-used (i != j).

                  Above material similar to exercise 2.3-7 of CLRS

                  Some people will recognize that the material above is similar to exercise 2.3-7 of CLRS. Below is the actual exercise. We have deviated from it for pedagogical reasons.

                  Alternative Solution

                  Zhao HG on GitBook solves the problem in O(n log n) time using the code snippet below. Notice its simple elegance of just walking two pointers. However, it is complex because have 3 tests, and for each test, there is a different action.

                  Code Snippet 11
                  def two_sum(a, x):
                      l, r = 0, len(a)-1
                      while l < r:
                          if a[l] + a[r] == x:
                              return True
                          elif a[l] + a[r] < x:
                              l += 1
                              r -= 1
                      return False
                  Notice how we have examined the algorithm from 3 different dimensions.
                  1. The first dimension was time. How long does it take to perform the computations? 
                  2. The second dimension was required storage. Please note that storage is used in the generic sense. It does not just refer to memory. For further details, please refer to the Wikipedia article "Memory Hierarchy." 
                  3. The third dimension is algorithm/software complexity. You can use the radon package to quantify the complexity. Please note that complexity is a critical consideration when deploying production code. For an example, refer to "Why Netflix Never Implemented The Algorithm That Won The Netflix $1 Million Challenge by Mike Masnick".
                  Unit Tests

                  Some people will have an opinion that such trivial code snippets do not require unit tests. I strongly disagree. Unit tests are a good way to make sure that you haven't done anything silly. For the purpose of this blog, I have created rather extensive unit tests to prevent a lot of back and forth when people propose adding short circuit logic to reduce the number of computations performed. The conversations will probably start and end with, "Did you run the proposed short circuit logic against the unit tests?"

                  As for the unit tests themselves, we begin by realizing that each integer can be negative, zero or positive and that the desired sum can also be negative, zero or positive. This leads to 27 possible test cases. For an explicit list of all 27 tests, click here. Luckly, most of them are not applicable because they can't happen mathematically or they can't happen because the data is sorted.

                  In turn, each test case has sub-tests. One set of tests finds the desired sum while another set of tests does not. Within, those sub-tests, position matters: left edge, somewhere in the middle and right edge. To see the tests themselves, click here. When all is said and done, we end up with 104 tests. This may seem excessive, but it beats the alternative of having countless conversations about why the computations for the nested for loops was not short circuited.

                  Also, if you want to create tests to estimate run time, one could use the timeit module. Another option is to use the pytest-benchmark.

                  Subtlety: Repeated Integers Versus Re-Use of Integers

                  If repeated integers are allowed, you could have a list of integers like [3, 5, 1, 5, 12]. So, if repeated integers are allowed and wanted the sum to be 10, you could use the two 5's to come up with 10.

                  However, if repeated integers are not allowed, the previous list would become [3, 5, 1, 12]. In this case, if wanted the sum to be 10, there would be no two integers that sum to 10.

                  Now comes the caveat. If have [3, 5, 1, 12] and integers can be re-used, could re-use 5 a second time to have two integers that sum to 10.


                  This blog demonstrates two things.
                  1. First it is critical to know the runtimes of built-in algorithms/tools of a language. 
                  2. Secondly, the material demonstrates a common process. You start out with what appears a simple problem. You try a brute force approach but quickly realize that it won't scale. You then look for patterns like symmetry to reduce the number of required computations. If you are lucky, this will be good enough. If not, an alternative approach will have to be tried.
                  This blog also demonstrates the key traits and skills people need to choose an algorithm. You need some idea of the consequences of doing things like adding a loop.

                  Next you have to be creative. What if we subtract the integer that we're currently examining from the desired sum and then use the resulting integer to do a search? Understanding the Big O notation is a good skill and helps you understand consequences, but you also need to understand the principles behind it.

                  Lastly, know your data. Questions like "Is it sorted or unsorted?" or "Is it symmetric?" will help us in the algorithm selection process.


                  To download the presentation for the PGH Data Science Lightning Talks on 7/5/2017, click here.

                  To download the presentation for the PGHPython: Study Algorithms by Playing Pick 2 on 7/26/2017, click here. For a video recording of the presentation, click here (YouTube).

                  Wednesday, February 8, 2017

                  Work Around Transaction Support in an RDBMS for Bulk Loads - Just Say No

                  Recently, some of my contacts have been struggling with bulk loading into conventional RDBMS products like Oracle, SQL Server and MySQL. They were surprised by the issue.

                  My reply to them is that this is a standard issue. RDBMS products by their nature guarantee transactions. This is very expensive.

                  They do a lot of logging to guarantee transactions. For example, a standard technique in Oracle to add a column to a large table is to create a copy of the table with the additional column using NOLOGGING. Afterwards you add back in constraints, indexes, … .

                  The standard joke in Oracle is that it first does redo and then an undo and then actually does something. For those familiar with Oracle knobs, undo has significance.

                  Also, there is a lot of context switching going on between the RDBMS and the external interface. Heck, even within Oracle, PL/SQL implemented BULK COLLECT to reduce all this context switching between PL/SQL and SQL. (Bulk Processing with BULK COLLECT and FORALL by Steven Feuerstein)

                  So, the trick is to identify the knobs that minimize logging and context switching.

                  However, if you want to be true heretic, consider asking the question of why use an RDBMS if you don’t need transaction support?

                  For further details on this type of stuff, check out the following blog posts

                  1. SQL Can Be Slow -- Why Do People Doubt This?
                  2. Data Warehousing and SQL -- Tread Carefully by S. Lott
                  3. NoSQL Database Doesn’t Mean No Schema by Steven F. Lott