Friday, March 24, 2017

Algorithmic study of finding the sum of any 2 integers from a list being equal to a specific value

I attended the "CLRS textbook study" meeting on 3/20/2017 of the "Algorithms and data structure problems" Pittsburgh group. This was the second meeting of the group. In the first meeting, it was decided that algorithms would be studied using the standard book CLRS. Unfortunately, CLRS does not seem appropriate given the audience. Below is my guess of what may be appropriate for the group.
    We begin by explicitly stating what appears to be a simple problem: find the sum of any 2 distinct integers from a list of integers such that they equal a specific value. For example, given the data set [1, 2, 3, 4, 5, 6], which 2 distinct integers sum to 11? Mathematically, this can be stated as follows
    1. x + y = x0
      1. where
        1. x ∈ N and y ∈ N for some finite set of integer values N
          1. x0: the integer to which x and y sum
            1. x != y
          Concepts are demonstrated via code snippets. They code snippets were executed using Python 2.7.6. They were written in such a way that anyone with basic coding skills could read the code. In other words, the goal was not to be Pythonic or use Python's standard library to solve the problem. Let me re-iterate, the goal of the code snippets is to demonstrate concepts.

          Simplest Case

          The integer list is sorted and the quantity of integers is small enough such that brute force is acceptable.

          Code Snippet 1
          
          
          SomeIntegerList = [1, 2, 3, 4, 5, 6]
          DesiredSumOfIntegers = 11
          for SomeIntegerA in SomeIntegerList:
           for SomeIntegerB in SomeIntegerList:
            if SomeIntegerA == SomeIntegerB: continue
                  SumOfIntegers = SomeIntegerA + SomeIntegerB
                  print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
            if DesiredSumOfIntegers == SumOfIntegers:
                print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"
          
          
          Output of Code Snippet

          SomeInteger A =  1 , SomeInteger B =  2 , Sum of Integers =  3
          SomeInteger A =  1 , SomeInteger B =  3 , Sum of Integers =  4
          SomeInteger A =  1 , SomeInteger B =  4 , Sum of Integers =  5
          SomeInteger A =  1 , SomeInteger B =  5 , Sum of Integers =  6
          SomeInteger A =  1 , SomeInteger B =  6 , Sum of Integers =  7
          SomeInteger A =  2 , SomeInteger B =  1 , Sum of Integers =  3
          SomeInteger A =  2 , SomeInteger B =  3 , Sum of Integers =  5
          SomeInteger A =  2 , SomeInteger B =  4 , Sum of Integers =  6
          SomeInteger A =  2 , SomeInteger B =  5 , Sum of Integers =  7
          SomeInteger A =  2 , SomeInteger B =  6 , Sum of Integers =  8
          SomeInteger A =  3 , SomeInteger B =  1 , Sum of Integers =  4
          SomeInteger A =  3 , SomeInteger B =  2 , Sum of Integers =  5
          SomeInteger A =  3 , SomeInteger B =  4 , Sum of Integers =  7
          SomeInteger A =  3 , SomeInteger B =  5 , Sum of Integers =  8
          SomeInteger A =  3 , SomeInteger B =  6 , Sum of Integers =  9
          SomeInteger A =  4 , SomeInteger B =  1 , Sum of Integers =  5
          SomeInteger A =  4 , SomeInteger B =  2 , Sum of Integers =  6
          SomeInteger A =  4 , SomeInteger B =  3 , Sum of Integers =  7
          SomeInteger A =  4 , SomeInteger B =  5 , Sum of Integers =  9
          SomeInteger A =  4 , SomeInteger B =  6 , Sum of Integers =  10
          SomeInteger A =  5 , SomeInteger B =  1 , Sum of Integers =  6
          SomeInteger A =  5 , SomeInteger B =  2 , Sum of Integers =  7
          SomeInteger A =  5 , SomeInteger B =  3 , Sum of Integers =  8
          SomeInteger A =  5 , SomeInteger B =  4 , Sum of Integers =  9
          
          SomeInteger A =  5 , SomeInteger B =  6 , Sum of Integers =  11
          DesiredSumOfIntegers =  11  was found
          
          SomeInteger A =  6 , SomeInteger B =  1 , Sum of Integers =  7
          SomeInteger A =  6 , SomeInteger B =  2 , Sum of Integers =  8
          SomeInteger A =  6 , SomeInteger B =  3 , Sum of Integers =  9
          SomeInteger A =  6 , SomeInteger B =  4 , Sum of Integers =  10
          
          SomeInteger A =  6 , SomeInteger B =  5 , Sum of Integers =  11
          DesiredSumOfIntegers =  11  was found
          
          You will notice that "1 + 2" as well as "2 + 1" equal 3. This corresponds to the commutative property of addition. If create a matrix of the possible addition permutations, a symmetric matrix is created.


          Now, let's talk about all the possible addition permutations. Here I am using the term permutation as it is strictly defined in mathematics. For a discussion of the difference between permutations and combinations, refer to Combinations and Permutations at MathIsFun.Com. The above example corresponds to permutations without repetition which means that we should use the equation below
          1. Here n is the number of integers in the list and r = 2 because we are taking 2 numbers at a time. Below is a table which evaluates the above equation. It shows how quickly the number of required additions increase as the number of integers in the list increases.
            Number of Integers  Number of
            in the List         Additions
            ------------------  ---------
                2                     2
                3                     6               
                4   `                12
                5                    20
                6                    39
               10                    90
              100                 9,900
              150                22,350
              500               249,500
            1,000               999,000
            
            
            Another perspective is to note that the nested for loops result in O(n2). Clearly, the above approach is not scalable.

            Reduce Computations by Taking Advantage of Symmetry

            Code Snippet 2
            
            
            SomeIntegerList = [1, 2, 3, 4, 5, 6]
            DesiredSumOfIntegers = 11
            i = 0
            for SomeIntegerA in SomeIntegerList:
                i = i + 1
                j = 0
             for SomeIntegerB in SomeIntegerList:
                    j = j + 1
                    if j > i:
                        print "i = ", i, ", j = ", j
                        SumOfIntegers = SomeIntegerA + SomeIntegerB
                        print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
                  if DesiredSumOfIntegers == SumOfIntegers:
                      print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"
            
            
            Output of Code Snippet
            
            
            i =  1 , j =  2
            SomeInteger A =  1 , SomeInteger B =  2 , Sum of Integers =  3
            i =  1 , j =  3
            SomeInteger A =  1 , SomeInteger B =  3 , Sum of Integers =  4
            i =  1 , j =  4
            SomeInteger A =  1 , SomeInteger B =  4 , Sum of Integers =  5
            i =  1 , j =  5
            SomeInteger A =  1 , SomeInteger B =  5 , Sum of Integers =  6
            i =  1 , j =  6
            SomeInteger A =  1 , SomeInteger B =  6 , Sum of Integers =  7
            i =  2 , j =  3
            SomeInteger A =  2 , SomeInteger B =  3 , Sum of Integers =  5
            i =  2 , j =  4
            SomeInteger A =  2 , SomeInteger B =  4 , Sum of Integers =  6
            i =  2 , j =  5
            SomeInteger A =  2 , SomeInteger B =  5 , Sum of Integers =  7
            i =  2 , j =  6
            SomeInteger A =  2 , SomeInteger B =  6 , Sum of Integers =  8
            i =  3 , j =  4
            SomeInteger A =  3 , SomeInteger B =  4 , Sum of Integers =  7
            i =  3 , j =  5
            SomeInteger A =  3 , SomeInteger B =  5 , Sum of Integers =  8
            i =  3 , j =  6
            SomeInteger A =  3 , SomeInteger B =  6 , Sum of Integers =  9
            i =  4 , j =  5
            SomeInteger A =  4 , SomeInteger B =  5 , Sum of Integers =  9
            i =  4 , j =  6
            SomeInteger A =  4 , SomeInteger B =  6 , Sum of Integers =  10
            i =  5 , j =  6
            SomeInteger A =  5 , SomeInteger B =  6 , Sum of Integers =  11
            DesiredSumOfIntegers =  11  was found
            
            Now, let's talk about the number of computations involved in using nested for loops while taking advantage of the symmetry. Even though the number of computations is cut in half, the approach is still not scalable.

            Alternative Approach: Search for the other integer

            Since the above approaches are not scalable, we need to step back and try something different. In this case, we can take advantage of the fact that the list of integers is sorted. Since we have an integer value and we know what the sum must be, we can compute the integer value that we need to find. We can than do a search over an ordered list for the integer that we need to find. For the search, we will use the binary search from Problem Solving with Algorithms and Data Structures using Python By Brad Miller and David Ranum.

            Code Snippet 3
            
            
            
            
            def binarySearch(alist, item):
                first = 0
                last = len(alist) - 1
                found = False
                while first <= last and not found:
                    midpoint = (first + last) // 2
                    if alist[midpoint] == item:
                        found = True
                    else:
                        if item < alist[midpoint]:
                            last = midpoint - 1
                        else:
                            first = midpoint + 1
                return found
            SomeIntegerList = [1, 2, 3, 4, 5, 6]
            DesiredSumOfIntegers = 11
            for SomeInteger in SomeIntegerList:
                IntegerOfInterest = DesiredSumOfIntegers - SomeInteger
                if IntegerOfInterest == SomeInteger:
                    print "IntegerOfInterest was not found"
                    break
                print "SomeInteger = ", SomeInteger, ", IntegerOfInterest = ", IntegerOfInterest, ", DesiredSumOfIntegers = ", DesiredSumOfIntegers
                if binarySearch(SomeIntegerList,IntegerOfInterest) == True:
                    print "IntegerOfInterest was found"
                    break
                else:
                    print "IntegerOfInterest was not found"
                print "---"
            
            
            Output of Code Snippet

            SomeInteger =  1 , IntegerOfInterest =  10 , DesiredSumOfIntegers =  11
            IntegerOfInterest was not found
            ---
            SomeInteger =  2 , IntegerOfInterest =  9 , DesiredSumOfIntegers =  11
            IntegerOfInterest was not found
            ---
            SomeInteger =  3 , IntegerOfInterest =  8 , DesiredSumOfIntegers =  11
            IntegerOfInterest was not found
            ---
            SomeInteger =  4 , IntegerOfInterest =  7 , DesiredSumOfIntegers =  11
            IntegerOfInterest was not found
            ---
            SomeInteger =  5 , IntegerOfInterest =  6 , DesiredSumOfIntegers =  11
            IntegerOfInterest was found
            
            
            Let's analyze the code snippet.
            1. Since the integers that sum must be distinct, the diagnol on the matrix have values of N/A. This resulted in the following if statement
              if IntegerOfInterest == SomeInteger:
                  print "IntegerOfInterest was not found"
                  break
            2. Secondly, we should remove the integer that we are on from the binary search. However, removing an item from a list is expensive. For the details, refer to How to remove an element from a list by index in Python? at StackOverFlow.Com. Consequently, elected not to remove the integer that we are on from the list. Also, having 1 extra element in the search list is not a major impact when there are a large number of integers.
              Now, let's talk about the number of computations involved. It is commonly known that a binary search is O(log n). Since have to do the binary search once for each integer, have to multiply by n. This results in a performance of O(n [log n] ). For the rationale of why the base of the log does not matter, refer to Wikipedia: Sorting algorithm.

              Complicate Problem by Having Integer List Not Sorted

              This complication is easily addressed by just sorting the integer list. It is commonly known that all comparison sorts cannot perform better than O(n [log n]) in the average or worst case. For background information on this, refer to Wikipedia: Sorting algorithm. Consequently, the computations required to determine if 2 integers in an unsorted integer list sum to a particular value is
              1. n [log n] cost for sorting the unordered integer list
                1. n [log n] cost for searching for the other integer
                  The total computation complexity is O(2 * n [log n] ) = O(n [log n] )

                  Above material corresponds to exercise 2.3-7 of the book Introduction to Algorithms by Cormen, ...

                  Some people will recognize that the above material corresponds to exercise 2.3-7 of the book Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein (CLRS) [Edition 3]. Below is the actual exercise


                  Note that Zhao HG on GitBook demostrates that the problem of finding two elements that sum to a given value in a sorted list of n integers can be solved in O(n) time as follows:
                  def two_sum(a, x):
                      l, r = 0, len(a)-1
                      while l < r:
                          if a[l] + a[r] == x:
                              return True
                          elif a[l] + a[r] < x:
                              l += 1
                          else:
                              r -= 1
                      return False
                  
                  
                  Summary

                  The above material demonstrates a common process. You start out with what appears a simple problem. You try a brute force approach but quickly realize that it won't scale. You then look for patterns like symmetry to reduce the number of required computations. If you are lucky, this will be good enough. If not, an alternative approach will have to be tried.

                  Wednesday, February 8, 2017

                  Work Around Transaction Support in an RDBMS for Bulk Loads - Just Say No

                  Recently, some of my contacts have been struggling with bulk loading into conventional RDBMS products like Oracle, SQL Server and MySQL. They were surprised by the issue.

                  My reply to them is that this is a standard issue. RDBMS products by their nature guarantee transactions. This is very expensive.

                  They do a lot of logging to guarantee transactions. For example, a standard technique in Oracle to add a column to a large table is to create a copy of the table with the additional column using NOLOGGING. Afterwards you add back in constraints, indexes, … .

                  The standard joke in Oracle is that it first does redo and then an undo and then actually does something. For those familiar with Oracle knobs, undo has significance.

                  Also, there is a lot of context switching going on between the RDBMS and the external interface. Heck, even within Oracle, PL/SQL implemented BULK COLLECT to reduce all this context switching between PL/SQL and SQL. (Bulk Processing with BULK COLLECT and FORALL by Steven Feuerstein)

                  So, the trick is to identify the knobs that minimize logging and context switching.

                  However, if you want to be true heretic, consider asking the question of why use an RDBMS if you don’t need transaction support?

                  For further details on this type of stuff, check out the following blog posts

                  1. SQL Can Be Slow -- Why Do People Doubt This?
                  2. Data Warehousing and SQL -- Tread Carefully by S. Lott
                  3. NoSQL Database Doesn’t Mean No Schema by Steven F. Lott

                  Thursday, December 22, 2016

                  Jumping into Spark (JIS): Python / Spark / Logistic Regression (Update 3)

                  In this blog we will use the Python interface to Spark to determine whether or not someone makes more or less than $50,000. The logistic regression model will be used to make this determination.

                  Please note that no prior knowledge of Python, Spark or logistic regression is required. However, it is assumed that you are a seasoned software developer. 

                  There are various tutorials on Spark. There is the official documentation on Spark. However, if you are an experienced software professional and want to just jump in and kick the tires, there doesn't seem to be much available. Well, at least I couldn't find any.

                  Yes, there is Spark's Quick Start. Also, there are artifacts like databricks User Guide. Unfortunately, they have a smattering of stuff. You really don't get a chance to jump in.

                  Let me now explicitly define jumping in. Jumping in involves solving an almost trivial problem that demonstrates a good deal of the concepts and software. Also, it involves skipping steps. Links will be provided so that if someone is interested in the missing steps, they can look up the details. If skipping steps bothers you, immediately stop and go read a tutorial.

                  As mentioned earlier, we are going to use Python on Spark to address logistic regression for our jumping in. This is our almost trivial problem that demonstrates a good deal of the concepts and software.

                  The first thing that you need is a working environment. You could install and configure your own environment. However, that would not be in line with jumping in. Instead I recommend using databricks Spark community edition (DSCE). If you are using DSCE, refer to "Welcome to Databricks" on how to create a cluster, create a notebook, attach the notebook to a cluster and actually use the notebook.

                  Next, you need to connect to the Spark environment. In the database world, you create a database connection. In the Spark world, you create a context (SQLContextSparkContext/SparkSession). If you are using DSCE, the following three variables will be predefined for you
                  1. SparkContext: sc
                    1. SparkSession: spark
                      1. SQLContext: sqlContext
                      If you would like the IPython notebook associated with this blog, click here. If for some reason you don't have software to read the IPython notebook, you can download a pdf version of it by clicking here.


                      Row / Column


                      We are almost ready to code. First we have to talk about Row and DataFrame. A Row is just like a row in a spreadsheet or a row in a table. A Dataframe is just a collection of Row's. For the most part, you can think of a DataFrame as a table.

                      Now for the first code snippet. Please note that I took the code snippet from Encode and assemble multiple features in PySpark at StackOverFlow.Com.

                      Code Snippet 1
                      
                      
                      from pyspark.sql import Row
                      from pyspark.mllib.linalg import DenseVector
                      
                      row = Row("gender", "foo", "bar")
                      
                      dataFrame = sc.parallelize([
                        row("0",  3.0, DenseVector([0, 2.1, 1.0])),
                        row("1",  1.0, DenseVector([0, 1.1, 1.0])),
                        row("1", -1.0, DenseVector([0, 3.4, 0.0])),
                        row("0", -3.0, DenseVector([0, 4.1, 0.0]))
                      ]).toDF()
                      
                      dataFrame.collect()

                      Nothing fancy. You create a row which has column names gender, foo and bar. You then create a bunch of row's with actual data. Lastly, you group the row's into a DataFrame. DenseVector was used to demonstrate that a cell in a Row can have a complex data structure. If you are curious about parallelize and toDF, check the references at the end of the blog. This will be true for the rest of the blog. If you are not sure what some magic word means, go to the reference section at the end of the blog.

                      If things are working, you should get an output like that shown below.

                      Output of Code Snippet 1

                      [Row(gender=u'0', foo=3.0, bar=DenseVector([0.0, 2.1, 1.0])),
                       Row(gender=u'1', foo=1.0, bar=DenseVector([0.0, 1.1, 1.0])),
                       Row(gender=u'1', foo=-1.0, bar=DenseVector([0.0, 3.4, 0.0])),
                      Row(gender=u'0', foo=-3.0, bar=DenseVector([0.0, 4.1, 0.0]))]

                      Before going further, we need to take a little detour. Algorithms like numeric data. So we need to do things like convert a column for gender that has Male / Female / Unknown to binary values. We do this by creating two columns.
                      1. The first column will be used to indicate whether the person is or is not male. A one means that the person is male and a zero indicates that the person is not male. StringIndexer will be used to convert categories to numerical values.
                      2. The second column will be used to indicate whether the person is or is not female. A one means that the person is female and a zero indicates that the person is not female.
                      Notice that we don't need a third column for Unknown. If there is a zero in both the Male and Female columns, we know that the gender is Unknown. This process of taking a single category column and decomposing it to many columns of binary values will be accomplished by the OneHotEncoder.

                       

                      StringIndexer


                      A StringIndexer converts categories to numbers. The numbers have a range from 0 to number of categories minus one. The most frequent category gets a number of zero, the second most frequent category gets a number of 1 and so on. We are going to use the code snippets from Preserve index-string correspondence spark string indexer from StackOverFlow.Com to demonstrate what the preceding English means.

                      Let's actually create a StringIndexer and use it to map/fit/transform categories to numbers.

                      Code Snippet 2

                      dataFrame = sqlContext.createDataFrame(
                          [(0, "a"), (1, "b"), (2, "b"), (3, "c"), (4, "c"), (5, "c"), (6,'d'), (7,'d'), (8,'d'), (9,'d')],
                          ["id", "category"])
                      
                      dataFrame.collect()
                      
                      from pyspark.ml.feature import StringIndexer
                      stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
                      modelStringIndexer = stringIndexer.fit(dataFrame)
                      transformedDataFrame = modelStringIndexer.transform(dataFrame)
                      transformedDataFrame.collect()
                      

                      Output of Code Snippet 2

                      [Row(id=0, category=u'a', categoryIndex=3.0),
                       Row(id=1, category=u'b', categoryIndex=2.0),
                       Row(id=2, category=u'b', categoryIndex=2.0),
                       Row(id=3, category=u'c', categoryIndex=1.0),
                       Row(id=4, category=u'c', categoryIndex=1.0),
                       Row(id=5, category=u'c', categoryIndex=1.0),
                       Row(id=6, category=u'd', categoryIndex=0.0),
                       Row(id=7, category=u'd', categoryIndex=0.0),
                       Row(id=8, category=u'd', categoryIndex=0.0),
                       Row(id=9, category=u'd', categoryIndex=0.0)]
                      

                      Notice how d's got 0.0 because they are the most numerous. The letter c's got 1.0 because they are the second most numerous. And so on. The code snippet below will make this more clear.

                      Code Snippet 3

                      transformedDataFrame.select('category','categoryIndex').distinct().orderBy('categoryIndex').show()
                      

                      Output of Code Snippet 3

                      +--------+-------------+
                      |category|categoryIndex|
                      +--------+-------------+
                      |       d|          0.0|
                      |       c|          1.0|
                      |       b|          2.0|
                      |       a|          3.0|
                      +--------+-------------+

                      OneHotEncoder


                      A OneHotEncoder converts category numbers to binary vectors with at most a single one-value per row. For a true understanding of one-hot encoding, refer to the associated Wikipedia page.

                      Next, let's use a OneHotEncoder it to transform the category index that we created earlier to a binary vector.

                      Code Snippet 4

                      from pyspark.ml.feature import OneHotEncoder
                      oneHotEncoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVector")
                      oneHotEncodedDataFrame = oneHotEncoder.transform(transformedDataFrame)
                      oneHotEncodedDataFrame.show()
                      

                      Output of Code Snippet 4

                      +---+--------+-------------+--------------+
                      | id|category|categoryIndex|categoryVector|
                      +---+--------+-------------+--------------+
                      |  0|       a|          3.0|     (3,[],[])|
                      |  1|       b|          2.0| (3,[2],[1.0])|
                      |  2|       b|          2.0| (3,[2],[1.0])|
                      |  3|       c|          1.0| (3,[1],[1.0])|
                      |  4|       c|          1.0| (3,[1],[1.0])|
                      |  5|       c|          1.0| (3,[1],[1.0])|
                      |  6|       d|          0.0| (3,[0],[1.0])|
                      |  7|       d|          0.0| (3,[0],[1.0])|
                      |  8|       d|          0.0| (3,[0],[1.0])|
                      |  9|       d|          0.0| (3,[0],[1.0])|
                      +---+--------+-------------+--------------+

                      SparseVector

                      The column categoryVector is a SparseVector. It has 3 parts. The first part is the length of the vector. The second part are the indicies which contain values. The third part are the actual values. Below is a code snippet demonstrating this.

                      Code Snippet 5

                      from pyspark.mllib.linalg import SparseVector
                      v1 = SparseVector(5, [0,3], [10,9])
                      for x in v1:
                        print(x)
                      

                      Output of Code Snippet 5

                      10.0
                      0.0
                      0.0
                      9.0
                      0.0
                      

                      Notice how category a (categoryVector = (3,[],[])) is not included because it makes the vector entries sum to one and hence linearly dependent. The code snippet below will provide a better visual for this.

                      Code Snippet 6

                      oneHotEncodedDataFrame.select('category','categoryIndex', 'categoryVector').distinct().orderBy('categoryIndex').show()
                      

                      Output of Code Snippet 6

                      +--------+-------------+--------------+
                      |category|categoryIndex|categoryVector|
                      +--------+-------------+--------------+
                      |       d|          0.0| (3,[0],[1.0])|
                      |       c|          1.0| (3,[1],[1.0])|
                      |       b|          2.0| (3,[2],[1.0])|
                      |       a|          3.0|     (3,[],[])|
                      +--------+-------------+--------------+
                      

                      VectorAssembler

                      By this point you are probably getting impatient. Luckly, we have just one more item to cover before we get to logistic regression. That one item is the VectorAssembler. A VectorAssembler just concatenates columns together. As usual, we will demonstrate what the words mean via a code snippet.

                      Code Snippet 7

                      from pyspark.ml.feature import VectorAssembler
                      dataFrame_1 = spark.createDataFrame([(1, 2, 3), (4,5,6)], ["a", "b", "c"])
                      vectorAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
                      dataFrame_2 = vectorAssembler.transform(dataFrame_1)
                      dataFrame_2.show()

                      Output of Code Snippet 7

                      +---+---+---+-------------+
                      |  a|  b|  c|     features|
                      +---+---+---+-------------+
                      |  1|  2|  3|[1.0,2.0,3.0]|
                      |  4|  5|  6|[4.0,5.0,6.0]|
                      +---+---+---+-------------+

                      Logistic Regression


                      We have now learned enough Spark to look at a specific problem involving logistic regression. We are going to work through the example provided in the databricks documentation.

                      If you are interested in learning about logistic regression, recommend reading "Chapter 6: The Grand daddy of Supervised Artificial Intelligence - Regression (spreadsheet)" of  the book Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman (2013). Personally, I like the book because it provides minimal math and an implementation of logistic regression in a spreadsheet.

                      Drop / Create Table
                      1. Drop Table
                        %sql 
                        DROP TABLE IF EXISTS adult
                        
                        1. Create Table
                          %sql
                          
                          CREATE TABLE adult (
                            age               DOUBLE,
                            workclass         STRING,
                            fnlwgt            DOUBLE,
                            education         STRING,
                            education_num     DOUBLE,
                            marital_status    STRING,
                            occupation        STRING,
                            relationship      STRING,
                            race              STRING,
                            sex               STRING,
                            capital_gain      DOUBLE,
                            capital_loss      DOUBLE,
                            hours_per_week    DOUBLE,
                            native_country    STRING,
                            income            STRING)
                          USING com.databricks.spark.csv
                          OPTIONS (path "/databricks-datasets/adult/adult.data", header "true")
                          
                          
                        Convert table to a DataFrame
                        dataset = spark.table("adult")
                        Get a list of columns in original dataset
                        cols = dataset.columns
                        1. This step has to be done here and not later. Unfortunately, the databricks examples re-uses the variable dataset.
                        Perform One Hot Encoding on columns of interest
                        from pyspark.ml import Pipeline
                        from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
                        
                        categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
                        stages = [] # stages in our Pipeline
                        for categoricalCol in categoricalColumns:
                          stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
                          encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"classVec")
                          # Add stages.  These are not run here, but will run all at once later on.
                          stages += [stringIndexer, encoder]
                        
                        
                        Create a StringIndexer on income
                        
                        
                        label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
                        stages += [label_stringIdx]
                        
                        
                        Combine feature columns into a single vector column using VectorAssembler
                        
                        
                        numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
                        assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols
                        assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
                        stages += [assembler]
                        Put data through all of the feature transformations using the stages in the pipeline
                        pipeline = Pipeline(stages=stages)
                        pipelineModel = pipeline.fit(dataset)
                        dataset = pipelineModel.transform(dataset)
                        
                        
                        Keep relevant columns
                        
                        selectedcols = ["label", "features"] + cols
                        dataset = dataset.select(selectedcols)
                        display(dataset)
                        
                        
                        Split data into training and test sets
                        
                        
                        (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
                        print trainingData.count()
                        print testData.count()
                        1. Set seed for reproducibility
                        Train logistic regression model and then make predictions
                        from pyspark.ml.classification import LogisticRegression
                        lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)
                        lrModel = lr.fit(trainingData)
                        predictions = lrModel.transform(testData)
                        selected = predictions.select("prediction", "age", "occupation")
                        display(selected)
                        
                        1. Output
                          prediction  age occupation
                          ----------  --- --------------
                          ...
                          0           20  Prof-specialty
                          1           35  Prof-specialty
                          ...
                          
                          
                          1. So, a prediction of 0 means that the person earns <=50K. While a prediction of 1 means that the person earns >50K

                          Summary

                          In summary, we jumped in by using Python on Spark to address logistic regression. We did not read any tutorials. We did skip steps. However, links are provided in the reference section to fill in the steps if the reader so desires.

                          The advantage of jumping in is that you learn by solving a problem. Also, you don't spend weeks or months learning material like that listed in the references below before you can do anything. The disadvantage is that steps are skipped and full understanding of even the provided steps won't be present. It is realized that jumping in is not for everybody. For some people, standard tutorials are the way to begin.

                           

                          References

                          1. Book: Data Smart by John W. Foreman
                            1. john-foreman.com
                            2. O'Reilly
                            3. Chapter 6: The Granddaddy of Supervised Artificial Intelligence - Regression
                              1. Text
                              2. Spreadsheet
                          2. Cartoon from New Yorker Magazine: Does your car have any ide why my car pulled it over?
                          3. databricks
                          4.  sklearn.linear_model.LogisticRegression
                          5. Spark.Apache.Org
                          6. StackOverFlow.Com
                          7. Wikipedia

                          Thursday, September 29, 2016

                          Can a Purely State Based Database Schema Migration Be Trusted? by Robert Lucente

                          A really good article on the tradeoffs between state based vs migration based schema transformation is titled Critiquing two different approaches to delivering databases: Migrations vs state by Alex Yates on 6/18/2015. The blog Database Schema Migration by S. Lott on 9/27/2016 ends with "It's all procedural migration. I'm not [sure] declarative ("state") tools can be trusted beyond discerning the changes and suggesting a possible migration."

                          Let me start by examining the statement of the problem: state based vs migration based schema transformation. The problem statement implies that the solution is an either or type of thing. Why not both?

                          As opposed to speaking in generalities, let me pick a specific problem which is often encountered in real systems which demonstrates the core issue. Also, let me pick a particular tool set to execute on this particular problem.

                          The specific problem involves
                          1. Adding some domain table and its associated data.
                            1. Adding a not null / mandatory column to some table that already has data.
                              1. Creating a foreign key between the domain table and the new mandatory column.
                                Below is a picture for the above words. The "stuff" in color are the tables and columns being added.


                                In a migration based schema approach the following steps would have to be performed
                                1. SomeTable exists with data.
                                  1. A new domain table (SomeDomain) gets created.
                                    1. The new domain table (SomeDomain) gets populated with data.
                                      1. A new nullable / not mandatory column SomeDomainUuid is added to SomeTable.
                                        1. Column SomeDomainUuid in SomeTable gets populated with data.
                                          1. Column SomeDomainUuid in SomeTable is made not null / mandatory.
                                            1. A foreign key is created between the two tables.
                                              Notice that the above is very labor intensive and involves 7 steps. Software should be able to figure out all the steps and their sequences except for the following
                                              1. The new domain table (SomeDomain) gets populated with data.
                                                1. Column SomeDomainUuid in SomeTable gets populated with data.
                                                  The key thing to notice is that the steps that can't be automated involve data. There is no way for the software to know about the specifics of the data.
                                                    Now that we have defined a specific problem, let's execute on solving the problem using a specific tool set. I am going to use the Visual Studio SQL Server Database Project concept with SQL Server as the target database. Via a series of clicks, the end state of the database is specified in the "Visual Studio SQL Server Database Project".
                                                      Next we write a script (Populate_dbo_SomeDomain.sql) to populate SomeDomain table with data.
                                                      
                                                      
                                                      INSERT INTO dbo.SomeDomain
                                                          VALUES 
                                                          (newid(), 'Fred'), (newid(), 'Barney');
                                                      
                                                      
                                                      
                                                      The second step is to write a script (Update_dbo_Deal_AdvertiserTypeUuid.sql) to populate SomeDomainUuid column in SomeTable.
                                                      
                                                      
                                                      UPDATE [dbo].[SomeTabe] 
                                                      SET    [SomeDomainUuid] = (SELECT SomeDomainUuid
                                                                                FROM   [dbo].[SomeDomain] 
                                                                                WHERE  Name = 'Fred');
                                                      
                                                      
                                                      The last preparation step is to write a script (Script.PostDeployment.sql) to run the above 2 scripts after the state based change has happened.
                                                      
                                                      
                                                      :r ".\Populate_dbo_SomeDomain.sql"
                                                      go
                                                      
                                                      :r ".\Update_dbo_Deal_AdvertiserTypeUuid.sql"
                                                      go
                                                      
                                                      
                                                      
                                                      Now that the desired end state has been specified as well as the data manipulation scripts written, it is time to modify a database. The Microsoft terminology for this is "publishing the database project". There will be an issue because the state changes will be made and then the Script.PostDeployment.sql script will be run. In between the state changes and the script being run, there will need to be data in the SomeDomainUuid column in the table SomeTable. This issue is addressed by using the GenerateSmartDefaults option when publishing the database project.

                                                      Let's summarize what this combination of state based and migration based schema transformation has allowed us to do. We were able to take 7 steps and reduce it down to 2 steps. These 2 steps couldn't be automated anyways because they involved data. These are the pros. The con is that have to be familiar with the framework and select the GenerateSmartDefaults option out of the 60 plus available options.

                                                      In conclusion, a purely state based approach can't work because of data. There is no way for the software to know how to do data migrations. Only humans know the correct way to do data migrations. In our example, there is no way for the software to know whether or not the new column SomeDomainUuid is to be initially populated with "Fred" or "Barney". This is a long winded and nice way of saying that a purely state based database schema migration can't be trusted. However, the combination of state based and migration based can truly improve productivity.

                                                      Saturday, February 27, 2016

                                                      Gentle Introduction to Various Math Stuff

                                                      I recently attended a meetup in Pittsburgh titled Analytics of Social Progress: When Machine Learning meets Human Problems given by Amy Hepner. She did an outstanding job of introducing some math concepts very simply and intuitively. If you are interested in the slides showing how she did this, you can go to her web site by clicking here.

                                                      This got me thinking about how I could help others with simple and intuitive ways to explain math stuff. I get annoyed when people say "doubly differentiable" as opposed to no gaps and no kinks. What makes this difficult is that everyone is at a different level and so you won't be able to please most people. However, I figure, any help is better than no help at all.

                                                      As expected, there is already plenty of material on the internet. For statistics, a good place to start is "The Most Comprehensive Review of Comic Books Teaching Statistics by Rasmus Baath." A second good place to look is "A Brief Review of All Comic Books Teaching Statistics by Rasmus Baath and Christian Robert." For links to the book themselves, see the list below.
                                                      1. The Cartoon ...
                                                        1. The Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith
                                                        2. The Cartoon Introduction to Statistics by Grady Kelin, Alan Dabney
                                                      2. Manga Guide to Statistics by Shin Takahashi
                                                      3. ... for Dummies
                                                        1. Biostatistics For Dummies by John Pezzullo
                                                        2. Business Statistics For Dummies by Alan Anderson
                                                        3. Predictive Analytics For Dummies by Anasse Bari
                                                        4. Probability For Dummies by Deborah J. Rumsey
                                                        5. Psychology Statistics For Dummies by Donncha Hanna
                                                        6. Statistical Analysis with Excel For Dummies by Joseph Schmuller
                                                        7. Statistics Essentials For Dummies by Deborah J. Rumsey
                                                        8. Statistics for Big Data For Dummies by Alan Anderson
                                                        9. Statistics For Dummies by Deborah J. Rumsey
                                                        10. Statistics II For Dummies by Deborah J. Rumsey
                                                        11. Statistics Workbook For Dummies by Deborah J. Rumsey
                                                        12. Statistics: 1,001 Practice Problems For Dummies (+ Free Online Practice) by Consumer Dummies
                                                       For other gentle introduction to other math stuff, you can check out the list below. People have complained that the list below is too long. My response is if you are not willing to spend 10 minutes to skim through the list, you are not ready to make the commitment to upgrade your math skills.
                                                      1. Algebra ...
                                                        1. Algebra I For Dummies by Mary Jane Sterling
                                                        2. Algebra I Essentials For Dummies by Mary Jane Sterling
                                                        3. Algebra I Workbook For Dummies by Mary Jane Sterling
                                                        4. Algebra II For Dummies by Mary Jane Sterling
                                                        5. Algebra II Workbook For Dummies by Mary Jane Sterling
                                                        6. Algebra II: 1,001 Practice Problems For Dummies (+ Free Online Practice) by Mary Jane Sterling
                                                      2. Basic Math and Pre-Algebra ...
                                                        1. Basic Math and Pre-Algebra For Dummies by Mark Zegarelli
                                                        2. Basic Math and Pre-Algebra: 1,001 Practice Problems For Dummies (+ Free Online Practice) by Mark Zegarelli
                                                      3. Calculus ...
                                                        1. Calculus For Dummies by Mark Ryan
                                                        2. Calculus II For Dummies by Mark Zegarelli
                                                        3. Calculus Essentials For Dummies by Mark Ryan
                                                        4. Calculus Workbook For Dummies by Mark Ryan
                                                        5. Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice) by Patrick Jones
                                                      4. Complete Idiot's Guide to Algebra Word Problems by Izolda Fotiyeva
                                                      5. Cartoon Guide ...
                                                        1. Cartoon Guide to Calculus by Larry Gonick
                                                        2. Cartoon Guide to Physics by Larry Gonick
                                                      6. Data ...
                                                        1. Data Mining For Dummies by Meta S. Brown
                                                        2. Data Science For Dummies by Lillian Pierson
                                                        3. Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman
                                                      7. Differential Equations ...
                                                        1. Differential Equations For Dummies by Steven Holzner
                                                        2. Differential Equations Workbook For Dummies by Steven Holzner
                                                      8. Excel Data Analysis For Dummies by Stephen L. Nelson
                                                      9. Geometry ...
                                                        1. Geometry Essentials For Dummies by Mark Ryan
                                                        2. Geometry For Dummies by Mark Ryan
                                                        3. Geometry Workbook For Dummies by Mark Ryan
                                                        4. Geometry: 1,001 Practice Problems For Dummies (+ Free Online Practice) by Allen Ma
                                                      10. How to Solve Word Problems in Algebra by Mildred Johnson
                                                      11. Linear Algebra For Dummies by Mary Jane Sterling
                                                      12. Manga Guide to ...
                                                        1. Manga Guide to Calculus by Hiroyuki Kojima
                                                        2. Manga Guide to Linear Algebra by Shin Takahashi
                                                        3. Manga Guide to Physics by Hideo Nitta
                                                        4. Manga Guide to Regression Analysis Shin Takahashi
                                                        5. Manga Guide to Relativity by Hideo Nitta
                                                      13. Math Word Problems ...
                                                        1. Math Word Problems Demystified by Allan Bluman
                                                        2. Math Word Problems For Dummies by Mary Jane Sterling
                                                      14. Optimization Modeling with Spreadsheets by Kenneth R. Baker
                                                      15. Physics ...
                                                        1. Physics I For Dummies by Steven Holzner
                                                        2. Physics I Workbook For Dummies by Steven Holzner
                                                        3. Physics II For Dummies by Steven Holzner
                                                      16. Pre-Calculus ...
                                                        1. Pre-Calculus For Dummies by Yang Kuang
                                                        2. Pre-Calculus Workbook For Dummies by Yang Kuang
                                                        3. Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online… by Mary Jane Sterling
                                                      17. Predictive Analytics For Dummies by Anasse Bari
                                                      18. R For Dummies by Andrie de Vries
                                                      19. Schaum's ...
                                                        1. Schaum's Outline of Introduction to Probability and Statistics by Seymour Lipschutz
                                                        2. Schaum's Outline of Probability by Seymour Lipschutz
                                                        3. Schaum's Outline of Probability and Statistics: 760 Solved Problems + 20 Videos by John Schiller
                                                        4. Schaum's Outline of Statistics by Murray Spiegel
                                                      20. Technical Math For Dummies by Barry Schoenborn
                                                      21. Trigonometry ...
                                                        1. Trigonometry For Dummies by Mary Jane Sterling
                                                        2. Trigonometry Workbook For Dummies by Mary Jane Sterling

                                                      Friday, August 7, 2015

                                                      Sunday, June 7, 2015

                                                      Quantifying the Efficacy of Machine Learning Algorithms

                                                      There are many articles and books and talks and so on all making claims about machine learning algorithms. Out of those, how many actually QUANTIFY THE EFFICACY OF THE ALGORITHM?.

                                                      Below are some references which will hopefully save people some leg work and help others quantify the performance of their algorithms.
                                                      1. Articles
                                                      2. Books
                                                        1. Assessing and Improving Prediction and Classification by Timothy Masters
                                                        2. Evaluating and Comparing the Performance of Machine Learning Algorithms by Melanie Mitchell
                                                        3. Evaluating Learning Algorithms: A Classification Perspective by Japkowicz, Shah
                                                          1. Amazon
                                                            1. Customer Reviews
                                                              1. Howard B. Bandy on July 7, 2014
                                                                1. ... examples presented are all of stationary data ...
                                                            2. Cambridge.Org
                                                              1. Looking for an examination copy?
                                                                1. If you are interested in the title for your course we can consider offering an examination copy. To register your interest please contact collegesales@cambridge.org providing details of the course you are teaching.
                                                              2. Google
                                                                1. Has table of contents and you can look at some randomly selected pages
                                                                2. MohakShah.Com
                                                                  1. Email Address: eval@mohakshah.com
                                                                    1. For electronic editions, ...
                                                                      1. Computing Resources
                                                                  2. Evaluation and Analysis of Supervised Learning Algorithms and Classifiers by Niklas Lavesson
                                                                    1. Machine Learning and Data Mining: 14 Evaluation and Credibility by Pier Luca Lanzi