 Articles
 20 Most BikeFriendly Cities in the World, From MalmÃ¶ to Montreal by Mikael ColvilleAndersen  06.14.17
 Citibike Business Opportunity: Advertising by Josh Yoon  October 15, 2017
 CMU student creates cool maps of Pittsburgh bikeshare stats by Ryan Deto  Dec 9, 2015
 Data from Public Bicycle Hire Systems by Mark Padgham  October 17, 2017
 Here’s Proof That Commuter Bikes Don’t Have to Suck by Michael Calore  08.08.17
 Is it faster to take a bike or taxi in NYC? by David Smith  October 18, 2017
 New Yorkers, municipal bikes, and the weather by David Smith  January 29, 2016
 NYC Citi Bike Visualization  A Hint of Future Transportation by Summer Sun  August 26, 2017
 Photo of the Week: A Dizzying View of a Bicycle Graveyard in China by Laura Mallonee  6.30.17
 Reservoir Balancing Constraint with Applications to BikeSharing by Joris Kinable
 Tale of TwentyTwo Million Citi Bikes: Analyzing the NYC Bike Share System by Todd Schneider  January 13, 2016
 Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate by Claire Vignon Keser  April 26, 2017
 When are Citi Bikes Faster than Taxis in New York City? by Todd Schneider  September 26, 2017
 Why Investors Are Betting That Bike Sharing Is the Next Uber by Erin Griffith  10.16.17
 GitHub
 DeFusco, Albert (AlbertDeFusco)
 DeMarco, Steven (stevedem)
 Whittaker, Tim (timsetsfire)
 Kaggle: Bike Sharing Demand
 Slack Channels
 PGH Data Science
 Web Pages
 bikepgh.org
 healthyridepgh.com
 meetup.com  PGH Data Science
 pghbikeshare.org
The Bit Plumber
Friday, September 29, 2017
Biking and Machine Learning
Tuesday, July 11, 2017
Mapping the Machine Learning Time Feature to a Circle to Compute Distance
One of the standard questions in machine learning is to compute the
distance between two time intervals. If did a standard time interval
difference between 1:00 am and 12:59 PM, there would be a lot of minutes
between the two in terms of elapsed time. However, from a human
activity perspective, there isn't much of a difference between the two.
They are done really early in the morning.
One approach to solving the problem is to map the times 00:00 to 23:59 to a circle and then compute the chord length between the two mappings. For a discussion on how to compute chord length, please refer to the Wikipedia article "Circular Segment."
One approach to solving the problem is to map the times 00:00 to 23:59 to a circle and then compute the chord length between the two mappings. For a discussion on how to compute chord length, please refer to the Wikipedia article "Circular Segment."
Below is a picture of a circle with the cord length denoted by c
The general equation to compute the length of the chord is given by the equation below
For simplicity will pick R to be 1. The above equation now simplifies to the following
The above concepts and math was implemented in the following code using Python 3.6.
import math def create_hours_minutes_in_a_day(): number_of_hours_in_day = 24 number_of_minutes_in_an_hour = 60 number_of_minutes_in_a_day = 0 hours_minutes_in_day = [] for hour in range(0, number_of_hours_in_day): if hour < 10: hour_to_print = '0' + str(hour) else: hour_to_print = str(hour) for minute in range(0, number_of_minutes_in_an_hour): if minute < 10: minute_to_print = '0' + str(minute) else: minute_to_print = str(minute) hours_minutes_in_day.append(hour_to_print+':'+minute_to_print) number_of_minutes_in_a_day += 1 return hours_minutes_in_day, number_of_minutes_in_a_day def map_hour_minute_in_day_to_circle(hour_colon_minute_in_day, number_of_minutes): number_of_degrees_in_a_circle = 360 increment_size_in_degrees = number_of_degrees_in_a_circle / number_of_minutes current_angle_in_degrees = 0 map_hour_minute_to_circle = dict() for x in hour_colon_minute_in_day: map_hour_minute_to_circle[x] = current_angle_in_degrees current_angle_in_degrees += increment_size_in_degrees return map_hour_minute_to_circle def compute_arc_length_between_time_interval(mapping_of_hour_minute_to_angle, time_a, time_b): # Since we want the relative distance betwee two time intervals, # can set the circle to a radius = 1 angle_a = mapping_of_hour_minute_to_angle[time_a] angle_b = mapping_of_hour_minute_to_angle[time_b] chord_length = 2 * math.sin ( math.radians(angle_b  angle_a) / 2 ) print("Time A: ", time_a, "Angle A: ", angle_a) print("Time b: ", time_b, "Angle A: ", angle_b) print("Chord: ", chord_length) print("") hour_colon_minute_in_day, number_of_minutes = create_hours_minutes_in_a_day() map_hour_minute_to_angle = map_hour_minute_in_day_to_circle(hour_colon_minute_in_day, number_of_minutes) compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '06:00') compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '12:00') compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '18:00') compute_arc_length_between_time_interval(map_hour_minute_to_angle, '00:00', '23:59')
The output from the above code is below
Time A: 00:00 Angle A: 0 Time b: 06:00 Angle A: 90.0 Chord: 1.4142135623730951  Time A: 00:00 Angle A: 0 Time b: 12:00 Angle A: 180.0 Chord: 2.0  Time A: 00:00 Angle A: 0 Time b: 18:00 Angle A: 270.0 Chord: 1.4142135623730951  Time A: 00:00 Angle A: 0 Time b: 23:59 Angle A: 359.75 Chord: 0.0043633196686735124 
Wednesday, June 14, 2017
Don’t use [deep learning  Hadoop] your data isn’t that big
 Don't use deep learning your data isn't that big by Jeff Leek
 Don't use Hadoop  your data isn't that big by Chris Stucchio
Friday, March 24, 2017
Study Algorithms by Playing Pick 2: Find the Sum of Any 2 Integers from a List that are Equal to a Specific Value
The "Algorithms and Data Structure Problems" Pittsburgh group decided to study algorithms using the book Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein (CLRS) [Edition 3].
Even though CLRS is a standard text book at many universities, it may not be appropriate given our audience. Below is my recommendation for an approach that may be appropriate for the group.
Problem Statement
We begin by explicitly stating what appears to be a simple problem: Find two integers from a list of integers that sum to a specific value. For example, given [1, 2, 3, 4, 5, 6], which two integers sum to 11? Mathematically, this can be stated as follows
Additional requirements are as follows
The implications of the additional requirements will be addressed
throughout this blog post. Also, to keep things simple, we are going to
state that no concurrent processing is allowed.
Note the writing style used in the problem statement. It's formal enough that people can learn something concrete but approachable enough that people won't skip it for fear of feeling dumb. That is why, initially, no Big O notation is used.
Obviously, the goal is not to teach a trivial addition algorithm. The goal is to help people start thinking about how to find and evaluate algorithms. The addition algorithm is a nice example for learning how to approach algorithms in general. In the particular case, the progression is as follows
Concepts are demonstrated via code snippets. The code snippets are executed using Python 3.6.1.
Simplest Case
The simplest case assumes that the integers are sorted and the quantity of integers is small enough that brute force is acceptable. Code Snippet 1 satisfies the simplest case. Please note that there is nothing wrong with brute force as long as it satisfies the requirements. Sometimes people forget this.
If the integer list is unsorted, a simple solution is to just sort it. The consequences of sorting are addressed later in this blog. Perhaps sorted data is unnecessary? This too will be addressed later.
Code Snippet 1
Notice that the diagonal of the matrix has N/A. This corresponds to the fact that i != j. Another way of stating this is that we cannot reuse integers.
Before looking at the Big O notation, let's just count all the possible permutations of addition. Here we use the term permutation as strictly defined in mathematics. For a discussion of the difference between permutations and combinations, refer to Combinations and Permutations at MathIsFun.Com. The example above corresponds to permutations without repetition, which means that we should use the equation below
Here n is the number of integers in the list, and r = 2 because we are taking 2 numbers at a time. Below is a table that evaluates the equation above. It shows how quickly the number of required additions increases as the number of integers in the list increases.
Reduce Computations by Taking Advantage of Symmetry Arising from Commutative Property of Addition
Code Snippet 2: Reduce computations by taking advantage of symmetry
Counter Example
Alternative Approach: Binary Search for the Desired Integer
Because the above approaches are not scalable, we need to step back and try another approach. In this other approach, we will use the a priori knowledge that the integers are sorted. Since we know what the desired integer is, we can do a binary search for it over the ordered integers. For background information about binary search, refer to the book "Problem Solving with Algorithms and Data Structures using Python" by Brad Miller and David Ranum.
Luckily, Python has the bisect module which can be used to search sorted lists. Code Snippet 3 uses the bisect_left method of the bisect_module.
Code Snippet 3 (Source: Docs.Python.Org  bisect module  Searching Sorted Lists)
Code Snippet 4: bisect_left
Code Snippet 5
Now, let's talk about the number of computations involved in Code Snippet 5. It is commonly known that a binary search is O(log n). Because we have to do the binary search once for each integer, we have to multiply by n. This results in a performance of O(n [log n] ). For the rationale of why the base of the log does not matter, refer to Wikipedia: Sorting algorithm.
Significant Performance Improvement: Computations Reduced from O(n^{2}) to O(n [log n])
Notice that in going from Code Snippet 2 to Code Snippet 5, performance went from O(n^{2}) to O(n [log n]). This is a significant performance improvement.
To visually confirm this, let's examine the graph below. Notice how the purple line {O(n [log n]} is relatively linearly but the dark blue line {O(n^{2}} just shoots straight up.
Why Not Just Use Python Builtin Functionality?
Those familiar with Python will be tempted to simplify the above code by using the index() method of list objects, which results in the code snippet below.
Code Snippet 6: Use list.index
One would be tempted to simplify the above code snippet by directly using the in operator. Unfortunately, we cannot do this, because we not only have to know that the integer is present, but that its index differs from the current integer being processed (i != j).
Since the Python set collection has a O(1) performance, why not use it? Unfortunately, in Python the set collection quietly ignores repeated integers. Recall that in the problem statement, [1, 2, 2, 4, 5, 6, 7] was allowable. Later in the blog, we will address this sublety.
Complicate Problem by Having Integer List Not Sorted
The complication of an unsorted integer list is easily addressed simply by just sorting. It is commonly known that no comparison sorts can perform better than O(n [log n]) in the average or worst case. For background information on this, refer to Wikipedia: Sorting algorithm. Consequently, the computations required to determine if two integers in an unsorted integer list sum to a particular value is
The total computation complexity is O(2 * n [log n] ) == O(n [log n] ).
Remove Restriction: Integers can Now be ReUsed (i = j)
We can now finally use the in.
Code Snippet 7: Use in operator
ReUse of Integers Permits the Application of Set
Because the reuse of integers is allowed, we can use Python's built in set and get a O(1) lookup performance. However, don't forget that you initially have to create the set which involves a cost of O(n). Luckily this is just a onetime setup fee. Please note that creating a set involves hashing, which, depending on the nature of the data, can be problematic.
Also, by using set, we are implicitly removing the additional requirement that the system has no memory because we are storing hash values. This is an example of why people must be aware of all ramifications associated with data structures and their associated algorithms.
Code Snippet 8: Use set
Code Snippet 9: From StackOverFlow
At this point, it is important to remember that we removed a restriction and integers can now be reused (i=j). This means that when "sum_in([2, 1, 5, 6, 8], 2)" returns True, it is a correct answer because "1 + 1 == 2." This may or may not be acceptable depending on the situation.
Also, at this point, we need to address the distinction between "repeated integers not allowed" and "integers can be reused." Repeated integers are simply not allowed because Python's set quietly ignores them. "Integers can be reused" refers to the fact that "sum_in([2, 1, 5, 6, 8], 2)" returning True is a correct answer because "1 + 1 == 2."
Code snippets 8 and 9 require that the entire set (hash values) is computed ahead of time. Why not create the hash value as we process the data? This way, we might find the desired integers before we have created all the hash values. The other advantage of creating the hash values as we process the data is that it allows for repeated integers ([1, 2, 2, 4, 5, 6, 7]) but not the reuse of integers.
.
Code Snippet 10: Generate hash values as process data
Above material similar to exercise 2.37 of CLRS
Some people will recognize that the material above is similar to exercise 2.37 of CLRS. Below is the actual exercise. We have deviated from it for pedagogical reasons.
Alternative Solution
Zhao HG on GitBook solves the problem in O(n log n) time using the code snippet below. Notice its simple elegance of just walking two pointers. However, it is complex because have 3 tests, and for each test, there is a different action.
Code Snippet 11
Some people will have an opinion that such trivial code snippets do not require unit tests. I strongly disagree. Unit tests are a good way to make sure that you haven't done anything silly. For the purpose of this blog, I have created rather extensive unit tests to prevent a lot of back and forth when people propose adding short circuit logic to reduce the number of computations performed. The conversations will probably start and end with, "Did you run the proposed short circuit logic against the unit tests?"
As for the unit tests themselves, we begin by realizing that each integer can be negative, zero or positive and that the desired sum can also be negative, zero or positive. This leads to 27 possible test cases. For an explicit list of all 27 tests, click here. Luckly, most of them are not applicable because they can't happen mathematically or they can't happen because the data is sorted.
In turn, each test case has subtests. One set of tests finds the desired sum while another set of tests does not. Within, those subtests, position matters: left edge, somewhere in the middle and right edge. To see the tests themselves, click here. When all is said and done, we end up with 104 tests. This may seem excessive, but it beats the alternative of having countless conversations about why the computations for the nested for loops was not short circuited.
Also, if you want to create tests to estimate run time, one could use the timeit module. Another option is to use the pytestbenchmark.
Subtlety: Repeated Integers Versus ReUse of Integers
If repeated integers are allowed, you could have a list of integers like [3, 5, 1, 5, 12]. So, if repeated integers are allowed and wanted the sum to be 10, you could use the two 5's to come up with 10.
However, if repeated integers are not allowed, the previous list would become [3, 5, 1, 12]. In this case, if wanted the sum to be 10, there would be no two integers that sum to 10.
Now comes the caveat. If have [3, 5, 1, 12] and integers can be reused, could reuse 5 a second time to have two integers that sum to 10.
Summary
This blog demonstrates two things.
Next you have to be creative. What if we subtract the integer that we're currently examining from the desired sum and then use the resulting integer to do a search? Understanding the Big O notation is a good skill and helps you understand consequences, but you also need to understand the principles behind it.
Lastly, know your data. Questions like "Is it sorted or unsorted?" or "Is it symmetric?" will help us in the algorithm selection process.
Presentations
To download the presentation for the PGH Data Science Lightning Talks on 7/5/2017, click here.
To download the presentation for the PGHPython: Study Algorithms by Playing Pick 2 on 7/26/2017, click here. For a video recording of the presentation, click here (YouTube).
Even though CLRS is a standard text book at many universities, it may not be appropriate given our audience. Below is my recommendation for an approach that may be appropriate for the group.
Problem Statement
We begin by explicitly stating what appears to be a simple problem: Find two integers from a list of integers that sum to a specific value. For example, given [1, 2, 3, 4, 5, 6], which two integers sum to 11? Mathematically, this can be stated as follows

Given a set N = {x_{1}, x_{2}, ..., x_{n} } and a constant c, find x_{i} and x_{j} such that
 x_{i} + x_{j} = c
 i != j
 Allow for duplication of integers. In other words, [1, 2, 2, 4, 5, 6, 7] is valid.
 Only have to find 1 permutation of 2 integers that sum to c. Don't have to find all the permutations.
 Can't assume that the integers are sorted.
 System does not have any memory of what has been previously processed. For example, it cannot remember whether the last numbers processed were positive or negative or zero.
Note the writing style used in the problem statement. It's formal enough that people can learn something concrete but approachable enough that people won't skip it for fear of feeling dumb. That is why, initially, no Big O notation is used.
Obviously, the goal is not to teach a trivial addition algorithm. The goal is to help people start thinking about how to find and evaluate algorithms. The addition algorithm is a nice example for learning how to approach algorithms in general. In the particular case, the progression is as follows
 Brute Force: Start with nested for loops.
 Look for Patterns: Notice that "2 + 4" as well as "4 + 2" equal 6. Therefore can reduce the computations by half.
 Sort and Perform Binary Search
 Loosen Requirements: Reuse of integers allows the use of hashing.
Simplest Case
The simplest case assumes that the integers are sorted and the quantity of integers is small enough that brute force is acceptable. Code Snippet 1 satisfies the simplest case. Please note that there is nothing wrong with brute force as long as it satisfies the requirements. Sometimes people forget this.
If the integer list is unsorted, a simple solution is to just sort it. The consequences of sorting are addressed later in this blog. Perhaps sorted data is unnecessary? This too will be addressed later.
Code Snippet 1
def sum_in(int_list, desired_sum): for i, int_a in enumerate(int_list): for j, int_b in enumerate(int_list): if i != j: sum_of_ints = int_a + int_b print("int_a = {}, int_b = {}, sum_of_ints = {}".format(int_a, int_b, sum_of_ints)) if desired_sum == sum_of_ints: return True # test 1: Desired sum exists print(" Start test 1") SOME_INTEGER_LIST = [1, 2, 3, 4, 5, 6] DESIRED_SUM_OF_INTS = 11 if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True: print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS)) else: print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS)) # test 2: Desired sum does not exist print(" Start test 2") SOME_INTEGER_LIST = [2, 4, 8] DESIRED_SUM_OF_INTS = 7 if sum_in(SOME_INTEGER_LIST, DESIRED_SUM_OF_INTS) is True: print("DESIRED_SUM_OF_INTS = {} was found".format(DESIRED_SUM_OF_INTS)) else: print("DESIRED_SUM_OF_INTS = {} was not found".format(DESIRED_SUM_OF_INTS))Output of Code Snippet
 Start test 1 int_a = 1, int_b = 2, sum_of_ints = 3 int_a = 1, int_b = 3, sum_of_ints = 4 int_a = 1, int_b = 4, sum_of_ints = 5 int_a = 1, int_b = 5, sum_of_ints = 6 int_a = 1, int_b = 6, sum_of_ints = 7 int_a = 2, int_b = 1, sum_of_ints = 3 int_a = 2, int_b = 3, sum_of_ints = 5 int_a = 2, int_b = 4, sum_of_ints = 6 int_a = 2, int_b = 5, sum_of_ints = 7 int_a = 2, int_b = 6, sum_of_ints = 8 ... int_a = 5, int_b = 1, sum_of_ints = 6 int_a = 5, int_b = 2, sum_of_ints = 7 int_a = 5, int_b = 3, sum_of_ints = 8 int_a = 5, int_b = 4, sum_of_ints = 9 int_a = 5, int_b = 6, sum_of_ints = 11 DESIRED_SUM_OF_INTS = 11 was found  Start test 2 int_a = 2, int_b = 4, sum_of_ints = 6 int_a = 2, int_b = 8, sum_of_ints = 10 int_a = 4, int_b = 2, sum_of_ints = 6 int_a = 4, int_b = 8, sum_of_ints = 12 int_a = 8, int_b = 2, sum_of_ints = 10 int_a = 8, int_b = 4, sum_of_ints = 12 DESIRED_SUM_OF_INTS = 7 was not foundYou will notice that "2 + 4" as well as "4 + 2" equal 6.
int_a = 2, int_b = 4, sum_of_ints = 6 ... int_a = 4, int_b = 2, sum_of_ints = 6 ...The above corresponds to the commutative property of addition. If we create a matrix of the possible permutations of additions for [1, 2, 3, 4, 5, 6], a symmetric matrix is created.
Notice that the diagonal of the matrix has N/A. This corresponds to the fact that i != j. Another way of stating this is that we cannot reuse integers.
Before looking at the Big O notation, let's just count all the possible permutations of addition. Here we use the term permutation as strictly defined in mathematics. For a discussion of the difference between permutations and combinations, refer to Combinations and Permutations at MathIsFun.Com. The example above corresponds to permutations without repetition, which means that we should use the equation below
Here n is the number of integers in the list, and r = 2 because we are taking 2 numbers at a time. Below is a table that evaluates the equation above. It shows how quickly the number of required additions increases as the number of integers in the list increases.
Number of Integers Number of in the List Additions   2 2 3 6 4 ` 12 5 20 6 30 10 90 100 9,900 150 22,350 500 249,500 1,000 999,000Now, let's look at the Big O notation. The nested for loops result in O(n^{2}). Clearly, this approach is not scalable.
Reduce Computations by Taking Advantage of Symmetry Arising from Commutative Property of Addition
Code Snippet 2: Reduce computations by taking advantage of symmetry
def sum_in(int_list, desired_sum): for i in range(1, len(int_list)): int_a = int_list[i] desired_int = desired_sum  int_a for j in range(0, i): # Cutting computations in half int_b = int_list[j] print("i = {}, j = {}".format(i, j)) print("int_a = {}, int_b = {}, Int Sum = {}".format(int_a, int_b, int_a+int_b)) if (int_b == desired_int): return True return FalseWe could try to short circuit the computations being performed in this code snippet. For example, if the desired sum is positive, both integers are positive, and the integer sum is greater than the desired sum, there is no reason to continue to perform computations. Below is a counter example. The key is that the integers can be negative, zero or positive and the system does not have any memory of what has been processed previously. Notice that here we are using one of the earliermentioned additional requirements.
Counter Example
def sum_in( [6, 2, 0, 5, 6, 8], 2 ) i = 1, j = 0 int_a = 2, int_b = 6, Int Sum = 8 i = 2, j = 0 int_a = 0, int_b = 6, Int Sum = 6 i = 2, j = 1 int_a = 0, int_b = 2, Int Sum = 2 i = 3, j = 0 int_a = 5, int_b = 6, Int Sum = 1 i = 3, j = 1 int_a = 5, int_b = 2, Int Sum = 3 i = 3, j = 2 int_a = 5, int_b = 0, Int Sum = 5 i = 4, j = 0 int_a = 6, int_b = 6, Int Sum = 0 i = 4, j = 1 int_a = 6, int_b = 2, Int Sum = 4 i = 4, j = 2 int_a = 6, int_b = 0, Int Sum = 6 i = 4, j = 3 int_a = 6, int_b = 5, Int Sum = 11 For the above, notice the following I) Both integers are positive II) The integer sum of 11 is greater than the desired sum of 2 i = 5, j = 0 int_a = 8, int_b = 6, Int Sum = 2 For the above, notice that the integer sum of 2 is equal to the desired sum of 2Even though the number of computations is cut in half by iterating over half the matrix, the approach is still not scalable. This is just a repetition that O(cn^{2}) == O(n^{2}) for any constant c.
Alternative Approach: Binary Search for the Desired Integer
Because the above approaches are not scalable, we need to step back and try another approach. In this other approach, we will use the a priori knowledge that the integers are sorted. Since we know what the desired integer is, we can do a binary search for it over the ordered integers. For background information about binary search, refer to the book "Problem Solving with Algorithms and Data Structures using Python" by Brad Miller and David Ranum.
Luckily, Python has the bisect module which can be used to search sorted lists. Code Snippet 3 uses the bisect_left method of the bisect_module.
Code Snippet 3 (Source: Docs.Python.Org  bisect module  Searching Sorted Lists)
def index(a, x): 'Locate the leftmost value exactly equal to x' i = bisect_left(a, x) if i != len(a) and a[i] == x: return i raise ValueErrorSince we are dealing with open source, we can go look at the code for bisect_left and see that it is your standard binary search.
Code Snippet 4: bisect_left
def bisect_left(a, x, lo=0, hi=None): """Return the index where to insert item x in list a, assuming a is sorted. The return value i is such that all e in a[:i] have e < x, and all e in a[i:] have e >= x. So if x already appears in the list, a.insert(x) will insert just before the leftmost x already there. Optional args lo (default 0) and hi (default len(a)) bound the slice of a to be searched. """ if lo < 0: raise ValueError('lo must be nonnegative') if hi is None: hi = len(a) while lo < hi: mid = (lo+hi)//2 if a[mid] < x: lo = mid+1 else: hi = mid return loPutting it all together, results in the code snippet below.
Code Snippet 5
from bisect import bisect_left def index(some_list, some_value): 'Locate the leftmost value exactly equal to some_value' i = bisect_left(some_list, some_value) if i != len(some_list) and some_list[i] == some_value: return i raise ValueError def sum_in(int_list, desired_sum): int_list.sort() for i, int_a in enumerate(int_list): desired_int = desired_sum  int_a try: j = index(int_list, desired_int) if (i == j): continue else: return True except ValueError: continue return FalseBy the way, some consider the use of exceptions for flow control a bad practice. For a typical article on this topic, refer to "Using Exception Handling for Control Flow (in Python)" by Scott Lobdell.
Now, let's talk about the number of computations involved in Code Snippet 5. It is commonly known that a binary search is O(log n). Because we have to do the binary search once for each integer, we have to multiply by n. This results in a performance of O(n [log n] ). For the rationale of why the base of the log does not matter, refer to Wikipedia: Sorting algorithm.
Significant Performance Improvement: Computations Reduced from O(n^{2}) to O(n [log n])
Notice that in going from Code Snippet 2 to Code Snippet 5, performance went from O(n^{2}) to O(n [log n]). This is a significant performance improvement.
To visually confirm this, let's examine the graph below. Notice how the purple line {O(n [log n]} is relatively linearly but the dark blue line {O(n^{2}} just shoots straight up.
Why Not Just Use Python Builtin Functionality?
Those familiar with Python will be tempted to simplify the above code by using the index() method of list objects, which results in the code snippet below.
Code Snippet 6: Use list.index
def sum_in(int_list, desired_sum): for i, int_a in enumerate(int_list): desired_int = desired_sum  int_a try: j = int_list.index(desired_int) if (i == j): continue else: return True except ValueError: continue return FalseThe use of list.index is equivalent to invoking the in operator. Note that in calls the __contains__ method. This is an O(n) operation (StackOverFlow.Com, Complexiity at Wiki.Python.Org). Since we have to execute O(n) for each iteration of the for statement, the overall performance is now back to O(n^{2}). Yucky.
One would be tempted to simplify the above code snippet by directly using the in operator. Unfortunately, we cannot do this, because we not only have to know that the integer is present, but that its index differs from the current integer being processed (i != j).
Since the Python set collection has a O(1) performance, why not use it? Unfortunately, in Python the set collection quietly ignores repeated integers. Recall that in the problem statement, [1, 2, 2, 4, 5, 6, 7] was allowable. Later in the blog, we will address this sublety.
Complicate Problem by Having Integer List Not Sorted
The complication of an unsorted integer list is easily addressed simply by just sorting. It is commonly known that no comparison sorts can perform better than O(n [log n]) in the average or worst case. For background information on this, refer to Wikipedia: Sorting algorithm. Consequently, the computations required to determine if two integers in an unsorted integer list sum to a particular value is
 n [log n] cost for sorting the unordered integer list
 n [log n] cost for a binary search for the other integer
Remove Restriction: Integers can Now be ReUsed (i = j)
We can now finally use the in.
Code Snippet 7: Use in operator
def sum_in(int_list, desired_sum): for i, int_a in enumerate(int_list): desired_int = desired_sum  int_a if desired_int in int_list: return True else: continue return FalseRemember, we are now back to the O(n^{2}) performance. Yucky.
ReUse of Integers Permits the Application of Set
Because the reuse of integers is allowed, we can use Python's built in set and get a O(1) lookup performance. However, don't forget that you initially have to create the set which involves a cost of O(n). Luckily this is just a onetime setup fee. Please note that creating a set involves hashing, which, depending on the nature of the data, can be problematic.
Also, by using set, we are implicitly removing the additional requirement that the system has no memory because we are storing hash values. This is an example of why people must be aware of all ramifications associated with data structures and their associated algorithms.
Code Snippet 8: Use set
def sum_in(int_list, desired_sum): int_set = set(int_list) for i, int_a in enumerate(int_list): desired_int = desired_sum  int_a if desired_int in int_set: return True else: continue return FalseThe above code snippet is the verbose version of the code snippet found at the StackOverFlow question "How to write an algorithm to check if the sum of any two numbers in an array/list matches a given number?"
Code Snippet 9: From StackOverFlow
def sum_in(int_list, desired_sum): set_of_input = set(int_list) # O(n) return any( ( desired_sum  n) in set_of_input for n in set_of_input ) # O(n)
At this point, it is important to remember that we removed a restriction and integers can now be reused (i=j). This means that when "sum_in([2, 1, 5, 6, 8], 2)" returns True, it is a correct answer because "1 + 1 == 2." This may or may not be acceptable depending on the situation.
Also, at this point, we need to address the distinction between "repeated integers not allowed" and "integers can be reused." Repeated integers are simply not allowed because Python's set quietly ignores them. "Integers can be reused" refers to the fact that "sum_in([2, 1, 5, 6, 8], 2)" returning True is a correct answer because "1 + 1 == 2."
Code snippets 8 and 9 require that the entire set (hash values) is computed ahead of time. Why not create the hash value as we process the data? This way, we might find the desired integers before we have created all the hash values. The other advantage of creating the hash values as we process the data is that it allows for repeated integers ([1, 2, 2, 4, 5, 6, 7]) but not the reuse of integers.
.
Code Snippet 10: Generate hash values as process data
def sum_in(int_list, desired_sum): indexMap = {} for i in range( len( int_list ) ): if ( desired_sum  int_list[i] ) in indexMap: return True else: indexMap[ int_list[i] ] = i return FalseNote that code snippet 10 returns False for "sum_in([2, 1, 5, 6, 8], 2)". In other words, integers cannot be reused (i != j).
Above material similar to exercise 2.37 of CLRS
Some people will recognize that the material above is similar to exercise 2.37 of CLRS. Below is the actual exercise. We have deviated from it for pedagogical reasons.
Alternative Solution
Zhao HG on GitBook solves the problem in O(n log n) time using the code snippet below. Notice its simple elegance of just walking two pointers. However, it is complex because have 3 tests, and for each test, there is a different action.
Code Snippet 11
def two_sum(a, x): l, r = 0, len(a)1 while l < r: if a[l] + a[r] == x: return True elif a[l] + a[r] < x: l += 1 else: r = 1 return FalseNotice how we have examined the algorithm from 3 different dimensions.
 The first dimension was time. How long does it take to perform the computations?
 The second dimension was required storage. Please note that storage is used in the generic sense. It does not just refer to memory. For further details, please refer to the Wikipedia article "Memory Hierarchy."
 The third dimension is algorithm/software complexity. You can use the radon package to quantify the complexity. Please note that complexity is a critical consideration when deploying production code. For an example, refer to "Why Netflix Never Implemented The Algorithm That Won The Netflix $1 Million Challenge by Mike Masnick".
Some people will have an opinion that such trivial code snippets do not require unit tests. I strongly disagree. Unit tests are a good way to make sure that you haven't done anything silly. For the purpose of this blog, I have created rather extensive unit tests to prevent a lot of back and forth when people propose adding short circuit logic to reduce the number of computations performed. The conversations will probably start and end with, "Did you run the proposed short circuit logic against the unit tests?"
As for the unit tests themselves, we begin by realizing that each integer can be negative, zero or positive and that the desired sum can also be negative, zero or positive. This leads to 27 possible test cases. For an explicit list of all 27 tests, click here. Luckly, most of them are not applicable because they can't happen mathematically or they can't happen because the data is sorted.
In turn, each test case has subtests. One set of tests finds the desired sum while another set of tests does not. Within, those subtests, position matters: left edge, somewhere in the middle and right edge. To see the tests themselves, click here. When all is said and done, we end up with 104 tests. This may seem excessive, but it beats the alternative of having countless conversations about why the computations for the nested for loops was not short circuited.
Also, if you want to create tests to estimate run time, one could use the timeit module. Another option is to use the pytestbenchmark.
Subtlety: Repeated Integers Versus ReUse of Integers
If repeated integers are allowed, you could have a list of integers like [3, 5, 1, 5, 12]. So, if repeated integers are allowed and wanted the sum to be 10, you could use the two 5's to come up with 10.
However, if repeated integers are not allowed, the previous list would become [3, 5, 1, 12]. In this case, if wanted the sum to be 10, there would be no two integers that sum to 10.
Now comes the caveat. If have [3, 5, 1, 12] and integers can be reused, could reuse 5 a second time to have two integers that sum to 10.
Summary
This blog demonstrates two things.
 First it is critical to know the runtimes of builtin algorithms/tools of a language.
 Secondly, the material demonstrates a common process. You start out with what appears a simple problem. You try a brute force approach but quickly realize that it won't scale. You then look for patterns like symmetry to reduce the number of required computations. If you are lucky, this will be good enough. If not, an alternative approach will have to be tried.
Next you have to be creative. What if we subtract the integer that we're currently examining from the desired sum and then use the resulting integer to do a search? Understanding the Big O notation is a good skill and helps you understand consequences, but you also need to understand the principles behind it.
Lastly, know your data. Questions like "Is it sorted or unsorted?" or "Is it symmetric?" will help us in the algorithm selection process.
Presentations
To download the presentation for the PGH Data Science Lightning Talks on 7/5/2017, click here.
To download the presentation for the PGHPython: Study Algorithms by Playing Pick 2 on 7/26/2017, click here. For a video recording of the presentation, click here (YouTube).
Wednesday, February 8, 2017
Work Around Transaction Support in an RDBMS for Bulk Loads  Just Say No
Recently, some of my contacts have been struggling with bulk loading into conventional RDBMS products like Oracle, SQL Server and MySQL. They were surprised by the issue.
However, if you want to be true
heretic, consider asking the question of why use an RDBMS if you don’t need
transaction support?
For further details on this type of stuff, check out the following blog posts
My reply to them is that this is a standard issue. RDBMS products by their nature
guarantee transactions. This is very expensive.
They do a lot of logging to guarantee transactions. For example, a
standard technique in Oracle to add a column to a large table is to create a copy of the
table with the additional column using NOLOGGING. Afterwards you add back in
constraints, indexes, … .
The standard joke in Oracle is that it first does redo and then an
undo and then actually does something. For those familiar with Oracle knobs, undo
has significance.
Also, there is a lot of context switching going on between the
RDBMS and the external interface. Heck, even within Oracle, PL/SQL implemented
BULK COLLECT to reduce all this context switching between PL/SQL and SQL. (Bulk Processing with BULK COLLECT and FORALL by Steven
Feuerstein)
So, the trick is to identify the knobs that minimize logging and
context switching.
For further details on this type of stuff, check out the following blog posts
Thursday, December 22, 2016
Jumping into Spark (JIS): Python / Spark / Logistic Regression (Update 3)
In this blog we will use the Python interface to Spark to determine whether or not someone makes more or less than $50,000. The logistic regression model will be used to make this determination.
Please note that no prior knowledge of Python, Spark or logistic regression is required. However, it is assumed that you are a seasoned software developer.
Please note that no prior knowledge of Python, Spark or logistic regression is required. However, it is assumed that you are a seasoned software developer.
There are various tutorials on Spark. There is the official documentation on Spark. However, if you are an experienced software professional and want to just jump in and kick the tires, there doesn't seem to be much available. Well, at least I couldn't find any.
Yes, there is Spark's Quick Start. Also, there are artifacts like databricks User Guide. Unfortunately, they have a smattering of stuff. You really don't get a chance to jump in.
Let me now explicitly define jumping in. Jumping in involves solving an almost trivial problem that demonstrates a good deal of the concepts and software. Also, it involves skipping steps. Links will be provided so that if someone is interested in the missing steps, they can look up the details. If skipping steps bothers you, immediately stop and go read a tutorial.
As mentioned earlier, we are going to use Python on Spark to address logistic regression for our jumping in. This is our almost trivial problem that demonstrates a good deal of the concepts and software.
The first thing that you need is a working environment. You could install and configure your own environment. However, that would not be in line with jumping in. Instead I recommend using databricks Spark community edition (DSCE). If you are using DSCE, refer to "Welcome to Databricks" on how to create a cluster, create a notebook, attach the notebook to a cluster and actually use the notebook.
Next, you need to connect to the Spark environment. In the database world, you create a database connection. In the Spark world, you create a context (SQLContext, SparkContext/SparkSession). If you are using DSCE, the following three variables will be predefined for you
 SparkContext: sc
 SparkSession: spark
 SQLContext: sqlContext
If you would like the IPython notebook associated with this blog, click here. If for some reason you don't have software to read the IPython notebook, you can download a pdf version of it by clicking here.
We are almost ready to code. First we have to talk about Row and DataFrame. A Row is just like a row in a spreadsheet or a row in a table. A Dataframe is just a collection of Row's. For the most part, you can think of a DataFrame as a table.
Row / Column
Now for the first code snippet. Please note that I took the code snippet from Encode and assemble multiple features in PySpark at StackOverFlow.Com.
Code Snippet 1
from pyspark.sql import Row from pyspark.mllib.linalg import DenseVector row = Row("gender", "foo", "bar") dataFrame = sc.parallelize([ row("0", 3.0, DenseVector([0, 2.1, 1.0])), row("1", 1.0, DenseVector([0, 1.1, 1.0])), row("1", 1.0, DenseVector([0, 3.4, 0.0])), row("0", 3.0, DenseVector([0, 4.1, 0.0])) ]).toDF() dataFrame.collect()
Nothing fancy. You create a row which has column names gender, foo and bar. You then create a bunch of row's with actual data. Lastly, you group the row's into a DataFrame. DenseVector was used to demonstrate that a cell in a Row can have a complex data structure. If you are curious about parallelize and toDF, check the references at the end of the blog. This will be true for the rest of the blog. If you are not sure what some magic word means, go to the reference section at the end of the blog.
If things are working, you should get an output like that shown below.
Output of Code Snippet 1
[Row(gender=u'0', foo=3.0, bar=DenseVector([0.0, 2.1, 1.0])),
Row(gender=u'1', foo=1.0, bar=DenseVector([0.0, 1.1, 1.0])),
Row(gender=u'1', foo=1.0, bar=DenseVector([0.0, 3.4, 0.0])),
Row(gender=u'0', foo=3.0, bar=DenseVector([0.0, 4.1, 0.0]))]
Before going further, we need to take a little detour. Algorithms like numeric data. So we need to do things like convert a column for gender that has Male / Female / Unknown to binary values. We do this by creating two columns.
 The first column will be used to indicate whether the person is or is not male. A one means that the person is male and a zero indicates that the person is not male. StringIndexer will be used to convert categories to numerical values.
 The second column will be used to indicate whether the person is or is not female. A one means that the person is female and a zero indicates that the person is not female.
StringIndexer
Let's actually create a StringIndexer and use it to map/fit/transform categories to numbers.
Code Snippet 2
dataFrame = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "b"), (3, "c"), (4, "c"), (5, "c"), (6,'d'), (7,'d'), (8,'d'), (9,'d')], ["id", "category"]) dataFrame.collect() from pyspark.ml.feature import StringIndexer stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") modelStringIndexer = stringIndexer.fit(dataFrame) transformedDataFrame = modelStringIndexer.transform(dataFrame) transformedDataFrame.collect()
Output of Code Snippet 2
[Row(id=0, category=u'a', categoryIndex=3.0), Row(id=1, category=u'b', categoryIndex=2.0), Row(id=2, category=u'b', categoryIndex=2.0), Row(id=3, category=u'c', categoryIndex=1.0), Row(id=4, category=u'c', categoryIndex=1.0), Row(id=5, category=u'c', categoryIndex=1.0), Row(id=6, category=u'd', categoryIndex=0.0), Row(id=7, category=u'd', categoryIndex=0.0), Row(id=8, category=u'd', categoryIndex=0.0), Row(id=9, category=u'd', categoryIndex=0.0)]
Notice how d's got 0.0 because they are the most numerous. The letter c's got 1.0 because they are the second most numerous. And so on. The code snippet below will make this more clear.
Code Snippet 3
transformedDataFrame.select('category','categoryIndex').distinct().orderBy('categoryIndex').show()
Output of Code Snippet 3
+++ categorycategoryIndex +++  d 0.0  c 1.0  b 2.0  a 3.0 +++
OneHotEncoder
Next, let's use a OneHotEncoder it to transform the category index that we created earlier to a binary vector.
Code Snippet 4
from pyspark.ml.feature import OneHotEncoder oneHotEncoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVector") oneHotEncodedDataFrame = oneHotEncoder.transform(transformedDataFrame) oneHotEncodedDataFrame.show()
Output of Code Snippet 4
+++++  idcategorycategoryIndexcategoryVector +++++  0 a 3.0 (3,[],[])  1 b 2.0 (3,[2],[1.0])  2 b 2.0 (3,[2],[1.0])  3 c 1.0 (3,[1],[1.0])  4 c 1.0 (3,[1],[1.0])  5 c 1.0 (3,[1],[1.0])  6 d 0.0 (3,[0],[1.0])  7 d 0.0 (3,[0],[1.0])  8 d 0.0 (3,[0],[1.0])  9 d 0.0 (3,[0],[1.0]) +++++
SparseVector
The column categoryVector is a SparseVector. It has 3 parts. The first part is the length of the vector. The second part are the indicies which contain values. The third part are the actual values. Below is a code snippet demonstrating this.
Code Snippet 5
from pyspark.mllib.linalg import SparseVector v1 = SparseVector(5, [0,3], [10,9]) for x in v1: print(x)
Output of Code Snippet 5
10.0 0.0 0.0 9.0 0.0
Notice how category a (categoryVector = (3,[],[])) is not included because it makes the vector entries sum to one and hence linearly dependent. The code snippet below will provide a better visual for this.
Code Snippet 6
Output of Code Snippet 6
By this point you are probably getting impatient. Luckly, we have just one more item to cover before we get to logistic regression. That one item is the VectorAssembler. A VectorAssembler just concatenates columns together. As usual, we will demonstrate what the words mean via a code snippet.
Code Snippet 7
from pyspark.ml.feature import VectorAssembler
dataFrame_1 = spark.createDataFrame([(1, 2, 3), (4,5,6)], ["a", "b", "c"])
vectorAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
dataFrame_2 = vectorAssembler.transform(dataFrame_1)
dataFrame_2.show()
Code Snippet 6
oneHotEncodedDataFrame.select('category','categoryIndex', 'categoryVector').distinct().orderBy('categoryIndex').show()
++++ categorycategoryIndexcategoryVector ++++  d 0.0 (3,[0],[1.0])  c 1.0 (3,[1],[1.0])  b 2.0 (3,[2],[1.0])  a 3.0 (3,[],[]) ++++
VectorAssembler
Code Snippet 7
from pyspark.ml.feature import VectorAssembler
dataFrame_1 = spark.createDataFrame([(1, 2, 3), (4,5,6)], ["a", "b", "c"])
vectorAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
dataFrame_2 = vectorAssembler.transform(dataFrame_1)
dataFrame_2.show()
Output of Code Snippet 7
+++++  a b c features +++++  1 2 3[1.0,2.0,3.0]  4 5 6[4.0,5.0,6.0] +++++
Logistic Regression
We have now learned enough Spark to look at a specific problem involving logistic regression. We are going to work through the example provided in the databricks documentation.
If you are interested in learning about logistic regression, recommend reading "Chapter 6: The Grand daddy of Supervised Artificial Intelligence  Regression (spreadsheet)" of the book Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman (2013). Personally, I like the book because it provides minimal math and an implementation of logistic regression in a spreadsheet.
Drop / Create Table
If you are interested in learning about logistic regression, recommend reading "Chapter 6: The Grand daddy of Supervised Artificial Intelligence  Regression (spreadsheet)" of the book Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman (2013). Personally, I like the book because it provides minimal math and an implementation of logistic regression in a spreadsheet.
Drop / Create Table
 Drop Table
%sql DROP TABLE IF EXISTS adult
 Create Table
%sql CREATE TABLE adult ( age DOUBLE, workclass STRING, fnlwgt DOUBLE, education STRING, education_num DOUBLE, marital_status STRING, occupation STRING, relationship STRING, race STRING, sex STRING, capital_gain DOUBLE, capital_loss DOUBLE, hours_per_week DOUBLE, native_country STRING, income STRING) USING com.databricks.spark.csv OPTIONS (path "/databricksdatasets/adult/adult.data", header "true")
Convert table to a DataFrame
dataset = spark.table("adult")Get a list of columns in original dataset
cols = dataset.columns
 This step has to be done here and not later. Unfortunately, the databricks examples reuses the variable dataset.
from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"] stages = [] # stages in our Pipeline for categoricalCol in categoricalColumns: stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index") encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"classVec") # Add stages. These are not run here, but will run all at once later on. stages += [stringIndexer, encoder]Create a StringIndexer on income
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label") stages += [label_stringIdx]Combine feature columns into a single vector column using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"] assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features") stages += [assembler]Put data through all of the feature transformations using the stages in the pipeline
pipeline = Pipeline(stages=stages) pipelineModel = pipeline.fit(dataset) dataset = pipelineModel.transform(dataset)Keep relevant columns
selectedcols = ["label", "features"] + cols dataset = dataset.select(selectedcols) display(dataset)Split data into training and test sets
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100) print trainingData.count() print testData.count()
 Set seed for reproducibility
Train logistic regression model and then make predictions
from pyspark.ml.classification import LogisticRegression lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10) lrModel = lr.fit(trainingData) predictions = lrModel.transform(testData) selected = predictions.select("prediction", "age", "occupation") display(selected)
 Output
prediction age occupation    ... 0 20 Profspecialty 1 35 Profspecialty ...
 So, a prediction of 0 means that the person earns <=50K. While a prediction of 1 means that the person earns >50K
Summary
The advantage of jumping in is that you learn by solving a problem. Also, you don't spend weeks or months learning material like that listed in the references below before you can do anything. The disadvantage is that steps are skipped and full understanding of even the provided steps won't be present. It is realized that jumping in is not for everybody. For some people, standard tutorials are the way to begin.
References
 Book: Data Smart by John W. Foreman
 johnforeman.com
 O'Reilly
 Chapter 6: The Granddaddy of Supervised Artificial Intelligence  Regression
 Cartoon from New Yorker Magazine: Does your car have any ide why my car pulled it over?
 databricks
 sklearn.linear_model.LogisticRegression
 Spark.Apache.Org
 Spark Python API Docs [pyspark]
 pyspark.ml package [ML: Machine Learning]
 pyspark.sql module
 SparkContext
 pyspark.ml package [ML: Machine Learning]
 MLlib (Machine Learning)
 Spark Python API Docs [pyspark]
 StackOverFlow.Com
 Wikipedia
Thursday, September 29, 2016
Can a Purely State Based Database Schema Migration Be Trusted? by Robert Lucente
A really good article on the tradeoffs between state based vs migration based schema transformation is titled Critiquing two different approaches to delivering databases: Migrations vs state by Alex Yates on 6/18/2015. The blog Database Schema Migration by S. Lott on 9/27/2016 ends with "It's all procedural migration. I'm not [sure] declarative ("state") tools can be trusted beyond discerning the changes and suggesting a possible migration."
Let me start by examining the statement of the problem: state based vs migration based schema transformation. The problem statement implies that the solution is an either or type of thing. Why not both?
As opposed to speaking in generalities, let me pick a specific problem which is often encountered in real systems which demonstrates the core issue. Also, let me pick a particular tool set to execute on this particular problem.
The specific problem involves
 Adding some domain table and its associated data.
 Adding a not null / mandatory column to some table that already has data.
 Creating a foreign key between the domain table and the new mandatory column.
 SomeTable exists with data.
 A new domain table (SomeDomain) gets created.
 The new domain table (SomeDomain) gets populated with data.
 A new nullable / not mandatory column SomeDomainUuid is added to SomeTable.
 Column SomeDomainUuid in SomeTable gets populated with data.
 Column SomeDomainUuid in SomeTable is made not null / mandatory.
 A foreign key is created between the two tables.
 The new domain table (SomeDomain) gets populated with data.
 Column SomeDomainUuid in SomeTable gets populated with data.
INSERT INTO dbo.SomeDomain VALUES (newid(), 'Fred'), (newid(), 'Barney');The second step is to write a script (Update_dbo_Deal_AdvertiserTypeUuid.sql) to populate SomeDomainUuid column in SomeTable.
UPDATE [dbo].[SomeTabe] SET [SomeDomainUuid] = (SELECT SomeDomainUuid FROM [dbo].[SomeDomain] WHERE Name = 'Fred');The last preparation step is to write a script (Script.PostDeployment.sql) to run the above 2 scripts after the state based change has happened.
:r ".\Populate_dbo_SomeDomain.sql" go :r ".\Update_dbo_Deal_AdvertiserTypeUuid.sql" goNow that the desired end state has been specified as well as the data manipulation scripts written, it is time to modify a database. The Microsoft terminology for this is "publishing the database project". There will be an issue because the state changes will be made and then the Script.PostDeployment.sql script will be run. In between the state changes and the script being run, there will need to be data in the SomeDomainUuid column in the table SomeTable. This issue is addressed by using the GenerateSmartDefaults option when publishing the database project.
Let's summarize what this combination of state based and migration based schema transformation has allowed us to do. We were able to take 7 steps and reduce it down to 2 steps. These 2 steps couldn't be automated anyways because they involved data. These are the pros. The con is that have to be familiar with the framework and select the GenerateSmartDefaults option out of the 60 plus available options.
In conclusion, a purely state based approach can't work because of data. There is no way for the software to know how to do data migrations. Only humans know the correct way to do data migrations. In our example, there is no way for the software to know whether or not the new column SomeDomainUuid is to be initially populated with "Fred" or "Barney". This is a long winded and nice way of saying that a purely state based database schema migration can't be trusted. However, the combination of state based and migration based can truly improve productivity.
Subscribe to:
Posts (Atom)