Tuesday, January 16, 2018

How do I get started in data science?

A recurring question that I have run into is: How do I get started in data science?

Unfortunately, "data science" has devolved into a marketing phrase. So, I will provide a definition which will be applicable for this blog post.

Definition: Data science is the application of statistics and mathematical optimization (operations research) to real world data to make probabilistic predictions and / or minimize/maximize some business attribute.

Notice how the definition combines the following items
  1. Statistics
    1. Mathematical Optimization / Operations Research
      1. Real world data
        1. Making probabilistic predictions
          1. Minimize / maximize some business attribute
          Also, notice how the above list emphasizes the fact that knowledge of mathematics is required. The good news is that people can elect the depth to which they delve into the associated mathematics.

          One option which won't work is to take the position that mathematics is irrelevant and that all that is required is to just learn the API calls. I have personally seen several people fail who have adopted this position. The classic example is that someone tries to use linear regression and they have a lot of outliers and then wonder why things aren't working.

          The above material provides the context for "data science." Next, let's talk about how to execute on getting started in data science.

          If you mathematical background is weak, recommend the following
          1. Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith
          2. Cartoon Guide to Calculus by Larry Gonick
          3. Linear Algebra For Dummies by Mary Jane Sterling
          4. Manga Guide to Linear Algebra by Shin Takahashi, ...
          5. If the above books are of interest to you, a full list can be found in my blog post titled "Gentle Introduction to Various Math Stuff."
          Now, we can finally get to the first thing that a data scientist must know: linear regression. To systematically study linear regression, recommend the book "Regression Analysis with Python" by Luca Massaron, Alberto Boschetti [PacktPub.ComCode Download / ErrataSafariBooksOnline.ComAmazonO'Reilly]. Personally, I think that it is a good book because it processes data sets using Python to create a linear regression model. It is not just a bunch of math (NJBM). Below is paragraph taken from the book.
          1. We provide some practical examples in Python throughout the book and do not leave explanations about the various regression models at a purely theoretical level. Instead, we will explore together some example datasets, and systematically illustrate to you the commands necessary to achieve a working regression model, interpret its structure, and deploy a predicting application.

          No comments:

          Post a Comment