causes of error

(More notes from my Udacity Machine Learning Nanodegree course)

Once you have measured model performance, it is important to understand the reasons why models exhibit errors in the first place.

In model prediction there are two main sources of errors that a model can suffer from.

Bias

Bias occurs when a model has enough data but is not complex enough to capture the underlying relationships. As a result, the model consistently and systematically misrepresents the data, leading to low accuracy in prediction. This is known as underfitting. Simply put, bias occurs when we have an inadequate model.

Variance

When training a model, we typically use a limited number of samples from a larger population. If we repeatedly train a model with randomly selected subsets of data, we would expect its predictions to be different based on the specific examples given to it. Here variance is a measure of how much the predictions vary for any given test sample.

Some variance is normal, but too much variance indicates that the model is unable to generalize its predictions to the larger population. High sensitivity to the training set is also known as overfitting, and generally occurs when either the model is too complex or when we do not have enough data to support it.

We can typically reduce the variability of a model’s predictions and increase precision by training on more data. If more data is unavailable, we can also control variance by limiting our model’s complexity.

Improving the Validity of a Model

There is a trade-off in the value of simplicity or complexity of a model given a fixed set of data. If it is too simple, our model cannot learn about the data and misrepresents the data. However if our model is too complex, we need more data to learn the underlying relationship. Otherwise it is very common for a model to infer relationships that might not actually exist in the data.

The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity. Of course with more data any model can improve, and different models may be optimal.

precision and recall

(Notes from my Udacity Machine Learning Nanodegree course)

Recall:

True Positive / (True Positive + False Negative)

Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were ‘recalled’ from the dataset.

Precision:

True Positive / (True Positive + False Positive)

Out of all the items labeled as positive, how many truly belong to the positive class.

F1 Score

Now that you’ve seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0:

F1 = 2 * (precision * recall) / (precision + recall)

If you enroll for the Udacity Nanodegree Plus program, Udacity promise that “you’ll get hired within 6 months of graduating, or we’ll refund 100% of your tuition”. Which is a bold promise. But of course there are terms and conditions to this promise. But this is where is gets interesting to me.

The terms and conditions are so comprehensive that it seems to me that they should be required actions for anyone who is looking for a job!

This is what Udacity require you to do when looking for a job:

Graduate is active in their job search and demonstrates this activity by submitting a minimum of 5 [job] applications per week.

Graduate tracks all applications submitted and clearly organizes status of each such that next steps are clear.

• Graduate is able to furnish details and communications relating to their job search
• The application materials must be tailored to the role and company

Graduate continues to build their portfolio of work by regularly working on personal development projects.

• [for developer roles] complete a minimum average of 6 GitHub commits per week resulting in at least 1 published application or website
• [for analyst roles] complete a minimum average of 1 project or competition every 2 months resulting in a publicly published report/result

Graduate establishes meaningful connections with an average of 3 relevant industry professionals each week via email, LinkedIn or Twitter resulting in conversation about an open role.

Graduate schedules a 1:1 appointment with [a careers guidance person] if after 2 months they’re not having success in finding work.

You understand that your next job is one step closer to your dream job. To achieve your goal you should ensure that:

• Applied positions meet the your skill level
• You don’t reject any offers that match your ability and expectations

I think that everyone looking for a job should be doing these things – not just those in the Udacity Nanodegree program!