causes of error

(More notes from my Udacity Machine Learning Nanodegree course)

Once you have measured model performance, it is important to understand the reasons why models exhibit errors in the first place.

In model prediction there are two main sources of errors that a model can suffer from.

Bias

Bias occurs when a model has enough data but is not complex enough to capture the underlying relationships. As a result, the model consistently and systematically misrepresents the data, leading to low accuracy in prediction. This is known as underfitting. Simply put, bias occurs when we have an inadequate model.

Variance

When training a model, we typically use a limited number of samples from a larger population. If we repeatedly train a model with randomly selected subsets of data, we would expect its predictions to be different based on the specific examples given to it. Here variance is a measure of how much the predictions vary for any given test sample.

Some variance is normal, but too much variance indicates that the model is unable to generalize its predictions to the larger population. High sensitivity to the training set is also known as overfitting, and generally occurs when either the model is too complex or when we do not have enough data to support it.

We can typically reduce the variability of a model’s predictions and increase precision by training on more data. If more data is unavailable, we can also control variance by limiting our model’s complexity.

Improving the Validity of a Model

There is a trade-off in the value of simplicity or complexity of a model given a fixed set of data. If it is too simple, our model cannot learn about the data and misrepresents the data. However if our model is too complex, we need more data to learn the underlying relationship. Otherwise it is very common for a model to infer relationships that might not actually exist in the data.

The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity. Of course with more data any model can improve, and different models may be optimal.

precision and recall

(Notes from my Udacity Machine Learning Nanodegree course)

Recall:

True Positive / (True Positive + False Negative)

Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were ‘recalled’ from the dataset.

Precision:

True Positive / (True Positive + False Positive)

Out of all the items labeled as positive, how many truly belong to the positive class.

F1 Score

Now that you’ve seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0:

F1 = 2 * (precision * recall) / (precision + recall)

 

land your dream job

If you enroll for the Udacity Nanodegree Plus program, Udacity promise that “you’ll get hired within 6 months of graduating, or we’ll refund 100% of your tuition”. Which is a bold promise. But of course there are terms and conditions to this promise. But this is where is gets interesting to me.

The terms and conditions are so comprehensive that it seems to me that they should be required actions for anyone who is looking for a job!


This is what Udacity require you to do when looking for a job:

Graduate is active in their job search and demonstrates this activity by submitting a minimum of 5 [job] applications per week.

Graduate tracks all applications submitted and clearly organizes status of each such that next steps are clear.

  • Graduate is able to furnish details and communications relating to their job search
  • The application materials must be tailored to the role and company

Graduate continues to build their portfolio of work by regularly working on personal development projects.

  • [for developer roles] complete a minimum average of 6 GitHub commits per week resulting in at least 1 published application or website
  • [for analyst roles] complete a minimum average of 1 project or competition every 2 months resulting in a publicly published report/result

Graduate establishes meaningful connections with an average of 3 relevant industry professionals each week via email, LinkedIn or Twitter resulting in conversation about an open role.

Graduate schedules a 1:1 appointment with [a careers guidance person] if after 2 months they’re not having success in finding work.

You understand that your next job is one step closer to your dream job. To achieve your goal you should ensure that:

  • Applied positions meet the your skill level
  • You don’t reject any offers that match your ability and expectations

I think that everyone looking for a job should be doing these things – not just those in the Udacity Nanodegree program!

mendeley software update

I use Mendeley software a lot and today they released an update to the desktop version of their software. The popup box described what changed and it was very clear and informative. My initial response what that sometimes a simple and clear message is better than a fancy one or one with no information at all. iPhone apps update often just say bug fixes which is mildly irritating to say the least.

Good design needs to have a clear and simple message that leads to the required action (in this case to update the software).

Here is what the update said:

Mendeley - New version available

Mendeley Desktop 1.16.1
crash fixes
  • While displaying the user profile in notes.
  • When modifying the list of document authors.
general bug fixes
  • Fixed excessively long startup times for users with large libraries.
  • Fixed an issue where deleting a note would delete a wrong one.
  • Fixed a sync error caused by Mendeley Desktop erroneously submitting an empty websites field entry.
  • Fixed an issue where citation data (e.g. status of “Suppress authors” checkbox) was prevented from being saved.
visual improvements
  • Fixed an issue where the user name would overlap the date in notes.
  • (Linux) Fixed an issue where editing an author would not show the right tooltips.
  • Fixed an issue where filtering documents would result in an “Empty search query…” notification being displayed.
  • Fixed a variety of visual issues related to displaying the startup splash screen.
feedback and support
  • If you have suggestions for improvements please let us know by visiting the feedback forum.
  • If you encounter any problems using Mendeley or have questions to ask please contact our support team.
  • For news and updates about Mendeley see our blog.

Clear and concise and to the point!

version naming convention

Rather than name development releases with a number I’ve always liked the idea to give the release names. One option is to use a convention where the release is named by taking the major version number and using the corresponding name from the phonetic number aplphabet.

So I initially wanted to use:

0 : Zero, 1 : Won, 2 : Too, 3 : Tree, etc.

But I realized today that version 3 (i.e. “Tree”) sounds too much like “Free” and it didn’t seem to be good practice to call a version “free” if it wasn’t actually free.

So going forward my development release version naming convention will be as follows:

0 : Ze-ro

1 : Wun

2 : Too

3 : Thuh-ree

4 : Fo-wer

5 : Fi-yiv

6 : Six

7 : SEVen

8: Ate

9 : NINer

So for example if the project is called Photopia then the release for version 3 will be named:

Photopia Thuhree

I’ll see how this goes, but does seem better to me than using a number eg “Photopia v3”