I created a pseudo-SVD++(3) model (mf_time.cpp in my framework) - one without the |N(u)|^(-.5)*Yi feedback term, and also without the alpha_u*dev_u_hat term (which I found to make the RMSE worse). However the one I initially wrote had a bug in the training and use of the alpha_u_k*dev_u_hat term (the term that is added to the user features - not to be confused with the alpha_u*dev_u_hat term).
Surprisingly, and confusingly, this bug actually improved the probe RMSE significantly over the correct code.
The code within the training loop was:
movie_features[movieid-1][f] += LRATE2 * (err*(uf_old+alpha_uk_old*dev_u_hat) - LAMBDA2*mf_old); alpha_u_k[i][f] = LRATE2 * (err*dev_u_hat*mf_old - LAMBDA2*alpha_uk_old);
And within the prediction function:
sum += movie_features[movieid-1][f] * (user_features[u][f] + alpha_u_k[u][f]);
In the training, there was a "+" plus sign absent - it should have been "alpha_u_k[i][f] +=" rather than just "=". And in the prediction function, the "*dev_u_hat" term was missing. It should have been:
sum += movie_features[movieid-1][f] * (user_features[u][f] + alpha_u_k[u][f]*dev_u_hat);
Yet despite these errors, I achieved a probe RMSE of 0.901871 with 100 features. When I actually corrected the code, the RMSE surprisingly increased to 0.907681 (also with 100 features).
I also tried a model where the training step matched the incorrect prediction function. That is, I changed the code to:
movie_features[movieid-1][f] += LRATE2 * (err*(uf_old+alpha_uk_old) - LAMBDA2*mf_old); alpha_u_k[i][f] += LRATE2 * (err*mf_old - LAMBDA2*alpha_uk_old); ... sum += movie_features[movieid-1][f] * (user_features[u][f] + alpha_u_k[u][f]);
I tested with 10 features. The RMSEs I got:
Correct model 0.921176
Bad model and bad training 0.913509
Bad model, proper training 0.917931
So the bad model with training that mis-matched the prediction formula actually outperformed the algo where the training step matched the incorrect algorithm (with dev_u_hat absent).
I find this very befuddling, as there is no reason I can think of which the wrong formula with a training step that doesn't even match the formula, would outperform not only the correct model, but also one where the training step was fixed.
The incorrect training step makes the contributions of the term seem pretty random to me - yet if they were random, they shouldn't improve the RMSE. If there is some pattern there, then I can't think of it.