HN2new | past | comments | ask | show | jobs | submitlogin

This current article seems to cover the various choices for constructing 'contexts' (which include skip-gram and CBOW) pretty well.

Note that negative-sampling and hierarchical-softmax are actually alternative choices to interpret the hidden-layer and to arrive at error-values to back-propagate. Each can be used completely independently.

If you enable them both, you're training two independent hidden layers, which then in an interleaved fashion update the same shared input-vectors. (Essentially, it's joint training of each example via the hierarchical-softmax codepath to nudge the vectors, then via the separate negative-sampling codepath to nudge the vectors.) So the actual combination doesn't reduce the complexity – it's additive to model state size and training time – and I think most projects with large amounts of data just use one or the other (usually just negative-sampling).



Ah, thank you for pointing that out. I guess I got confused in all the papers I've read on the topic recently. It's hard to get into.

However, I would still not agree that the comment-linked article explaining negative sampling really explains how word2vec works, well enough, or maybe I just didn't understand.

Either way I recommend looking at this article as well if anyone wants to understand word2vec. http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: