This current article seems to cover the various choices for constructing 'contexts' (which include skip-gram and CBOW) pretty well.
Note that negative-sampling and hierarchical-softmax are actually alternative choices to interpret the hidden-layer and to arrive at error-values to back-propagate. Each can be used completely independently.
If you enable them both, you're training two independent hidden layers, which then in an interleaved fashion update the same shared input-vectors. (Essentially, it's joint training of each example via the hierarchical-softmax codepath to nudge the vectors, then via the separate negative-sampling codepath to nudge the vectors.) So the actual combination doesn't reduce the complexity – it's additive to model state size and training time – and I think most projects with large amounts of data just use one or the other (usually just negative-sampling).
Ah, thank you for pointing that out. I guess I got confused in all the papers I've read on the topic recently. It's hard to get into.
However, I would still not agree that the comment-linked article explaining negative sampling really explains how word2vec works, well enough, or maybe I just didn't understand.
Note that negative-sampling and hierarchical-softmax are actually alternative choices to interpret the hidden-layer and to arrive at error-values to back-propagate. Each can be used completely independently.
If you enable them both, you're training two independent hidden layers, which then in an interleaved fashion update the same shared input-vectors. (Essentially, it's joint training of each example via the hierarchical-softmax codepath to nudge the vectors, then via the separate negative-sampling codepath to nudge the vectors.) So the actual combination doesn't reduce the complexity – it's additive to model state size and training time – and I think most projects with large amounts of data just use one or the other (usually just negative-sampling).