Is Cosine-Similarity of Embeddings Really About Similarity?

cfgauss2718 · 2024-03-12T15:51:48.000000Z

Distance measures are only as good as the Pseudo-Riemannian metric they (implicitly)implement. If the manifold hypothesis is believed, then these metrics should be local because the manifold curvature is a local property. You would be mistaken to use an ordinary dot product to compare straight lines on a map of the globe, because those lines aren’t actually straight - they do not account for the rich information in the curvature tensor. Using the wrong inner product is akin to the flat-Earth fallacy.

seanhunter · 2024-03-12T21:19:33.000000Z

I'm not sure I understand the underlying maths well enough to opine on your point but I can say for certain that no embedding space that I've ever seen used for any kind of ML is uniform in the sense that a Euclidian distance around one point means the same thing as the same Euclidian distance around another point. I'm not even sure that it would be possible to make an embedding that was uniform in that way because it would mean that we had a universal measure of similarity between concepts (which can obviously be enormously different).

The other potential issue is for all the embeddings that I have seen the resulting space once you have embedded some documents is sort of "clumpy" and very sparse overall. So you have very large areas with basically nothing at all I think because semantically there are many dimensions which only make sense for subsets of concepts so you end up with big voids where really the embedding space is totally unreachable so distance doesn't have any meaning at all.

In spite of all that there are a few similarity measures which work well enough to be useful for many practical purposes and cosine similarity is one of them. I don't think anyone thinks it's perfect.

ashvardanian · 2024-03-12T17:10:28.000000Z

That sounds like a good point. Any research you know about extracting the curvature tensor from the a transformer-like model?

groceryheist · 2024-03-12T17:26:50.000000Z

This is exactly right and is one (among many) reasons that reliance on cosine similarities in my field (computational social science) is so problematic. The curvature of the manifold must be accounted for in measuring distances. Other measures based on optimal transport are more theoretically sound, but are computationally expensive.

trhway · 2024-03-12T17:22:01.000000Z

We implicitly train for minimizing of the distance in that style of metric - by using functions continuous and differentiable on classic manifolds (where continuity and differentiability is defined using the classic local maps into Euclidian space). I think if we were training using functions continuous and differentiable in say p-adic metric space (which looks extremely jagged/fractallian/non-continuos when embedded into Euclidian) then we'd have something like p-adic version of cosine (or other L-something metric) for similarity

neoncontrails · 2024-03-12T04:19:02.000000Z

> In the following, we show that [taking cosine similarity between two features in a learned embedding] can lead to arbitrary results, and they may not even be unique.

Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.

nerdponx · 2024-03-12T08:04:34.000000Z

It's not even a distance metric, it doesn't obey the triangle inequality (hence the not-technically-meaningful name "similarity", like "collection" as opposed to "set").

__tmk__ · 2024-03-12T14:31:09.000000Z

1-cos_sim(x,y) is a valid distance metric for L2 normalized vectors, though.

epgui · 2024-03-12T15:15:40.000000Z

It is a distance metric on a projection.

Der_Einzige · 2024-03-12T14:21:13.000000Z

Not obeying the triangle inequality simply means that it’s fast to compute.

queuebert · 2024-03-12T15:23:53.000000Z

One does not normally consider trigonometric functions fast to compute. Would you mind elaborating?

seanhunter · 2024-03-12T15:46:29.000000Z

Consider the geometric definition of the dot product of two vectors,

a.b = |a||b|cos theta.

This means you get cos of the angle between the two vectors by just dividing the dot product by the product of their magnitudes. You don't actually take cos of the angle to get cosine similarity (for one because you don't know the angle) you just use "cos theta" (calculated as above) as a proxy for how narrow the angle is and therefore how close the two embeddings are.

The paper in TFA shows that if you construct an embedding space such that that angle isn't meaningfully measuring similarity then a low angle doesn't mean two things are very similar. I have a similar paper measuring bears and woods but I haven't got around to typesetting it for publication yet.

SR2Z · 2024-03-12T15:33:08.000000Z

You don't need to call cosine to compute cosine similarity - most of the time, a normalized dot product is sufficient.

Needless to say, dot products are directly supported in hardware via the FMA unit.

queuebert · 2024-03-12T16:54:09.000000Z

Yes, this would be fast assuming the vectors are pre-normalized, obviating the need to repeatedly calculate two square roots.

SR2Z · 2024-03-12T19:36:38.000000Z

Even if the vectors are NOT normalized, you don't always need to normalize them.

Add in nd indices and the costs tend to be very small.

pletnes · 2024-03-12T05:39:27.000000Z

I sure hope noone claimed that. You’re doing potentially huge dimensionality reduction, uniqueness would be like saying you cannot have md5 collisions.

JohnKemeny · 2024-03-12T08:43:42.000000Z

Johnson and Lindenstrauss disagree. https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

yorwba · 2024-03-12T09:34:56.000000Z

If you have 1000 points and want to preserve their squared distances to within an error of 1%, the Johnson-Lindenstrauss construction suggests an embedding dimension of 8(ln 1000)/(0.01²) > 552620. If your points start out in a lower-dimensional space than that to begin with, it's obviously pointless.

The crossover point where the number of dimensions falls below the number of points is at 1113868. If you're willing to tolerate 10% error, it's at 7094.

SimplyUnknown · 2024-03-12T08:15:58.000000Z

I think maybe it's poorly phrased. As far as I can tell, their linear regression example for eq. 2 has an unique solution, but I think they state I that when optimizing for cosine similarity you can find non-unique solutions. But I haven't read in detail.

Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.

visarga · 2024-03-12T05:58:43.000000Z

Embeddings result from computing what word can appear in a given context, so words that would appear in the same spot will have higher cosine score between themselves.

But it doesn't differentiate further, so you can have "beautiful" and "ugly" embed very close to each other even though they are opposites - they tend to appear in similar places.

Another limitation of embeddings and cosine-similarity is that that they can't tell you "how similar" - is it equivalence or just relatedness? They make a mess of equivalent, antonymous and related things.

minimaxir · 2024-03-12T06:21:16.000000Z

For word2vec-esque word embeddings, yes.

For modern embedding models which effectively mean-pool the last hidden state of LLMs (and therefore make use of its optimizations such as attention tricks), embeddings can be much more robust to different contexts both local and global.

LunaSea · 2024-03-12T06:31:34.000000Z

Would you have some in links in mind of models that were pulled from LLMs?

The last one I have in mind is BERT and it's variants.

minimaxir · 2024-03-12T06:37:13.000000Z

BERT embeddings were what proved that taking embeddings using the last hidden state works (in that case, using the [CLS] token representation which is IMO silly). Most of the top embedding models on the MTEB leaderboard are mean-pooled LLMs: https://huggingface.co/spaces/mteb/leaderboard

The not-quite-large embedding model I like to use now is nomic-embed-text-v1.5 (based on a BERT architecture), which supports a 8192 context window and MRL for reducing the dimensionality if needed: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

_t89y · 2024-03-12T07:04:12.000000Z

Works for what? The leaderboards? BPE tiktokens, BPE GPT-2 tokens, SentencePiece, GloVe, word2vec, ..., take your pick, they all end up in a latent space of arbitrary dimensionality and arbitrary vocab size where they can be mapped onto images. This is never going to work for language. The only thing the leaderboards are good for is enabling you to charge more for your model than everyone else for a month or two. The only meaning hyperparameters like dimensionality and vocab size have is in their message that more is always better and scaling up is what matters.

yorwba · 2024-03-12T09:59:05.000000Z

Works for:

Bitext mining: Given a sentence in one language, find its translation in a collection of sentences in another language using the cosine similarity of embeddings.

Classification: identify the kind of text you're dealing with using logistic regression on the embeddings.

Clustering: group similar texts together using k-means clustering on the embeddings.

Pair Classification: determine whether two texts are paraphrases of each other by using a binary threshold on the cosine similarity of the embeddings.

Reranking: given a query and a list of potential results, sort relevant results ahead of irrelevant ones by sorting according to the cosine similarity of embeddings.

Etc etc.

These are MTEB benchmark tasks https://arxiv.org/pdf/2210.07316.pdf . If you have no need for something like that, good for you, you don't need to care how well embeddings work for these tasks.

_t89y · 2024-03-12T15:58:30.000000Z

Easy there, Firthmiester. I'm familiar with the canon. If getting some desirable behavior in your application is good enough for you then feel free to ignore what I'm saying.

minimaxir · 2024-03-12T17:05:41.000000Z

Embeddings and vector stores wouldn't have taken off in the way that they did if they didn't actually work.

_t89y · 2024-03-14T22:06:00.000000Z

They've taken off because they have utility in information retrieval systems. They work for getting info into Google (Stanford) Knowledge Panels. I don't think it really goes any further than that. They are most useful to the few orgs that went from dominating NLP research to controlling it outright by convincing everyone scale is the only way forward and owning scale. Alternatives to word embeddings aren't even considered or discussed. They are assumed as a starting point for pretty much all work in NLP today even though they are as uninteresting today as they were when word2vec was published in 2013. They do not and will not work for language.

Nowado · 2024-03-12T09:05:38.000000Z

I'm not sure if it's the most modern setup there is, but https://www.youtube.com/watch?v=UPtG_38Oq8o gives exceptionally friendly explanation.

vl · 2024-03-12T15:56:32.000000Z

While indeed modern embedding are more robust;

All embeddings are first layer of DNN. In case of word2vec this is shallow 2-layer network. Selection of embedding is multiplication of embedding matrix by one-hot vector, which is usually optimized as array lookup.

minimaxir · 2024-03-12T17:09:55.000000Z

That's not "all embeddings", that's just implementations like word2vec/fastText. And even though they are fast, they both don't get context as well and require significant preprocessing (e.g. stemming/stop word removal).

Implementations that use a LLM require a full forward pass, but the many optimizations in inference speed make it not noticable for small applications.

carschno · 2024-03-12T12:26:03.000000Z

That is why computational linguistics prefer the term related over similar here. Similarity is notoriously hard to define, for starters in terms of grammatical vs semantically similarity.

mo_42 · 2024-03-12T06:05:51.000000Z

Only if those two words appear in the same contexts with the same frequency. In natural language this is probably not the case. There are things typically considered beautiful and others as ugly.

singularity2001 · 2024-03-12T08:24:48.000000Z

one key insights is that opposites do have a lot in common, they are often opposites in exactly one of n feature dimensions. For example black and white are both (arguably) colors, are related to (lack of) light, have representations in various formats (RGB, …), appear in the same grammatical position of ordered adjectives …

bruturis · 2024-03-12T08:53:43.000000Z

I think that yours comment is very interesting, I have reflected many times about how to differentiate things that appear in the same context of things that are similar. Any big idea here could be the spark to initiate a great startup.

_t89y · 2024-03-12T06:12:26.000000Z

They make a mess of language. They are not a suitable representation. They are suitable for their efficiency in information retrieval systems and for sometimes crudely capturing semantic attributes in a way that is unreliable and uninterpretable. It ends there. Here's to ten more years of word2vec.

bongodongobob · 2024-03-12T07:35:26.000000Z

Isn't that kind of the point though? That beautiful and ugly are encoded closely as an "idea" when viewed from the appropriate angle?

Sai_ · 2024-03-12T07:43:40.000000Z

Apart from an ESL class explaining antonyms, I can’t think of any use case where an API which could equally return “ugly” or “beautiful” can be used.

We need embeddings to give relatedness across axes like synonymity etc.

SubiculumCode · 2024-03-12T04:56:13.000000Z

Distance metrics are an interesting topic. The field of ecology has a ton of them. For example see vegdist the Dissimilarity Indices for Community Ecologists function in the Vegan package in R: https://rdrr.io/cran/vegan/man/vegdist.html which includes, among others the "canberra", "clark", "bray", "kulczynski", "gower", "altGower", "morisita", "horn", "mountford", "raup", "chao", "cao", "mahalanobis", "chord", "hellinger", "aitchison", or "robust.aitchison".

Generic distance metrics can often be replaced with context-specific ones for better utility; it makes me wonder whether that insight could be useful in deep learning.

hackerlight · 2024-03-12T04:57:30.000000Z

What are good distance metrics applied to latent embeddings as part of a diversity loss function to prevent model collapse?

SubiculumCode · 2024-03-12T05:20:28.000000Z

hell if I know!! Sorry. I've used the vegan package for some analyses, but I've mostly used Manhattan and cosine metrics. I just wanted to bring up the idea that there are a lot of metrics out there that may not be generally appreciated.

Claude Opus says "There are a few good distance metrics commonly used with latent embeddings to promote diversity and prevent model collapse:

1. Euclidean Distance (L2 Distance) 2. Cosine distance 3. Kullback-Leibler (KL) Divergence: KL divergence quantifies how much one probability distribution differs from another. It can be used to measure the difference between the distributions of latent embeddings. Minimizing KL divergence as a diversity loss would encourage the embedding distribution to be more uniform. 4. Maximum Mean Discrepancy (MMD): MMD measures the difference between two distributions by comparing their moments (mean, variance, etc.) in a reproducing kernel Hilbert space. It's useful for comparing high-dimensional distributions like those of embeddings. MMD loss promotes diversity by penalizing embeddings that are too clustered together. 5. Gaussian Annulus Loss: This loss function encourages embeddings to lie within an annulus (ring) in the latent space defined by two Gaussian distributions. It promotes uniformity in the embedding norms while allowing angular diversity. This can be effective at preventing collapse to a single point. But I haven't checked for hallucinations.

SubiculumCode · 2024-03-12T05:24:41.000000Z

To add further: https://cran.r-project.org/web/packages/vegan/vignettes/dive... Vegan package is very much into methods to assess diversity in ecologies.

Beta diversity is one metric for examining diversity, define as the ratio between regional and local species diversity. https://en.wikipedia.org/wiki/Beta_diversity

SubiculumCode · 2024-03-12T05:26:09.000000Z

I had no idea that diversity loss function was a topic in deep learning. I admit, I'm a bit fascinated, as a neuroimaging scientist.

hackerlight · 2024-03-12T05:45:57.000000Z

Have a look at section 3.2 of the Wav2vec2 paper:

https://arxiv.org/pdf/2006.11477.pdf

mo_42 · 2024-03-12T05:52:41.000000Z

I quickly read through the paper. One thing to note is that they use the Frobenius norm (at least I suppose this from the index F) for the matrix factorization. That is for their learning algorithm. Then, they use the cosine-similarity to evaluate. A metric that wasn't used in the algorithm.

This is a long-standing question for me. Theoretically, I should use the CS in my optimization and then also in the evaluation. But I haven't tested this empirically.

For example, there is sperical K-meams that clusters the data on the unit sphere.

nerdponx · 2024-03-12T08:13:44.000000Z

I think that's kind of the point of the paper. The model is based on un-normalized dot products, and wasn't deliberately designed to produce meaningful cosine similarities. They are showing that, in that case, cosine similarities might be arbitrary and not as useful as people might assume or hope.

mikewarot · 2024-03-12T11:14:08.000000Z

Why would anyone expect cosine-similarity to be a useful metric? In the real word, the arbitrary absolute position of an object in the universe (if it could be measured) isn't that important, it's the directions and distances to nearby objects that matters most.

It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]

When did this view fall from favor?

[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...

VHRanger · 2024-03-12T12:26:33.000000Z

The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).

Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.

If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.

If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.

There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.

liminal · 2024-03-12T14:02:40.000000Z

> If you normalize your vectors, cosine similarity is the same as Euclidean distance.

If you normalize your vectors, cosine similarity is the same as dot product. Euclidean distance is still different.

VHRanger · 2024-03-12T14:10:10.000000Z

Oh, thanks for the correction.

If all the vectors are on the unit ball, then cosine = dot product. But then the dot product is a linear transformation away from the euclidean distance:

https://math.stackexchange.com/questions/1236465/euclidean-d...

If you're using it in a machine learning model, things that are one linear transform away are more or less the same (might need more parameters/layers/etc.)

If you're using it for classical statistics uses (analytics), right, they're not equivalent and it would be good to remember this distinction.

gbjw · 2024-03-12T14:38:43.000000Z

To be very explicit, if |x| = |y| = 1, we have |x - y|^2 = |x|^2 - 2xy + |y|^2 = 2 - 2xy = 2 - 2* cos(th). So they are not identical but minimizing the Euclidian distance of two unit vectors is the same as maximizing the cosine similarity.

montebicyclelo · 2024-03-12T12:39:09.000000Z

Cosine-similarity is a useful metric. The cases where it is useful are models that have been trained specifically to produce a meaningful cosine distance, (e.g. OpenAI's CLIP [1], Sentence Tranformers [2]) - but these are the types of models that the majority of people are using when they use cosine distances.

> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most

Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.

[1] https://arxiv.org/abs/2103.00020

[2] https://www.sbert.net/docs/training/overview.html

necroforest · 2024-03-12T11:49:53.000000Z

cosine similarity is (isomorphic to) "distances to nearby objects". and not all embeddings are word embeddings.

VHRanger · 2024-03-12T12:28:34.000000Z

It's isomorphic when vectors are normalized, otherwise it's angle distance, not position distance

jameshart · 2024-03-12T16:11:53.000000Z

It’s a mistake to think of vectors as coordinates of objects in space, though. You can visualize them like that, but that’s not what they are. The vectors are the objects.

A vector is just a list of n numbers. Embedded into a n dimensional space, a vector is a distance in a direction. It isn’t ’the point you get to by going that distance in that direction from the origin of that space’. You don’t need as space to have an origin for the embedding to make sense - for ‘cosine similarity’ to make sense.

Cosine similarity is just ‘how similar is the direction these vectors point in’.

The geometric intuition of ‘angle between’ actually does a disservice here when we are talking about high dimensional vectors. We’re talking about things that are much more similar to functions than spatial vectors, and while you can readily talk about the ‘normalized dot product’ of two functions it’s much less reasonable to talk about the ‘cosine similarity’ between them - it just turns out that mathematically those are equivalent.

VHRanger · 2024-03-12T16:55:07.000000Z

Fair enough.

I think people skip over that the vectors are the result of the minimization of the objective.

That objective is roughly the same since word2vec. GLoVe is mathematically equivalent. LLMs are also equivalent.

For a LM, the objective function is still roughly the same. Maximizing probability of the next token conditional on previous tokens.

This means the embedding vector of a token minimizes distance to tokens that come before it often, and maximizes distance to those that don't.

rdedev · 2024-03-12T12:49:16.000000Z

From my experience trying to train embeddings from transformers, using cosine similarity is less restrictive for the model than euclidean distance. Both works but cosine similarity seems to have slightly better performance.

Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there

kevindamm · 2024-03-12T12:15:00.000000Z

The word2vec inspired tricks like king-man+woman only work if the embedding is trained with synonym/antonym triplets to give them the semantic locality that allows that kind of vector math. This isn't always done, even some word2vec re-implementations skip this step completely. Also, not all embeddings are word embeddings.

mikewarot · 2024-03-12T12:22:21.000000Z

My understanding was that Word2Vec[1] was trained on Wikipedia and other such texts, not artificially constructed things like the triplets you suggest. There's an inherent structure present in human languages that enable the "magic" of embeddings to work, as far as I can tell.

[1] https://code.google.com/archive/p/word2vec/

itronitron · 2024-03-12T12:49:18.000000Z

Has there been any rigorous evaluation of word2vec calculating 'king-man+woman=queen' associations? I only recall the author providing some cherry-picked examples from their results, which I suppose makes it a seminal AI paper.

kevindamm · 2024-03-12T15:42:22.000000Z

The original paper included source, and that has their test data and results -- it gets ~77% accuracy on about 20k example word analogies (with 99.7% coverage), and 78% accuracy with phrases with 77% coverage (and a much smaller data set, 3,218 examples). You can see the test sets here:

https://github.com/tmikolov/word2vec/blob/master/questions-w...

https://github.com/tmikolov/word2vec/blob/master/questions-p...

and you could see how much better LLMs do on the same 20k examples.

sp332 · 2024-03-12T16:09:57.000000Z

A lot more info about how it worked here https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935... and an interactive demo https://dash.gallery/dash-word-arithmetic/ (blog post about the demo https://medium.com/plotly/understanding-word-embedding-arith...)

0xdeadbeefbabe · 2024-03-12T16:10:18.000000Z

If a paper attacks plus and minus it ought to cite this one.

kevindamm · 2024-03-12T15:37:49.000000Z

Ah, yes I may have mis-remembered or misunderstood when looking at the training data and model definition, it would have been about a decade ago now. Or perhaps I was thinking of an unrelated experiment to use chosen analogies during training but yeah that isn't part of the original paper.

The training process from the original Mikolov et al. paper only uses the analogy examples (questions-words.txt and questions-phrases.txt) to measure accuracy after training: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50...

_t89y · 2024-03-12T16:08:41.000000Z

There is an inherent structure in language. Embeddings do not and will not capture it. It's why they do not work. Their ability to form grammatical sentences with high accuracy is part of the illusion that you have been understood.

itronitron · 2024-03-12T12:52:08.000000Z

>> the delta between two word embeddings, gives a direction, and the magic is from using those directions

A direction can be given in terms of an angle measure, such as cosine.

LifeIsBio · 2024-03-12T05:15:34.000000Z

The paper kinda leaves you hanging on the "alternatives" front, even though they have a section dedicated to it.

In addition to the _quality_ of any proposed alternative(s), computational speed also has to be a consideration. I've run into multiple situations where you want to measure similarities on the order of millions/billions of times. Especially for realtime applications (like RAG?) speed may even out weight quality.

sweezyjeezy · 2024-03-12T08:48:06.000000Z

> While cosine similarity is invariant under such rotations R, one of the key insights in this paper is that the first (but not the second) objective is also invariant to rescalings of the columns of A and B

Ha interesting I wrote a blog post where I pointed this out a few years ago [1], and how we got around it for item-item similarity at an old job (essentially an implicit re-projection to original space as noted in section 3).

https://swarbrickjones.wordpress.com/2016/11/24/note-on-an-i...

_t89y · 2024-03-12T05:36:25.000000Z

It's definitely not about semantics or language. As far as language is concerned similarity metrics are semantically vacuous and quantifying semantic similarity is a bogus enterprise.

blackbear_ · 2024-03-12T07:33:52.000000Z

Can you elaborate?

_t89y · 2024-03-12T08:26:57.000000Z

Modeling language in a latent space is useful for certain kinds of analyses and certain aspects of language. It has its place as an empirical tool. That place is not the nuts and bolts of language itself. There are more suitable formalisms for this than directional magnitudes and BPE tiktokens.

throwaway2562 · 2024-03-12T09:58:53.000000Z

Intuitively I agree with you, despite the unexpected ‘success’ of current approaches. What formalisms do you suggest?

_t89y · 2024-03-12T15:41:34.000000Z

The Lambek calculus. Categorial grammars. Meanings are proofs. Not clusters of directional magnitudes in space.

jll29 · 2024-03-12T16:34:20.000000Z

Curiously, the upcoming third edition of Jurafsky and Martin [1], one of the two standard text books for NLP, places Context-Free Grammars, Combinatory Categorial Grammars, and logical meaning representations in its appendices on the companion Web site, no longer in the text book itself. Unthinkable only a few years ago.

[1] https://web.stanford.edu/~jurafsky/slp3/

_t89y · 2024-03-14T22:40:34.000000Z

That's a really interesting thing to point out. NLP doesn't even work on language anymore. If it was adjacent to information retrieval before it is now a subfield of information retrieval. As long as it's grounded in Firth Mode natural language understanding, as it's called, can't really be a semantics.

I tried to create a Kaggle (TensorFlow Hub, TensorFlow Quantum) competition for motivating alternative formalisms but was unable to publish it because all Kaggle competitions must be evaluated with information retrieval metrics. Talk about a one-track mindset!

Today work in NLP advances by ``leaderboards'' and dubious, language-specific evaluation datasets that the same authors stand to benefit from when their proprietary model is praised for doing well on the evaluation criteria they invented a few months back. It validates the price hike for access to their proprietary models.

These formalisms that do work are at odds with Firth Mode, the preferred representation for Google (Stanford, OpenAI), so I guess we should be thankful they're still in the book. If you're interested in language, though, I'd suggest picking up a different book.

blackbear_ · 2024-03-12T08:35:18.000000Z

What are those formalisms?

apstroll · 2024-03-12T07:49:24.000000Z

Cosine Similarity is very much about similarity, but it's quite fickle and indirect.

Given a function f(l, r) that measures, say, the logprobability of observing both l and r, and that the function takes the form f(l, r) = <L(l), R(r)>, i.e. the dot product between embeddings of l and r, then cosine similarity of x and y, i.e. normalized dot product of L(x) and L(y) is very closely related to the correlation of f(x, Z) and f(y, Z) when we let Z vary.

shiandow · 2024-03-12T12:15:54.000000Z

A well kept secret of linear algebra is that having an inner product at all isn't as self-evident as it might seem. Euclidean distance might seem like some canonical notion of distance, but it doesn't have to be meaningful, especially if the choice of coordinates has no geometrical meaning.

0xdeadbeefbabe · 2024-03-12T15:31:08.000000Z

It seems easiest to blame the embedding and not the cosine similarity.

montebicyclelo · 2024-03-12T12:13:19.000000Z

Hmm, typically the models where people use cosine similarity on embeddings have been deliberately trained such that the cosine similarity is meaningful. It looks like this paper is looking at examples where the models have not been deliberately trained for the cosine similarities, and hence in these situations it would indeed be unreasonable to assume cosine similarities to be a good idea.. (but that's kind of a given?)

For example, here's the loss from the CLIP paper [1], which ensures cosine similarities are meaningful:

    # joint multimodal embedding [n, d_e]
    I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
    T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
    # scaled pairwise cosine similarities [n, n]
    logits = np.dot(I_e, T_e.T) * np.exp(t)
    # symmetric loss function
    labels = np.arange(n)
    loss_i = cross_entropy_loss(logits, labels, axis=0)
    loss_t = cross_entropy_loss(logits, labels, axis=1)
    loss = (loss_i + loss_t)/2

And Sentence Transformers [2] using CosineSimilarityLoss:

    train_loss = losses.CosineSimilarityLoss(model)
    # Tune the model
    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

[1] https://arxiv.org/pdf/2103.00020.pdf

[2] https://www.sbert.net/docs/training/overview.html

0xdeadbeefbabe · 2024-03-12T15:39:48.000000Z

Right, the embedding is contrived to give euclidean distance and cosine similarity the usual meaning.

malshe · 2024-03-12T18:10:33.000000Z

Thanks for these examples.

greesil · 2024-03-12T05:52:01.000000Z

It's about keeping stuff on the unit hypersphere

toisanji · 2024-03-12T15:23:40.000000Z

We found similar results when working on our paper for creating LLM agents with metacognition and explicitly called this out in the paper: https://replicantlife.com

h_koko · 2024-03-12T06:11:28.000000Z

What are better alternatives to cosine similarity?

minimaxir · 2024-03-12T06:24:21.000000Z

There aren't any alternatives: cosine similarity is effectively an extension of Euclidian distance, which is the mathematical correct way for finding the distance between vectors.

You may not want to use cosine similarity as your only metric for rankings, however, and you may want to experiment with how you construct the embeddings.

rocqua · 2024-03-12T06:38:28.000000Z

Euclidean distance is only 'mathematically correct' if you care about rotations in your space being distance preserving. I don't think that is really the case in these spaces.

There is a whole branch of mathematics called dedicated to other ways to measure distance.

0xdeadbeefbabe · 2024-03-12T15:42:54.000000Z

How wouldn't distances be preserved under rotation. I'm asking for a friend of course.

rocqua · 2024-03-12T20:29:51.000000Z

In the Manhattan distance for example if you rotate your coordinate grid, distances will change.

singularity2001 · 2024-03-12T08:29:56.000000Z

What happens when you mix cosine similarity with euclidean distance? at least give a small penalty if the euclidean distance is too far off?

wongarsu · 2024-03-12T09:29:36.000000Z

If your embeddings are normalized this won't change much, since euclidean distance and cosine similarity produce the same ranking on normalized vectors.

If your embeddings aren't normalized it's worth trying. In our use cases it never made a substantial difference, but I imagine there are cases where it does.

_t89y · 2024-03-12T07:20:04.000000Z

For describing semantics in natural language? Pretty much anything else.

singularity2001 · 2024-03-12T08:28:04.000000Z

like?

_t89y · 2024-03-12T08:34:22.000000Z

Having a mid-century theory of natural language semantics isn't necessarily a bad thing. You just have to pick the right one.

stormfather · 2024-03-12T15:20:26.000000Z

...like?

_t89y · 2024-03-12T15:43:59.000000Z

https://news.ycombinator.com/item?id=39680852

jsn_5 · 2024-03-12T08:49:59.000000Z

All models are wrong, some are useful.

latency-guy2 · 2024-03-12T04:03:42.000000Z

Not reading the paper, cosine similarity has little to no semantic understanding of sentences.

E.g. the following triple

1: "Yes, this is a demonstration"

2: "Yes, this isn't a demonstration"

3: "Here is an example"

<1, 2>, Has "higher" cosine similarity than <1, 3>, structurally equivalent except for one token/word, <1, 2> semantically means the opposite of each other depending on what you're targeting in that sentence. While <1, 3> means effectively the same thing.

If this paper is about persuading people about efficacy with regards to semantic understanding, OK, but that was always known. If its about something with relation to vectors and the underlying operations, then I'll be interested.

vidarh · 2024-03-12T04:46:26.000000Z

Whether your not the cosine similarity of either pair is higher depends on the mapping you create from the strings to the embedding vector. That mapping can be whichever function you choose, and your result will be entirely dependent on that.

If you choose a straight linear mapping of tokens to a number, then you'd be right.

Extending that, if you choose any mapping which does not do a more extensive remapping from raw syntactic structure to some sort of semantic representation, you'd be right.

But hence why we increasingly use models to create embeddings instead of simpler approaches before applying a similarity metric, whether cosine similarity or other.

Put another way, there is no inherent reason why you couldn't have a model where the embeddings for 1 and 3 are identical even, and so it is meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.

latency-guy2 · 2024-03-12T06:39:32.000000Z

> meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.

I agree, but from generics POV, you have to settle on a few things to compare between models. If you can't, then benchmarks are useless too outside of extremely narrow measures.

I only address structure in the parent, and sure, it can be too generic of a statement by only touching on structure. But I would almost assert structure is still an important feature, and I would almost assert that it is required or otherwise a dominant feature when you want to deliver a product for general use.

I don't think I get too much more incorrect going beyond a few dimensions given this.

vidarh · 2024-03-12T10:57:26.000000Z

From the introduction to the paper:

> Discrete entities are often embedded via a learned mapping to dense real-valued vectors in a variety of domains.

Already from that point, it is clear that a comparison based on the similarity of the textual version of the sentences is irrelevant to the evaluation in the paper. The paper consistently talk in terms of "learned embeddings" rather than simplistic direct mappings of words.

_t89y · 2024-03-12T05:42:42.000000Z

It is meaningless to talk about cosine similarity of sentences, or words, at all. Choose whatever mapping you want. You'll still be in Firth Mode.

vidarh · 2024-03-12T11:05:22.000000Z

It's meaningful to talk about cosine similarity for anything that you can quantify in ways such that the cosine similarity reflects a measure you care about. Same applies for any function. If it works, it's meaningful to talk about it whether or not it has a reasonable interpretation beyond that.

_t89y · 2024-03-12T07:31:51.000000Z

Uh oh. LOL. Got some angry Firthers out there.

Tostino · 2024-03-12T04:10:14.000000Z

That is entirely dependant on the model for the embeddings. You can fine tune for pretty much any outcome you want.

_t89y · 2024-03-12T07:24:46.000000Z

You can't fine-tune for understanding or reasoning. You can't "get better performance" on understanding. You're either equipped for it or you're not.

superkuh · 2024-03-12T04:31:22.000000Z

That might be true for one-hot vectors but it's not true for learned embedding through the lens of attention. That said, I only made to page 3/9 of the paper before the mark-up for the math went over my head.

latency-guy2 · 2024-03-12T06:14:54.000000Z

If we're talking about adding dimensionality and relying on the kernel, sure, but I only get so much incorrect by going from 1 to M dims.

I can't of course be certain in all cases, but dimensions are typically (past experience, and using knowledge from word2vec experiments from years ago) derivative of higher dimensions. The kernel still operates on the same concept by applying a norm along with whatever weightings to each dim.

Semantic understanding is still not there in my opinion, we might feign it by increasing specificity, but only so much. Largest contributor will likely still the determining factor rather than the series of smaller, more specific dimensions.

I tested this using similar sentences as my original comment and failing in more scenarios than passing. I of course am biased since it may be given I did not select the right dimensions or measures.

_t89y · 2024-03-12T05:55:59.000000Z

It is true. And if you want to say anything about meaning this isn't even the right math.

soarerz · 2024-03-12T04:08:50.000000Z

What is the cheapest way to capture similarity if not via dot product then?

scotty79 · 2024-03-12T08:49:09.000000Z

Instead of sums of multiplications you could for example use sum of squares of differences.

Means squared error instead of dot product, it's not cheaper but it's close

If you want to go cheaper you could use sum of abs of differences.

soarerz · 2024-03-13T14:07:20.000000Z

This is effectively "the same" as dot product.

For a lot of embeddings we have today, norm of any embedding vector is roughly of same size, so the angle between two vectors is roughly same size as length of difference that you are saying, and can be expressed in terms of 1 - dot product after scaling

latency-guy2 · 2024-03-12T06:27:18.000000Z

I don't have an answer for this really outside of silly ones like "strict equality check", but I assert that no one else does either, at least today and right now, and its an inherent limitation due to the nature of embeddings and the space it desires to be (cheap, fast, good enough similarity for your use case).

You're probably best off using the commercial suggestion, and if its dot product, go for it. I am no expert in this area and my interest wanes every day.

gajus · 2024-03-12T05:06:25.000000Z

Interested to know as well

_t89y · 2024-03-12T05:39:53.000000Z

No understanding. Embeddings are a semantically vacuous representation and similarity is a semantically vacuous interpretation.