HN2new | past | comments | ask | show | jobs | submitlogin
SentenceTransformers: Python framework for sentence, text and image embeddings (sbert.net)
205 points by tosh on April 7, 2024 | hide | past | favorite | 70 comments


These are extremely useful embedding models, and some are small enough to use in the frontend (with transformers.js) for on-device semantic search.

One issue I’ve run into is that they produce a very good ranking, but the actual cosine similarity scores appear to be meaningless. (E.g., “chocolate chip cookies” and “PLS6;YJBXSRF&/” could have a similarity of 0.8.) Consequently, I’ve had a hard time selecting sensible cutoff values to balance recall and precision. Has anyone found a good approach to this?


Even if you choose a more sophisticated similarity measure as suggested by another commenter, you’ll still need to set a threshold on that metric to perform binary classification.

In my experience there are two paths forward, the one I recommend is to train an MLP classifier on the embeddings to produce a binary classification (ie similar or not). The advantage is that you no longer need to set a numeric threshold on a distance metric, however you will need labeled training data to define what is “similar” in the context of your use case.

The other path is to calculate the statistics of pair-wise distances for every record in some unlabeled dataset you have, and use the resulting distribution to inform choice of threshold. This will at least give you an estimate of what % of records will be classified as similar for a given threshold value.


I’ve had a lot of success training a classifier on top of these embeddings. An MLP works well, as does an SVM. If you don’t have labeled data, I’ve also found that various active learning techniques can produce extremely strong results while only requiring a small amount of human labeling.


Yep, we do precisely this, st + SVM => pre filtering / routing => RAG pipelines for some of louie.ai

In our cases, < 100 labels on modern models (per task) goes far. We can do better, but more useful problems to solve. Impressive how far these have come!


I second the point about learning a classifier over them - it is practically quite useful. But I'd caution against believing Active Learning (AL) would be unequivocally be useful; most positive performances seem to arise in very specific circumstances [1]. Esp. see Table 1: a Linear SVM with MPNet (one of the embeddings sbert supports) has indeed strong performance, but, in general, random sampling (over an AL strategy) performs quite well!

[1] https://arxiv.org/pdf/2403.15744v1.pdf


Extremely interesting paper that runs counter to my experience. In my previous role I ran tens of thousands of classification pipelines similar to those described in the paper, using bert-based models and then running SVMs on top as classifiers. In almost all cases AL techniques such as Least Certainty and Query by Committee dramatically outperformed random sampling. Reading the paper a bit more closely, the two differences that stand out are the use of multiclass classifiers (we transformed multiclass classification problems into a number of binary classification problems in order to improve performance), and the relative simplicity of the datasets used in the paper. That being said, anyone who is doing AL should of course first consider model performance with a randomly-selected subsample of the data!


Appreciate the reply!

We have 2 binary datasets in there [1], but of course, those results are not separately presented - you see the averages. The choice of datasets here is based on what other papers report, to have comparable results. In our experience with real data, AL doesn't fare any better on average - our pipelines use different representations (not just BERT based, includes some legacy ones like n-grams+tf-idf) and classifiers.

We do see some strong peformance with RoBERTa (Fig 6) and the margin strategy, which is an uncertainty based strategy like least confidence (Fig 7) - these probably correspond to your observations (to an extent).

It is a little hard to comment on differences without looking closely (I'd be happy to go through any document/tech. report you might be able to share) but in our experience, we have seen inconsistencies arise from:

* Not performing model selection at each iteration.

* Not calibrating the selected model at each iteration.

In these cases, a non-random query strategy sometimes just makes up for the lack of proper model selection, making it seem like the query strategy is stronger than it really is. Another point of difference is the seed data size - that can have a significant effect on "priming" the query strategy.

On a related note, specifically wrt uncertainty-based strategies, sampling bias is a known problem [2].

[1] On the Fragility of Active Learning, https://arxiv.org/abs/2403.15744

[2] Section 1.2 in Two faces of active learning, https://cseweb.ucsd.edu/~dasgupta/papers/twoface.pdf


Unfortunately I don't have anything to share, as much of the data now rests with my former employer. That being said, a few comments:

* Understood that dataset selection exists within the context of what others are doing and trying to provide comparable results. That being said, binary sentiment classifiers are (and should be) a pretty low bar for any reasonable technique in 2024, so it necessarily makes the reader wonder whether the conclusions are actually applicable to problems they might be facing. The point is taken that you're also looking at some older methods like TF-IDF which are prevalent in the (historical) AL literature.

* Performing model selection and calibration (cross-validation) at each iteration makes intuitive sense, but doesn't really make practical sense. If the goal of AL in the real world is getting frequent updates from a human oracle, new models need to be trained in seconds (at absolute maximum). Even if performance improves by spending time doing model selection/CV, humans simply won't tolerate waiting around for new responses to label.

* In your first link, the sample sizes seem extremely...small? If I was consistently faced with datasets only containing 1000 or 2000 data points, I'd probably just do supervised learning because the effort to set up a good AL pipeline (not to mention dealing with stopping criteria etc.) probably wouldn't be worth it. My experience has mostly been with datasets containing tens of thousands of data points (if not more).

* 100% agree that sampling bias is a huge issue with a number of AL strategies and one we worked on extensively. A component of that is trying to design the interaction such that the human labeler has some insight into parts of the data distribution that the model is making assumptions about; your second link is very nice and we tried a number of clustering-type techniques alongside our active learner to try and offset sample bias. More generally, I think this points to the importance of considering any human-in-the-loop ML problem in a way that takes into account more than just algorithmic efficiency.


Thanks for the good discussion!

* Reg. model selection/calibration: I acknowledge that in interactive setups cross-validation (CV) is not practical, but that is a particular kind of model selection. And faster alternatives exist for specific model families, such as approx. gradients [1,2], out-of-bag error (Random Forests), Infinitesimal Jacknife [3]. The bigger problem to me (for no model selection) is that there is a philosophical gap here: why would we fit the model with a default parameter blessed by a specific library? For ex., for LinearSVC in scikit, what is so great about C=1? And then, papers should mention that their query strategies are evaluated against C=1, but that of course doesn't sound as cool :-)

Problem #2 is what I had mentioned earlier: the AL query strategy seems to be doing two things now - makes up for the lack of good hyperparameters in the current model, and picking up "good" points - how do I know which it does more of? Or what is it really good at? If its performance is influenced by a lot of the former, it is hardly surprising that when you actually perform model selection, the gap between a query strategy and random decreases.

I guess my point is it doesn't have to be CV - but there needs to be model selection, because there is no well defined notion of a "default model". Calibration is less of an issue, because you can do this just once on the best model, and its not as expensive.

* In that paper [4], the effect of AL is shown for upto 5k points, but the unlabelled pool size is 20k points ("Other Settings" in Page 5). So we do require good instance selection.

On a different note, my hunch is that as language models get better at holding a broad variety of concept information, and it keeps getting easier to adapt them to specific tasks with just few examples/shots, we'll see whatever benefit AL has over random diminish. First (the bigger reason), the number of examples required to get good performance would reduce to the point that it might not make practical sense to set up an AL pipeline. And second, it would get easier to adapt the model per se, so that the difference (in terms of final model accuracy) between k good points and k randomly selected points might not be large. I guess we already see this in some form with SetFit [5] where examples are randomly picked to reach a good accuracy. I wonder if you have a view on that.

[1] Hyperparameter optimization with approximate gradient, https://proceedings.mlr.press/v48/pedregosa16.html

[2] Optimizing Millions of Hyperparameters by Implicit Differentiation, https://proceedings.mlr.press/v108/lorraine20a.html

[3] A Swiss Army Infinitesimal Jackknife, https://proceedings.mlr.press/v89/giordano19a.html

[4] On the Fragility of Active Learners, https://arxiv.org/pdf/2403.15744.pdf

[5] Efficient Few-Shot Learning Without Prompts, https://neurips2022-enlsp.github.io/papers/paper_17.pdf


When I was looking at AL query strategies (including random sampling, which I agree should always be included as a baseline), hyperparameter tuning (and model selection more generally) was definitely performed and optimized on a per-strategy basis. (This is a lot of combinations but it's reasonable to prune things that are clearly dead-ends, and that's how we ended up with a non-linear SVC.) We just didn't do it between iterations due to the practical concerns.

I understand the view that from a scientific perspective it is difficult to quantify the utility of a technique if it is providing multiple benefits at the same time. Not necessarily saying that's what's happening here, but it's worth pointing out that you can also look at this from a practical angle.For any given dataset your chosen model and set of hyperparameters probably isn't optimal, so if indeed AL is compensating for suboptimal choices in model selection, that may still be valuable.

Another potentially interesting angle to consider is label imbalance; because we were constructing binary classification problems out of multiclass problems (as discussed earlier), some evidence shows AL strategies outperform random sampling to a significantly greater degree in imbalanced scenarios [1].

I agree with you that we're moving towards a world in which transformer models are so good that zero-shot or few-shot approaches to classification are going to be the way to go. Stuff I did 3 years ago I probably would not do today.

[1] https://aclanthology.org/2020.emnlp-main.638.pdf


It becomes valuable if it either compensates for the lack of hyperparam tuning in all or a broad class of models, or the result is precisely presented to say something like "better than random if the model is a linear svm with C=1".

Right now we are somewhere in between where the results are presented to seem somewhat generally useful, but when you tweak the hyperparams you realize that the utility is potentially quite narrow.

We either need broad utility or increased rigor in reporting - we cannot abandon both.

Yeah that's an interesting paper [1], and label imbalance is a case we have not investigated (we were able to reproduce some results from it - the DAL experiments in Section A.3 "Reproducibility Experiments" in [2] come from there).

[1] Active Learning for BERT: An Empirical Study, https://aclanthology.org/2020.emnlp-main.638.pdf

[2] One the Fragility of Active Learners, https://arxiv.org/abs/2403.15744


Just wanted to say thanks for the interesting discussion - I learned a few things and I hope you did too!


Yes I did, thank you!


The simplest way of building an MLP on top of the embeddings is simply to concatenate the embeddings and put some dense layers on top.

However, if you use a "two towers" approach and have several additional MLP layers on top of each embedding separately, and then a dense MLP on the concatenations of each tower, the individual tower MLP layers are an embedding transformation, and will improve retrieval.

Like this:

        Final MLP
      /           \
   QP MLP       DP MLP
     |            |
    QE            DE
Now, you can apply the document preprocessor ("DP MLP") to your document embeddings ("DE") before storing them in the vector database, and apply the query preprocessor ("QP MLP") to your query embeddings ("QE") before querying your vector database.

This should improve the precision and recall of your vector retrieval step beyond using e.g. raw LLM embeddings. Even better is for the final layer to just be cosine similarity, maximum inner product, or L2 distance, rather than having an MLP so you can just use a raw threshold (it's at least worth trying).


Interesting. This sounds like it would be similar to fine-tuning the embeddings, but with added benefit of learning different representations for the query and document. If you keep a distance/similarity measure as the final layer, then I'm assuming this isn't going to work with binary labels?


If you have e.g. cosine distance as the final layer, if the label is 1 you reward the cosine distance being close to 1, if the label is 0 you reward the cosine distance being close to 0.

The finetuning here is specific to optimizing for retrieval, which may be different than just matching documents, which can be an advantage.

You may want to force the query and document finetunings to be the same, which makes a lot of sense, but the advantage in them differing can be that query strings are often rather short, and in a different sort of language structure than documents, so the differing query and document tunings can in some sense "normalize" queries and documents to be in the same space, when it works well.


This has got me curious. I don't really understand how binary labels could give us embeddings that are well-ordered by distance. For a training pair where (Q, D) are highly similar and a pair where (Q, D) are just barely related, the model is being trained that they're the same distance apart. Is there something I'm not seeing here?


See contrastive losses and siamese networks, like here (where they use L2 distance):

https://lilianweng.github.io/posts/2021-05-31-contrastive/

If documents are similar, you want the two embeddings to be close to each other, if they are dissimilar, you want them to be far apart.

Binary relevance judgements of course don't necessarily produce an ordering, but usually over a large enough set of training examples some will be better matches than others.

"Learning to Rank" gets you into all kinds of labels and losses if you want to go down that rabbit hoel.


These are excellent suggestions, and much appreciated. Thanks!

If you train a binary classifier on the embeddings, have you found that the resulting probabilities are also good for ranking? Or do you stick with a distance measure for that?


Could you maybe tell me more about this approach? How do I have to build my training dataset?

Any paper or other source about this approach?


Er, is it possible that you are using `scipy.spatial.distance.cosine` to compute the similarity? If so, note that this computes the cosine distance, and the cosine similarity is defined as 1-cosine distance.

I tried out your example using the following code:

  from sentence_transformers import SentenceTransformer
  import scipy.spatial as ssp
  
  model = SentenceTransformer("all-mpnet-base-v2")
  A = model.encode(['chocolate chip cookies','PLS6;YJBXSRF&/'])

  CosineDistance = ssp.distance.cosine(A[0],A[1])
Where `CosineDistance == 0.953`

This means the model is actually working quite well, were these to be similar to each other we'd expect CosineDistance to be much closer to 0.

The other comments about such distances being useful for relative comparisons also apply: I've used SentenceTransformers quite successfully for nearest-neighbor searches.


Yes, check out my library for vector similarity that has various other measures which are more discriminative:

https://github.com/Dicklesworthstone/fast_vector_similarity

pip install fast_vector_similarity


That’s cool. Thanks!


great resource


I haven't noticed this behavior in such extreme ways as you. I've worked with many different embeddings using this library and the OpenAI API for data normalization, extraction and structuring tasks.

On one hand it's super impressive what you can get out of these embeddings. On the other I'm with you that there seems to be a missing piece to make them work great.

An additional MLP or Siamese Network works really well but it seems like there should be something easier and unsupervised given your set of vectorized samples.

Thinking out loud, could PCA help with this by centering the data and removing dimensions that mostly model noise?


> I haven't noticed this behavior in such extreme ways as you.

I should have clarified that it really depends on the model. I have had this problem to a greater extent with the GTE and BGE embeddings. But I use them anyway, because they’re so strong overall.

PCA is an interesting idea, and worth looking at.


I've tried PCA (and perhaps more usefully, UMAP) and while can be great for stuff like unsupervised learning, I think the trick is to choose a classifier (as per my comment above) that can do a good job with high-dimensional data.


Bootstrap a pvalue by creating a set of thousands of random words, calculate the metric for those and either explicitly keep these numbers to compute the rank, or fit some normal distribution and use this mean and std to estimate the probability that it's similar.


Something is off if you're seeing behavior like that, 0.1-0.15 with MiniLM L6 V3 are good for "has any relevancy whatsoever"


I should have added that it depends on the model. MiniLM didn’t exhibit this behavior, but it unfortunately didn’t perform as well on recall or ranking as other models that did.

GTE comes to mind. You can try the demo widget on HuggingFace and see this: https://huggingface.co/thenlper/gte-large

As an example, against “Chocolate chip cookies”, “Oreos” has a cosine similarity of .808, while “Bubonic plague” is at .709.


Amazing lol: for what it's worth, it's hugely, hugely, shocking to me how little anyone outside sentence-transformers bothers noting whether their embedding model is for asymmetric vs. symmetric search. i.e. in all seriousness, afaik, 90% of people are using the "wrong" embeddings, and either handwave about it, don't care, or think it's being picky.

So of course I want to reach to blame that here.

I'm doing a cross-platform app pipeline that basically search API => get pages => embed => run against prompt, so unfortunately I don't have much more insight.

MiniLM V3 is good enough, the nearest constraint at this point is more "will I download all 10 search results in time?" than "am I grabbing the absolutely best passages?" -- good on you for legitimately testing those things.

edit: actually, I do have one more thought: sometimes I end up with embeddings for the last part of a web page that match ~anything. i.e. imagine i have a 1005 word web page and split it into 250 word chunks to embed. Then I'll have 4 "full" vectors, and one that represents just 5 words. That 5 word tends to match almost ~anything, i.e. I'll see ones from previous search queries match the current search query. Maybe you're seeing noise because its just 3-5 words? But then again, you were probably just illustrating without splatting in huge chunks of texts.


That’s a good point! Actually I’ve been using these in the asymmetric search setting, and the queries are very short phrases (not full questions like you might see in a RAG application).


maybe the normalize_embeddings flag on encode?


Not sure about the specific implementation here but the very definition of cosine similarity includes normalization. [0]

[0] https://en.wikipedia.org/wiki/Cosine_similarity


I use these all the time. For many of the classification tasks I do it works very well to pass an image or text through an embedding and then apply some kind of classical machine learning like the SVM. Model training is super reliable and takes maybe 3 minutes to train multiple models and cross-validate. In maybe 45 minutes I can train a single fine-tuned model but the results are really hit and miss.


would love to try this out myself. seems like there are lighter solutions than relying a giant LLM


Should probably have "(2019)" appended to the title as per HN conventions, given the date in the citation paper links (1x 2019 and 2x 2020) and the fact that this site has been around for quite a few years...


It should be mentioned that the original author Nils Reimers has moved on and the repo has been stale since 2021/2022. It has recently (e.g. end of 2023) gotten a new team and has since seen updates.

This is obviously quite significant given how important sentence embedding models are.


How are you guys deciding what parts of a document to turn into embeddings? I've heard paragraph embeddings aren't that reliable, so I'm planning on using tf-idf first to extract keywords from a document, and then just create embeddings from those keywords.


Take your document, cut it (cleanly!) into pieces small enough to fit into your sentence embedder's context window, and generate several embeddings that all point to the same document.

I would recommend against merging (averaging, etc.) the embeddings (unless you want a blurry idea of what your document contains), as well as feeding very large pieces of text to the embedder (some models have massive context lengths, but the result is similarly vague).


> [recommend against] feeding very large pieces of text to the embedder

Sounds right, I've heard this from multiple sources. That's why I'm leaning towards just embedding the keywords.


When trying to find similarity between whole docs one would feed the entire doc though, right?


Keyword would leave you with semantics of word definitions but lose sentence meaning/context right?


I'm sure it would lose a ton of meaning, but for me it's easier to fit into a traditional search pipeline.


I'm curious how people are handling multi-lingual embeddings.

I've found LASER[1] which originally had the idea to embed all languages in the same vector space, though it's a bit harder to use than models available through SentenceTransformers. LASER2 stuck with this approach, but LASER3 switched to language-specific models. However, I haven't found benchmarks for these models, and they were released about 2 years ago.

Another alternative would be to translate everything before embedding, which would introduce some amount of error, though maybe it wouldn't be significant.

1. https://github.com/facebookresearch/LASER


The transformer models handle multilingual directly.

For good old embedding models (eg. GLoVe), you have a few choices:

1. LASER as you mentioned. The performance tends to suck though.

2. Language prediction + one embedding model per supported language. Libraries like whichlang make this nice, and MUSE has embedding modes aligned per language for 100ish languages.

Fastembed is a good library for this.

Note that for most people, 32 dimension glove is all they need if you benchmark it.

As the length of the text you're embedding goes up, or as the specificity goes up (eg. You have only medical documents and want difference between them) you'll need richer embeddings (more dimensions, or a transformer model, or both)

People never benchmark their embeddings and I find it incredible how they end up with needlessly overenginneered systems.


Any idea which model has the best performance across languages? I'm checking out model performance on the Huggingface leaderboard and the top models for English aren't even in the top 20 for Polish, French and Chinese


Depends on what your usecase is.

For the normal user that just wants something across languages, the minilm-paraphrase-multilingual in the OP library is great.

Of you want better than that (either bigger model, or specifically for a subset of languages, etc.) then you need to think about your task, priorities, target languages, etc.


What is everyone using embeddings for, and which mdels? I built a RAG last year (for searching long documents) but found it a bit disappointing. I tested it with OpenAI and with SentenceTransformers (instructor-xl or a smaller variant). Apparently they've come a long way since then though.

Currently I'm working on an old fashioned search engine (tf-if + keyword expansion). Apparently that can work better than vector databases in some cases:

https://hackernews.hn/item?id=38703943


I have used this library for a few years and is reliable. As someone mentioned, sometimes 2 things that are not related can have the same cosine similarity. Easy to use and get started with.


I don't have much experience with embeddings...

Could someone more knowledgeable suggest when it would make sense to use the SentenceTransformers library vs for instance relying on the OpenAI API to get embeddings for a sentence?


It's fairly easy to use, not that compute intensive (e.g. can run on even a small-ish CPU VM), the embeddings tend to perform well and you can avoid sending your data to a third party. Also, there are models fine tuned for particular domains on HF-hub, that can potentially give better embeddings for content in that domain.


Just to add to this, a great resource is the Massive Text Embedding Benchmark (MTEB) leaderboard which you can use to find good models to evaluate, and there are many open models that outperform i.e. OpenAI's text-embedding-ada-002, currently ranked #46 for retrieval, which you can use with SentenceTransformers.

https://huggingface.co/spaces/mteb/leaderboard


I see - thanks for the clarifications

I presume if your customers are enterprise companies then you may opt to use this library vs sending their data to OpenAI etc.

And you can get more customisation/fine-tuning from this library too.


Embeddings is one of those things that using OpenAI (or any other provider) isn't really necessary. There are many small open source embedding models that perform very well. Plus, you can finetune them on your task. You can also run locally and not worry about all the constraints (latency, rate limits etc) of using an external provider endpoint. If performance is important for you, then you'll need a GPU.

The main reason to use one of those providers is if you want something that performs well out of the box without doing any work and you don't mind paying for it. Those companies like OpenAI, Cohere and others, already did they work to make those models work well on various domains. They may also use larger models that are not as easy to deal with yourself. (although as I mentioned previously, a small embeddings model fine-tuned on your task is likely to perform as well as a much bigger general model)


You should basically never use the openAI embeddings.

There isn't a single usecase where they're better than the free models, and they're slower, needlessly large, and outrageously expensive for what they are.


Up until a month ago, the OpenAI embeddings where very poor. But they recently released a new model which is much better then they're previous one.

Now it depends un specific usecase (domain, language, length of texts)


I recall paragraph2Vec. It was the earliest way to experiment with position embeddings that I recall reading. At first when I read it, it felt kind of crazy/weird that it worked:

You feed in the paragraph ID(yes just an integer of the para where the word was) via its own layer along with the word to CBOW/SkipGram set up. Then throw away that part after training. Then, during inference, you attach a new randmly initialized layer for where that old layer was and "re-train" just that part before generating the embedding for words.



Used this library quite a bit. I still have no idea if there is a good reason this API is not packaged within huggingface/transformers.

Probably historic reasons, anyway solid 9/10 API.


Major kudos to this library for supporting Matryoshka/nested/adaptive embeddings (https://huggingface.co/blog/matryoshka), which I needed to train a model recently


Is there an open source Matryoshka embedding model?


These are the models on HF which have "Matryoshka" in the name, such as the PubMed one; I worked on a protein sequence one https://huggingface.co/models?search=matryoshka


Thanks! I’ll test truncating and see how they do.


Has anyone used FlagEmbedding? I'm testing a model that comes with examples for both SentenceTransformers and FlagEmbedding, but it's hard to find any information about it.


>tfw it's 2024 and people STILL aren't using span compression to implement a "medium term" memory (RAG is long term and the context length is short term) for LLMs.

>tfw it's 2024 and we just accept that the context "falls out" of the model if we push it beyond it's regular context length

So everyone forgot that we can put large N number of tokens into small N number of embeddings because???


What do you mean by span compression? We have experimented with various embedding contes lengths and we have found that bigger embeddings aren't the ones providing the best recall. We have hit the best results with something between 65% to 75% of the maximum embedding context length. We have been using OpenAI embeddings models though.


I love this library - I'm surprised cohere doesn't jump in and have its logo in the front!


it doesn't look they fine-tune those models on any of modern foundational models, which likely produces huge performance gap compared to OpenAI embedding for example..


how are people using this with/without LLM?

is the feature more of "reliable and accurate without hallucination" than LLM?

where are you using it? text interface?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: