HN2new | past | comments | ask | show | jobs | submitlogin

I'm curious how people are handling multi-lingual embeddings.

I've found LASER[1] which originally had the idea to embed all languages in the same vector space, though it's a bit harder to use than models available through SentenceTransformers. LASER2 stuck with this approach, but LASER3 switched to language-specific models. However, I haven't found benchmarks for these models, and they were released about 2 years ago.

Another alternative would be to translate everything before embedding, which would introduce some amount of error, though maybe it wouldn't be significant.

1. https://github.com/facebookresearch/LASER



The transformer models handle multilingual directly.

For good old embedding models (eg. GLoVe), you have a few choices:

1. LASER as you mentioned. The performance tends to suck though.

2. Language prediction + one embedding model per supported language. Libraries like whichlang make this nice, and MUSE has embedding modes aligned per language for 100ish languages.

Fastembed is a good library for this.

Note that for most people, 32 dimension glove is all they need if you benchmark it.

As the length of the text you're embedding goes up, or as the specificity goes up (eg. You have only medical documents and want difference between them) you'll need richer embeddings (more dimensions, or a transformer model, or both)

People never benchmark their embeddings and I find it incredible how they end up with needlessly overenginneered systems.


Any idea which model has the best performance across languages? I'm checking out model performance on the Huggingface leaderboard and the top models for English aren't even in the top 20 for Polish, French and Chinese


Depends on what your usecase is.

For the normal user that just wants something across languages, the minilm-paraphrase-multilingual in the OP library is great.

Of you want better than that (either bigger model, or specifically for a subset of languages, etc.) then you need to think about your task, priorities, target languages, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: