You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here:
http://search-sensei.s3-website-us-east-1.amazonaws.com
(warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.
We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539
Thank you for publishing this! I absolutely love small embedding models, and have used them on a number of projects (both commercial and hobbyist). I look forward to checking this one out!
I don't know if this is too much to ask, but something that would really help me adopt your model is to include a fine-tuning setup. The BGE series of embeddings-models has been my go-to for a couple of years now -- not because it's the best-performing in the leaderboards, but because they make it so incredibly easy to fine-tune the model [0]. Give it a JSONL file of a bunch of training triplets, and you can fine-tune the base models on your own dataset. I appreciate you linking to the paper on the recipe for training this type of model -- how close to turnkey is your model to helping me do transfer learning with my own dataset? I looked around for a fine-tuning example of this model, and didn't happen to see anything, but I would be very interested in trying this one out.
Does support for fine-tuning already exist? If so, then I would be able to switch to this model away from BGE immediately.
As far as I can tell it should be possible to reuse this fine tuning code entirely and just replace `--embedder_name_or_path BAAI/bge-base-en-v1.5` with `--embedder_name_or_path MongoDB/mdbr-leaf-ir`
Note that bge-base-en-v1.5 is a 110M params model - our is 23M.
* BEIR performance is bge=53.23 vs ours=53.55
* RTEB performance is bge=43.75 vs ours=44.82
-> overall they should be very similar, except ours is 5x smaller and hence that much faster.
I interacted with the authors of these models quite a bit!
These are very interesting models.
The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].
Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.
I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.
As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.
If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params).
On the other side our models achieve higher retrieval scores.
[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".
Check under the "Retrieval" section, either RTEB Multilingual or RTEB German (under language specific).
You may also want to filter for model sizes (under "Advanced Model Filters"). For instance if you are self-hosting and running on a CPU it may make sense to limit to something like <=100M parameters models.
(warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
It runs a small embedding model in the browser and returns search results in "real time".
It has a few illustrative examples where semantic search returns the intended results. For example bm25 does not understand that "j lo" or "jlo" refer to Jennifer Lopez. Similarly embedding based methods can better deal with things like typos.
EDIT: search is performed over 1000 news articles randomly sampled from 2016 to 2024
I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.
You can process a single word through a transformer and get the corresponding intermediate representations.
Though it sounds odd there is no problem with it and it would indeed return the model's representation of that single word as seen by the model without any additional context.
Embeddings as a tool have been around for longer than LLMs. They were (and are) ubiquitous in, e.g., recommender systems. It sounds maybe this would be more in-line with what you are looking for. In this case, check out https://github.com/benfred/implicit - I have used it in the past with great success.
https://huggingface.co/MongoDB/mdbr-leaf-ir
It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).
You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.
We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539