This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?
Using LLMs to evaluate other LLMs sounds like a it would be dumb, but LLMs work in mysterious ways. I’ve found this approach useful though. In the context of RAG, using an LLM to evaluate whether a context chunk is relevant to answer a question is a nice complement to using the vector embedding semantic similarity search. Sometimes prompting the LLM gives better results than vector similarity.
Joe here. It's difficult to evaluate natural language responses that come from LLM applications - there are not hard metrics to measure performance like there are in say supervised machine learning tasks. For RAG, you have the response to evaluate as well as the retrieved context chunks. We found that using gpt-4 as an evaluator to measure the quality of RAG responses and the relevance of the context chunks gave similar results to using human evaluators at Tonic to do the same task. Some research also agrees that using LLMs as an evaluator for natural language tasks gives similar results to using human evaluators https://arxiv.org/abs/2306.05685.
As far as whether using gpt-4 is a safe approach, the best you could ask for is that gpt-4's evaluations match those of human evaluators, and that's what we've found as well as this research.
Something not mentioned much is that you can respond to these messages that come in through a Masked Email, and your identity is hidden on the outbound messages as well.
They seamlessly integrate with the sender identity feature in Fastmail making it very clear that you are replying from the Masked Email.
From a quick analysis on the headers, I don't see anything that leaks who your real identity is, but of course Fastmail knows and could reveal that if legal reasons exist.
Overall smooth feature along with the ability to use a custom domain for portability (to a less sophisticated wildcard setup, or another provider).
Kabbage, Inc | Full-Stack, Backend, QA/SDET, iOS, DBA, Data Engineers | Atlanta, GA and New York City (NYC) | Full-time ONSITE | kabbage.com
Kabbage is a leading FinTech company changing the way small businesses solve cash-flow challenges. Fully automated and deeply connected with its 160,000+ customers, Kabbage provides access to funding in minutes, extends more than $10 million every day to small businesses, and powers borrowing experiences for some of the largest companies in the world. While we've received numerous awards and recognition—such as Entrepreneur's Top Company Cultures, Inc Magazine's Top Private Companies, GlassDoor’s Best Places to Work, and Forbes FinTech 50 — it is our people, our culture, and our leaders that make Kabbage such a great place to work.
Our Technology teams are growing fast and we're hiring for the following roles:
Anyone find a secret command line 'default' setting to change how Mission Control / Spaces works so it shows the previews without having to go to the top of the screen?
I hate this change too, but I believe this was done to improve the frame rates. 10.11 finally has smooth Mission control animations on a retina display, even when a lot of apps are up and running, which is great.
I don't see any other reason why they would hide this from showing up by default.
If you use you display in some Scaled mode it could be the culprit of low framerate as it then runs at something like 3k by 2k resolution which is then downscaled to your selection.
I think the improvements are more around Metal and such. It see improvements everywhere not just Mission Control, especially when hooked up to my 4K monitor.