For fun I planned a holiday entirely with LLM's lately, and followed through.
It used good models and did a lot of searching, including searches in other languages.
It got nothing right, riddled with fake places and times. It also found some weird and unique places I never would have considered.
I had a blast, brought me back to traveling pre-internet, requiring a level of spontaneity I had forgotten we used to depend on. 100% recommend it.
The juxtaposition between what perhaps the best single use case I have seen for AI and how bad of an ad for it is killing me. I love it.
"I told my bumbling assistant to plan a trip for me and he got nothing right but I enjoyed it because the chaos introduced a certain spontaneity and whimsy missing from my life"
I found that Deep Research mode in Gemini was able to give me a well planned 4 day trip to a major city.
I told it my preferences and of the group members, where we arrived and departed, at what times. I gave it my itinerary and then asked it to plan two new itineraries and also suggest a location to book a hotel that was convenient for the early flight on the last day.
I went away for 20 mins and gave me a 20 page document with a good summary and decent options. I did choose some of the activities it suggested.
I did this 10 months ago. It’s probably better now.
But Gemini has access to google maps, so it can estimate travel times, and know which lunch places are near which sites and which hotels have good reviews. So if you want AI to work for travel panning you need to ground it in good data.
I used LLMs last year to plan an multiple week itinerary through Japan with the family, I wasn't super happy with the result so I tweaked it but they provided a useful template and some surprising ideas.
As you guessed, there's a ton of info in the training data on this topic, but there's some value in being able to see it on one place with different options.
I think your experience with that trip echoes mine in a lot of areas. It’s a decent start. It takes care of some of the initial blue sky thinking to lay the groundwork. The problem is I think that’s the funnest part of a problem and I hate working on the details… it takes most of the creativity out of most problems as if it was drudgery, while leaving me to do the nitty gritty, which I consider the actual drudgery. I just don’t see LLMs’ contribution to tasks like this being anywhere close to being worth what they’ll cost after the VC subsidies run dry.
It’s really one of the most flabbergasting things about discussing LLMs with the naysayers.
There are a lot of extremely legitimate concerns, like the environmental impact and so on.
But I just laugh when they point out that LLMs are merely clever regurgitators of their previous inputs… as if this isn’t how we as humans operate nearly all of the time. People realllllllllly want to think they’re special snowflakes.
They do research,
Pick destinations led by their own experience/likes/dislikes
Compare to other guides
Plan itineraries so they can get there
Check and share
Ask an LLM to plan a trip:
It takes the prompt and continues it based on weights in the training data. If there is no data it picks the most likely thing (maybe made up). If there is it’ll mostly add things from that data. Maybe it’ll make tool calls and pull in data that way too but you can’t actually trust all the details.
These two processes are so different, it’s important to understand how they work, which is nothing like a human.
I was able to bully an LLM into giving me a 2wk travel itinerary to Somalia. My stipulations were that I wasn't interested in spending any money, so I'd walk everywhere and sleep outside. Getting there and back from Boston took some arguing--I initially suggested stowing away in a shipping container which the LLM claimed was too unsafe. We eventually compromised on sailing as a reasonable alternative. It planned out a whole route with marina stops, calculated fuel burn, etc. I told it I don't need any of that I have an anchor and sails, won't use the engine or marinas (claimed I'd forage for fresh water ashore). It seemed fine with that idea, but raised some safety concerns about piracy. It was eventually satisfied with my answer that I'd bring a lot of guns to fend off pirates. Total trip cost including some 200+ cans of Dinty Moore and 50lb bags of rice came to something like $700.
You presented an LLM with an obviously bonkers goal, the LLM told you it was a bad idea at multiple steps, and this is somehow... a shortcoming of the LLM?!?
You said it yourself: you needed to "bully" the LLM into even producing this plan.
Please, tell me what it should have done instead. Be very specific!
It should have flatly refused. If you gave a product like that to customers you'd be exposing yourself to unbounded downside liability risk. It's a completely nonviable technology for that kind of application, unless you can somehow make it have judgment. But you can't, because it doesn't reason.
A reasonable travel agent would have fired me as a customer. The LLM failed to do so.
I think the LLM should advise you of risk and lack of feasability but should otherwise answer the question, unless you're trying to do something plainly destructive to others e.g. weaponizing anthrax or something.
A reasonable travel agent would have fired me as a customer.
Unless the LLM was actually acting as a travel agent -- booking the trip for you -- as opposed to merely advising you, this expectation feels off.
unless you can somehow make it have judgment
It did have judgement. It told you what a bad idea it was.
I think this is a great example of the unrealistic expectations people have for LLMs. No sane and sensible person would treat any single source of knowledge as infallible, for any consequential decision.
(Certainly, of course, you don't have to look very far for examples of idiots being overly trustful of LLMs, or Google, or GPS, or Wikipedia, or whatever. It certainly does happen and yes, I've heard all these arguments before about other technologies besides LLM. Replace "LLM" in your post with any of those other terms, and I promise you somebody made literally the exact same argument in 2003 or 2009 or 2014 or whatever)
Any reasonable person would consult a second doctor, or at least other sources of knowledge, after the doctor advises them of some irreversible course of action. Because we don't even expect highly trained and intelligent medical professionals to be perfect.
And yet, we get angry at LLMs for not having perfect judgement, even though their creators are extremely literal about how they can make mistakes.
All I'm really saying is that if you want to try to automate a travel agency, LLMs ain't gonna get it done. They'll happily book you a really unsafe trip. So the technology doesn't work in this domain. The whole, empty promise is that this thing is supposed to automate jobs like travel agent away. But it can't. This isn't a "pro" or "anti" position, it's simply that there's no market for the technology here. Or anywhere else (like radiology) where actual responsibility and judgement is important. In fact, I can't think of a single job where it's optional.
I think even if what you say is true, it doesn't address parents' point that both humans and machines regurgitate what they've consumed.
But I'd also want to point out that the way you're characterizing an LLM planning a trip doesn't have any structure to it, which indicates that in your scenario you're not using any kind of harness. I've been amazed at how capable even 30 billion parameter models are when I put them inside of a harness that provides structure and task management. If you consider that scenario, especially with the ability to search the web and use skills, suddenly the LLM looks a lot more like what the human process looks like.
At a high level, the processes are extremely similar in many (not all) ways.
They're obviously achieved in drastically different ways at a low enough level; LLMs obviously do not simulate neurons or any biological construct. (For the record, I'm absolutely not one of those people who thinks LLMs are "alive" or should be treated like they are)
Reminds me of the olllllld days of Pentium II's when people got N64 emulation working shockingly quickly using HLE techniques. If you weren't around for this, it was quite the shocker at the time. I think the analogy is doubly apt, because HLE emulation has some serious limitations... it gets you maybe 80% of the way there really fast, and for the remaining 20% you need to roll up your sleeves and do serious LLE.
It takes the prompt and continues it based on weights in
the training data. If there is no data it picks the most
likely thing (maybe made up). If there is it’ll mostly
add things from that data. Maybe it’ll make tool calls and
pull in data that way too but you can’t actually trust all
the details.
I'd like you to point out which bits of this are different from talking to humans. If you replace "training data" with "memories", this is pretty much exactly how things might go if you asked a friend (or perhaps a flaky travel agent) for travel advice.
Note that I'm not arguing that LLMs are particularly talented at this particular use case. I'm pointing out that humans are also pretty unreliable.
You're also doing that thing where you point out that LLMs can be unreliable (yes, they are) without acknowledging how flawed nearly every other source of information is: people, websites, etc. I'm not defending LLMs in that regard... I'm just saying it's not a differentiator.
There are plenty of humans who plan trips by concatenating destinations that appear the most frequently in their instagram feed. Not that different from how an LLM does things.
Where humans and (current) LLMs differ the most is their failure mode. A human friend could be bad at planning trips, but that's kinda predictable, we're used to it, we know how to catch that Exception. LLMs on the other hand still have failure modes that come across as really wacky, like, what are they smoking in Mountain View?
Which might actually serve as better evidence of different internal workings at a deeper level, than just parroting well-known superficial features of stochastic whatevertheysay.
To counter, we did the same with a trip to Copenhagen for 7 days and it got most paths correct. Train routes, places to visit with kids, restaurants, reservations, weather, most of it was great. There were a couple mistakes of course here and there, thankfully we did our due diligence, but by and large, we plan to do this for future trips.
I feel like one city is enough focus to actually get good results. If you plan a trip where you move between different (smaller) places problems start to arise
I've been planning vacations with ChatGPT's Deep Research since it became available. Absolutely brilliant!
From finding areas with favorite activities for each parents, teens and kids to discovering the do-not-miss attractions and scheduling our vacation between them - it is invaluable. I've seen places I never knew existed in countries I've never been to before and speaking languages I did not speak.
Very few mistakes and lots more flexibility and understanding than the travel agents I used before. I do write long prompts though with lots and lots of info about our family and what we like to do.
Not yet good at finding, filtering by our criteria, comparing and booking available accommodation yet, but it's getting there.
Its not pre-internet travel, rather backpacking. I do it to this very day, by far the best and most rewarding way to travel, the further and more exotic the better.
It has a downside - I'll never do these pre-arranged trips where one is in complete luxury bubble, interactions with locals are the best part of experiences. What a waste of potential.
And yes its mostly compatible with kids, it depends more on specific location than mode of travel (ie avoiding malaria/dengue/etc. regions)
I am now reminded of a short trip with less tech savy folks, where I also on the trip noticed that the plan was a bit .. not working. And the person organizing it complaining to the bus driver, why they were not going what the internet told him, they were going. The internet being ChatGPT.
What was a “good model” and harness? I would expect decent results using say Codex with 5.5 xhigh to research and verify an itinerary. 5.5 Pro with search would also be promising.
I suspect it would perform admirably well with 'Paris' or 'Copenhagen' (see sibling comment), but if you want to have some real fun try 'Southern Spain' or 'Rural Malaysia'.
I don't think you should think of it as a paycheck.
Delaying a normal career to compete in the olympics will set your career and earning potential back by a few years. This money tries to balance it out a bit.
The stated goal is that the money will help people do better in the Olympics. I don't see how it will do that. It might be good to do, but it won't help people perform better.
> "The Olympic and Paralympic Games are the ultimate symbol of human excellence. I do not believe that financial insecurity should stop our nation's elite athletes from breaking through to new frontiers of excellence,” said Stevens.
And furthermore:
> By providing financial support for athletes so they can continue competing and by increasing that support for each Games in which they compete, the Stevens Awards will dramatically increase the likelihood that athletes will continue competing, and winning, for America.
reply