A combo of ChatGPT and spot EC2 instances screwed me over
Hey guys,
I thought this was a pretty funny story and just wanted to rant a bit about it.
Recently I was deploying our application Potarix on an EC2 instance. I asked ChatGPT for the right EC2 server for our SAAS app, and it recommended spot instances because spot instances are cheaper than normal instances since they use AWS’s unused EC2 capacity. I thought you know why not? I prefer to save a little extra cash and there are probably periods of downtime in our app.
We deploy and everything is fine for a few days. I wake up on a Friday morning to a bunch of pings that the application isn’t working. I was like hmmm strange, I gave it a try and confirmed our app doesn’t work. I then decided to ssh into the EC2 instance to figure out what was going wrong, only to find that the EC2 instance wasn’t even on the AWS portal. I was panicking and thinking, did I even deploy this, or did I give the instance a different name, or did I accidentally delete it? I looked at the EC2 history and found that Amazon terminated the instance for me. A quick ChatGPT prompt revealed that these instances can be terminated anytime Amazon reclaims the capacity.
Through this experience, I learned 2 things: 1. Don’t use spot instances for anything that’s going to be deployed to production and needs to be active all the time. 2. Every time ChatGPT gives you a sketch recommendation, ask for downsides.
The takeaway should be, "I should understand the system I'm deploying, regardless of the tool I'm using for search." You relied on ChatGPT to design your deployment strategy and to understand it for you. And consequently you were surprised when it didn't perform the way you expected and you were put on the spot to build a better understanding in order to respond to an incident.
Asking ChatGPT for downsides might be a good exercise, but that's not addressing the root of the issue. If that's all you do then you're still relying on ChatGPT to understand the system and anticipate issues for you. Unless it's ChatGPT's responsibility to maintain the system and ChatGPT's reputation at stake when it suffers incidents, then that prospect should make you uncomfortable.
Consider for instance that it's not uncommon for ChatGPT to have downtime. What are you going to do if your system is down and, by coincidence, so is ChatGPT?
In OP's defense, people have misunderstood spot instances years before ChatGPT existed. IMHO it's just not messaged very clearly in AWS. Yes, RTFM and all that, but I think it's easy to miss. And yes, AWS is basically the only place I've ever deployed something VPS-/instance-like where there it was not clear if this is kinda ephemeral, if they even have that option.
I don't mean to indict or attack OP. To the extent that I come across that way, I've failed in my editing.
I respect them for taking this as a learning experience and for being open about it.
I love this story so much. Ask an llm to reason about your company devops decisions, get advice to do something you don't recognize, follow the advice without looking up what it means, surprised pikachu face.
In the current climate of lay people and pointy hair bosses saying ai can do this job or make it so more people can do this job without having to know things, this gives me a snarky chuckle.
I'm glad you learned what spot instances are (you did learn, right? They are limited time instances you can bid a price for and if the "spot" price goes above your bid they get reclaimed. They are NOT just magically cheaper servers everyone else just didn't know about). The big takeaway you should have, however, is that understanding what you're doing is important in this business and it's worth the time to at least try to understand everything you're working with. This applies every time, no matter if you are told to do it by a senior dev, a clueless pm, or a mindless robot.
I think you're setting yourself up for more failure if your takeaway is just to ask chatbots to do more of them understanding for you instead of less (which is how I read your takeaway #2).
Good luck out there! We've all made dumb production level mistakes; make sure you learn from yours!
I was going to write that this should of been caught pretty early if you had searched and click AWS Spot instances link to read about them being quite unstable/temporary. However, the first result is this: https://aws.amazon.com/ec2/spot/
It is ceaseless shilling for how great spot instances are with big percentages, big savings, omg-so-good messaging. No details whatsoever about some pretty glaring trade-offs. Even their video about getting to know it is quite light on details, mostly telling you what other of their services you can use them together with. What is the point of this marketing word soup? Does that really generate leads?
/rant over
The actual page that tells you about spot instances is a later result: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-sp...
> Spot Instance interruption – Amazon EC2 terminates, stops, or hibernates your Spot Instance when Amazon EC2 needs the capacity back. Amazon EC2 provides a Spot Instance interruption notice, which gives the instance a two-minute warning before it is interrupted.
I would say #3 is to verify info before acting on it.
I tend to doubt the LLM when it tells me about some great thing I didn’t know existed, and even though pretty often it’s right, I usually just go and google the thing just to be sure, especially if there’s some risk in being wrong.
In this case I’d have googled “AWS spot instance” and read up on them a bit, and hopefully noticed the clues in the search results that set off alarm bells saying things like “how to save state between spot instances” and “if you do xyz you can use spot instances without losing any data”.
> Every time ChatGPT gives you a sketch recommendation, ask for downsides.
It's sort of the right lesson (double-check) but also not (you are checking the same stochastic source). This is like asking an assistant to find out what AWS tech to use by reading the internet, and then ask the same person to do the same thing again. It's better than asking once for sure, but the solution should be to actually check the source info (in this case, the AWS documentation).
But the more important lesson here is that, it should be a bare minimum requirement that if you want to deploy something new into prod, you should read its documentation and _know_ what it does.
Regardless of whether you are asking LLM or a real person or some forum's post, you don't just follow an advice without understanding it. This is a scary deployment tactic.
You can deploy production workloads to spot instances, just make sure you have the rest of your infrastructure setup to handle the spot terminations. Excluding spot instances from the discussion, for any robust use case your infrastructure should be able to handle a single point of failure anywhere. See https://netflix.github.io/chaosmonkey/
I'd be surprised if the LLM gave no warnings with the suggestion of using spot instances. Did you fully read its advise?
And it's the second sentence of the first result in a simple search. https://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud#S...
> As a trade-off, [...] and customers take the risk that it can be interrupted with only two minutes of notification when Amazon needs the capacity back.
AWS certifications (Cloud Practitioner, the first level) teach this info. If it sounds like a good idea you might look into grabbing a few, plus you'd have the benefits of holding certifications for a couple years. Best of all, no LLM content (yet).
"Lessons learned #2: Ask ChatGPT again"
It's hard to say if this post is satire at this point. Perhaps next time consider consulting alternative sources of information, like AWS own documentation.
That's not to say ChatGPT isn't useful, just don't trust it blindly. Despite all the useful information it gives, it also will hallucinate and misinform on regular basis.
Sorry this happened, and your instinct to check LLM responses are correct.
With that being said, if you ask me the same question and tell me your app can tolerate downtime, need provisioned compute (EC2) - I'd probably say the same thing ChatGPT said.
> your instinct to check LLM responses are correct.
No, they said the lesson learned is to ask it to cook up downsides. That's not checking, that's asking it to verify its own recommendation
Good clarification, but I still think it's a good idea that might've helped op here
Isn't that a form of checking?
If a used car salesman says a specific car is in great condition and fits all of your needs, you don't verify their claim by asking what the downsides are. You take the car to a trusted mechanic and have them look it over before finalizing the contract.
Yeah, but the used car salesman's explicitly disincentivised to tell you the downside. An LLM is agnostic to the outcome. It's only incentive is to produce an output.
A correct analogy here would be your distant uncle who knows about cars. He doesn't really care whether you buy it or not. He only cares he imparts some information which is helpful.
The point was really when you want to verify something seek a 2nd opinion.
An LLM does not care if it's information that it is imparting with is helpful or not, unlike the uncle. Maybe it will be maybe it won't be. Either way, seek a 2nd opinion.
So maybe doctor giving a diagnoses and telling you about the recommended procedure. You can ask the downsides and get them. But you don't go with the first thing the doc says, most people seek a 2nd or third opinion from other sources to verify if the information first given is right for you or if there may be better options etc
that's kind of hilarious, on the brightside, now you know a niche aspect of AWS though that people pay big bucks for
3) Never trust an LLM
More like never trust an LLM blindly. See their output as advice at most.
4) ask HN for comments
The problem is that these LLM responses, even if they're browsing the web to get data, are accumulated from random blog posts instead of the official AWS docs. If they're not browsing the web then it's giving you outdated data it was pre-trained on.
hmmm they are ephemeral by design.
spot instances are designed for short-lived stateless tasks (think of it as workers that just connects to databases with persisted data).
also it looks like you're trying out aws on your own money. on your first year you should try to apply for their free startup credits to avoid getting burned on their initial costs.
Add this string to your prompts: "what could go wrong?"
care to share the chat? i’m frankly surprised 4o would mention it without saying something like “for short running applications” or another disclaimer
Indeed - they're fantastic for specific use-cases, i.e. ECS where instances are "disposable" but they're the last thing you want for something you want running continuously.
> Every time ChatGPT gives you a sketch recommendation, ask for downsides.
Funny I thought the take away was "don't outsource your critical thinking".