What nobody tells you is that to get the results you see people post online often requires hours of work for a single image.
First you need to refine the prompt so it is ultra specific (which OP has not done), then you need to generate a hundred or so images and pick the best ones. From there you can use img2img to refine it more, and once that's done you might want to go to Photoshop to add some finishing touches.
To get good results it is an art form at the moment, but of course as the tools get better eventually it will just be a click of a button.
As someone researching generative models, this is one of my pet peeves. There's been a rat race to show off high quality cherry-picked samples. Very few works actually show random samples. It is rather frustrating because when pointed out it still goes ignored. I'm not sure why we look the other way, aren't these issues we want to solve? But if you tried to publish a paper with random results you won't get accepted.
I am also not worried about AI taking over art, exactly for the reasons you say. Even if alignment was a whole lot better you're never going to be able to perfectly describe an image in your head. Language is too limited. So it will always be a tool. Of course, it can enable fraud. But not many people are putting digital arts on their walls.
The other pet peeve is that people think generative models are only image or text. They so so much more.
>you're never going to be able to perfectly describe an image in your head. Language is too limited.
Pretty much this - and that's why both "AI art panic" and "I won't need artists anymore as I can just prompt anything" (which OP seems to try) are based on the wrong premise.
In general, people tend to make 2 false assumptions:
1. That SD is a ready to use product (it's more of a "middleware" model that is designed to build products upon)
2. That text to image is all it can do, while the most power is in style transfer, finetuning, and the ability to be guided with higher order hints than just text alone.
What would be best right now to use it as a content creation tool, is to combine SD with several other models to have temporal stability and tagged-3D-to-2D, and to use some software toolkit with a pipeline like that:
- quickly layout a mock-up scene in 3D with rough assets, just like in game engine level editors, possibly tagging the geometry or objects with short descriptions, like "middle-aged man", "Volvo semi", "pine tree" etc. No need for detailed geometry, just stick figures and rough shapes.
- using one of the multiple available techniques, train your style on your reference images (which might either be curated output of the model itself, or a specific visual language you constructed)
- enter your prompt, which doesn't need to be complex and convoluted as you already provide the precise hints on what you want
- let the model render the result, based on the depth, tagged geometry, rough rendering providing shadows and reflections etc
You don't need to be able to perfectly describe something in your head. That's what you hire an artist for. Almost nobody hires an artist to produce the exact thing they are imagining - they have a general idea of what they want, and the artist works with them in a back and forth way to find something they like. These image models seem quite capable of doing that to me. You enter your vague description, and iterate from there until you get something you like.
I'm not an artist. I have a GCSE grade C or D, from 22 years ago back when that qualification still used letters. But despite that, I have enough of an eye for art to be surprised and disappointed by how many flaws other people are not only willing to accept, but actually prefer. Back in 2000-ish, that was my peers using non-tiling animated gifs as the background of their geocities pages; later it was seeing 72 dpi pixelation and jpeg compression artefacts on food packaging in a supermarket; or a lack of kerning in a clearly not-fixed-width font in a video game.
I've also seen an artist get frustrated because two managers completely disagreed about what an ideal UI should look like, and kept saying they were too busy to talk to each other even though both needed to sign off the design and didn't like what the other wanted the UI to look like.
And another who got a string of unpaid interns (they didn't tell me the interns were unpaid, but given how that contract ended it couldn't have been otherwise) to design a series of changes to the UI and then wondered why it was taking me so long to finish it. But that manager also kept telling me they wanted a button "wider" until I asked them to draw it and then they said "wider but vertically".
I'm starting to think one of the things artists are needed for is to convince everyone else to stop arguing and just do the thing. This fits with the meme of the last two decades of fancy art being "Here is an unmade bed representing my depression" — the convincing justification is more important than the work itself.
Exactly this. I think the equation will be different with AGI, but we're pretty far from that. This alignment issue is even difficult for humans, which also means it is difficult to convey it to computers, which likely means computers will be a lot worse at it. Especially since computers are targeted at general audiences. People are better able to understand a lot of context clues that a machine would have an incredibly difficult time. We all know that in order of communication ability it is: in person > video chat > phone call > text. Anyone on the internet has probably witnessed first hand people arguing that should be agreeing because of a slow growth of miscommunication. It is why we humans gesticulate (if that manager gesticulated vertically when saying "wider" you would have had no problems). We'd need to give these AI artists cameras (we're already giving them speech inputs fwiw) but and teach them a lot of things. I'm sure we could get there, but this is really far from where we are now.
There's something called the Gullibility Gap and it is created because we anthropomorphize things as humans. More specifically, we see machines doing things that ONLY humans can do and thus our propensity to anthropomorphize them is even greater. We think that they must be intelligent and thinking because the only other things we've seen that do the tasks that they are doing also are intelligent and thinking. But what we've forgot is that the non-machines are also generalists. The machines can only do their specific tasks.
There's also issues with how machines "think." We know that they do not think like humans, and so this does create issues and ones I doubt we'll solve anytime soon. And like you pointed out with emotion, that's not going to be something that machines can understand. Who knows if AGI will even be able to. But then again, often we have a hard time understanding one another's feelings. But sympathy and empathy were powerful creations for us bio machines. So we'll see, but I do think it is quite easy to over attribute what these machines can do.
The "iterate from there until you get something you like" requires a great deal of experience working with the model. Personally I've spent at least 100 hours in total messing with SD, with an eye towards understanding it and improving at using it, and I still wouldn't say I'm "good at it" in the sense of being able to create something that really looks how I want it to look (I'm not an artist though).
Imagine going back to 1960 and showing a programmer modern JavaScript. Surely this new "JavaScript" thing will completely replace programmers! I mean, what use is a programmer when someone can just describe their program in a language that looks almost like English (relative to this stack of punchcards I've been working with!). The user doesn't have to spend time fiddling with memory management. Complex algorithms like search and sort are abstracted away into simple functions. Why, implementing all that was 99% of a programmer's job! Sure, you still have to format the program in a particular way, but any random John should be capable of that with a little tweaking!
And you know, the old programmer would kind of be correct. It takes no less than two weeks to learn React and throw together a decent webapp now (not an enterprise-scale webapp, but a good proof of concept). Despite this 99% reduction to the barrier of entry, there's still a lot of paid work for web developers.
Another trick you can do is when you find a composition you like, but it's off in whatever way, is lock down that seed, and then apply variation to that initial noise state, and produce a bunch more images from there.
So long as you're content with some random crazy image, you can get fancy stuff pretty easily with Midjourney V4. Getting specific details correct, however...
'The Policeman's beard is half-constructed' is a collection of poems from 1984. They were generated by Hidden Markov Models, handpicked from thousands and slightly edited by humans. Not much has changed in those 38 years :)
First you need to refine the prompt so it is ultra specific (which OP has not done), then you need to generate a hundred or so images and pick the best ones. From there you can use img2img to refine it more, and once that's done you might want to go to Photoshop to add some finishing touches.
To get good results it is an art form at the moment, but of course as the tools get better eventually it will just be a click of a button.