Comparing AI Image Generators When You Actually Need Results

Table of Contents

1. DALL-E 3 is good but not where the weird lives

When I first tried to generate a custom header image for a Substack post with DALL-E 3 in ChatGPT, I just typed: “create an image of a robot using Photoshop like a 90s hacker.” The image was fine. Slick lighting, metallic reflections, a blue-violet gradient haze that felt very prompt-engineer-chic. But it missed that twitchy oddball energy I wanted — and it also completely ignored my request for 90s-era UI.

DALL-E 3 inside ChatGPT reacts a lot like a helpful intern: confident, logical, but not particularly imaginative. It handles object placement and style blending pretty well, but it doesn’t lean into abstraction unless you really bait it. And even then, the outputs start converging after four or five tries.

Here’s the weird part: DALL-E seems to soft-filter anything that can’t be clearly mapped to known visual content. I asked for a 3D-model of a cascading stack of tabs, each showing a mid-90s website with pixel-art popups. The mouse was frozen over a spinning phone icon. It generated a polite image of a single tab with flat HTML — no pixel art, no phone icon, no motion implied. When pressed, it literally said: “Some image requests may be simplified depending on policies and system limitations.”

That’s not in any OpenAI documentation I’ve found, but it’s consistent with other redirection behaviors (like refusing to show realistic usernames even in fictional mockups). The interface won’t warn you that it’s sanitizing your prompt. You just get a watered-down image and a fake sense of success.

Still useful in a pinch. But if the core of your visual idea hinges on visual contradiction, you’ll probably hit a quiet no.

2. Midjourney excels at style but fails at editing memory

Midjourney feels like working with a very stylish friend who can’t remember anything you said beyond 15 seconds ago. I love it for visual exploration — that part of the process where I don’t want answers, I just want options. You type something like cyberpunk librarian floating through digital archive clouds --v 5.2 --style raw and it gives you four hauntingly bold compositions in seconds.

But here’s the recurring problem: Midjourney doesn’t actually know what you mean 90% of the time. It guesses. And that’s fine — chaotic creative iteration is part of the appeal — but doing revisions feels like trying to tune a detuned radio using only keywords. There is no internal state, no edit history. If one of the generated quadrants is almost good (say, top-right), but the face is wrong, and you just want to tweak mouth size or jacket color… tough. You either gamble another variation from that image or write an even longer prompt that may or may not get you close again.

Prompt:
fluid newspaper UI overlaying a person writing in a bunker --v 6 --ar 2:3
Result: Looks amazing. But all buttons say POWER or CONTROL in weird fonts. No way to specify actual UI text.

Also it still can’t do readable text inside interface elements. You get gibberish. Doesn’t matter how many --no gibberish instructions you add; it’ll just politely ignore it.

I had a storyboard image for a SaaS launch — we wanted a stylized representation of our dashboard running on a projector against a conference wall. No dice. Every time we added real text, the image deformed or ended up looking like a Russian news broadcast. The only way around this (temporary but consistent) was to manually generate an empty screen and Photoshop our actual UI in — which defeats the point entirely.

3. Stability AI is fast and open but almost too DIY

I used Stability’s SDXL model through the comfyUI wrapper for a client last week who wanted “illustrated but not cartoony” avatars of their support team. What I loved was how brutally transparent it is. You see every part of the generation graph. Model weights, prompts, samplers — it feels like building a Rube Goldberg image engine from cables and switches. Fine-tuning is extremely powerful. Also extremely addictive, in a time-leaking way.

The biggest issue? You can absolutely break the system without warning, especially using poorly matched LoRAs (low-rank adaptation modules). I loaded in a custom facial expression LoRA trained for exaggerated anime mouths… and forgot two nodes were still linked to photo-real base models. The result was wild. Think: ultra-photoreal mouths with pink gums and sharp teeth on top of otherwise friendly, hand-drawn Disney faces. It felt illegal.

This never throws a system error. You just get disturbing outputs and no idea where the conflict came from unless you trace back through your graph line-by-line.

If you want specific control over composition — exact hairstyle matching, color themes by hex code, reference poses — Stability’s lineup is as precise as it gets (assuming you know how to wrangle HuggingFace). But the ecosystem’s reliance on forum-scattered knowledge is still a barrier. More than once, I had to extract a style embedding from a .ckpt and inject it into a YAML-config patch just to get the look I wanted.

4. Leonardo AI looks clean but somehow always off target

I really thought Leonardo might be a midway solution — like Midjourney but with knobs I could tweak. And visually, the interface is gorgeous. Fast render speed, nice upscales, clean prompt tracking. But the model’s output tends to misread indirect phrasing. You say “an underwater workspace with CRT displays” and you’ll often get something that technically matches — glowing monitors, vague wet shimmer — but also has fish floating through keyboards or glowing coral sticking out the command-line window.

There’s something in how Leonardo parses compositional logic that makes it untrustworthy for commercial visuals. I once had to submit a Black Friday banner mockup for an app. I used Leonardo to try visually riffing on “a phone screen melting into coffee with app screenshots.” Nearly every image nailed the coffee, totally botched the app interface, and half of them added dollar bills or ice cubes (?) floating beside the phone — despite never mentioning either one.

Aha moment (not documented anywhere): Leonardo’s composition skew seems to follow order sensitivity in prompts. When “dashboard interface made of light” comes after context phrases like environment or lighting, the interface gets hallucinated. If you flip it — “a dashboard interface made of light in a bedroom” — boom, now you reliably get an interface.

Wasn’t in their Discord FAQ. Wasn’t in my favorite mid-level cheat sheet thread either. Just found that by brute force.

5. Canva AI images are safe if you need zero vibe

I’m including Canva’s Magic Media because someone’s gonna ask. Yes, it’s technically an AI image generator. Yes, it outputs things on a background roughly fitting your prompt. But in practice, it feels like asking your printer to draw you a picture. Everything is fine. Nothing surprises you. Nothing exceeds expectations.

Great for making generic thumbnails. I used it to create a banner that said “Spring Budget Planning” with smiling people at laptops. Took six minutes. But when I tried to get it to generate a stylized version — same scene, but in muted pink and teal with fake 3D icon overlays — it collapsed. Returned distorted faces and backgrounds that looked like melted LMS dashboards. I assume they’re using a trimmed or sanitized version of a standard diffusion model, maybe limited by their in-browser API stack.

Undocumented edge case: if you include any reference to file types or image formats, Canva’s AI adds random UI mockups. Ask for “a stylized .csv import flow” and you get Excel-ish windows with blocky green headers hanging in space. No other tool hallucinated file UIs like this.

6. Most of these platforms still have terrible memory

Here’s the biggest collective flaw I keep running into with all of these: no image generator, no matter how expensive or good-looking, retains actual visual memory across sessions. Zero continuity by default. If you want variations of the same character in different poses — or same room, different time of day — you essentially have to re-bake the scenario each time, using disgustingly verbose prompts or manually refeeding references.

The fix is ridiculous

You have to create a semantic token embedding tied to a fake name like "[scene1-room]" or "[keisha-variant-a]" and inject that into your prompt start. Then store that image and forcibly compare future outputs using either CLIP feature extraction or a third-party visual similarity model. That’s not sustainable for quick creative iteration. It’s barely tolerable at prototype phase.

Even Midjourney’s new Style Reference isn’t truly programmatic. You feed it a style image, and it approximately copies it — textures, mood, color, composition rules. But it doesn’t bind anything to identity.

I made a mini-series of onboarding images for a product walkthrough: same figure, casual clothes, gesturing at different UI elements. Took me three hours and multiple tools. In the end, I screenshotted the first few Midjourney results and then manually composited consistent parts over new backgrounds using Affinity Photo — which I had to reinstall because Photoshop crashed after a plugin update from last January started throwing error dialogs about activation keys not matching the OS lockfile.