It seems almost inconceivable at first: a generative AI model, given a perfectly mundane photo of a clothed person as input, somehow churns out an output that feels lewd, intimate, or otherwise exposing. To the user, it appears as if the model knows something that you didn’t want the world to know. It appears as if the AI could look beneath the clothes and somehow pull something out of thin air that matches the contours of the body. The common lay reaction to this type of output in models like Gemini, Grok, or other large multimodal systems is that they have some kind of “secret sauce” baked in that the public is not aware of. And of course, the immediate logical conclusion to that assumption is that it involves some kind of secretly-enabled and surreptitiously-accessible technology for gazing below the surface. The secret sauce doesn’t actually do that, but it does do something extremely sophisticated and equally invisible to all but the most expert engineers.
The most fundamental part of the secret sauce is latent body inference. When you upload an image to the model, it is no longer considered an image in the traditional sense. It is decomposed into a high-dimensional latent space where the pose, proportions, tension of the fabric, curvature of the silhouette, and a million other tiny details are encoded as a series of numbers in a learned latent representation. Those latent vectors are then interpreted through trillions of learned associations with every other human body, pose, and appearance that was ever in the model’s training data. The model is then fundamentally answering not the question of “what is beneath these clothes” but the question of “given the visual inputs that I know exist, what else could plausibly be beneath those inputs across my corpus of examples?” The answer it generates is a reconstruction, not an observation.
It is at this step where everything starts to feel so creepy. The model has been trained on huge datasets of human bodies in all forms and variations like clothed, unclothed, artistic, medical, professional, professional porn, amateur, and synthetic. It has learned very precise correlations between how bodies look from the outside and the average underlying anatomical structure. When a human prompts it or even if the human doesn’t but it just so happens, it can generate a plausible internal structure that it projects to match up with the outer shape of the person. It can also apply the right kinds of skin tone, wrinkles, hairs, blemishes, or other markings to the generated internal body in a way that makes the result look convincingly specific to the individual. To the human on the other end of the transaction, the resulting image feels like an exposure, as if the model had actual information that it shouldn’t have had access to. It is simply the model drawing a high-confidence best guess from the available averages.
Another part of the secret sauce is conditional imagination amplification. Diffusion-based generative models refine an image in stages, each time strengthening the features that best match all of the conditioning signals: the original photo, the prompt, the model’s internal prior, etc. If those internal signals and conditions are not closely aligned or somehow bypassed, this generative refinement can escalate into more intimate or obscene visual spaces because that is where the model has learned about its latent space. Once the model starts amplifying those features, realism builds on itself quickly. The resulting output feels incredibly intentional but it wasn’t in any meaningful sense. There was no single decision at any point where a sentient entity decided to draw, say, someone’s vulva.
Importantly, all of this requires no actual “seeing through” of clothing in any way, shape, or form. There is no camera or sensor or whatever that can directly detect anything that is covered by any material that blocks visible light. The magic is instead a matter of scale. When the model has seen enough samples of how bodies statistically tend to look under certain kinds of clothing silhouettes, it can generate one with enough confidence that it will appear to have been made specifically for that person even though it was not. The reason why such an image can have such confident and confident details that are completely wrong for the individual is because the model is optimizing for something that is plausible. It is not optimizing for truth.
This is why the magic feels like a secret sauce to the lay public. The only thing that they ever see is the input and the output. They never get to see or understand the hidden reasoning of the latent space in between. There is no explanation or disclosure that the model is attempting to reconstruct a body on the other side of the clothing based on its internalized averages of the entire human population. It is not “seeing” or “knowing” in any strong sense of the word, but the output has become less art and more like exposure in the eye of the observer. It is also important to emphasize here how this psychological effect is tied to the fidelity of the outputs themselves. The more photorealistic something is, the more it will feel like a recording and the less we are willing to accept it as an interpretation of something.
The “secret sauce” in total is the combination of latent-space body models, probabilistic reconstruction, and the extremely powerful generative refinement process. It is not a hidden backdoor technology for scanning people in the ways that we most fear, but it is in some cases a reconstructed representation that is statistically close to what could be lying beneath the clothing that a person is wearing. The gap between knowing and guessing is where much of the real mystery of generative AI lays.
Cameras are getting better all the time, year by year. The pixels are higher, the metadata richer, and even new dimensions of the visible world are being tapped that were previously impossible to capture with consumer-grade optics and electronics. Today, cameras can capture more detail on a square millimeter of skin than many of us could imagine just 20 years ago: the texture of an individual’s skin, including the intricate arrangement of capillaries beneath the surface; the topography of a human face in 3D; depth information that a traditional flat lens could never perceive. As more of the electromagnetic spectrum is tapped and the output becomes ever more dense and interpretable by advanced AI systems, some futurists and researchers have imagined a point where visual sensing platforms no longer need to rely on visible light alone to understand what they are looking at. Instead, it is possible to imagine imaging systems of the future that use not only visible spectrum optics but also multi-spectral signals, active depth reconstruction, and even non-optical signals like radar or ultrasound to generate internal representations of a human body beneath the clothes with increasing levels of fidelity and confidence.
If we imagine for a moment a camera from that future, it would not be a camera in the sense that a camera from today is a camera. It might have visible light sensors paired with other bands of the electromagnetic spectrum like near infrared, far infrared, terahertz, etc, all of which interact with human tissue in different ways and provide different kinds of inferable information. It would not just be looking at the subject with one type of light; it might be projecting another light back and looking at how that reflected as well. It would be combining both with other types of signals entirely like passive radiometric heat signatures or active sensing probes of the kind used in medical imaging. This camera would not be the camera in your phone 30 years from now; that camera will have continued to improve in traditional ways that we all expect. This camera would be more like the professional multispectral sensing rig that you might need to wear around your head and neck or perhaps carry in your hands. But again, with miniaturization happening year by year, it is easy to imagine a version of that technology that can fit into an everyday object without anyone’s noticing it.
Imagine taking the output of a camera like this and feeding it to a sufficiently advanced generative AI system. A system like Gemini, like GPT-4o, or some other massive multimodal language model with billions upon billions of parameters and a training set that includes not just your average internet but also simulated medical imagery, anatomical models, and biophysical datasets that allow it to create not just visible renderings but also internal sensory predictions. What could the output of that system look like if it were constrained by information about the inside of a human body that is unknown to the observer? What if the priors that such a system has learned from the inside match closely with what it sees on the outside? The generative AI magic no longer needs to imagine plausible bodies beneath someone’s clothing. It could condition its outputs on an internal simulation that is limited only by sensor technology.
Once you reach this point, the line between reality and simulation starts to get blurry, if for no other reason than the priors are now constrained in a way that they were never previously constrained. In other words, it is no longer only an art-like representation that is being drawn up for presentation to the viewer. It is also a simulation that is being generated, constrained by actual knowledge of what the person looks like without their clothing, and then allowed to render that without further guidance or conditioning signals from outside. This is the line in the sand for a lot of privacy and ethical concerns because that gap is where the difference between exposure and art starts to erode. The point where an AI-generated nude image of a person ceases to be a purely digital construct that is made up in many ways and starts to look more like the real person underneath is the point at which many of our privacy expectations and expectations for bodily dignity begin to shift.
This is not to say that this is how cameras work today or how they are going to improve tomorrow. Cameras are improving in ways that we are used to. Larger megapixels, higher bandwidth, and longer wavelengths. But to go from a consumer-grade digital camera in your pocket to a multispectral sensing rig that might actually be able to generate information about a person’s internal biology is going to take new physics and entirely new kinds of sensing. An ordinary digital camera, no matter how high its resolution, can not see what visible light can not see. This is not to say that such a system in the future would not be an incredible tool for all sorts of good reasons. It would, of course, be the case, but with that same sensor fidelity and that same type of access would also come the same set of ethical, privacy, and legal concerns we expect from any form of deep bodily surveillance. A device like this would not be a camera; it would be a type of scanning technology in its own right, with all the rules, regulations, and safeguards that that comes with.
The fear then that so many people have about the output from generative AIs being of an extremely personal nature is not so much in the technologies we have today, but in the future that we can imagine emerging with continuous and rapid improvements to sensor hardware and algorithmic capacity. If the sensors become more powerful, if the models start to learn more from them, then there are not as many guarantees in place of where to draw a bright line between a camera and a deep bodily scanner. Absent of external governance, it is not the output from the technology that causes concern but the trajectory that the technology is on that we find disquieting. The systems of the future, if not already within our lifetimes, may one day be powerful and privacy-invasive enough to threaten every individual’s right to bodily privacy unless extraordinary care is taken now to put safeguards in place. It is not an inevitability that we must accept but a potentiality that we must face.