Amidson? Ai Hardware For Text2Face: Text-based Face Generation with Geometry and Appearance Control

Amidson? Ai Hardware For Text2Face: Text-based Face Generation with Geometry and Appearance Control

“At 50, everyone has the face he deserves.” – George Orwell

In recent years, we've witnessed an exciting wave of innovations in the world of text-based human face generation and manipulation. These cutting-edge techniques aim to bridge the gap between words and visuals, empowering users to effortlessly transform their ideas into captivating images using the power of text. This breakthrough opens the door to a multitude of creative multimedia applications.

Yet, there's a challenge we must address. Language is incredibly flexible, and translating sentences into specific facial images can be a bit like solving a puzzle with multiple possible solutions. This leads to some confusion when converting text into faces.

To tackle this issue, we used a clever solution by Zhaoyang Zhang, Junliang Chen, Hongbo Fu, Jianjun Zhao, Shu-Yu Chen, and Lin Gao: a local-to-global framework equipped with two sophisticated neural networks—one for handling facial geometry and the other for managing facial appearance. These networks work together, taking into account the intricate relationships between different parts of a face. Pairing it with our standardized graphology and physiognomy cards plus standardized descriptions. For example in graphology if a person writes with a slanted upward text means they are optimistic, and in physiognomy face reading an optimist has upward slanted eyes. Therefore the eye part would be slanted upwards.

Zhaoyang Zhang, Junliang Chen, Hongbo Fu, Jianjun Zhao, Shu-Yu Chen, and Lin Gao's breakthrough insight is that the various attributes defining facial components aren't independent of each other. In other words, the features that make up a face's different parts aren't just random puzzle pieces. Instead, they follow certain patterns and distributions. We've harnessed Zhaoyang Zhang et' al's knowledge to create networks that learn from the patterns in their dataset. Zhaoyang Zhang et' al's AI NRF can also provide recommendations when faced with partial descriptions of human faces. As well as using their Sketch-based Facial Generation and Editing in Neural Radiance Fields to assist in law enforcement, military software, picking a cast in movies/shows, picking the jury for lawyers and for human resource departments.

Zhaoyang Zhang et' al's AI NRF results are astounding. Their method excels at generating high-quality, attribute-specific facial images from text descriptions. Through extensive experimentation, they've confirmed that their approach outshines previous methods, delivering superior results and unmatched usability. We're entering a new era of text-based face generation, and their method is leading the way.


The following is a summarized extract from Text-based Face Generation with Geometry and Appearance Control by Zhaoyang Zhang et al:


Have you ever wondered what a character from a novel looks like? It's a common curiosity that strikes readers when they're engrossed in a story, prompting them to seek more vivid details beyond the text. Can we actually bring the faces in a novel to life solely from textual descriptions? While it may have seemed impossible in the past, recent advancements in human face generation, manipulation techniques, and natural language processing have opened up the possibility of turning this into a reality.

Inspired by these exciting developments, our work aims to visualize these textual depictions by creating a user-friendly interface for transforming textual descriptions into human faces. We also introduce a recommendation system that suggests coherent faces based on partial descriptions.

Text-based image generation has been an active field of research for several years, but it's only recently that these methods have been applied to creating facial images. Thanks to the remarkable visual-linguistic representation capabilities of CLIP [24], a series of breakthroughs (such as [21, 40]) have emerged in this domain. These approaches strive to bridge the gap between CLIP's visual-linguistic latent space and the latent space of the state-of-the-art face generation model, StyleGAN [12]. As a result, they can generate and manipulate facial images with specific attributes that seamlessly align with the given text prompts. These attributes can include details like glasses, hairstyles, emotions, and expressions, and the results have been truly impressive.

Another noteworthy study, [10], has provided a powerful tool for interactively editing face images using text as guidance. They've devised a method to map textual editing instructions to specific editing directions in StyleGAN's latent space, effectively creating a semantic field for text-guided image editing.

What sets our work apart from previous efforts is that we focus on the process of generating faces guided by text descriptions, rather than simply using textual descriptions to edit existing images.


Imagine having the power to precisely guide the editing process of human faces, breaking down the structure and appearance into separate elements rather than a tangled mess, as typically seen in StyleGAN. Our approach offers greater flexibility for controlling individual facial features via text descriptions. Unlike previous methods that struggled with certain attributes, our method excels in generating accurate results for a wider range of attributes, as evidenced in our experiments.

To achieve this, we've devised a comprehensive four-part framework:

1. Text Parsing Module: This initial module transforms sentences into attribute-value pairs. It's a straightforward yet effective method for identifying essential textual cues.

2. Feature Extraction Module: Here, we disentangle the geometry and appearance features of each facial component. This separation is crucial for precise control.

3. Graph Recommendation Module: This part of our framework learns the intricate relationships between different facial components using graph neural networks (GNNs). It's all about inferring how these elements work together.

4. Global Generation Module: Finally, the geometry and appearance features, optimized through the Graph Recommendation Module, are converted into photorealistic facial images. It's where the magic happens.

Our contributions can be summarized as follows:

- We've unlocked the ability to generate highly detailed facial images with attribute-level control based on textual descriptions, offering more control over attributes than previous methods.

- By incorporating graph neural networks (GNNs) into the face generation process, we've enabled recommendations for both geometry and appearance attributes based on text conditions. This innovation enhances the accuracy and flexibility of the entire process.


The rise of deep neural networks has showcased their remarkable prowess in the realm of human face generation and editing. A noteworthy contribution to this field is the introduction of StyleGAN by Karras and their subsequent variants [12, 13, 11]. These groundbreaking models are adept at crafting high-resolution, lifelike face images by drawing from a latent distribution represented as pZ(z). They possess a robustness that makes them tolerant to noisy inputs, sparking a flurry of subsequent research endeavors [20, 1, 33]. These follow-up studies delve into the properties of the intermediate latent space W, harnessing it for conditional face generation and editing.

While StyleGAN-based methods excel at exploiting the extraordinary generative capabilities of StyleGAN, it's important to note that non-StyleGAN-based approaches also make significant strides in this domain. For instance, Chen et al. [3] introduce a structured framework designed to disentangle geometry features from appearance features, employing sketches as intermediaries. Meanwhile, Lee et al. [15] adopt semantic masks as intermediaries to facilitate flexible face manipulation while preserving the essence of identity and fidelity.

While these methods show promise in generating and manipulating human face images, they often fall short in considering the inherent coherence among facial components' appearances and geometric features. Consequently, they struggle to comprehend the high-level semantics and structures of human faces, and thus, they lack the capability to recommend and generate faces with both geometrically cohesive and appearance-consistent attributes.

In contrast, Zhaoyang Zhang et al's approach takes a radically different approach. Zhaoyang Zhang et al explicitly model the intricate relationships between facial part geometry and appearance using graph-based techniques. This method empowers the software to exert precise control over both geometry and appearance features, resulting in a more cohesive and coherent generation of human faces.


Text plays a pivotal role in human-computer interaction, and recent advancements in the fields of vision and graphics have seamlessly integrated text as a potent interface for generating and manipulating images. In the past, text-based image generation methods primarily focused on creating simple-structured images, such as birds from the CUB200 dataset or flowers from the Oxford-Flower-102 dataset. These methods, however, often lacked a comprehensive analysis of the target data distribution (birds, flowers, etc.), making it challenging to enhance the quality of the generated images. In contrast, large pre-trained models like DALLE and DALLE2 have demonstrated the capability to generate complex and richly detailed images from pure text inputs, achieving remarkable results in text-based image generation.

The recent advancements in text-guided graphics and vision owe much of their success to the robust visual-linguistic representation capabilities of CLIP. CLIPasso, for instance, employs the CLIP image encoder to measure semantic and geometric similarities between real images and abstracted sketches, leveraging the rich semantics within the CLIP text-image joint latent space. CLIPstyler utilizes CLIP for image style transfer, allowing users to specify the desired style through text inputs. Other applications include image retrieval systems using both text and sketch queries, which enable fine-grained retrieval not achievable through either modality alone. In the 3D content creation field, Text2Mesh is a prominent example, predicting per-vertex color and positional offsets from an input template mesh and utilizing a differentiable renderer to propagate CLIP's 2D semantic supervision into 3D.

Within the face generation and manipulation community, several CLIP-based approaches have emerged. These approaches aim to manipulate inverted StyleGAN images. Some efforts focus on mapping multi-modal inputs, including text, into the fixed W-space of StyleGAN, while others model the mapping between text features and StyleGAN latent editing directions using machine learning techniques. These approaches excel at editing global attributes like age, beard, or emotional expressions but often fall short when it comes to editing part-level geometry and appearance features.

While the methods mentioned above have achieved impressive results in manipulating human faces, they heavily rely on the representation capabilities of large pre-trained models like CLIP and StyleGAN. Consequently, they may sacrifice detailed semantic control over individual components of human faces, as some attributes within the StyleGAN latent space are highly entangled. In contrast, our work, built upon a local-to-global framework, excels at translating semantic descriptions into part-level visuals with compatibility in both geometry and appearance. This approach enables us to exercise disentangled control over each facial component while maintaining overall coherence—a distinctive feature of our methodology.


Zhaoyang Zhang et al goal is to take a descriptive sentence about a human face and generate a highly realistic facial image, aligning with the details mentioned in that sentence. Amidson AI's goal is to take the graphology report translate it into a physiognomy report and then generate a highly realistic facial image, aligning with the details mentioned in the reports. To ensure the responsible use of their technology and prevent potential misuse, Zhaoyang Zhang et al impose restrictions on the adjectives and descriptors that can be used in these face descriptions (please refer to Zhaoyang Zhang et al supplementary materials for a more comprehensive discussion on this end of article).

Given the vast diversity of linguistic descriptions, it's clear that the relationship between sentences and generated faces isn't straightforward; it's a many-to-many mapping. This complexity leads to increased ambiguity when the input sentence lacks detailed specifications for each facial part. To address this challenge, we propose a recommendation mechanism. This mechanism helps deduce the features of facial components that aren't explicitly mentioned in the input sentence, with the ultimate aim of seamlessly combining these part features during the global image generation process.

It's important to note that the input sentences can be composed of several separate sentences, as long as they collectively describe the same face. This necessitates our model to learn the intricate interdependencies and inherent compatibility among various facial parts, considering both geometry and appearance aspects. Consequently, Zhaoyang Zhang et al designed their entire pipeline to work in a local-to-global manner. As stated above Amidson AI will stick to standardized language of how a face is described through it's propriety prompts however we will make it compatible with Zhaoyang Zhang et al's full engine for future use if needed.

During both training and inference, Zhaoyang Zhang et al divide a facial image into five distinct parts, which they denote as part∈P:=(left eye, right eye, nose, mouth, background), where "background" refers to the background of the image. You can find more detailed information about their network architecture in the supplementary materials.

Figure 2.Overview of our pipeline.Our pipeline follows a local-to-global manner. TheText Parsing Moduleparses one or multiplesentencessdescribing the same face into a set of keywords, which are used for conditionally sampling features for face generation froma property pool. The features in the property pool are extracted in advance using theFeature Extraction Module, which is trained todisentangle geometry from appearance for each facial component. TheGraph Recommendation Modulecontains two graphs,AppearanceGraphandGeometry Graph. They learn the coherence among facial components from appearance and geometry perspectives, respectively,and thus are able to propose recommendation for unspecified facial parts ins. Finally, theGlobal Generation Moduleis used to fuse thepart-level feature maps into a generated face imageIfinal. During inference, the input sentencesis parsed into keywords indexing intothe property pool to get corresponding part features. The part features are optimized by theAppearance GraphandGeometry Graph, afterwhich the optimized features are sent into the part-level decoders ({Decr}) in theFeature Extraction Moduleto get the feature maps. Thefeature maps are fused at fixed positions and translated into real imageIrealby theGlobal Generation Module.


We propose to divide a facial image into eight distinct parts, which they denote as part∈P:=(left eye, right eye, nose, mouth, background, left ear, right ear, forehead and face lines). This will correspond to our standardized library of how to explain a face. See Image below .



The British Institute of Graphologists explain graphology as follows: Back in school, we all learned a certain way to write. But as we grow up, something interesting happens – our handwriting starts to take on a life of its own. It's like a unique fingerprint that sets us apart.

You see, once we master the art of writing, we can't help but add our personal touch. We tweak the shapes and sizes of our letters based on what we like and don't like. That's why no two people's handwriting looks the same – it's a reflection of our individuality.


The way we write is closely tied to our personalities. After we've been taught the basics of writing, our unique personalities start to shape our handwriting. It's like our psychology expressed through symbols on paper, and just like our DNA, these symbols are one-of-a-kind.

Once you become familiar with someone's handwriting, it becomes as recognizable as a famous painting or photograph. Graphology is built on the idea that each person's handwriting carries its own distinct character, thanks to the individuality of their personality.

This means that when expert graphologists analyze handwriting, they can accurately assess the writer's character and abilities by observing the deviations from the standard taught in copybooks.

In fact, graphologists have a unique advantage. They have, right in front of them, a black and white representation of a writer's entire psychological profile in symbolic form. This is quite different from psychoanalysts and psychotherapists who must rely solely on what their clients tell them over time to form their opinions.

The purpose of graphology is to provide valuable insights into the challenges and issues that individuals face in their daily lives, regardless of their background or circumstances.


Graphology offers several benefits:

1. Unbiased Insight: Handwriting is a universal skill that doesn't discriminate based on factors like gender, race, color, or beliefs. Graphology provides an unbiased and unique perspective on an individual's personality and behavior, even when they are not physically present.

2. Science and Art: It combines both science and art. It's scientific because it meticulously measures the structure and movement of written forms, considering factors like slant, angles, spacing, and pressure. Simultaneously, it's an art because a graphologist must consider the overall context in which the writing occurs, understanding the complete 'gestalt' of the writing.

3. Comprehensive Analysis: Graphology studies variations in movement, spacing, and form in handwriting and associates psychological interpretations with these variations. Expert graphologists can achieve a high degree of accuracy in their analyses.

4. Understanding Character: Handwriting analysis provides insights into how a person thinks, feels, and behaves. It reveals the motivations behind actions and outlines a person's tendencies, even those that might not be readily apparent.

5. Revealing Subconscious: It delves into the subconscious, uncovering the underlying reasons behind actions, which may not be apparent through other means or in such a quick manner.

6. Practical Applications: Graphology is a powerful tool with a wide range of practical applications. It can be used effectively in various real-life situations to gain valuable insights into individuals' personalities and behaviors.

In summary, graphology is a versatile and insightful tool that can help us better understand people, their motivations, and their behavior, making it valuable in many practical scenarios.

In our case trying to generate a face from writing seems like an impossible task however you can generally get a generalized slate and with the help of NErfs we may be able to generate thousands of iterations and come close to actually being some what close. An example would be if a person is an optimist their handwriting is usually slanted upwards and to translate that into the face, according to physiognomy an optimist has their eyes slanted upwards. Another example would be pointed tops with shallow dish like connecting lines of m & n's and or dish like t-bar which means the person makes quick surface decisions, based on others' investigations. In physiognomy this person goes into a exercise without thinking of all the details which translates to separated or parted front teeth.


Text2Face: Text-based Face Generation with Geometry and Appearance Control Video

https://www.youtube.com/watch?v=up50EtPI-Mc


DeepFaceReshaping: Interactive Deep Face Reshaping via Landmark Manipulation Video

https://www.youtube.com/watch?v=WnUbygW4vIg


The hardware designed to be used by the software:

Hardware:

https://www.youtube.com/watch?v=sfDbz00jiw4

With Software:

https://www.youtube.com/watch?v=7UQGIvFIbpU&t=8s

Portable Vertical CPU and Multimedia Device Portable Vertical CPU and Multimedia Device

2022900068 · Filed Jan 14

  • EMU (Electronics Management Unit) is a portable vertical CPU and Multimedia Device, used for entertainment (movies, internet and gaming)/business (Point of sale, graphics and admin) and can be installed into a vehicles dashboard (car, truck, ect...) forming the vehicles center console which also can be removed. The Electronics Management Unit is originally designed to be used by the Military and Police force for instant facial recognition, infra red cameras, UV light for disinfection and forensic investigations as well as a medium for communication.


AI Communication Tool Video

https://www.dhirubhai.net/feed/update/urn:li:activity:7098932901437673472/


To read more about text 2 face AI: https://iccvm.org/2023/papers/s4-2-224-TVCG.pdf

To read more about Deep Face Reshaping: Interactive Deep Face Reshaping via Landmark Manipulation: https://iccvm.org/2023/papers/s4-1-226-CVMJ.pdf


In the next article we will discuss how we can use Sketch FaceNeRF: Sketch-based Facial Generation and Editing in Neural Radiance Fields https://www.youtube.com/watch?v=5ipABLyVSV4

#ImageGeneration #TextbasedInteraction #Human Faces #AI

要查看或添加评论,请登录

社区洞察

其他会员也浏览了