research | Aashish K. Misraa

2025

Paper
DLA: Dual Layer Aggregation for Squeezing Capacity of Multimodal Large Language Models for Subject-driven Generation

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, and Igor Gilitschenski

2025

Under Review

Abs Bib

Subject-driven image generation aims to synthesize new images that preserve a given subject’s identity while following textual instructions. Existing approaches often encode text and reference images separately. This limits multimodal reasoning abilities and causes copy-paste artifacts. Recent multimodal diffusion frameworks improve instruction following but largely overlook identity preservation. We propose a conditioning framework based on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy progressively balances MLLM semantics and VAE fine-detail refinement. Experiments demonstrate that our approach harmonizes multimodal features, mitigates copy-paste issues, and achieves superior semantic understanding and identity preservation in subject-driven image generation.
@article{zheng2025dual, title = {DLA: Dual Layer Aggregation for Squeezing Capacity of Multimodal Large Language Models for Subject-driven Generation}, author = {Zheng, Shuhong and Misraa, Aashish Kumar and Li, Yu-Teng and Li, Yu-Jhe and Gilitschenski, Igor}, year = {2025}, note = {<i>Under Review</i>}, }
Product

Layered Image Editing

2025

part of V5 release

Abs Video

Layered Image Editing enables precise, context-aware compositing that keeps every change perfectly coherent. Automatically separates objects in an image onto different layers, making it easier to reposition, resize, or replace them.
Product

Firefly Image Generation & Editing V5

2025

HTML
Paper
Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, and Yang Zhou

2025

a version published at CVPR 2025

Abs Bib HTML

We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
@misc{nam2025visualpersonafoundationmodel, title = {Visual Persona: Foundation Model for Full-Body Human Customization}, author = {Nam, Jisu and Son, Soowon and Xu, Zhan and Shi, Jing and Liu, Difan and Liu, Feng and Misraa, Aashish and Kim, Seungryong and Zhou, Yang}, year = {2025}, eprint = {2503.15406}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, url = {https://arxiv.org/abs/2503.15406}, note = {<i>a version published at CVPR 2025</i>}, }
Patent

Zero-shot person customization in diffusion model

Aashish Kumar Misraa, Pranav Aggarwal, and Midhun Harikumar

Jan 2025

US App. 19/372,643

Abs

We propose a zero-shot method for personalized image generation, enabling the creation of images that accurately capture a subject’s face and appearance without additional fine-tuning. Leveraging a novel dual-embedding technique that integrates both sparse and dense features, our approach ensures high-fidelity identity and likeness preservation for faces and clothing, while maintaining image realism. Key innovations include identity preservation without leakage, composition locking, robust face-clothing blending, and multi-character interpolation. This enables efficient generation of coherent and customizable personas for applications such as storyboarding and advertising.

2024

Product

Firefly Custom Models

Nov 2024

Abs HTML

Train Firefly models with your own brand assets to create tailored, commercially safe content. Easily preview, refine, and manage models to ensure content matches your audience and brand vision.
Patent

Content customization and composition in diffusion

Pranav Aggarwal, Aashish Kumar Misraa, He Zhang, Soo Ye Kim, Wei Xiong, Hareesh Ravi, Jing Shi, Midhun Harikumar, Zhe Lin, and Eli Shechtman

Oct 2024

US App. 18/913,107

Abs

We present a unified diffusion-based framework for content customization and composition in images. Unlike traditional approaches that require separate specialized models and expensive fine-tuning for tasks such as custom image generation, object insertion, and localized editing, our method performs all these operations via a single model and novel training and inference strategies. Key innovations include generic content insertion and harmonization into user-provided backgrounds, text- and image-conditioned editing and styling, attention-based blending for seamless integration, and consistent object style control—all without the need for per-object fine-tuning at inference. This enables scalable, efficient, and flexible image creation and editing.
Patent

One-click dynamic storyboarding using Text Guidance

Pranav Aggarwal, Midhun Harikumar, and Aashish Kumar Misraa

Oct 2024

US App. 18/916,140

Abs

We introduce a system for dynamic, one-click storyboarding using text guidance. Our approach streamlines the traditionally manual and multi-person workflow of storyboarding by allowing users to generate, edit, and expand storyboard panes with simple prompts. Key features include single-prompt storyboard variation, character embedding, independent and global pane editing, and a chat-based interface, enabling rapid, iterative, and unified narrative creation.
Patent

Multi-concept adaptor learning of multi-modal LLM for image diffusion model

Yu-Jhe Li, Aashish Kumar Misraa, and Midhun Harikumar

Oct 2024

US App. 18/953,734

Abs

We introduce a unified framework for multi-concept content creation using a multi-modal LLM (M-LLM) adaptor for image diffusion models. Our method fuses visual and textual inputs by learning a unified feature representation—enabling tasks such as content customization, composition, and image generation from simple text/image prompts. Key innovations include a training pipeline that requires only single-object data (no multi-concept datasets), keeps the M-LLM frozen, and relies solely on adapter and decoder training through a two-stage fine-tuning process. No fine-tuning is needed at inference, enabling efficient, scalable multi-concept image synthesis from flexible multi-modal inputs.
Patent

Stacked image generation for character consistency

Pranav Aggarwal, Aashish Kumar Misraa, and Midhun Harikumar

Oct 2024

US App. 19/238,852

Abs

We introduce a method for stacked image generation to enhance character consistency in AI-driven image creation. Our approach enables the synthesis of multiple, character-consistent images—across various poses and views—from a single model, with or without reference images. Stacking these generated images allows for rapid preview and variation, while an attention-based inference technique improves diversity in background and lighting. This framework supports efficient data augmentation for tasks such as identity preservation and 3D reconstruction, advancing character-consistent image generation for creative applications.
Patent

Self attention reference for improved diffusion personalization

Nick Kolkin, Aashish Kumar Misraa, Midhun Harikumar, and Eli Shechtman

Aug 2024

US App. 18,187,915

Abs HTML

Describes systems and methods for improved image processing using reference images and text prompts. The technique involves obtaining a reference image and an input prompt describing an image element, identifying an object within the reference image, generating image features of the object, and synthesizing a new image that depicts both the described image element and the referenced object. This process enables more accurate and flexible manipulation of generated images by leveraging both textual and visual inputs.
Patent

Modality specific learnable attention for multi-conditioned diffusion models

Hareesh Ravi, Aashish Kumar Misraa, and Ajinkya Kale

Aug 2024

US App. 18/817,692

Abs HTML

Describes a method and system for multi-conditioned image generation via joint processing of textual and visual prompts. The technique involves encoding a text prompt to derive a text embedding and encoding an image prompt to obtain an image embedding. Cross-attention is applied separately to the text and image embeddings, producing corresponding attention outputs. These outputs are then fused to guide an image generation model, enabling the synthesis of images that effectively reflect both the semantic content of the text and the characteristics of the reference image.
Product

Generative Image Stylization

Aug 2024

part of V2 release

Abs

Zero-shot stylized image generation enables creation of an image conditioned on a text prompt and a style reference image, allowing flexible synthesis in the style of the provided reference. This feature is part of Adobe Firefly Image Generation V2.
Product

Generative Image Structure Match

Aug 2024

part of V2 release

Abs

Structure Match enables generating an image conditioned on a text prompt and a reference image to match the structure or composition of the reference image. This allows users to generate new content while preserving the layout, pose, or general arrangement present in the reference. Part of Adobe Firefly Image Generation V2.
Product

Generative Image Content Control

Aug 2024

part of V2 release

Abs

Provides control over the photorealism and creative style of generated images, supporting types such as "photorealistic", "illustration", and more. For photographic generations, additional controls over attributes such as EXIF parameters (e.g., aperture, ISO, focal length) are available, enabling fine-grained adjustment of content and image aesthetics.
Product

Firefly Image Generation V2

Aug 2024

HTML
Patent

Score based fine grained control of concept generation using DINO adapter

Pranav Aggarwal, Aashish Kumar Misraa, Midhun Harikumar, Jing Shi, He Zhang, and Wei Xiong

Jul 2024

US App. 18/785,914

Abs

This invention addresses the challenge of customizing text-to-image models for new concepts without the need for additional finetuning. It enables fine-grained control over concept attributes, such as adjusting how closely the generated content matches a reference image or modifying the size of the concept within the output. The approach provides flexible, dynamic manipulation of concepts during inference, streamlining customization and control for users.

2023

Product

Firefly Image Generation V1

Jul 2023

HTML
Product

Blink: AI-powered video editing

Jul 2023

contributions include video defect detection, face re-identification, and moments of interest detection

Abs HTML Video

Project Blink is an AI-powered video editing platform that enables users to edit video as easily as text. It automatically transcribes video content, allowing users to cut, copy, paste, and delete video segments simply by editing the transcript. Project Blink streamlines workflows such as removing filler words, searching for spoken phrases, and rapidly assembling rough cuts, making video editing faster and more accessible for everyone.
Product

Remove Video Background

Jul 2023

HTML Video

2022

Patent

Tracking unique face identities in videos

Ali Aminian, Aashish Kumar Misraa, Kshitiz Garg, and Aseem Agarwala

Nov 2022

US 12,412,419

Abs HTML

This technology enables automated identification and tracking of unique face identities in videos. Detected faces across video frames are linked using object tracking to form tracklets—sequences of consecutive frames containing the same individual. Advanced clustering techniques group these tracklets by analyzing facial feature vectors, resulting in accurate identity clusters. The identities and their associated frame sequences are stored in a datastore, facilitating efficient face-based video search and retrieval. This approach significantly improves accuracy and scalability compared to traditional frame-by-frame analysis.
Patent

Processing framework for temporal-consistent face manipulation in videos

Han Guo, Kshitiz Garg, Ali Aminian, Aashish Misraa, William Marino, and Nicolas Huynh Thien

May 2022

US App. 17/751,322

Abs HTML

This patent addresses the challenge of generating temporally consistent manipulated videos. The disclosed method receives a target appearance and an input digital video composed of multiple frames, then generates target appearance frames based on the input. A video prediction network is trained to modify the appearance of the subject in the input video to match the target, resulting in an output video where changes are smoothly propagated over time, minimizing visual artifacts and ensuring natural, coherent transformations across frames. This approach enhances temporal consistency and realism for facial or identity manipulations in video editing applications.
Product

Morpheus: temporal-consistent face manipulations in videos

May 2022

Abs HTML Video

Project Morpheus uses generative AI to realistically edit facial expressions, age, and speech in videos while preserving identity and quality and temporal consistency.
Product

Face Detection & Identification in After Effects, Premiere Pro, and Elements workflows

May 2022

HTML

2021

Product

Selecting Best Photos with Assisted Culling

Dec 2021

contributions include blur estimation, eye focus, best frame detection and model compression

Abs HTML

Assisted Culling in Lightroom Classic uses AI to help users quickly identify and organize their best photos by selecting high-quality images and filtering out low-quality or unwanted shots. The feature provides flexible controls to prioritize subject or eye focus, reject images with closed or indiscernible eyes, misfires, documents, or exposure issues, and batch-organize results. Assisted Culling streamlines the culling process and enables efficient album and folder management with easy filtering, stacking, and batch actions.
Patent

Text-based framework for video object selection

Shivam Nalin Patel, Kshitiz Garg, Han Guo, Ali Aminian, and Aashish Kumar Misraa

Nov 2021

US 12,266,181

Abs HTML

This patent discloses a method for selecting objects in videos using text input. A user provides a textual description and an input video with multiple frames. The system extracts features from the text and the video frames, identifies keyframes containing the target object, and groups these keyframes. For each group, segmentation masks are generated. Reference masks related to the user’s input and the object are also identified. These segmentation and reference masks are then fused into composite masks, which are propagated throughout the video to produce accurate, temporally-coherent masks that correspond to the object specified by the user. The approach improves precision and consistency of video object selection based on flexible natural language prompts.
Patent

Blur classification and blur map estimation

Aashish Kumar Misraa and Zhe Lin

Mar 2021

US 11,816,181

Abs HTML

Describes systems and methods for automated blur detection and blur map estimation in images. A training set is constructed with a first image labeled for blur classification and a second image paired with a ground truth blur map. Both images are encoded using a shared image encoder to produce embedded representations. For the first image, a classification layer predicts whether the image is blurred; for the second, a map decoder estimates a pixel-wise blur map. Classification and map losses are computed against respective ground truths, enabling joint training of the encoder, classification, and map decoder modules. This approach allows simultaneous, efficient learning of both global (classification) and local (map estimation) blur characteristics for improved performance.
Product

Shot angle and size classifiers for stock videos and images

Mar 2021

Abs HTML Video

Introduces Shot Size and Shot Angle video filters, enabling refined searches by framing and perspective (e.g., close-up, long shot, top-down, low angle, eye level) within its extensive 20-million-clip video library. These advanced filters streamline discovery of high-quality, royalty-free video content by allowing users to quickly find clips that match specific creative needs.

2020

Patent

Unified framework for multi-modal similarity search

Pranav Aggarwal, Ali Aminian, Ajinkya Kale, and Aashish Kumar Misraa

Apr 2020

US 11,500,939

Abs HTML

This patent presents a unified framework for multi-modal similarity search, enabling efficient retrieval across different data types such as images, text, and audio. The disclosed system determines the modality of a query object, then generates an embedding by leveraging both the object’s features and the relationships between the object and its neighbors within a graph structure. By modeling connections between diverse types of data, the approach enhances retrieval accuracy and relevance. The resulting embedding allows for flexible, high-quality similarity search in varied applications—including recommending related content, organizing large datasets, and supporting cross-modal queries—by capturing both intra-modal and inter-modal relationships in a scalable manner.
Paper
Multi-modal retrieval using graph neural networks

Aashish Kumar Misraa, Ajinkya Kale, Pranav Aggarwal, and Ali Aminian

Apr 2020

Abs arXiv Bib

Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.
@misc{misraa2020multimodalretrievalusinggraph, title = {Multi-modal retrieval using graph neural networks}, author = {Misraa, Aashish Kumar and Kale, Ajinkya and Aggarwal, Pranav and Aminian, Ali}, year = {2020}, eprint = {2010.01666}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, url = {https://arxiv.org/abs/2010.01666}, }
Paper
Waymo driverless car data analysis and driving modeling using CNN and LSTM

Aashish Kumar Misraa, Naman Jain, and Saurav Singh Dhakad

Apr 2020

This work contributed to research acknowledged in MDPI Journal of Applied Sciences

Abs arXiv Bib

Self driving cars has been the biggest innovation in the automotive industry, but to achieve human level accuracy or near human level accuracy is the biggest challenge that research scientists are facing today. Unlike humans autonomous vehicles do not work on instincts rather they make a decision based on the training data that has been fed to them using machine learning models using which they can make decisions in different conditions they face in the real world. With the advancements in machine learning especially deep learning the self driving car research skyrocketed. In this project we have presented multiple ways to predict acceleration of the autonomous vehicle using Waymo’s open dataset. Our main approach was to using CNN to mimic human action and LSTM to treat this as a time series problem.
@article{2025arXiv250501446M, author = {{Misraa}, Aashish Kumar and {Jain}, Naman and {Dhakad}, Saurav Singh}, title = {{Waymo driverless car data analysis and driving modeling using CNN and LSTM}}, year = {2020}, eprint = {2505.01446}, note = {This work contributed to research acknowledged in <a href="https://www.mdpi.com/2076-3417/10/6/2046">MDPI Journal of Applied Sciences</a> } }

2017

Paper
An automatic detection of helmeted and non-helmeted motorcyclist with license plate extraction using convolutional neural network

Jimit Mistry, Aashish K. Misraa, Meenu Agarwal, Ayushi Vyas, Vishal M. Chudasama, and Kishor P. Upla

In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), Nov 2017

Abs DOI Bib

Detection of helmeted and non-helmeted motorcyclist is mandatory now-a-days in order to ensure the safety of riders on the road. However, due to many constraints such as poor video quality, occlusion, illumination, and other varying factors it becomes very difficult to detect them accurately. In this paper, we introduce an approach for automatic detection of helmeted and non-helmeted motorcyclist using convolutional neural network (CNN). During the past several years, the advancements in deep learning models have drastically improved the performance of object detection. One such model is YOLOv2 [1] which combines both classification and object detection in a single architecture. Here, we use YOLOv2 at two different stages one after another in order to improve the helmet detection accuracy. At the first stage, YOLOv2 model is used to detect different objects in the test image. Since this model is trained on COCO dataset, it can detect all classes of the COCO dataset. In the proposed approach, we use detection of person class instead of motorcycle in order to increase the accuracy of helmet detection in the input image. The cropped images of detected persons are used as input to second YOLOv2 stage which was trained on our dataset of helmeted images. The non-helmeted images are processed further to extract license plate by using OpenALPR. In the proposed approach, we use two different datasets i.e., COCO and helmet datasets. We tested the potential of our approach on different helmeted and non-helmeted images. Experimental results show that the proposed method performs better when compared to other existing approaches with 94.70% helmet detection accuracy.
@inproceedings{misraa2017helmet, author = {Mistry, Jimit and Misraa, Aashish K. and Agarwal, Meenu and Vyas, Ayushi and Chudasama, Vishal M. and Upla, Kishor P.}, booktitle = {2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA)}, title = {An automatic detection of helmeted and non-helmeted motorcyclist with license plate extraction using convolutional neural network}, year = {2017}, volume = {}, number = {}, pages = {1-6}, keywords = {Licenses;Motorcycles;Feature extraction;Training;Detectors;Testing;Cameras;helmet detection;YOLOv2;license plate extraction;COCO}, doi = {10.1109/IPTA.2017.8310092}, issn = {2154-512X}, month = nov, }
Paper
An analysis of non-immigrant work visas in the USA using Machine Learning

Dhanasekar Sundararaman, Nabarun Pal, and Aashish Kumar Misraa

Int. J. Comput. Sci. Secur. (IJCSS), Nov 2017

Abs Bib

High-skilled immigrants are a very important factor in US innovation and entrepreneurship, accounting for roughly a quarter of US workers in fields such as computer science and delivering in terms of patents or firm starts. Their contributions to the US is rapidly increasing in the past three decades and are found to be well trained and skilled on average than their native counterparts. While the impact of these high-skilled workers is signified, the way in which they compete to enter a tech hub like the US is rather not fair. H-1B, the work visa to import high-skilled workers, is not used for high skilled anymore but rather used to import cheap labor to displace native workers in many cases. Many billionaires, experts, pundits and even the government are looking for many amendments in H-1B to abolish this by bringing in a merit system or increasing the minimum wages to awarding these visas. We attempt to analyze the petitions filed by 2011-16 and classify the petitions filed as positive or negative, indicating whether the petition is highly skilled or not. After classifying, we build a model using Random Forest to predict any visa petition in any state of the US as positive or negative. Experimental results show the companies that are classified as abusing these visas (negative) are well consistent with the ones shown in reports and news articles.
@article{sundararaman2017analysis, author = {Sundararaman, Dhanasekar and Pal, Nabarun and Misraa, Aashish Kumar}, title = {An analysis of non-immigrant work visas in the {USA} using Machine Learning}, journal = {Int. J. Comput. Sci. Secur. (IJCSS)}, volume = {6}, year = {2017} }
OSS

Scilab: Memory & Performance Improvements

Nov 2017

Abs HTML

Contributed upstream patches to Scilab focused on memory efficiency and performance improvements. The work involved identifying bottlenecks, debugging low-level issues, and proposing fixes that were reviewed and integrated into the open-source Scilab codebase.