Subject-driven image generation aims to synthesize new images that preserve a given subject’s identity while following textual instructions. Existing approaches often encode text and reference images separately. This limits multimodal reasoning abilities and causes copy-paste artifacts. Recent multimodal diffusion frameworks improve instruction following but largely overlook identity preservation. We propose a conditioning framework based on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy progressively balances MLLM semantics and VAE fine-detail refinement. Experiments demonstrate that our approach harmonizes multimodal features, mitigates copy-paste issues, and achieves superior semantic understanding and identity preservation in subject-driven image generation.
@article{zheng2025dual,title={DLA: Dual Layer Aggregation for Squeezing Capacity of Multimodal Large Language Models for Subject-driven Generation},author={Zheng, Shuhong and Misraa, Aashish Kumar and Li, Yu-Teng and Li, Yu-Jhe and Gilitschenski, Igor},year={2025},note={<i>Under Review</i>},}
Paper
Visual Persona: Foundation Model for Full-Body Human Customization
Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, and Yang Zhou
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
@misc{nam2025visualpersonafoundationmodel,title={Visual Persona: Foundation Model for Full-Body Human Customization},author={Nam, Jisu and Son, Soowon and Xu, Zhan and Shi, Jing and Liu, Difan and Liu, Feng and Misraa, Aashish and Kim, Seungryong and Zhou, Yang},year={2025},eprint={2503.15406},archiveprefix={arXiv},primaryclass={cs.CV},url={https://arxiv.org/abs/2503.15406},note={<i>a version published at CVPR 2025</i>},}
Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.
@misc{misraa2020multimodalretrievalusinggraph,title={Multi-Modal Retrieval using Graph Neural Networks},author={Misraa, Aashish Kumar and Kale, Ajinkya and Aggarwal, Pranav and Aminian, Ali},year={2020},eprint={2010.01666},archiveprefix={arXiv},primaryclass={cs.IR},url={https://arxiv.org/abs/2010.01666},}
Paper
Waymo Driverless Car Data Analysis and Driving Modeling using CNN and LSTM
Self driving cars has been the biggest innovation in the automotive industry, but to achieve human level accuracy or near human level accuracy is the biggest challenge that research scientists are facing today. Unlike humans autonomous vehicles do not work on instincts rather they make a decision based on the training data that has been fed to them using machine learning models using which they can make decisions in different conditions they face in the real world. With the advancements in machine learning especially deep learning the self driving car research skyrocketed. In this project we have presented multiple ways to predict acceleration of the autonomous vehicle using Waymo’s open dataset. Our main approach was to using CNN to mimic human action and LSTM to treat this as a time series problem.
@article{2025arXiv250501446M,author={{Misraa}, Aashish Kumar and {Jain}, Naman and {Dhakad}, Saurav Singh},title={{Waymo Driverless Car Data Analysis and Driving Modeling using CNN and LSTM}},year={2020},eprint={2505.01446},note={This work contributed to research acknowledged in <a href="https://www.mdpi.com/2076-3417/10/6/2046">MDPI Journal of Applied Sciences</a>
}}
2017
Paper
An automatic detection of helmeted and non-helmeted motorcyclist with license plate extraction using convolutional neural network
Detection of helmeted and non-helmeted motorcyclist is mandatory now-a-days in order to ensure the safety of riders on the road. However, due to many constraints such as poor video quality, occlusion, illumination, and other varying factors it becomes very difficult to detect them accurately. In this paper, we introduce an approach for automatic detection of helmeted and non-helmeted motorcyclist using convolutional neural network (CNN). During the past several years, the advancements in deep learning models have drastically improved the performance of object detection. One such model is YOLOv2 [1] which combines both classification and object detection in a single architecture. Here, we use YOLOv2 at two different stages one after another in order to improve the helmet detection accuracy. At the first stage, YOLOv2 model is used to detect different objects in the test image. Since this model is trained on COCO dataset, it can detect all classes of the COCO dataset. In the proposed approach, we use detection of person class instead of motorcycle in order to increase the accuracy of helmet detection in the input image. The cropped images of detected persons are used as input to second YOLOv2 stage which was trained on our dataset of helmeted images. The non-helmeted images are processed further to extract license plate by using OpenALPR. In the proposed approach, we use two different datasets i.e., COCO and helmet datasets. We tested the potential of our approach on different helmeted and non-helmeted images. Experimental results show that the proposed method performs better when compared to other existing approaches with 94.70% helmet detection accuracy.
@inproceedings{misraa2017helmet,author={Mistry, Jimit and Misraa, Aashish K. and Agarwal, Meenu and Vyas, Ayushi and Chudasama, Vishal M. and Upla, Kishor P.},booktitle={2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA)},title={An automatic detection of helmeted and non-helmeted motorcyclist with license plate extraction using convolutional neural network},year={2017},volume={},number={},pages={1-6},keywords={Licenses;Motorcycles;Feature extraction;Training;Detectors;Testing;Cameras;helmet detection;YOLOv2;license plate extraction;COCO},doi={10.1109/IPTA.2017.8310092},issn={2154-512X},month=nov,}
Paper
An analysis of non-immigrant work visas in the USA using Machine Learning
Dhanasekar Sundararaman, Nabarun Pal, and Aashish Kumar Misraa
High-skilled immigrants are a very important factor in US innovation and entrepreneurship, accounting for roughly a quarter of US workers in fields such as computer science and delivering in terms of patents or firm starts. Their contributions to the US is rapidly increasing in the past three decades and are found to be well trained and skilled on average than their native counterparts. While the impact of these high-skilled workers is signified, the way in which they compete to enter a tech hub like the US is rather not fair. H-1B, the work visa to import high-skilled workers, is not used for high skilled anymore but rather used to import cheap labor to displace native workers in many cases. Many billionaires, experts, pundits and even the government are looking for many amendments in H-1B to abolish this by bringing in a merit system or increasing the minimum wages to awarding these visas. We attempt to analyze the petitions filed by 2011-16 and classify the petitions filed as positive or negative, indicating whether the petition is highly skilled or not. After classifying, we build a model using Random Forest to predict any visa petition in any state of the US as positive or negative. Experimental results show the companies that are classified as abusing these visas (negative) are well consistent with the ones shown in reports and news articles
@article{sundararaman2017analysis,author={Sundararaman, Dhanasekar and Pal, Nabarun and Misraa, Aashish Kumar},title={An analysis of non-immigrant work visas in the {USA} using Machine
Learning},journal={Int. J. Comput. Sci. Secur. (IJCSS)},volume={6},year={2017}}