Bachelor Thesis · Data Science · HCMIU – VNU

Segmented
Objects,
Guided
by Prompts

A Computer Vision framework that uses natural language prompts to detect, segment, and apply neural style transfer to individual objects — not the whole image. Combines GroundingDINO · SAM2 · S2WAT into a unified pipeline.

Nguyen Hong SonAuthor
Dr. Le Thi Ngoc HanhSupervisor
HCMIU · VNU-HCMInstitution
2025Year
3
Pipeline stages
5
Art styles
~10s
Per object
Objects / image
Core CV Models
01🤖
Google Gemini
Prompt parsing · LLM
LLM
02🔗
CLIP
Cross-modal style retrieval
CV
03🔍
GroundingDINO
Open-set object detection
CV
04✂️
SAM 2
Instance segmentation
CV
05🎨
S2WAT
Neural style transfer · ViT
CV
06🧩
Alpha Compositing
Mask-guided blending
CV
For CV Engineers

Key Computer Vision Components

👁 CV Pipeline· Open-vocab detection· Prompt-conditioned segmentation· ViT-based style transfer· CLIP cross-modal retrieval· Alpha-composite blending
🔍
GroundingDINO — Open-Set Detection
Transformer-based detector that fuses vision + language. Takes arbitrary text phrases and outputs bounding boxes. Localizes objects mentioned in the user prompt with high accuracy, including occluded or partially visible objects.
Architecture: DINO backbone + text-guided grounding · Output: cxcy bounding box centers
✂️
SAM2 — Segment Anything Model v2
State-of-the-art segmentation model taking bounding box coordinates from GroundingDINO as spatial prompts. Produces high-quality binary masks and salient object images. Handles irregular shapes and partial occlusion without retraining.
Input: box center coords · Output: binary mask (Seg-O) + salient object image (SLO)
🎨
S2WAT — Strips Window Attention Transfer
Hierarchical Vision Transformer for neural style transfer. Uses horizontal/vertical strip windows + square patches to capture both long-range context and fine-grained textures. Adaptive merging avoids grid artifacts.
Architecture: ViT + strip attention · Loss: content + Gram-matrix style + identity
🔗
CLIP — Cross-Modal Style Retrieval
Contrastive Language-Image Pre-training maps text and images into a shared embedding space. Used offline to retrieve the closest style reference via cosine similarity. Open-vocabulary style queries without task-specific retraining.
Similarity(s,i) = (s·i) / (‖s‖·‖i‖) · Open-vocabulary, no retraining
🧩
Mask-Guided Alpha Compositing
Pixel-level blending using SAM2's salient mask. For each pixel (x,y): if mask=255 → use stylized pixel; else → keep original background. Applied recursively per object in sequence.
I_blend(x,y) = I_styled(x,y) if mask=255 else I_base(x,y)
End-to-End Modular Pipeline
LLM → CLIP retrieval → GroundingDINO → SAM2 → S2WAT → alpha-composite blending. Each stage independently swappable. Scales linearly with object count: ~10s/object. Fully local except prompt parsing.
Runtime: ~10.44s (1 obj) → ~42.30s (4 objs) · Linear O(n) scaling
System Design

3-Stage Pipeline

pipeline("I want a cat in sketch style, a dog in cubist style", content_image) → stylized_output
INPUT
🖼️
Content Image
+ text prompt
LLM
🤖
Gemini
Prompt parsing
CV
STAGE 1
🔗
CLIP Retrieval
Style image match
CV
STAGE 1
🔍
GroundingDINO
Object detection
CV
STAGE 2
✂️
SAM2
Segmentation
CV
STAGE 2
🎨
S2WAT
Style transfer
CV
STAGE 3
🧩
Blending
Alpha composite
OUTPUT
Stylized Image
Per-object styles
Stage 01
🤖
Prompt Processing
Google Gemini interprets the user's natural language prompt and extracts structured object–style pairs. Handles compound instructions with multiple objects, attributes, and styles.
Google GeminiNLPMultimodal LLM
Stage 01
🔗
Style Image Retrieval
CLIP encodes both the style text and all images in the style folder into a shared embedding space. Cosine similarity selects the best-matching style reference.
CLIPCosine SimilarityCross-modal Retrieval
Stage 01
🔍
Object Detection
GroundingDINO outputs bounding box center coordinates (cxcy). Open-set detection handles arbitrary objects. Robust to occlusion, blur, and partial visibility.
GroundingDINOVision TransformerOpen-vocab
Stage 02
✂️
Object Segmentation
SAM2 receives cxcy and produces a binary segmentation mask (Seg-O) plus salient object image (SLO). Handles fine boundaries and partial occlusions without retraining.
SAM2Instance SegmentationBinary Masking
Stage 02
🎨
Neural Style Transfer
S2WAT applies the retrieved style image to the segmented object. Multi-shape attention windows capture both local textures and global structure.
S2WATVision TransformerGram Matrix Loss
Stage 03
🧩
Blending & Composition
Mask-guided alpha compositing places each stylized object back into the scene. Applied recursively: each blend result becomes the base for the next object.
Alpha CompositingMask-guided BlendingRecursive
Evaluation

Results & Demo Outputs

🐻
Input Bear
🐻
Segmented
🐻
Newspaper
draw new bear with newspaper style
High-contrast monochrome; halftone texture. Single-object single-style. Clean segmentation with background preserved.
newspaper
🦊
Input Fox
🦊
Segmented
🦊
Cubist
a new fox with cubist style
Geometric fragmentation, angular planes, vibrant color contrasts from Picasso's La Muse. Background remains natural.
cubist / picasso
👧
Input Girl
👧
Segmented
👧
Sketch
style transfer a girl by sketch style
Fine linework, cross-hatching, minimalist monochrome. Retains facial structure and expression during transfer.
sketch
🐕
Sketch
🐱
Cyberpunk
🧑
Cubist
I want a dog by sketch, a cat by cyberpunk, and a human by cubist style
3-object composition. Each subject gets a fully independent style. Blending is cumulative: dog → cat → human.
sketchcyberpunkcubist
🐕
Dog Input
🐕
Sketch
🐱
Cat Input
🐱
Cubist
I want a dog drawn by sketch style, and a cat drawn by cubist style
Sequential multi-object processing. Dog stylized first → becomes base → cat detected & stylized independently.
sketchcubist
🐯
Tiger (occluded)
🐯
Partial Seg.
🐯
Cubist
I want to style transfer tiger on cubist style
Tiger partially hidden behind grass. GroundingDINO + SAM2 still correctly isolate the visible portions.
cubistocclusion handled
Performance

Runtime Benchmarks

~10s
per object · avg single-style
O(n)
linear scaling with object count
Runtime scales linearly with object count. Each object adds ~10.5s overhead (segmentation + style transfer + blending).

Gemini API is the only online dependency. All CV stages run fully local.
Scenario# Styles# ObjectsRuntimeComplexity
Bear + Newspaper1110.44sSingle object
Fox + Cubist1110.55sSingle object
Girl + Sketch1111.18sHuman subject
Girl + Cubist1112.17sHuman subject
Tiger + Cubist (occluded)1111.18sPartial visibility
Panther + Newspaper (blurred)1111.14sPartial visibility
Dog + Cat (sketch + cubist)2220.65sMulti-object ×2
Dog + Human (sketch + cubist)2221.19sMulti-object ×2
Dog + Human v22220.67sMulti-object ×2
Dog + Cat + Human (3 styles)3332.84sMulti-object ×3
4-object multi-style composition4442.30sMulti-object ×4
Known Issues

Limitations & Failure Cases

⚠ Incomplete Human Segmentation
SAM2 sometimes segments only skin regions of human subjects, excluding hair and clothing. Stylization is applied to fragmented areas only, breaking visual coherence.
⚠ Partial Object Capture
When an object spans multiple segmentation passes, different sub-regions may receive different style assignments, causing conflicting artistic representations.
⚠ Wrong Region Targeting
GroundingDINO occasionally mislocalizes: e.g., segmenting the sea/ground instead of a lion. Center coordinate ambiguity leads to style applied to irrelevant regions.
⚠ Fine Boundary Loss
Small or low-contrast regions (e.g., a cat's dark tail against dark background) may be clipped during SAM2 mask generation. Fine-grained boundary details are hard to preserve.
⚠ Online Dependency
The Gemini API call for prompt parsing requires internet connectivity. All downstream CV stages are local, but the LLM step prevents fully offline deployment.
⚠ Limited Style Library
Style retrieval is bounded by the pre-curated folder (5 styles). Styles not represented will fall back to the nearest match, potentially misrepresenting user intent.
Roadmap

Future Work

01
Offline LLM
Replace Gemini with a local fine-tuned small LLM or rule-based NLP. Enables fully offline, edge-deployable pipeline.
02
Better Segmentation
Integrate whole-body person segmentation (DensePose or HRNet) to fix the skin-only SAM2 limitation. Add multi-pass refinement steps.
03
Diffusion-based Transfer
Explore diffusion models (ControlNet, IP-Adapter) for style transfer to improve texture fidelity at object boundaries.
04
Gradient-Domain Blending
Replace alpha compositing with Poisson image editing or GAN-based inpainting at boundaries to reduce visible seam artifacts.
05
Spatial Prompts
Support spatial positioning ("object on the left"), style composition (mixing two styles on one object), and multi-image style prompts.
06
Real-time Optimization
Model distillation and quantization to bring per-object runtime from ~10s toward near-real-time for interactive / video applications.
Style Library

5 Reference Styles

🖼️
La Muse
Cubist / Picasso
Geometric abstraction, angular planes, multiple viewpoints, vibrant earth tones
🌌
Starry Night
Van Gogh
Swirling brushstrokes, deep blues & yellows, rhythmic movement, emotional intensity
📰
Newspaper
Print Media
High-contrast monochrome, halftone dots, texture over color, vintage aesthetic
🌆
Cyberpunk
Neon / Futuristic
Electric blues, magentas, neon highlights, high saturation, digital textures
✏️
Sketch
Pencil / Linework
Fine lines, cross-hatching, monochrome, minimal — emphasizes contour over color
Comparison

vs Related Methods

MethodObject-LevelPrompt-GuidedMulti-StyleOpen-VocabSpeed
This Work (Ours)✅ Per-instance✅ Natural language✅ Unlimited✅ Any object~10s/obj
CBS (Class-Based Styling)✅ Per-class❌ Fixed classes⚠ Fixed 3 styles❌ Fixed labelsReal-time
Regional Style Transfer⚠ Foreground only❌ Manual mask❌ Single style⚠ LimitedBatch only
MOSAIC (CLIP-guided)✅ Per-object✅ Text prompts✅ Multiple✅ Open-vocabSlow (ViT)
FreeStyle (Diffusion)❌ Global only✅ Text guided❌ Single pass✅ Open-vocabSlow (diffusion)
About

The Author

S
Nguyen Hong Son
Bachelor of Data Science · HCMIU – VNU-HCM · 2025

Computer Vision researcher focused on prompt-guided image manipulation, object segmentation, and neural style transfer. This thesis integrates state-of-the-art CV models (GroundingDINO, SAM2, S2WAT, CLIP) into a unified, natural-language-driven pipeline for per-object artistic stylization. Supervised by Dr. Le Thi Ngoc Hanh, School of Computer Science and Engineering.

📄 Read Full PDF Thesis