Bachelor Thesis · Data Science · HCMIU – VNU
Segmented
Objects,
Guided
by Prompts
A Computer Vision framework that uses natural language prompts to detect, segment, and apply neural style transfer to individual objects — not the whole image. Combines GroundingDINO · SAM2 · S2WAT into a unified pipeline.
3
Pipeline stages
5
Art styles
~10s
Per object
∞
Objects / image
Core CV Models
01🤖LLM
Google Gemini
Prompt parsing · LLM
02🔗CV
CLIP
Cross-modal style retrieval
03🔍CV
GroundingDINO
Open-set object detection
04✂️CV
SAM 2
Instance segmentation
05🎨CV
S2WAT
Neural style transfer · ViT
06🧩CV
Alpha Compositing
Mask-guided blending
For CV Engineers
Key Computer Vision Components
👁 CV Pipeline·
Open-vocab detection·
Prompt-conditioned segmentation·
ViT-based style transfer·
CLIP cross-modal retrieval·
Alpha-composite blending
GroundingDINO — Open-Set Detection
Transformer-based detector that fuses vision + language. Takes arbitrary text phrases and outputs bounding boxes. Localizes objects mentioned in the user prompt with high accuracy, including occluded or partially visible objects.
Architecture: DINO backbone + text-guided grounding · Output: cxcy bounding box centers
SAM2 — Segment Anything Model v2
State-of-the-art segmentation model taking bounding box coordinates from GroundingDINO as spatial prompts. Produces high-quality binary masks and salient object images. Handles irregular shapes and partial occlusion without retraining.
Input: box center coords · Output: binary mask (Seg-O) + salient object image (SLO)
S2WAT — Strips Window Attention Transfer
Hierarchical Vision Transformer for neural style transfer. Uses horizontal/vertical strip windows + square patches to capture both long-range context and fine-grained textures. Adaptive merging avoids grid artifacts.
Architecture: ViT + strip attention · Loss: content + Gram-matrix style + identity
CLIP — Cross-Modal Style Retrieval
Contrastive Language-Image Pre-training maps text and images into a shared embedding space. Used offline to retrieve the closest style reference via cosine similarity. Open-vocabulary style queries without task-specific retraining.
Similarity(s,i) = (s·i) / (‖s‖·‖i‖) · Open-vocabulary, no retraining
Mask-Guided Alpha Compositing
Pixel-level blending using SAM2's salient mask. For each pixel (x,y): if mask=255 → use stylized pixel; else → keep original background. Applied recursively per object in sequence.
I_blend(x,y) = I_styled(x,y) if mask=255 else I_base(x,y)
End-to-End Modular Pipeline
LLM → CLIP retrieval → GroundingDINO → SAM2 → S2WAT → alpha-composite blending. Each stage independently swappable. Scales linearly with object count: ~10s/object. Fully local except prompt parsing.
Runtime: ~10.44s (1 obj) → ~42.30s (4 objs) · Linear O(n) scaling
System Design
3-Stage Pipeline
INPUT
🖼️
Content Image
+ text prompt
→
LLM
🤖
Gemini
Prompt parsing
→
CV
STAGE 1
🔗
CLIP Retrieval
Style image match
→
CV
STAGE 1
🔍
GroundingDINO
Object detection
→
CV
STAGE 2
✂️
SAM2
Segmentation
→
CV
STAGE 2
🎨
S2WAT
Style transfer
→
CV
STAGE 3
🧩
Blending
Alpha composite
→
OUTPUT
✨
Stylized Image
Per-object styles
Stage 01
🤖
Prompt Processing
Google Gemini interprets the user's natural language prompt and extracts structured object–style pairs. Handles compound instructions with multiple objects, attributes, and styles.
Google GeminiNLPMultimodal LLM
Stage 01
🔗
Style Image Retrieval
CLIP encodes both the style text and all images in the style folder into a shared embedding space. Cosine similarity selects the best-matching style reference.
CLIPCosine SimilarityCross-modal Retrieval
Stage 01
🔍
Object Detection
GroundingDINO outputs bounding box center coordinates (cxcy). Open-set detection handles arbitrary objects. Robust to occlusion, blur, and partial visibility.
GroundingDINOVision TransformerOpen-vocab
Stage 02
✂️
Object Segmentation
SAM2 receives cxcy and produces a binary segmentation mask (Seg-O) plus salient object image (SLO). Handles fine boundaries and partial occlusions without retraining.
SAM2Instance SegmentationBinary Masking
Stage 02
🎨
Neural Style Transfer
S2WAT applies the retrieved style image to the segmented object. Multi-shape attention windows capture both local textures and global structure.
S2WATVision TransformerGram Matrix Loss
Stage 03
🧩
Blending & Composition
Mask-guided alpha compositing places each stylized object back into the scene. Applied recursively: each blend result becomes the base for the next object.
Alpha CompositingMask-guided BlendingRecursive
Evaluation
Results & Demo Outputs
🐻
Input Bear→
🐻
Segmented→
🐻
Newspaperdraw new bear with newspaper style
High-contrast monochrome; halftone texture. Single-object single-style. Clean segmentation with background preserved.
newspaper
🦊
Input Fox→
🦊
Segmented→
🦊
Cubista new fox with cubist style
Geometric fragmentation, angular planes, vibrant color contrasts from Picasso's La Muse. Background remains natural.
cubist / picasso
👧
Input Girl→
👧
Segmented→
👧
Sketchstyle transfer a girl by sketch style
Fine linework, cross-hatching, minimalist monochrome. Retains facial structure and expression during transfer.
sketch
🐕
Sketch🐱
Cyberpunk🧑
CubistI want a dog by sketch, a cat by cyberpunk, and a human by cubist style
3-object composition. Each subject gets a fully independent style. Blending is cumulative: dog → cat → human.
sketchcyberpunkcubist
🐕
Dog Input→
🐕
Sketch🐱
Cat Input→
🐱
CubistI want a dog drawn by sketch style, and a cat drawn by cubist style
Sequential multi-object processing. Dog stylized first → becomes base → cat detected & stylized independently.
sketchcubist
🐯
Tiger (occluded)→
🐯
Partial Seg.→
🐯
CubistI want to style transfer tiger on cubist style
Tiger partially hidden behind grass. GroundingDINO + SAM2 still correctly isolate the visible portions.
cubistocclusion handled
Performance
Runtime Benchmarks
~10s
per object · avg single-style
O(n)
linear scaling with object count
Runtime scales linearly with object count. Each object adds ~10.5s overhead (segmentation + style transfer + blending).
Gemini API is the only online dependency. All CV stages run fully local.
Gemini API is the only online dependency. All CV stages run fully local.
| Scenario | # Styles | # Objects | Runtime | Complexity |
|---|---|---|---|---|
| Bear + Newspaper | 1 | 1 | 10.44s | Single object |
| Fox + Cubist | 1 | 1 | 10.55s | Single object |
| Girl + Sketch | 1 | 1 | 11.18s | Human subject |
| Girl + Cubist | 1 | 1 | 12.17s | Human subject |
| Tiger + Cubist (occluded) | 1 | 1 | 11.18s | Partial visibility |
| Panther + Newspaper (blurred) | 1 | 1 | 11.14s | Partial visibility |
| Dog + Cat (sketch + cubist) | 2 | 2 | 20.65s | Multi-object ×2 |
| Dog + Human (sketch + cubist) | 2 | 2 | 21.19s | Multi-object ×2 |
| Dog + Human v2 | 2 | 2 | 20.67s | Multi-object ×2 |
| Dog + Cat + Human (3 styles) | 3 | 3 | 32.84s | Multi-object ×3 |
| 4-object multi-style composition | 4 | 4 | 42.30s | Multi-object ×4 |
Known Issues
Limitations & Failure Cases
⚠ Incomplete Human Segmentation
SAM2 sometimes segments only skin regions of human subjects, excluding hair and clothing. Stylization is applied to fragmented areas only, breaking visual coherence.
⚠ Partial Object Capture
When an object spans multiple segmentation passes, different sub-regions may receive different style assignments, causing conflicting artistic representations.
⚠ Wrong Region Targeting
GroundingDINO occasionally mislocalizes: e.g., segmenting the sea/ground instead of a lion. Center coordinate ambiguity leads to style applied to irrelevant regions.
⚠ Fine Boundary Loss
Small or low-contrast regions (e.g., a cat's dark tail against dark background) may be clipped during SAM2 mask generation. Fine-grained boundary details are hard to preserve.
⚠ Online Dependency
The Gemini API call for prompt parsing requires internet connectivity. All downstream CV stages are local, but the LLM step prevents fully offline deployment.
⚠ Limited Style Library
Style retrieval is bounded by the pre-curated folder (5 styles). Styles not represented will fall back to the nearest match, potentially misrepresenting user intent.
Roadmap
Future Work
01
Offline LLM
Replace Gemini with a local fine-tuned small LLM or rule-based NLP. Enables fully offline, edge-deployable pipeline.
02
Better Segmentation
Integrate whole-body person segmentation (DensePose or HRNet) to fix the skin-only SAM2 limitation. Add multi-pass refinement steps.
03
Diffusion-based Transfer
Explore diffusion models (ControlNet, IP-Adapter) for style transfer to improve texture fidelity at object boundaries.
04
Gradient-Domain Blending
Replace alpha compositing with Poisson image editing or GAN-based inpainting at boundaries to reduce visible seam artifacts.
05
Spatial Prompts
Support spatial positioning ("object on the left"), style composition (mixing two styles on one object), and multi-image style prompts.
06
Real-time Optimization
Model distillation and quantization to bring per-object runtime from ~10s toward near-real-time for interactive / video applications.
Style Library
5 Reference Styles
🖼️
La Muse
Cubist / Picasso
Geometric abstraction, angular planes, multiple viewpoints, vibrant earth tones
🌌
Starry Night
Van Gogh
Swirling brushstrokes, deep blues & yellows, rhythmic movement, emotional intensity
📰
Newspaper
Print Media
High-contrast monochrome, halftone dots, texture over color, vintage aesthetic
🌆
Cyberpunk
Neon / Futuristic
Electric blues, magentas, neon highlights, high saturation, digital textures
✏️
Sketch
Pencil / Linework
Fine lines, cross-hatching, monochrome, minimal — emphasizes contour over color
Comparison
vs Related Methods
| Method | Object-Level | Prompt-Guided | Multi-Style | Open-Vocab | Speed |
|---|---|---|---|---|---|
| This Work (Ours) | ✅ Per-instance | ✅ Natural language | ✅ Unlimited | ✅ Any object | ~10s/obj |
| CBS (Class-Based Styling) | ✅ Per-class | ❌ Fixed classes | ⚠ Fixed 3 styles | ❌ Fixed labels | Real-time |
| Regional Style Transfer | ⚠ Foreground only | ❌ Manual mask | ❌ Single style | ⚠ Limited | Batch only |
| MOSAIC (CLIP-guided) | ✅ Per-object | ✅ Text prompts | ✅ Multiple | ✅ Open-vocab | Slow (ViT) |
| FreeStyle (Diffusion) | ❌ Global only | ✅ Text guided | ❌ Single pass | ✅ Open-vocab | Slow (diffusion) |
About