Bachelor Thesis · Data Science · HCMIU – VNU

Segmented
Objects,
Guided
by Prompts

A Computer Vision framework that uses natural language prompts to detect, segment, and apply neural style transfer to individual objects — not the whole image. Combines GroundingDINO · SAM2 · S2WAT into a unified pipeline.

Nguyen Hong SonAuthor

Dr. Le Thi Ngoc HanhSupervisor

HCMIU · VNU-HCMInstitution

2025Year

📄 Read Full PDF ↓ View Pipeline

Pipeline stages

Art styles

~10s

Per object

∞

Objects / image

Core CV Models

01🤖

Google Gemini

Prompt parsing · LLM

LLM

02🔗

CLIP

Cross-modal style retrieval

03🔍

GroundingDINO

Open-set object detection

04✂️

SAM 2

Instance segmentation

05🎨

S2WAT

Neural style transfer · ViT

06🧩

Alpha Compositing

Mask-guided blending

For CV Engineers

Key Computer Vision Components

👁 CV Pipeline· Open-vocab detection· Prompt-conditioned segmentation· ViT-based style transfer· CLIP cross-modal retrieval· Alpha-composite blending

🔍

GroundingDINO — Open-Set Detection

Transformer-based detector that fuses vision + language. Takes arbitrary text phrases and outputs bounding boxes. Localizes objects mentioned in the user prompt with high accuracy, including occluded or partially visible objects.

Architecture: DINO backbone + text-guided grounding · Output: cxcy bounding box centers

✂️

SAM2 — Segment Anything Model v2

State-of-the-art segmentation model taking bounding box coordinates from GroundingDINO as spatial prompts. Produces high-quality binary masks and salient object images. Handles irregular shapes and partial occlusion without retraining.

Input: box center coords · Output: binary mask (Seg-O) + salient object image (SLO)

🎨

S2WAT — Strips Window Attention Transfer

Hierarchical Vision Transformer for neural style transfer. Uses horizontal/vertical strip windows + square patches to capture both long-range context and fine-grained textures. Adaptive merging avoids grid artifacts.

Architecture: ViT + strip attention · Loss: content + Gram-matrix style + identity

🔗

CLIP — Cross-Modal Style Retrieval

Contrastive Language-Image Pre-training maps text and images into a shared embedding space. Used offline to retrieve the closest style reference via cosine similarity. Open-vocabulary style queries without task-specific retraining.

Similarity(s,i) = (s·i) / (‖s‖·‖i‖) · Open-vocabulary, no retraining

🧩

Mask-Guided Alpha Compositing

Pixel-level blending using SAM2's salient mask. For each pixel (x,y): if mask=255 → use stylized pixel; else → keep original background. Applied recursively per object in sequence.

I_blend(x,y) = I_styled(x,y) if mask=255 else I_base(x,y)

⛓

End-to-End Modular Pipeline

LLM → CLIP retrieval → GroundingDINO → SAM2 → S2WAT → alpha-composite blending. Each stage independently swappable. Scales linearly with object count: ~10s/object. Fully local except prompt parsing.

Runtime: ~10.44s (1 obj) → ~42.30s (4 objs) · Linear O(n) scaling

System Design

3-Stage Pipeline

pipeline("I want a cat in sketch style, a dog in cubist style", content_image) → stylized_output

INPUT

🖼️

Content Image

+ text prompt

→

LLM

🤖

Gemini

Prompt parsing

→

STAGE 1

🔗

CLIP Retrieval

Style image match

→

STAGE 1

🔍

GroundingDINO

Object detection

→

STAGE 2

✂️

SAM2

Segmentation

→

STAGE 2

🎨

S2WAT

Style transfer

→

STAGE 3

🧩

Blending

Alpha composite

→

OUTPUT

✨

Stylized Image

Per-object styles

Stage 01

🤖

Prompt Processing

Google Gemini interprets the user's natural language prompt and extracts structured object–style pairs. Handles compound instructions with multiple objects, attributes, and styles.

Google GeminiNLPMultimodal LLM

Stage 01

🔗

Style Image Retrieval

CLIP encodes both the style text and all images in the style folder into a shared embedding space. Cosine similarity selects the best-matching style reference.

CLIPCosine SimilarityCross-modal Retrieval

Stage 01

🔍

Object Detection

GroundingDINO outputs bounding box center coordinates (cxcy). Open-set detection handles arbitrary objects. Robust to occlusion, blur, and partial visibility.

GroundingDINOVision TransformerOpen-vocab

Stage 02

✂️

Object Segmentation

SAM2 receives cxcy and produces a binary segmentation mask (Seg-O) plus salient object image (SLO). Handles fine boundaries and partial occlusions without retraining.

SAM2Instance SegmentationBinary Masking

Stage 02

🎨

Neural Style Transfer

S2WAT applies the retrieved style image to the segmented object. Multi-shape attention windows capture both local textures and global structure.

S2WATVision TransformerGram Matrix Loss

Stage 03

🧩

Blending & Composition

Mask-guided alpha compositing places each stylized object back into the scene. Applied recursively: each blend result becomes the base for the next object.

Alpha CompositingMask-guided BlendingRecursive

Evaluation

Results & Demo Outputs

🐻

Input Bear

→

🐻

Segmented

→

🐻

Newspaper

draw new bear with newspaper style

High-contrast monochrome; halftone texture. Single-object single-style. Clean segmentation with background preserved.

newspaper

🦊

Input Fox

→

🦊

Segmented

→

🦊

Cubist

a new fox with cubist style

Geometric fragmentation, angular planes, vibrant color contrasts from Picasso's La Muse. Background remains natural.

cubist / picasso

👧

Input Girl

→

👧

Segmented

→

👧

Sketch

style transfer a girl by sketch style

Fine linework, cross-hatching, minimalist monochrome. Retains facial structure and expression during transfer.

sketch

🐕

Sketch

🐱

Cyberpunk

🧑

Cubist

I want a dog by sketch, a cat by cyberpunk, and a human by cubist style

3-object composition. Each subject gets a fully independent style. Blending is cumulative: dog → cat → human.

sketchcyberpunkcubist

🐕

Dog Input

→

🐕

Sketch

🐱

Cat Input

→

🐱

Cubist

I want a dog drawn by sketch style, and a cat drawn by cubist style

Sequential multi-object processing. Dog stylized first → becomes base → cat detected & stylized independently.

sketchcubist

🐯

Tiger (occluded)

→

🐯

Partial Seg.

→

🐯

Cubist

I want to style transfer tiger on cubist style

Tiger partially hidden behind grass. GroundingDINO + SAM2 still correctly isolate the visible portions.

cubistocclusion handled

Performance

Runtime Benchmarks

~10s

per object · avg single-style

O(n)

linear scaling with object count

Runtime scales linearly with object count. Each object adds ~10.5s overhead (segmentation + style transfer + blending).

Gemini API is the only online dependency. All CV stages run fully local.

Scenario	# Styles	# Objects	Runtime	Complexity
Bear + Newspaper	1	1	10.44s	Single object
Fox + Cubist	1	1	10.55s	Single object
Girl + Sketch	1	1	11.18s	Human subject
Girl + Cubist	1	1	12.17s	Human subject
Tiger + Cubist (occluded)	1	1	11.18s	Partial visibility
Panther + Newspaper (blurred)	1	1	11.14s	Partial visibility
Dog + Cat (sketch + cubist)	2	2	20.65s	Multi-object ×2
Dog + Human (sketch + cubist)	2	2	21.19s	Multi-object ×2
Dog + Human v2	2	2	20.67s	Multi-object ×2
Dog + Cat + Human (3 styles)	3	3	32.84s	Multi-object ×3
4-object multi-style composition	4	4	42.30s	Multi-object ×4

Known Issues

Limitations & Failure Cases

⚠ Incomplete Human Segmentation

SAM2 sometimes segments only skin regions of human subjects, excluding hair and clothing. Stylization is applied to fragmented areas only, breaking visual coherence.

⚠ Partial Object Capture

When an object spans multiple segmentation passes, different sub-regions may receive different style assignments, causing conflicting artistic representations.

⚠ Wrong Region Targeting

GroundingDINO occasionally mislocalizes: e.g., segmenting the sea/ground instead of a lion. Center coordinate ambiguity leads to style applied to irrelevant regions.

⚠ Fine Boundary Loss

Small or low-contrast regions (e.g., a cat's dark tail against dark background) may be clipped during SAM2 mask generation. Fine-grained boundary details are hard to preserve.

⚠ Online Dependency

The Gemini API call for prompt parsing requires internet connectivity. All downstream CV stages are local, but the LLM step prevents fully offline deployment.

⚠ Limited Style Library

Style retrieval is bounded by the pre-curated folder (5 styles). Styles not represented will fall back to the nearest match, potentially misrepresenting user intent.

Roadmap

Future Work

Offline LLM

Replace Gemini with a local fine-tuned small LLM or rule-based NLP. Enables fully offline, edge-deployable pipeline.

Better Segmentation

Integrate whole-body person segmentation (DensePose or HRNet) to fix the skin-only SAM2 limitation. Add multi-pass refinement steps.

Diffusion-based Transfer

Explore diffusion models (ControlNet, IP-Adapter) for style transfer to improve texture fidelity at object boundaries.

Gradient-Domain Blending

Replace alpha compositing with Poisson image editing or GAN-based inpainting at boundaries to reduce visible seam artifacts.

Spatial Prompts

Support spatial positioning ("object on the left"), style composition (mixing two styles on one object), and multi-image style prompts.

Real-time Optimization

Model distillation and quantization to bring per-object runtime from ~10s toward near-real-time for interactive / video applications.

Style Library

5 Reference Styles

🖼️

La Muse

Cubist / Picasso

Geometric abstraction, angular planes, multiple viewpoints, vibrant earth tones

🌌

Starry Night

Van Gogh

Swirling brushstrokes, deep blues & yellows, rhythmic movement, emotional intensity

📰

Newspaper

Print Media

High-contrast monochrome, halftone dots, texture over color, vintage aesthetic

🌆

Cyberpunk

Neon / Futuristic

Electric blues, magentas, neon highlights, high saturation, digital textures

✏️

Sketch

Pencil / Linework

Fine lines, cross-hatching, monochrome, minimal — emphasizes contour over color

Comparison

vs Related Methods

Method	Object-Level	Prompt-Guided	Multi-Style	Open-Vocab	Speed
This Work (Ours)	✅ Per-instance	✅ Natural language	✅ Unlimited	✅ Any object	~10s/obj
CBS (Class-Based Styling)	✅ Per-class	❌ Fixed classes	⚠ Fixed 3 styles	❌ Fixed labels	Real-time
Regional Style Transfer	⚠ Foreground only	❌ Manual mask	❌ Single style	⚠ Limited	Batch only
MOSAIC (CLIP-guided)	✅ Per-object	✅ Text prompts	✅ Multiple	✅ Open-vocab	Slow (ViT)
FreeStyle (Diffusion)	❌ Global only	✅ Text guided	❌ Single pass	✅ Open-vocab	Slow (diffusion)

About

The Author

Nguyen Hong Son

Bachelor of Data Science · HCMIU – VNU-HCM · 2025

Computer Vision researcher focused on prompt-guided image manipulation, object segmentation, and neural style transfer. This thesis integrates state-of-the-art CV models (GroundingDINO, SAM2, S2WAT, CLIP) into a unified, natural-language-driven pipeline for per-object artistic stylization. Supervised by Dr. Le Thi Ngoc Hanh, School of Computer Science and Engineering.

📄 Read Full PDF Thesis

SegmentedObjects,Guidedby Prompts

Key Computer Vision Components

3-Stage Pipeline

Results & Demo Outputs

Runtime Benchmarks

Limitations & Failure Cases

Future Work

5 Reference Styles

vs Related Methods

The Author

Segmented
Objects,
Guided
by Prompts