########################################################################## Domain Understanding ########################################################################## .. contents:: Table of Contents :depth: 2 :local: :backlinks: none ************************************************************************** Topics ************************************************************************** Key Concepts ========================================================================== - Supervised Learning Paradigms - Multi-class classification - Multi-label classification - Hierarchical classification - Multi-task learning - Representation Learning & Retrieval - Metric learning / Contrastive learning (e.g. InfoNCE, SimCLR) - Embedding learning + Nearest Neighbor Search - Vector quantization / Product quantization - Dense vs Sparse retrieval models - Labeling & Supervision Strategies - Weak supervision - Self-supervised learning - Positive-Unlabeled (PU) learning - Active learning - Hard negative mining / Semi-hard negative mining - Model Architectures & Adaptation - Fine-tuning pre-trained encoders (CNNs, ViT, BERT, CLIP) - Freezing/unfreezing strategies - Lightweight fine-tuning (LoRA, adapters, prompt tuning) - Multimodal fusion (early vs late fusion) - Search & Ranking Infrastructure - Similarity-based search (e.g., FAISS, ANN) - Hybrid retrieval (text + image) - Cross-encoder vs dual-encoder ranking models - Data Handling & Preprocessing - Data cleaning and normalization (e.g., noisy title correction) - Image augmentations (crop, flip, blur, resize) - Text normalization and deduplication - Taxonomy mapping and vocabulary standardization Models ========================================================================== #. ResNet #. ViT/DeiT #. SimCLR #. Faster RCNN #. YOLO #. DETR #. CLIP/BLIP #. BERT/XLM-R, DistilBERT #. T5 / BART #. LLAMA #. ViViT #. TimeSformer ************************************************************************** Application ************************************************************************** Priority 1 – Very High Value (Common + Foundational) ========================================================================== #. Product Categorization (Classification) - Fixed taxonomy classification from image or metadata - Covers: image encoders (ResNet/ViT), fine-tuning, label scarcity, domain shift - Must know: supervised classification, hierarchical taxonomies, class imbalance, data augmentation, few-shot strategies #. Dynamic Tag Suggestion (Multi-label Prediction) - Open-ended multi-label prediction using metadata/image - Must know: BCEWithLogitsLoss, multi-label thresholds, label imbalance, tag vocab creation, weak supervision #. Product Taxonomy Mapping - Mapping noisy/seller-provided categories to structured taxonomy - Must know: text classification, category disambiguation, noisy inputs, hierarchical mappings Priority 2 – High Value (Used Across Systems) ========================================================================== #. Attribute Extraction (NER or Slot-filling) - Extract structured attributes like brand, color, size from title/description - Must know: sequence labeling (BIO format), spaCy or BERT-based token classifiers, weak labeling, schema constraints #. Duplicate Listing Detection - Detect duplicate or near-duplicate listings posted by users - Must know: pairwise embedding similarity, clustering, contrastive learning, efficient retrieval, deduplication heuristics #. Image-Based Visual Search - Match query images to catalog using visual similarity - Must know: contrastive loss (InfoNCE), SimCLR, in-domain pretraining, feature indexing (FAISS), query augmentation #. Text-Based Search (Query → Product Metadata) - Users search with queries matched to product text fields - Must know: BM25, dense retrieval (dual encoder), cross-encoder reranking, FAISS, negative sampling Priority 3 – Medium Value (Niche but insightful) ========================================================================== #. Multimodal Entity Matching / Linking - Link a product to a known item in a catalog (e.g., brand DB) using both image and text - Must know: multimodal encoders (e.g., CLIP), late fusion vs early fusion, product resolution, text normalization #. Item Quality / Integrity Detection - Detect suspicious, poor quality, or policy-violating listings - Must know: content moderation, adversarial examples, cross-modal rules, abuse signals, self-supervised pretraining Priority 4 – Lower Priority but Great for Bonus Points ========================================================================== #. Product Title Generation - Rewrite or generate SEO-friendly titles from user-written titles/descriptions - Must know: text generation (seq2seq), BART/T5 models, summarization, input pre-processing #. Title/Description Normalization - Normalize noisy seller-written text for search/ads relevance - Must know: grammar correction, paraphrasing, rule-based + neural hybrid methods #. Visual Grounding / Region Tagging - Localize object regions corresponding to attributes or tags - Must know: object detection + vision-language grounding, attention maps, weak supervision ************************************************************************** Problems ************************************************************************** Text-Based Product Search (Metadata Only) ========================================================================== - Problem - Allow users to search for products using a free-form text query. The system retrieves and ranks relevant products based on matching against product metadata (title, description). - Use Cases - Search bar experience in marketplace - Assistive auto-complete or suggestions - Indexing new products with better retrieval capabilities - Input / Output - Input: User text query (e.g., "red running shoes") - Output: Ranked list of product IDs with titles and images - Problem Type - Semantic text-to-text retrieval (information retrieval / ranking) - Model Choices - Sparse retrieval (baseline): - BM25 over title and description fields - Dense retrieval (modern): - Dual-encoder architecture: - Query encoder (e.g., BERT, DistilBERT) - Product encoder (e.g., same as query encoder) - Similarity via dot product or cosine similarity - Optional: Cross-encoder reranker (e.g., BERT) for top-k reranking - Labeling Scenarios - Supervised: Click logs or labeller-curated query-product matches - Weak supervision: Synthetic query generation from product text - Noisy signals: Search sessions or co-view logs - Training Setup - Contrastive learning using positive query-product pairs and in-batch negatives - Loss: InfoNCE or triplet loss - Optional hard negative mining using BM25 - Pretraining on large query-product corpora or Wikipedia Q-A pairs - Evaluation Metrics - Recall@k, NDCG@k, Mean Reciprocal Rank (MRR) - Offline: manual relevance judgments or simulated clicks - Online: click-through rate (CTR), dwell time - Scaling Considerations - Precompute and index product embeddings using vector database (e.g., FAISS, ScaNN) - Real-time encoding of user query at search time - Efficient reranking within top-N retrieved candidates - Alternative Methods - Hybrid retrieval: combine BM25 and dense scores - Use knowledge distillation to compress dual encoder - Use entity linking to match structured taxonomy (optional) Product Taxonomy Mapping ========================================================================== Task: Design a product categorisation tool for facebook marketplace Problem -------------------------------------------------------------------------- #. Use-case #. System - multiple possible use-cases #. >> Real time assist to the sellers during listing creation time #. Post upload clean-up/taxonomy mapping (invisible to the seller) #. Creation of category keyword index (invisible to the seller) #. Reroute to the quality/compliance/integrity team #. Actors - sellers, buyers, platform #. Entities - listings, user profiles, history #. Interests - #. Seller - reduce manual work (selecting from suggested category list) #. Buyers - find more relevant listings (search/recommendation) #. Platform - increase transactions made on the platform, increase quality/compliance/integrity #. Scale #. 1M sellers, 50M listings live, 1M/day new listings, listings lifespan - days-months #. Listings are diverse, sellers are global - needs to generalise well on unseen data #. Signals #. Product database #. Majority of the listings don't have taxonomy - 40M #. 10M listings have noisy taxonomy assigned by users (may/may not be correct) #. 20k listings with correct taxonomy assigned by human experts #. Seller profile, reputation #. Business kpis #. Successful session rate (#success sessions/#sessions) #. MRR #. CTR on search/recommendation #. Taxonomy coverage #. Misc #. Fixed set of categories - flat, 5k categories #. Each listing belongs to 1 single leaf category Solution -------------------------------------------------------------------------- #. Problem type #. Learning to rank - listing as the query, category lists is the doc, pointwise learning to rank #. Multi-class classification with fixed leaf labels from a predefined taxonomy list as target categories #. Learning to rank is better for #. Data #. Listings #. Content - title, description, images (multiple), metadata (product age, dimensions, colour) #. Context - upload time, upload location #. Seller #. User profile - demographics - agegroup, gender, geolocation, account age #. Activity in communities/groups #. Stats - past listings, current listings, reputation (might be useful to determine if user-assigned label is noisy) #. Feature #. Learning strategy #. Dataset curation #. Model #. Training #. Eval #. Deployment #. Monitoring #. Improvements Dynamic Tag Suggestion System (Image-Only) ========================================================================== - Problem - Suggest relevant tags (attributes, descriptors) for product listings to improve discovery, search, and categorization. - Use Cases - Improves product discoverability. - Drives tag-based browsing and filtering. - Feeds into downstream categorization or moderation systems. - Input: - One or more images of a product listing (no text input in the basic setup) - Tags are from a predefined vocabulary (e.g., 2,000 tags) - Output: - A ranked list or binary vector over the tag vocabulary (multi-label) - Problem Type - Fixed tag vocabulary -> Multi-label classification -> Vector of 0/1 labels or scores per tag - Open tag vocabulary -> Retrieval or generative -> Top-k retrieved tags using tag embeddings - Model Architecture Choices - CNNs (e.g., ResNet): Strong baseline, efficient, works with BCE loss - Vision Transformers (e.g., ViT): Better generalization, more data-hungry - CLIP-style dual encoders: Enables retrieval/zero-shot tagging with tag embeddings - Multi-modal models (future): Use image + title/description if available - Labeling Scenarios - Case A: 100k labeled images with tags - Finetune a CNN/ViT with BCEWithLogitsLoss - Case B: 10k labeled + 1M unlabeled - Use semi-supervised learning, self-training, pseudo-labeling - Optional: Contrastive pretraining with SimCLR or BYOL - Case C: Only curated positive tags, no known negatives - Use positive-unlabeled (PU) learning or ranking loss - Training Setup - Preprocessing: - Resize, normalize (use dataset-specific mean/std), augmentations - Pretraining (optional): - Contrastive learning (SimCLR, BYOL) on unlabeled product image corpus - Finetuning: - Use BCEWithLogitsLoss (independent sigmoid heads) - Do not use softmax - Optional: Freeze base layers initially, then unfreeze gradually - Thresholding: - Use global threshold (e.g., 0.5) or tune per-tag thresholds - Evaluation Metrics - Precision@K: How many of top-K predicted tags are correct - Recall@K: How many true tags appear in the top-K predictions - F1 score (macro and micro) - AUC per tag (for threshold tuning) - Scaling Considerations - Multi-GPU training for ViT or large datasets - Factorized/tag-bottleneck heads for large vocabularies - Index tag embeddings for fast retrieval or zero-shot inference - Alternative Methods - CLIP zero-shot tagging: Embed image and tag descriptions in same space - Image-to-tag retrieval: Learn tag embeddings, retrieve nearest - Vision-to-text (captioning): Generate pseudo-descriptions, extract tags Visual Search System (Image-Only) ========================================================================== - Problem - Enable users to search for products using only an image (e.g., phone-captured photos), matching to semantically similar catalog images. - Use Cases - Image search via phone camera (e.g., “find similar items”). - Visual discovery experience (Pinterest-style browse). - Helps cold-start users with no typed query. - Input / Output - Input: Query image (optionally cropped). - Output: Ranked list of product images (or product IDs) from a fixed catalog. - Problem Type - Image retrieval based on visual similarity (semantic embedding space). - No class prediction, no metadata, no personalization. - Model Choices - Backbone: - CNN-based: ResNet, EfficientNet, MobileNet (fast inference). - Transformer-based: ViT, DINOv2, DeiT, SAM (better semantics, requires more data). - Training Strategy: - Contrastive learning (SimCLR, MoCo, InfoNCE). - Triplet loss or arcface (optional). - Supervised fine-tuning with positive pairs (query ↔ matching catalog images). - Labeling Scenarios - Case A: 10k manually labeled query ↔ product pairs (positive matches). - Case B: 200M unlabeled mobile photos. - Use clustering, pseudo-labels, weak supervision, or pretraining. - Leverage augmentations on catalog images to synthesize training pairs. - Training Setup - Pretraining: Contrastive pretraining on product catalog (SimCLR-style) to adapt to product domain. - Finetuning: - On 10k labeled query-product pairs with InfoNCE loss. - Use product embedding = mean pooled embeddings of its multiple images. - Data Augmentations: Blur, crop, resize, grayscale, decolorization to simulate noisy inputs. - Embedding Head: Add projection head (e.g., 2-layer MLP) before retrieval embedding. - Evaluation Metrics - Recall@k, Precision@k, mAP@k (mean Average Precision). - Retrieval latency and embedding size (efficiency). - Offline: Mean cosine similarity with true match. - Online: Click-through rate (CTR), conversion rate (if measurable). - Scaling Considerations - Indexing: Use FAISS or ScaNN for approximate nearest neighbors (ANN). - Update index incrementally as new products are added. - Use quantization (PQ/IVF) or knowledge distillation to compress embeddings. - Optional: Use hierarchical retrieval (coarse-to-fine) for speed. - Alternative Methods - CLIP-style image encoders + product ID supervision (e.g., MIL-NCE). - Self-supervised ViT models (DINOv2) for generalizable embeddings. - Ensemble of CNN + transformer models. - Use DETR/SAM-based region embeddings if user crops objects in the query. Localized Object Search System (Object-Centric Visual Search) ========================================================================== - Problem - Users capture an image containing multiple objects and want to search for just one object in the image. - The system detects the region of interest (e.g., via cropping or object detection) and retrieves semantically similar products. - Use Cases - Tap-to-search on objects (like Google Lens) - Search specific item within a lifestyle image - Visual filters or product detection on seller-uploaded images - Input / Output - Input: Full image or cropped region from user - Output: Products visually similar to the detected/cropped object - Problem Type - Two-stage system: - Stage 1: Object detection/localization - Stage 2: Embedding-based retrieval - Model Choices - Stage 1: - DETR, Faster R-CNN, YOLOv8 (object localization) - SAM for user-assisted segmentation/cropping - Stage 2: - ResNet/ViT/DINOv2 embedding extractor - Projected to common embedding space - Product embedding: mean of region embeddings per product - Labeling Scenarios - Supervised: object bounding boxes + product match labels - Weakly supervised: click-through logs, cropped images - Self-supervised: augment product images as object crops - Training Setup - Stage 1: Pretrain detector on product dataset with boxes - Stage 2: Train image embedding model on matched object ↔ product pairs - Optionally fuse detection + embedding (jointly fine-tune) - Evaluation Metrics - Object localization accuracy (IoU, mAP) - Retrieval metrics: Recall@k, Precision@k for cropped objects - Overall latency (detection + search) - Scaling Considerations - Cache intermediate crops if common - Use lightweight detectors (YOLO-Nano, MobileSAM) - Optional: Joint detector-embedder model (faster inference) - Alternative Methods - SAM + embedding on segmented mask - One-stage detector with retrieval head (DELG-style) - Saliency-guided attention cropping without bounding boxes Product Taxonomy Mapping (Image + Metadata) ========================================================================== - Problem - Assign a product to a taxonomy node using both the image and product metadata (title and description). - Input / Output - Input: Product image, title, and description - Output: Category ID (taxonomy node) - Problem Type Multimodal hierarchical classification - Model Choices - Multimodal fusion models: - Early fusion: Concatenate image and text embeddings - Late fusion: Separate image and text towers with fusion at classifier level - Base encoders: - Image: ResNet, ViT - Text: BERT, DistilBERT, Sentence-BERT - Fusion techniques: MLP fusion, attention-based fusion, cross-modal transformer - Labeling Scenarios - Same as image-only, but optionally apply text-based weak supervision - Use keyword extraction to create noisy labels from metadata - Train with human-labeled examples, validate robustness to noisy text - Training Setup - Pretrain encoders separately or jointly - Finetune with labeled taxonomy classes - Text preprocessing: lowercasing, tokenization, stopword removal - Use dropout and regularization to avoid text overfitting - Evaluation Metrics - Same as image-only, plus ablations on image-only vs text-only vs multimodal - Optional: evaluate on tail classes separately - Use Cases - Improved classification performance in ambiguous or visually similar categories - Better coverage for long-tail or rare categories with descriptive text - Scaling Considerations - Long and noisy text: requires cleaning and truncation - Tradeoff between complexity and latency - Multilingual metadata (requires multilingual text encoder) - Alternative Methods - Use text-only or image-only when one modality is missing - Use CLIP-like models pretrained on image-text pairs - Train multitask models with auxiliary objectives (e.g., tag prediction) Dynamic Tag Suggestion (Image + Metadata) ========================================================================== - Problem - Suggest relevant tags (attributes, descriptors) for product listings to improve discovery, search, and categorization. - Use Cases - Improves product discoverability. - Drives tag-based browsing and filtering. - Feeds into downstream categorization or moderation systems. - Input / Output - Input: Product title, description, and optionally image. - Output: Set of 3–10 relevant tags from a fixed tag vocabulary. - Problem Type - Multi-label classification (multiple tags can be correct). - Optional: Sequence generation (if tags are open-vocabulary). - Model Choices - Text-only: BERT, DistilBERT, RoBERTa with sigmoid output. - Image-text: CLIP-style dual encoders for grounding. - Multimodal fusion: Late fusion or cross-attention models. - Lightweight: TextCNN or BiGRU + attention for mobile deployment. - Label Collection - No explicit tags -> weak supervision from seller text - Rule-based keyword matching (exact, fuzzy). - TF-IDF / RAKE / YAKE for unsupervised keyword extraction. - Embedding similarity (BERT/CLIP). - Phrase mining (NER, noun phrase chunking). - LLM prompting for zero-/few-shot tag extraction. - Human-in-the-loop to clean and validate extracted labels. - Training Setup - Loss: Binary cross-entropy with logits. - Data imbalance: Weighted sampling or focal loss. - Data augmentation: Synonym replacement, dropout, back-translation. - Initialization: Pretrained language/image models → fine-tune. - Evaluation Metrics - Precision@k, Recall@k, F1@k. - Coverage and diversity of tag suggestions. - Manual quality assessment on a small sample. - Scaling Considerations - Efficient inference via pre-computed embeddings. - Use tag clustering to reduce vocabulary explosion. - Incrementally refresh model with trending tag signals. - Alternative Methods - Tag generation via seq2seq (T5, BART). - Retrieval-based tagging (match to nearest products with known tags). - Tag co-occurrence graph models. Multimodal Visual Search System (Image + Text) ========================================================================== - Problem - Enhance search relevance by combining user-provided images with optional free-text (e.g., “red sneakers”) to retrieve matching product entries from the catalog. - Use Cases - “Search this + add description” - More accurate queries (“dress like this but in blue”) - Shopping assistants, style filters - Input / Output - Input: - Query image (phone-captured, optionally cropped) - Optional text query (user-entered keywords) - Output: Ranked product list (by semantic similarity) - Problem Type - Multimodal retrieval (image + text to image) - Model Choices - Encoders: - Image: ViT, DINOv2, ResNet (contrastive pretrained) - Text: BERT, DistilBERT, CLIP-Text - Fusion Strategy: - Late fusion: Weighted sum of image/text embeddings - Cross-modal attention (e.g., ALBEF, BLIP) - Labeling Scenarios - Paired (image, text) examples from product catalog - Manually curated positive query ↔ product matches - Use weak supervision (e.g., co-occurring tags, titles) - Training Setup - Pretraining: Contrastive alignment of image and text (CLIP-style) - Fine-tuning: Triplet or InfoNCE loss using curated query ↔ product pairs - Fusion tuning: Train a cross-attention head if needed - Embed catalog products with both modalities (combine features) - Evaluation Metrics - Recall@k, NDCG@k - Multimodal retrieval accuracy - Ablation: image-only, text-only, fused vs. oracle relevance - Scaling Considerations - Pre-compute and index catalog embeddings - Online combine query embeddings and perform ANN search - Modality dropout during training to handle missing inputs - Alternative Methods - CLIP or FLAVA for joint image-text space - Late fusion heuristics (weighted linear combination) - Multimodal transformers (e.g., ViLT) for deeper cross-modal reasoning ************************************************************************** Resources ************************************************************************** - Multi Modal models - [encord.com] `Top 10 Multimodal Models `_ - Vision-text encoder: - [medium.com] `Understanding OpenAI’s CLIP model `_ - [amazon.science] `KG-FLIP: Knowledge-guided Fashion-domain Language-Image Pre-training for E-commerce `_ - [amazon.science] `Unsupervised multi-modal representation learning for high quality retrieval of similar products at e-commerce scale `_ - Vision-encoder text-decoder: - [amazon.science] `MMT4: Multi modality to text transfer transformer `_ - [research.google] `MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks `_ - [medium.com] `Understanding DeepMind’s Flamingo Visual Language Models `_ - E-commerce publications - [amazon.science] `Amazon Science e-Commerce `_ Product Categorisation ========================================================================== - Resources: - [arxiv.org] `Semantic Enrichment of E-commerce Taxonomies `_ - [arxiv.org] `TaxoEmbed: Product Categorization with Taxonomy-Aware Label Embedding `_ Multimodal Product Representation ========================================================================== - Papers: - [ieee.org] `Deep Multimodal Representation Learning: A Survey `_ - [openaccess.thecvf.com] `Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce `_ - [amazon.science] `Unsupervised Multi-Modal Representation Learning for High Quality Retrieval of Similar Products at E-commerce Scale `_ Product Title Normalization & Rewriting ========================================================================== - Papers: - https://paperswithcode.com/task/attribute-value-extraction Product Deduplication and Matching ========================================================================== - Goal: Identify duplicate listings across users or platforms (e.g., same product uploaded multiple times). - Papers: - [arxiv.org] `Deep Product Matching for E-commerce Search `_ - [arxiv.org] `Multi-modal Product Retrieval in Large-scale E-commerce `_