Vision–Language Models (VLMs) have emerged as a powerful paradigm that bridges computer vision and natural language processing, enabling machines to jointly understand images and text. This seminar explores three key application areas of VLMs: image retrieval, where multimodal embeddings allow for efficient cross-modal search; fine-grained classification, where VLMs capture subtle distinctions across categories by leveraging both visual and textual cues; and parameter-efficient fine-tuning (PEFT), which provides practical strategies for adapting large-scale models to specific tasks without the need for full retraining. By examining recent advances and case studies, we will highlight the strengths, limitations, and future directions of VLMs in real-world applications.