The quality of gastrointestinal endoscopy is verified by documenting specific images, but identifying them from a large number of photographs is tedious. Conventional deep learning methods to automate this process are limited by subjective evaluations and poor interpretability. The authors present a new content-based image retrieval framework that integrates the general DINOv2 vision model and the domain-specific GastroNet dual-spine endoscopic model. The system was trained with parameters by efficient learning metrics to generate discriminative embeddings for similarity search. The model was evaluated on 3500 public endoscopic images from the Kvasir and HyperKvasir datasets and validated on unseen real and synthetic data. It achieved state-of-the-art performance with 97.71% Recall@1, 99.14% Recall@5 and an average of 96.4%, significantly better than the 76.7% average of the base models. Ablation studies have confirmed that the improvement results from two spines capturing complementary functions. The system offers a precise tool for automated endoscopy quality control in clinical practice.