So sánh CNN vs. ViT (ảnh) · RNN vs. Transformer (văn bản) · CLIP Zero-shot Retrieval trên Flickr30k
| Model | Type | Test Accuracy | F1-Macro | Params | Epochs |
|---|---|---|---|---|---|
| ResNet-50 | CNN | 44.11% | 0.4340 | 25.6M | 5 |
| ViT-B/16 | ViT | 89.60% | 0.8959 | 86M | 5 |
| Model | Type | Test Accuracy | F1-Macro | Epochs |
|---|---|---|---|---|
| GRU | RNN | 37.85% | 0.3608 | 5 |
| DistilBERT | Transformer | 69.04% | 0.6682 | 3 |
| Phương pháp | Train ảnh | Accuracy | F1-Macro |
|---|---|---|---|
| Zero-shot | 0 | 54.60% | 0.5173 |
| 1-shot | 10 | 32.80% | 0.3383 |
| 5-shot | 50 | 61.20% | 0.6221 |
| 10-shot | 100 | 76.40% | 0.7655 |
| 20-shot | 200 | 93.00% | 0.9322 |
Dataset: Flickr30k test split · 1,000 ảnh · 10 classes (keyword labeling) · CLIP ViT-B/32