Introduction
This project investigates the application of computer vision and machine learning methodologies for automated waste detection and segregation. Four models—GroundingDINO, DETR, YOLOv8, and ResNet—are trained and evaluated using the TACO dataset to identify the most effective approach for robust trash classification in real-world environments. The paper that accompanies this project is available here.
Dataset
The TACO dataset (Trash Annotations in Context) is a comprehensive collection of images depicting trash and recycling objects in real-world settings. This dataset, used for training and evaluating our models, includes images containing between 0 and 40 objects spanning 19 classes, such as bottles, cans, and plastic bags. Additionally, it features a "catch-all" class, labeled as unlabeled litter, for unidentified items. The dataset comprises 4,000 images, which were augmented to expand the total to 6,000 images, all resized to 416x416 pixels. The annotations were created using Roboflow and are available here.
Models Used
- GroundingDINO is a state-of-the-art transformer-based object detection model designed for open-set detection tasks, integrating grounding capabilities to align visual features with textual prompts for precise and flexible object identification.
- DETR (DEtection TRansformer) is a transformer-based object detection model that eliminates the need for traditional hand-crafted components like anchor generation and NMS, using an end-to-end approach with a bipartite matching loss to predict objects directly from input images.
- YOLO (You Only Look Once) is a real-time object detection framework that predicts bounding boxes and class probabilities directly from full images in a single forward pass, offering high-speed and accurate detection performance.
- ResNet (Residual Network) is a deep convolutional neural network architecture designed to address the vanishing gradient problem by introducing residual connections, enabling the training of very deep networks while maintaining high performance in image classification and feature extraction tasks.
Key Findings
The YOLOv8 model achieved the highest precision and recall for real-time applications, outperforming other model types. Smaller objects, such as bottle caps, remain challenging due to resolution limitations in the dataset and difficulty of the task. Our final results are available below.
Evaluation Metric | YOLO | DETR | ResNet | GroundingDino |
---|---|---|---|---|
Precision | 0.777 | 0.612 | 0.503 | N/A |
Recall | 0.398 | 0.285 | 0.379 | N/A |
mAP50 | 0.491 | 0.337 | 0.352 | N/A |
mAP50-95 | 0.403 | 0.260 | 0.297 | N/A |
FPS | 200 | 13 | 10 | 2 |