Airport semantic segmentation is fundamental to intelligent airport applications, yet there currently lacks a specialized dataset and algorithms for this task. In this paper, we introduce the first large-scale video semantic segmentation benchmark, AVSS, designed for airport ground video surveillance. AVSS comprises 18 common semantic categories in airport ground, 250 videos, totaling over 140,000 frames with accurate manual annotations. AVSS covers a wide range of challenges for airport ground video surveillance, such as extreme multi-scale, intra-class diversity, inter-class similarity, irregular targets, object occlusion, class imbalance, lighting changes, and weather variations. We evaluate 14 state-of-the-art (SOTA) semantic segmentation algorithms on AVSS. The significant performance declines indicate that current models are far from being ready for practical application. Furthermore, we discuss video semantic segmentation algorithms for airport ground surveillance. AVSS serves both as a research resource for airport semantic segmentation and a robustness evaluation tool for segmentation algorithms in practical applications.
Since airport grounds are semi-militarized areas, we obtained data collection permissions through cooperation with the Civil Aviation Administration of China. The data collection site is a large international airport in Southwest China. To ensure data diversity, we recorded over 5,000 short videos within a three-month collection period, with an average duration of 15 seconds and a frame rate of 30 FPS. The data collection equipment included four fixed cameras and one PTZ camera, with resolutions vary in 1280×960, 1440×1080, and 1920×1080. During data collection, we increased data diversity by varying collection times and adjusting camera angles and focal lengths. We then strictly selected data based on diversity principles, picking 250 videos totaling about 140,000 frames to form the AVSS database.
AVSS consists of 250 long videos (S1~S25, 140,000 + frames) and corresponding semantic segmentation ground truth. AVSS contains multiple challenges as follows.
Haze: Haze is a kind of disastrous weather. Haze is common in China, especially in winter. Air visibility is significantly reduced in haze days. This presents as imaging ambiguity and is easy to cause large-scale detection defects.
Camouflage: Camouflage is widespread in other change detection datasets, but it has a special manifestation in ground surveillance. The airport ground made of cement is gray-white, and the main color of the aircraft is also white.
Simultaneous Multi-scale Detection: In ground surveillance, both aircraft near dozens of meters and visible in details, and far beyond a kilometer and even blurred in outline, can be seen in the same camera field view at the same time. There may be conflicts in the treatment of different scale targets.
Shadow and Non-uniform Illumination Change: Because the airport ground is an outdoor scenario, shadow and illumination must be considered. Since the airport ground is very wide, the illumination changes are almost non-uniform.
Shape and Color Variation: Head-looking, side-looking, tail-looking and both close-range and long-range aircraft can be seen in AVSS. The color of aircraft also varies greatly in AVSS, e.g. all-white, all-red, all-blue and white-based aircraft mixed with various strips or patterns of other colors.
Stripe Shape: The fuselage and wing of the aircraft have strip shape, which is difficult to be completely detected. The detection integrity of fuselage and wing is important for some applications, e.g. visual conflict alert and docking guidance.
PTZ Camera: PTZ video is a special challenge for unsupervised methods, which generally assume that the background is stationary to facilitate pixel-wise modeling.
Other Challenges: There are some interesting challenges in AVSS, such as the water mist stirring by engines, and separated shadows and objects when planes are just flying up after gliding. Some extreme cases like midnight and strong reflection are not included in AVSS, but six such videos without ground truth (V1~V6) are also attached.
AVSS provides very precise pixel-wise ground truth, which is completely generated by manual annotation. A total of twenty-one students participated in the work of ground truth generation, which took several months.