EVLM: Intent-Driven Edge Vision Language Model for UAV-Based Power Line Inspection

2026 IEEE International Conference on Edge Computing and Communications (IEEE EDGE 2026)

Reza Farahani (TU Wien, Austria), Zoha Azimi (AAU, Austria), Ilir Murturi (University of Prishtina, Kosovo), Arda Goknil (SINTEF, Norway), Sagar Sen (SINTEF, Norway), Christian Timmerer (AAU, Austria), Schahram Dustdar (TU Wien, Austria)

Abstract: Inspection of critical infrastructure, such as power lines, is increasingly conducted using unmanned aerial vehicles (UAVs) that capture aerial video for subsequent human review. Although recent edge-based approaches deploy onboard object detectors to identify predefined defect classes, these pipelines remain closed-set, task-specific, and largely decoupled from operator intent and edge resource constraints. This paper introduces EVLM, an intent-driven vision-language framework for onboard UAV-based power line inspection. Given a high-level operator intent, EVLM (i) leverages lightweight histogram-based frame filtering to extract salient key frames under bounded compute budgets, (ii) executes a domain-adapted vision language model (VLM) directly on the UAV for intent-conditioned multimodal reasoning, and (iii) synthesizes structured inspection reports together with a minimal set of evidence frames, replacing continuous raw video transmission with compact semantic outputs. To align the VLM with infrastructure inspection semantics while preserving edge efficiency, we perform parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA), enabling domain specialization without updating the full model parameters. We implement and fully deploy EVLM on an NVIDIA Jetson device representative of UAV-class onboard hardware and evaluate it using 20 publicly released power line inspection video sequences spanning 8 heterogeneous environments and 5 operational intent categories. Experimental results show a data reduction of 94.8 %, with transmitted data decreasing from 485 kB to 25 kB per 4 s segment, corresponding to 72.75 MB versus 3.75 MB over a 10 min inspection mission. EVLM operates feasibly on embedded hardware, maintaining moderate CPU/GPU utilization and bounded power consumption (5.6 W), while producing interpretable, intent-aligned inspection outputs with richer semantic insights than detection-centric baselines.