



Scene understanding is a major challenge in computer vision, with recent approaches dominated by transformers (ViT, LLM, MLLM), which offer high performance but at a significant computational cost. This thesis proposes an innovative alternative combining lightweight convolutional neural networks (Lightweight CNN) and causal graph neural networks (Causal GNN) for efficient spatio-temporal analysis while optimizing computational resources. Lightweight CNNs enable high-performance extraction of visual features, while causal GNNs model dynamic relationships between objects in a scene graph, addressing challenges in object detection and relationship prediction in complex environments. Unlike current transformer-based models, this approach aims to reduce computational complexity while maintaining competitive accuracy, with potential applications in embedded vision and real-time systems.

