



In many complex environments, such as industrial sites, disaster-stricken buildings, or public spaces, it is necessary to automatically detect and localize sound events (falls, alarms, voices, mechanical failures). Mobile platforms equipped with cameras and microphones represent a promising solution, but a single platform remains limited: its microphone array provides an approximate direction towards the source but not a precise position in space, and its camera may be obstructed. This thesis proposes to study how a network of mobile platform, each carrying a calibrated audio-visual unit, can collaborate to localize and classify such events in 3D. Each platform analyses its own audio-visual observations and shares an estimate of the source direction with its neighbours; the network then combines these estimates to reconstruct the position of the event and identify it. The expected outcomes are a cooperative localization system that is robust to occlusions and partial platform failures.

