According to ecological psychology’ framework, attendees (both teachers and learners) are seen as coupled in a specific classroom place (the CAC), approaching a variety of resources, be they human or material, enabling different actions and interactions with multiple participants [1]. Materials and humans are resources people orient their attention to, in order to co-construct the meaning to be taught, in sort of multiple perception–action loops: attention directed toward resources shape perception and action which, in turn, can help to transform the resources, making goals and knowledge emerge [2]. As social learning episodes have a multimodal dimension, they are basically triggered and maintained by eye and postural signals (shared attention processes, deictics [3]), other human modalities (speech, emotions, etc.) and other factors. This justifies to resort to multiple types of data appearing during its progress (gazes’ scan path, classroom noise volume, facial expressions, etc.).
Over the past two decades, multimodal signal processing research was seldom done in real-world conditions like in classrooms while in education, research on CAC (a.k.a. smart classrooms) increased. But few theoretical research has been addressed through the use of these classrooms: they were considered as purely technology-driven solutionist “show-rooms” supporting actions like students’ presence scanning. So, their attractiveness shifted down worsened by privacy concerns.
Recent research on teachers’ cognition and practice using mobile eye-trackers validate some old results on expert–novice paradigm or on cultural differences [4]. To date, this work neither accounts for multimodal data nor considers the classroom as a whole. MULETA aims at embedding research in a classroom enriched with sensing and effector devices capable to store, process, analyze contextual data, in order to (semautomatically support actions and decisions of its attendees, like supervision and control.
Computational perception of people is a mature field of contribution. In many perception tasks, recent advances of deep neural networks are far beyond what was known a few years ago. Multimodal perception of people and groups in a CAC encompasses a set of different tasks dealing with different signals. Acoustic signals enable to deal with what teacher or students say and how it is said. Visual signals allow to focus on detecting people and their location, body posture, activities, facial expression changes… Mobile eye-trackers on the teacher can be used to infer cognitive phenomena and attention focus, using respectively pupillary response and scan paths. However, these advances mask challenges that still need to equal performance and generalization levels of human perception [5], especially in real world context.
Researches on perceptive spaces are not new [5], but CAC data has been poorly studied by computer scientists. Recent research on perception in similar contexts is focusing on people as individuals, i.e. each individual is processed as one unit. To detect a group’s mood, actual perception systems detect as accurately as possible individual faces and facial emotions and average the individual information. As far as we know, no computer science research challenges to evaluate the overall atmosphere of a perceptive space (i.e., a CAC) using global views that are not averages.