Spatially Grounded AI for Smart Living Spaces.
Research
Accepted at PeerJ Computer Science.
Background. The swift advancement of Internet of Things (IoT) technology has revolutionized smart home settings; the prevalent automation systems are limited by their need for specific device identification and rigid rule-based configurations. These constraints impede natural human-device interaction, especially in dynamic or communal environments where spatial context is more instinctive than predetermined naming conventions. Current solutions frequently neglect spatial reasoning and multimodal inputs, resulting in heightened cognitive demands and diminished accessibility. The proposed work develops a spatial context-aware control system aimed at facilitating intuitive, vision-driven, and language-based interaction with smart devices to overcome these problems.
Methods. The proposed model modular, multimodal framework that integrates computer vision, natural language processing, and spatial inference for context-aware smart device control. The system comprises six core components: (i) an Onboarding Inference Engine for extracting device information via natural language input, (ii) Zero-Shot Device Detection using OWL-ViT for object identification without prior training, (iii) Metadata Refinement and Filtering for structured annotation and disambiguation, (iv) a Geospatial Device Visualizer for annotated visual feedback, (v) Spatial Topology Inference using GPT-4o for reasoning about physical layouts, and (vi) Intent-Based Command Synthesis with Gemini Flash to generate precise, executable control commands. The final Agentic Execution Module interfaces with the Tuya Smart Device API, ensuring vendor-agnostic actuation. The system supports multilingual input and adapts to various environmental contexts, including smart homes and assisted living facilities.
Results. A user study involving 15 participants (aged 18–80, diverse educational backgrounds) was conducted to evaluate the effectiveness of the proposed method in comparison to the Google Home Assistant. Quantitative findings demonstrate a statistically significant reduction in cognitive workload, with NASA Task Load Index (TLX) scores decreasing by an average of 13.17 points (p = 0.0013, Cohen’s d = 1.0381). Participants rated the proposed method higher in terms of ease of use (mean = 4.67) compared to Google Home (mean = 3.8) on a 5-point Likert scale. Qualitative feedback highlighted the intuitive nature of spatial context commands, reduced cognitive burden due to the elimination of device name memorization, and enhanced accessibility via support for regional languages. 93.3% of users preferred the proposed method over the baseline system. These results affirm the feasibility and user-centric benefits of integrating vision-language models for context-aware smart device control.
2025
Demonstration
Start from 2:26 to see it execute complex commands.
Intent-aware environment, not just a command executor.
From 2:26:
“It’s too bright, dim the lights.” at 2:26
INOT interprets this as reducing overall brightness rather than turning everything off. It turns on all lights except the one near the AC, resulting in comfortable ambient lighting.
“It’s getting warm.” at 2:40
INOT understands this as a thermal comfort request and switches on the fan.
“I want to sleep.” at 2:49
INOT infers the intent for darkness and rest, and switches off all the lights contrasting with the earlier request to merely dim them.
“I am working on the computer.” at 3:02
INOT spatially understands where the computer workspace is located and turns on the two lights closest to that workspace.
“I need focused lighting for reading.” at 3:12
INOT reasons that only task-specific lighting is required. It avoids turning on nearby ambient lights and specifically switches on the desk lamp.
“Turn on the lights near the bed.” at 3:20
INOT identifies the bed’s location and activates only the closest light, ignoring all other lights in the room.
Notable Correct Commands But Involving Human Perception Errors
“Turn on the light near the photo.” at 2:00
INOT correctly identified the light closest to the photo and issued the command to turn it on. However, no visible change was observed because the light was already on.
“Turn on the light near the AC.” at 2:12
INOT correctly identified the light closest to the AC and issued the command to turn it on. Again, no visible change was observed because the light was already on.
2025
Instead of remembering names like "Switch on light no. 3",
"Turn on the light on the TV's right."
Without pre-determined setup, simply say,
"Turn on the light near the AC"
2025
Figure visualises the overall TLX scores mean across 6 dimensions and their standard deviations. (Lower the better)
Dimension wise comparison of NASA TLX scores across 6 dimensions. (Lower the better)
Condition 1: Commercially available Google Home Assistant, Condition 2: Proposed Method, INOT Assistant
A total of fifteen participants were recruited for the study, with ages ranging from 18 to 80 years (Mean: 45.8 years, Median: 49 years, Standard Deviation: 19.08).
The gender distribution included: 8 females (53.3%) and 7 males (46.7%).
2025
Model room where the user case study was conducted for both the conditions. The model room has 4 lights and 1 fan, smart home connectivity enabled.
Reference tasks provided included:
"Switch on the light near the AC."
"Switch on the light above the photo frame."
"Turn on the light on the desk."
"Switch on the leftmost light."
"Turn on the fan."
"Turn on lighting for studying or working."
2025
@article{kalivarathan2025intelligence,
title={Intelligence of Things: A Spatial Context-Aware Control System for Smart Devices},
author={Kalivarathan, Sukanth and Mohamed, Muhmmad Abrar Raja and Ravikumar, Aswathy and Harini, S},
journal={arXiv preprint arXiv:2504.13942},
year={2025}
}