Spatial Reasoning

Roam Bounty Program #019 - $8,000+

Overview

Roam is a research lab building its first social product: a mobile game builder that lets anyone create multiplayer games in minutes. We run targeted bounties to solve complex technical challenges that unlock scalability and extensibility in our stack.

This bounty is focused on developing a Spatial Reasoning Engine, a critical component of our AI pipeline designed to analyze gameplay screenshots and videos. The primary goal is to accurately convert 2D screen-space object positions into 3D world-space coordinates, effectively bridging the gap between visual analysis and spatial understanding. The budget for this bounty is $8,000.

Problem Statement

Our AI pipeline excels at visual analysis of gameplay footage, identifying objects, and understanding on-screen events. However, a significant gap exists in translating this 2D understanding into a coherent 3D representation of the game world. The core challenges to be addressed are:

Spatial Reasoning for LLMs: Large Language Models (LLMs) are proficient at visual analysis but struggle with inherent spatial reasoning, making it difficult to infer accurate 3D positions from 2D imagery.
Screen-to-World Complexity: The conversion of 2D screen-space coordinates to 3D world-space positions is a complex problem that requires accurate estimation of camera parameters and scene depth.
Semantic Understanding: Traditional computer vision techniques often lack the semantic understanding to correctly interpret the context of objects and their interactions within the game world.

Context and References

Existing Pipeline Components

Our current ecosystem consists of several key components that the Spatial Reasoning Engine will need to integrate with:

roam-game-analysis: An AI-powered tool for visual analysis that processes gameplay screenshots and videos to detect objects, classify them, and output their 2D screen-space positions.
Unity Scene Replica: A validation platform built in Unity that allows for the real-time 3D visualization of scenes loaded from a JSON schema. This is used to visually confirm the accuracy of the 3D spatial arrangements generated by the pipeline.
Unity Integration Scripts: A set of C# scripts for seamless import and export of scene data between Unity and our JSON format.

The Core Challenge: Screen-to-World Conversion