Roam Bounty Program #019 - $8,000+

Overview

Roam is a research lab building its first social product: a mobile game builder that lets anyone create multiplayer games in minutes. We run targeted bounties to solve complex technical challenges that unlock scalability and extensibility in our stack.

This bounty is focused on developing a Spatial Reasoning Engine, a critical component of our AI pipeline designed to analyze gameplay screenshots and videos. The primary goal is to accurately convert 2D screen-space object positions into 3D world-space coordinates, effectively bridging the gap between visual analysis and spatial understanding. The budget for this bounty is $8,000.


Problem Statement

Our AI pipeline excels at visual analysis of gameplay footage, identifying objects, and understanding on-screen events. However, a significant gap exists in translating this 2D understanding into a coherent 3D representation of the game world. The core challenges to be addressed are:

  1. Spatial Reasoning for LLMs: Large Language Models (LLMs) are proficient at visual analysis but struggle with inherent spatial reasoning, making it difficult to infer accurate 3D positions from 2D imagery.
  2. Screen-to-World Complexity: The conversion of 2D screen-space coordinates to 3D world-space positions is a complex problem that requires accurate estimation of camera parameters and scene depth.
  3. Semantic Understanding: Traditional computer vision techniques often lack the semantic understanding to correctly interpret the context of objects and their interactions within the game world.

Context and References

Existing Pipeline Components

Our current ecosystem consists of several key components that the Spatial Reasoning Engine will need to integrate with:

The Core Challenge: Screen-to-World Conversion

The central task of this bounty is to build the "Spatial Reasoning Engine" which acts as the missing link in our data flow pipeline. This engine will take the 2D screen-space positions and object data from roam-game-analysis and convert them into accurate 3D world-space positions, which are then used to generate a dynamic scene graph in JSON format.

Proposed Approaches