2025-11-17T18:07:13.560068

A Matter of Representation: Towards Graph-Based Abstract Code Generation

Iskandar, Bedri, Tsen
Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.
academic

A Matter of Representation: Towards Graph-Based Abstract Code Generation

Basic Information

  • Paper ID: 2510.13163
  • Title: A Matter of Representation: Towards Graph-Based Abstract Code Generation
  • Authors: Nyx Iskandar (UC Berkeley), Hisham Bedri (Ramen VR), Andy Tsen (Ramen VR)
  • Classification: cs.CL (Computational Linguistics)
  • Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Deep Learning for Code
  • Paper Link: https://arxiv.org/abs/2510.13163v1

Abstract

While current large language models (LLMs) excel at generating raw, sequential code, there has been minimal research on graph-based abstract code generation. Graph-based abstract code encapsulates important logic in predefined nodes, with edges determining execution flow. This code form is common in visual programming languages and is important in scenarios where raw source code is inaccessible to users and LLM training sets. This paper proposes and evaluates JSON representation methods for graphs to enable high-precision graph-based abstract code generation. The authors evaluate these representations on ScratchTest, a small benchmark based on a Python reimplementation of Scratch. The research finds that under correct graph representations, LLMs can indeed complete the aforementioned task in a single generation without relying on specialized or complex pipelines. Different representation methods yield significantly different accuracy rates, highlighting the critical role of representation in this generation task.

Research Background and Motivation

Problem Definition

Current LLMs in code generation primarily focus on raw, sequential code generation, where code is arranged linearly in terms of lines. However, many practical application scenarios require graph-based abstract code generation, such as:

  • Visual Programming Languages: Scratch, Unreal Engine Blueprints, n8n, etc.
  • Highly Abstract Libraries and Frameworks: Implementation details are encapsulated, and users can only operate through predefined interfaces

Importance Analysis

  1. Widespread Application: Graph-based programming languages are widely used by beginners, game developers, and software engineers
  2. Scarcity of Training Data: Emerging libraries and frameworks lack sufficient training data, facing similar abstraction challenges as graph-based code
  3. Non-linear Relationships: Graph-based languages introduce complex non-linear relationships between nodes, which traditional in-context learning struggles to address

Limitations of Existing Methods

  • Graph Generation Methods: GraphRNN, GraphGAN, etc., focus on general graph generation and are unsuitable for functional code graphs
  • Graph Foundation Models (GFMs): GNN-based methods have poor scalability, while LLM-based methods are overly dependent on fragile natural language
  • Code Generation Models: Primarily designed for sequential code, with varying support capabilities across different languages/frameworks

Core Contributions

  1. Proposed JSON representation methods for nodes: Enabling current LLMs to generate syntactically and logically accurate code graphs
  2. Proposed JSON representation methods for code graphs: Further improving the accuracy of LLM-generated graph representations
  3. Constructed the ScratchTest benchmark: Based on a Python reimplementation of Scratch, specifically designed to evaluate graph-based abstract code generation capabilities
  4. Validated the importance of representation: Demonstrated that under a single-agent LLM framework, correct representation can significantly improve generation accuracy

Methodology Details

Task Definition

  • Input: Natural language descriptions of functional requirements
  • Output: A connected graph satisfying requirements, containing predefined nodes and edge connections
  • Constraints: The graph must be a directed acyclic graph (DAG) to ensure valid execution sequences

ScratchTest Benchmark Design

Benchmark Characteristics

  • Node Count: 53 built-in Scratch blocks (from 107 CLI-implementable parts)
  • Node Types: Eight categories including motion, appearance, sound, events, control, sensing, operators, and variables
  • Simplified Implementation: Does not directly manipulate sprites; evaluates functionality through behavior logs
  • State Persistence: Maintains sprite attribute dictionaries (position, direction, etc.)

Evaluation Method

  • Test Set: 20 unique functional description prompts
  • Evaluation Runs: Each prompt run independently 5 times
  • Evaluation Criteria: Manual assessment of behavior logs and logical correctness of Python files

Representation Method Design

Reference Node Representation

[NODENAME]: {
    inPorts: [{id: string, type: string}],
    fields: [{id: string, type: string}],
    outPorts: [{id: string, type: string}]
}

Key Components:

  • NODENAME: Corresponds to Scratch block names
  • inPorts: Input ports, including parameters and EXEC ports (execution flow)
  • fields: Parameters with predefined options
  • outPorts: Output ports, including return values, THEN ports (subsequent execution), SUBSTACK ports (loops/control)
  • type: Port type, preventing incompatible connections

Output Graph Representation

{
    nodes: {
        [key: string]: {
            name: string,
            value: any | null
        }
    },
    edges: [{
        outNodeID: string,
        outPortID: string,
        inNodeID: string,
        inPortID: string
    }]
}

Design Advantages:

  • Separation of Concerns: Nodes and edges defined separately, reducing errors
  • Linear Generation: Nodes defined first, then connection relationships
  • Avoid Redundancy: Each edge defined only once

Post-processing Pipeline

  1. Topological Sorting: Ensures the directed acyclic property of the graph
  2. Python Conversion: Converts graph representation to Pythonic Scratch implementation
  3. Object Instantiation: Creates Scratch block objects and binds variables
  4. Connection Establishment: Establishes execution flow based on THEN and EXEC ports

Experimental Setup

Dataset

  • ScratchTest: 20 functional description prompts
  • Example Prompts:
    • "When the green flag is clicked, continuously move in a square pattern until the user presses the space key"
    • "When the 's' key is pressed, say a secret password made of two random letters and three random numbers"

Evaluation Metrics

  • Accuracy: Proportion of correctly implemented functionalities
  • Evaluation Criteria:
    • Reasonable behavior log output
    • Logically correct Python files
    • No post-processing or execution errors

Comparison Methods

Reference Node Representation Ablation

  1. No Types: Minimal baseline removing port type information
  2. Extra Description: Adding natural language descriptions of nodes and ports
  3. Proposed: Complete proposed representation

Output Graph Representation Ablation

  1. Alternative: Alternative representation with edge information embedded in nodes
  2. Proposed: Proposed representation separating nodes and edges

Implementation Details

  • Primary Model: OpenAI gpt-oss-120b (Groq platform)
  • Other Models: qwen3-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile
  • Configuration: Temperature=1, Max Tokens=8192, Top P=1
  • Framework: Single-agent, no fine-tuning, no in-context learning, no multi-turn interaction

Experimental Results

Main Results

Reference Node Representation Ablation Results

MethodAverage AccuracyStandard Deviation
No Types0.650.071
Extra Description0.740.082
Proposed0.750.050

Statistical Significance:

  • Proposed vs No Types: p=0.036 ≤ 0.05 (significant)
  • Proposed vs Extra Description: p=0.82 > 0.05 (not significant)

Output Graph Representation Ablation Results

MethodAverage AccuracyStandard Deviation
Alternative0.490.042
Proposed0.750.050

Statistical Significance: p=0.000024 ≤ 0.05, Proposed improves over Alternative by 23-29%

Key Findings

  1. Importance of Type Information: Adding port types significantly improves accuracy, preventing incompatible connections
  2. Limited Value of Descriptive Information: Natural language descriptions cannot significantly improve performance and instead increase token consumption
  3. Critical Role of Representation Structure: Separated graph representation significantly outperforms embedded representation
  4. Improved Consistency: The proposed method demonstrates more stable performance across multiple runs

Common Error Types

  1. Invalid DAG: Generating graphs containing cycles
  2. Undeclared Variable References: Referencing undefined variables
  3. Port Connection Errors: Connecting ports in the same direction or forgetting to define ports

Performance of Other Models

The other three models (qwen3-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile) perform significantly worse than gpt-oss-120b, with generally lower accuracy rates and higher error rates in most cases.

Graph Understanding and Generation

  • General Graph Generation: GraphRNN, GraphGAN focus on learning graph distributions and are unsuitable for functional code graphs
  • Graph Foundation Models: GNN-based methods have poor scalability, while LLM-based methods rely on fragile natural language
  • LLM Graph Understanding: Existing work primarily evaluates mathematical graph understanding rather than logical coding graphs

LLM Code Agents

  • Code Generation Capabilities: Current LLMs excel at traditional code generation
  • Enhancement Methods: Reinforcement learning, chain-of-thought, infilling training, constrained decoding, etc.
  • Performance Differences: Significant performance variations exist across different languages/frameworks, primarily determined by training data availability

Conclusions and Discussion

Main Conclusions

  1. Feasibility Verification: LLMs can complete graph-based abstract code generation in a single generation
  2. Critical Role of Representation: Correct JSON representation is crucial for generation accuracy
  3. Design Principles: Type information, separation of concerns, and linear generation flow are key elements of effective representation
  4. Benchmark Establishment: ScratchTest provides an evaluation benchmark for graph-based code generation

Limitations

  1. Accuracy Limitations: Even with optimal representation, 100% accuracy is not achieved
  2. Model Dependency: Performance is highly dependent on specific LLM capabilities
  3. Scale Limitations: Current validation only in the relatively simple Scratch domain
  4. Evaluation Method: Relies on manual evaluation, lacking automated evaluation standards

Future Directions

  1. Representation Optimization: Exploring more optimal non-JSON representation methods
  2. Specialized Models: Developing models specifically designed for code graph generation
  3. Multi-agent Framework: Combining planning and error correction through multi-turn approaches
  4. Large-scale Validation: Validating in more complex graph-based programming environments

In-Depth Evaluation

Strengths

  1. Problem Novelty: First systematic study of graph-based abstract code generation
  2. Method Simplicity: The proposed JSON representation method is simple, effective, and easy to implement and extend
  3. Experimental Rigor: Results ensured through multiple runs and statistical testing
  4. Practical Value: Provides foundation for AI-assisted development in visual programming languages

Weaknesses

  1. Evaluation Limitations: Manual evaluation is insufficiently objective and difficult to scale
  2. Benchmark Scale: 20 test cases are relatively few and may not be comprehensive
  3. Model Coverage: Results primarily based on one model; other models perform poorly
  4. Theoretical Analysis: Lacks in-depth theoretical explanation for why certain representations are more effective

Impact

  1. Pioneering Contribution: Establishes research baseline for graph-based code generation
  2. Practical Application: Can be directly applied to visual programming tools like Scratch
  3. Scalability: Method is extensible to other graph-based programming environments
  4. Inspirational Significance: Emphasizes the importance of representation learning in code generation

Applicable Scenarios

  1. Education: Assisting code generation in teaching tools like Scratch
  2. Rapid Prototyping: Automating blueprint systems in game development
  3. Workflow Automation: Intelligent configuration of tools like n8n
  4. New Library Adaptation: Code generation for new frameworks lacking training data

References

The paper cites 54 relevant references covering important works in LLM code generation, graph neural networks, visual programming languages, and other domains, providing a solid theoretical foundation for the research.


Overall Assessment: This is a pioneering work that systematically addresses graph-based abstract code generation for the first time. While there is room for improvement in evaluation methods and theoretical analysis, the proposed representation method is simple and effective, laying an important foundation for this emerging research direction. This work has strong practical value and inspirational significance, and is expected to promote further development in related fields.