2025-11-17T18:07:13.560068

A Matter of Representation: Towards Graph-Based Abstract Code Generation

Iskandar, Bedri, Tsen

Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.

academic

A Matter of Representation: Towards Graph-Based Abstract Code Generation

Basic Information

Paper ID: 2510.13163
Title: A Matter of Representation: Towards Graph-Based Abstract Code Generation
Authors: Nyx Iskandar (UC Berkeley), Hisham Bedri (Ramen VR), Andy Tsen (Ramen VR)
Classification: cs.CL (Computational Linguistics)
Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Deep Learning for Code
Paper Link: https://arxiv.org/abs/2510.13163v1

Abstract

While current large language models (LLMs) excel at generating raw, sequential code, there has been minimal research on graph-based abstract code generation. Graph-based abstract code encapsulates important logic in predefined nodes, with edges determining execution flow. This code form is common in visual programming languages and is important in scenarios where raw source code is inaccessible to users and LLM training sets. This paper proposes and evaluates JSON representation methods for graphs to enable high-precision graph-based abstract code generation. The authors evaluate these representations on ScratchTest, a small benchmark based on a Python reimplementation of Scratch. The research finds that under correct graph representations, LLMs can indeed complete the aforementioned task in a single generation without relying on specialized or complex pipelines. Different representation methods yield significantly different accuracy rates, highlighting the critical role of representation in this generation task.

Research Background and Motivation

Problem Definition

Current LLMs in code generation primarily focus on raw, sequential code generation, where code is arranged linearly in terms of lines. However, many practical application scenarios require graph-based abstract code generation, such as:

Visual Programming Languages: Scratch, Unreal Engine Blueprints, n8n, etc.
Highly Abstract Libraries and Frameworks: Implementation details are encapsulated, and users can only operate through predefined interfaces

Importance Analysis

Widespread Application: Graph-based programming languages are widely used by beginners, game developers, and software engineers
Scarcity of Training Data: Emerging libraries and frameworks lack sufficient training data, facing similar abstraction challenges as graph-based code
Non-linear Relationships: Graph-based languages introduce complex non-linear relationships between nodes, which traditional in-context learning struggles to address

Limitations of Existing Methods

Graph Generation Methods: GraphRNN, GraphGAN, etc., focus on general graph generation and are unsuitable for functional code graphs
Graph Foundation Models (GFMs): GNN-based methods have poor scalability, while LLM-based methods are overly dependent on fragile natural language
Code Generation Models: Primarily designed for sequential code, with varying support capabilities across different languages/frameworks

Core Contributions

Proposed JSON representation methods for nodes: Enabling current LLMs to generate syntactically and logically accurate code graphs
Proposed JSON representation methods for code graphs: Further improving the accuracy of LLM-generated graph representations
Constructed the ScratchTest benchmark: Based on a Python reimplementation of Scratch, specifically designed to evaluate graph-based abstract code generation capabilities
Validated the importance of representation: Demonstrated that under a single-agent LLM framework, correct representation can significantly improve generation accuracy

Methodology Details

Task Definition

Input: Natural language descriptions of functional requirements
Output: A connected graph satisfying requirements, containing predefined nodes and edge connections
Constraints: The graph must be a directed acyclic graph (DAG) to ensure valid execution sequences

ScratchTest Benchmark Design

Benchmark Characteristics

Node Count: 53 built-in Scratch blocks (from 107 CLI-implementable parts)
Node Types: Eight categories including motion, appearance, sound, events, control, sensing, operators, and variables
Simplified Implementation: Does not directly manipulate sprites; evaluates functionality through behavior logs
State Persistence: Maintains sprite attribute dictionaries (position, direction, etc.)

Evaluation Method

Test Set: 20 unique functional description prompts
Evaluation Runs: Each prompt run independently 5 times
Evaluation Criteria: Manual assessment of behavior logs and logical correctness of Python files

Representation Method Design

Reference Node Representation

[NODENAME]: {
    inPorts: [{id: string, type: string}],
    fields: [{id: string, type: string}],
    outPorts: [{id: string, type: string}]
}

Key Components:

NODENAME: Corresponds to Scratch block names
inPorts: Input ports, including parameters and EXEC ports (execution flow)
fields: Parameters with predefined options
outPorts: Output ports, including return values, THEN ports (subsequent execution), SUBSTACK ports (loops/control)
type: Port type, preventing incompatible connections

Output Graph Representation

{
    nodes: {
        [key: string]: {
            name: string,
            value: any | null
        }
    },
    edges: [{
        outNodeID: string,
        outPortID: string,
        inNodeID: string,
        inPortID: string
    }]
}

Design Advantages:

Separation of Concerns: Nodes and edges defined separately, reducing errors
Linear Generation: Nodes defined first, then connection relationships
Avoid Redundancy: Each edge defined only once

Post-processing Pipeline

Topological Sorting: Ensures the directed acyclic property of the graph
Python Conversion: Converts graph representation to Pythonic Scratch implementation
Object Instantiation: Creates Scratch block objects and binds variables
Connection Establishment: Establishes execution flow based on THEN and EXEC ports

Experimental Setup

Dataset

ScratchTest: 20 functional description prompts
Example Prompts:
- "When the green flag is clicked, continuously move in a square pattern until the user presses the space key"
- "When the 's' key is pressed, say a secret password made of two random letters and three random numbers"

Evaluation Metrics

Accuracy: Proportion of correctly implemented functionalities
Evaluation Criteria:
- Reasonable behavior log output
- Logically correct Python files
- No post-processing or execution errors

Comparison Methods

Reference Node Representation Ablation

No Types: Minimal baseline removing port type information
Extra Description: Adding natural language descriptions of nodes and ports
Proposed: Complete proposed representation

Output Graph Representation Ablation

Alternative: Alternative representation with edge information embedded in nodes
Proposed: Proposed representation separating nodes and edges

Implementation Details

Primary Model: OpenAI gpt-oss-120b (Groq platform)
Other Models: qwen3-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile
Configuration: Temperature=1, Max Tokens=8192, Top P=1
Framework: Single-agent, no fine-tuning, no in-context learning, no multi-turn interaction

Experimental Results

Main Results

Reference Node Representation Ablation Results

Method	Average Accuracy	Standard Deviation
No Types	0.65	0.071
Extra Description	0.74	0.082
Proposed	0.75	0.050

Statistical Significance:

Proposed vs No Types: p=0.036 ≤ 0.05 (significant)
Proposed vs Extra Description: p=0.82 > 0.05 (not significant)

Output Graph Representation Ablation Results

Method	Average Accuracy	Standard Deviation
Alternative	0.49	0.042
Proposed	0.75	0.050

Statistical Significance: p=0.000024 ≤ 0.05, Proposed improves over Alternative by 23-29%

Key Findings

Importance of Type Information: Adding port types significantly improves accuracy, preventing incompatible connections
Limited Value of Descriptive Information: Natural language descriptions cannot significantly improve performance and instead increase token consumption
Critical Role of Representation Structure: Separated graph representation significantly outperforms embedded representation
Improved Consistency: The proposed method demonstrates more stable performance across multiple runs

Common Error Types

Invalid DAG: Generating graphs containing cycles
Undeclared Variable References: Referencing undefined variables
Port Connection Errors: Connecting ports in the same direction or forgetting to define ports

Performance of Other Models

The other three models (qwen3-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile) perform significantly worse than gpt-oss-120b, with generally lower accuracy rates and higher error rates in most cases.

Graph Understanding and Generation

General Graph Generation: GraphRNN, GraphGAN focus on learning graph distributions and are unsuitable for functional code graphs
Graph Foundation Models: GNN-based methods have poor scalability, while LLM-based methods rely on fragile natural language
LLM Graph Understanding: Existing work primarily evaluates mathematical graph understanding rather than logical coding graphs

LLM Code Agents

Code Generation Capabilities: Current LLMs excel at traditional code generation
Enhancement Methods: Reinforcement learning, chain-of-thought, infilling training, constrained decoding, etc.
Performance Differences: Significant performance variations exist across different languages/frameworks, primarily determined by training data availability

Conclusions and Discussion

Main Conclusions

Feasibility Verification: LLMs can complete graph-based abstract code generation in a single generation
Critical Role of Representation: Correct JSON representation is crucial for generation accuracy
Design Principles: Type information, separation of concerns, and linear generation flow are key elements of effective representation
Benchmark Establishment: ScratchTest provides an evaluation benchmark for graph-based code generation

Limitations

Accuracy Limitations: Even with optimal representation, 100% accuracy is not achieved
Model Dependency: Performance is highly dependent on specific LLM capabilities
Scale Limitations: Current validation only in the relatively simple Scratch domain
Evaluation Method: Relies on manual evaluation, lacking automated evaluation standards

Future Directions

Representation Optimization: Exploring more optimal non-JSON representation methods
Specialized Models: Developing models specifically designed for code graph generation
Multi-agent Framework: Combining planning and error correction through multi-turn approaches
Large-scale Validation: Validating in more complex graph-based programming environments

In-Depth Evaluation

Strengths

Problem Novelty: First systematic study of graph-based abstract code generation
Method Simplicity: The proposed JSON representation method is simple, effective, and easy to implement and extend
Experimental Rigor: Results ensured through multiple runs and statistical testing
Practical Value: Provides foundation for AI-assisted development in visual programming languages

Weaknesses

Evaluation Limitations: Manual evaluation is insufficiently objective and difficult to scale
Benchmark Scale: 20 test cases are relatively few and may not be comprehensive
Model Coverage: Results primarily based on one model; other models perform poorly
Theoretical Analysis: Lacks in-depth theoretical explanation for why certain representations are more effective

Impact

Pioneering Contribution: Establishes research baseline for graph-based code generation
Practical Application: Can be directly applied to visual programming tools like Scratch
Scalability: Method is extensible to other graph-based programming environments
Inspirational Significance: Emphasizes the importance of representation learning in code generation

Applicable Scenarios

Education: Assisting code generation in teaching tools like Scratch
Rapid Prototyping: Automating blueprint systems in game development
Workflow Automation: Intelligent configuration of tools like n8n
New Library Adaptation: Code generation for new frameworks lacking training data

References

The paper cites 54 relevant references covering important works in LLM code generation, graph neural networks, visual programming languages, and other domains, providing a solid theoretical foundation for the research.

Overall Assessment: This is a pioneering work that systematically addresses graph-based abstract code generation for the first time. While there is room for improvement in evaluation methods and theoretical analysis, the proposed representation method is simple and effective, laying an important foundation for this emerging research direction. This work has strong practical value and inspirational significance, and is expected to promote further development in related fields.