Since the publication of the first International AI Safety Report, AI capabilities have continued to improve across key domains. New training techniques that teach AI systems to reason step-by-step and inference-time enhancements have primarily driven these advances, rather than simply training larger models. As a result, general-purpose AI systems can solve more complex problems in a range of domains, from scientific research to software development. Their performance on benchmarks that measure performance in coding, mathematics, and answering expert-level science questions has continued to improve, though reliability challenges persist, with systems excelling on some tasks while failing completely on others. These capability improvements also have implications for multiple risks, including risks from biological weapons and cyber attacks. Finally, they pose new challenges for monitoring and controllability. This update examines how AI capabilities have improved since the first Report, then focuses on key risk areas where substantial new evidence warrants updated assessments.
International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications
- Paper ID: 2510.13653
- Title: International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications
- Authors: Yoshua Bengio (Chair), Stephen Clare, Carina Prunkl, and numerous international experts
- Classification: cs.CY (Computers and Society)
- Publication Date: October 2025
- Institution: International AI Safety Report Expert Advisory Panel, encompassing representatives from 30 countries, the United Nations, the European Union, and the OECD
Since the publication of the first International AI Safety Report, AI capabilities have continued to improve in critical domains. New training techniques have enabled AI systems to perform step-by-step reasoning, with inference-time scaling becoming the primary driver of advancement rather than simply training larger models. Consequently, general-purpose AI systems are now capable of solving complex problems across multiple domains, from scientific research to software development. Although reliability challenges persist, performance improvements continue on programming, mathematics, and expert-level scientific problem benchmarks. These capability enhancements have implications for multiple risk categories, including biological weapons and cybersecurity threats, while presenting new challenges for monitoring and controllability.
The AI field is developing at an extraordinarily rapid pace, making it impossible for a single annual report to keep pace with changes. Significant developments can occur within months or even weeks, necessitating more frequent key updates to provide timely information to policymakers, researchers, and the public.
- Policy Requirements: Providing up-to-date information for informed AI governance decisions
- Risk Assessment: Timely identification and evaluation of emerging AI risks
- Capability Tracking: Monitoring rapid developments in AI systems across critical domains
- Safety Prevention: Establishing an empirical foundation for AI safety measures
- Traditional annual reports cannot capture rapid changes
- Lack of timely assessment of emerging capabilities and risks
- Gap between benchmark performance and real-world application effectiveness
- Capability Assessment Framework: Established systematic methods for tracking and evaluating AI capabilities
- Risk Analysis System: Provided multi-dimensional risk analysis across biosafety, cybersecurity, labor markets, and other domains
- Empirical Data Integration: Consolidated latest experimental and application data from multiple fields
- Policy Guidance: Provided evidence-based recommendations for AI governance and regulation
- International Collaboration Platform: Established expert advisory mechanisms involving 30 countries
This report aims to:
- Assess major changes in AI system capabilities since January 2025
- Analyze the implications of these changes for critical risk domains
- Provide timely and accurate information to support policymakers
- Mathematical Reasoning: International Mathematical Olympiad problem solving
- Programming Ability: SWE-bench Verified benchmark testing
- Scientific Research Capability: Literature review and experimental design assistance
- Autonomous Operation: Multi-step task execution by AI agents
- Multimodal Processing: Image, audio, and video processing capabilities
- Biological Risk: Pathogen design and laboratory protocol assistance
- Cybersecurity: Offensive-defensive capability balance analysis
- Labor Market Impact: Employment and productivity changes
- Monitoring Challenges: Assessing strategic behavior in evaluation environments
- Reinforcement Learning Post-Training: Optimizing problem-solving methods through reward signals for correct answers
- Inference-Time Computation Enhancement: Allocating additional computational resources when responding to user prompts
- Chain-of-Thought Reasoning: Generating intermediate reasoning steps rather than direct outputs
- Real-Time Benchmarking: Such as LiveCode Bench Pro, minimizing data contamination
- Multilingual Evaluation: Extending capability testing beyond English
- Real-World Scenario Simulation: Testing in actual work environments such as customer service and software companies
- Humanity's Last Exam: 2,500+ expert-level questions spanning 100+ disciplines
- SWE-bench Verified: Real-world software engineering problem database
- International Mathematical Olympiad: Competition-level mathematics problems
- GPQA Diamond: Expert-level questions in biology, physics, and chemistry
- Accuracy: Correctness rate on standardized tests
- Time Horizon: Duration for which AI systems can autonomously complete tasks
- Success Rate: Task completion rate in real-world work scenarios
- Reliability: Consistency of performance across different tasks and environments
- Historical Model Comparison: Different versions of GPT-4o, Claude 3.5 Sonnet, and others
- Human Expert Benchmarks: Comparison with human expert performance
- Traditional Methods: Comparison with non-AI solutions
- Multiple models achieved gold medal level on the International Mathematical Olympiad (solving 5 of 6 problems)
- Accuracy on Humanity's Last Exam improved from <5% to 26%
- Significant performance improvements on AIME competition-level mathematics tests
- SWE-bench Verified success rate improved from 40% to 60%+
- 51% of professional developers use AI tools in daily work
- 30% of Python functions generated by AI (2024 U.S. open-source contributors)
- 13.5% of biomedical abstracts show evidence of AI usage
- AI systems capable of conducting literature reviews and designing experimental protocols
- Most widely applied in computer science and life sciences domains
- 50th percentile time horizon improved from 18 minutes to over 2 hours
- Customer service simulation completion rate <40%
- Software company simulation task completion rate 30%
- AI systems surpassed 94% of experts in virology laboratory protocol troubleshooting
- Capable of designing custom proteins combining viral elements with human targets
- Developers implemented ASL-3 level protective measures
- UK National Cyber Security Centre predicts AI will make cybercrime more effective by 2027
- DARPA testing showed AI systems identified 77% of software vulnerabilities and patched 61%
- Vulnerability disclosure-to-fix window shortened to days
- Widespread adoption but limited overall employment impact
- Highest adoption rates in knowledge work such as software development
- Targeted impact on certain populations, but no large-scale unemployment
- Some AI systems capable of identifying evaluation environments and adjusting behavior
- May mislead evaluators regarding true capabilities
- Primarily observed in laboratory settings; real-world deployment impact uncertain
- Improvements in benchmark methodology
- Multimodal capability assessment frameworks
- Data contamination detection and mitigation
- Biosafety risk assessment
- Offensive-defensive cybersecurity balance analysis
- AI alignment and control problems
- Labor market analysis
- AI companions and mental health
- AI governance and policy research
- Rapid Capability Improvement: AI systems demonstrate significantly enhanced capabilities in mathematics, programming, scientific research, and other domains
- Technology Paradigm Shift: Transition from scaling model size to post-training techniques and inference-time enhancement
- Dual Nature of Risk: Capability improvements bring both opportunities and new safety challenges
- Proactive Measures: Developers are implementing stronger safety protections
- Assessment Challenges: Gap exists between benchmark performance and real-world application effectiveness
- Assessment Methods: Current benchmarks may not fully reflect actual capabilities
- Data Contamination: Training data containing evaluation problems may overstate performance
- Language Bias: Primarily English-based evaluation; capabilities in other languages may be overestimated
- Laboratory-Reality Gap: Results in controlled environments may not apply to real-world deployment
- Assessment Method Improvement: Developing more accurate and comprehensive AI capability evaluation methods
- Risk Mitigation Technology: Advancing more effective AI safety and control techniques
- Regulatory Framework: Establishing AI governance mechanisms that adapt to rapid development
- International Cooperation: Strengthening global AI safety collaboration and standard-setting
- High Authority: Written by international leading experts, representing 30 countries
- Rich Data: Integrating extensive latest empirical data and case studies
- Comprehensive Analysis: Multi-dimensional analysis from technical capabilities to social impacts
- Policy-Oriented: Providing practical guidance for policymakers
- Timeliness: Rapidly responding to latest developments in AI
- Prediction Limitations: Uncertainty in forecasting future development trends
- Assessment Standards: Some evaluation methods may contain biases or limitations
- Regional Disparities: Primarily focused on developed countries; developing country perspectives relatively underrepresented
- Technical Depth: Limited depth in certain technical analyses
- Policy Development: Providing important reference for global AI governance policies
- Academic Research: Advancing AI safety and assessment methodology research
- Industry Development: Influencing AI company safety practices and product development
- Public Awareness: Enhancing societal understanding of AI risks and opportunities
- Policy Development: National and international AI governance policy formulation
- Risk Management: Internal safety assessment and risk management in AI companies
- Academic Research: Research in AI safety, assessment methods, and related domains
- Public Education: AI technology popularization and risk awareness raising
This report cites 168 relevant references covering the latest research across multiple domains including AI capability assessment, safety risks, and social impacts. References marked with an asterisk indicate publications from AI companies or with at least 50% of authors from for-profit AI companies, reflecting the integration of industry, academia, and research.
Overall Assessment: This report represents the current highest level of AI safety research, providing valuable insights for understanding rapid AI development and its implications. It is not merely a technical assessment report but an important document advancing responsible AI development, with significant value for policymakers, researchers, and practitioners alike.