Industrial Automation
Setting the Standard for AI-Generated Control Code

AI is rewriting the rules of software development, but can it safely take the wheel in manufacturing? The partners of the MBO-KISS research project are on a three-year mission to find out. Launched in January 2025, the project is developing benchmarks for reliable, AI-generated control code in mechanical engineering. Halfway through their journey, the team shares an exclusive interim update.

I Stock 506520374 zorazhuang
mask I Stock 506520374 zorazhuang

Successfully deploying Large Language Models (LLMs) for industrial control software offers a dual breakthrough. First, it promises to significantly streamline the engineering pipeline, cutting down effort from the design phase right through to prototyping. Second, it addresses a critical vulnerability in industrial automation: the acute shortage of skilled labor. As experts trained in the industry-standard IEC 61131-3 languages become harder to find, automated code generation could bridge a widening skills gap.

Fueled by this potential, the partners behind MBO-KISS—which stands for Methods for Evaluating and Optimizing AI-Generated Control Applications Based on the Physical Simulation of Machines and Their Target Behavior—have engineered a comprehensive system architecture. Their design focuses on a dual approach: evaluation and optimization. The result is an end-to-end pipeline that spans the entire development lifecycle, from defining initial automation requirements and drafting specifications to generating Structured Text (ST) code and executing rigorous validation tests.

To achieve this, the project relies on multi-stage, automated test workflows that seamlessly merge static and dynamic analysis into a three-step optimization loop:

  • First, an analyzer scrutinizes the code for key metrics, evaluating complexities like loops and branches. These insights are instantly channeled back into the AI loop to refine the next generation run.
  • Next, the software components face unit testing. Impressively, the AI assistant creates these test cases itself, working directly from the original specifications. Any failures or successes are fed back into the system to steadily upgrade the codebase.
  • In the final stretch, the code is deployed onto a simulated machine setup—comprising a software PLC, a digital twin of the machine, and control testing tools. This full-system simulation logs both positive and negative results, feeding them back into the loop for ultimate optimization.

While these automated feedback loops are vital for refining code once a Large Language Model (LLM) is in place, choosing the right model from the start remains a critical hurdle. The root of the problem lies in the training data: standard LLMs are predominantly fed a diet of high-level languages like Python, Java, or C++. In contrast, industry-standard IEC 61131-3 languages like Structured Text (ST) are a niche dialect known only to automation specialists, making up a fraction of the internet's open-source data. Compounding the issue, much of this industrial code represents corporate intellectual property, locked away behind factory walls rather than being freely shared online.

Benchmarking of LLMs

When selecting a suitable LLM, it is necessary to consider relevant benchmarks for code generation. These are typically provided by the model vendors for well-known programming languages such as Python or Java; however, such benchmarks are lacking for IEC 61131-3 languages due to their limited relevance, as mentioned above. This shortcoming can be addressed by using benchmarking tool chains for evaluating LLMs, by adapting the tools developed at Fraunhofer IKS for the general evaluation of LLMs to the evaluation of generated ST code. Datasets consisting of requirement specifications and the corresponding reference implementations are crucial for achieving the most meaningful benchmarking possible. For example, code from libraries such as OSCAT [1] can be used for an initial evaluation. Ideally, however, previously unpublished code should be used to create the benchmark solution.

The combination of the described system architecture for evaluating and optimizing generated ST code and the LLMs selected through benchmarking offers an optimal approach to AI-assisted engineering of automation applications. Even though the MBO KISS project primarily uses cloud-based LLMs from dominant providers such as OpenAI or Anthropic, the system architecture and the benchmarking tool chain also form an ideal foundation for local models (edge AI solution) and for evaluating local engineering workflows.

Project MBO-Kiss: Six partners are on board

A total of six partners are involved in the MBO-KISS research project. While the Fraunhofer Institute for Cognitive Systems IKS focuses on aspects related to LLM-based, high-quality generation of control code and the feedback of test results to the prompting process, the participating industry partners are examining automated testing options for the generated software components and the entire application. By combining the activities of all partners, a loop is created in which the test results are fed back into the prompting process, thereby achieving iterative improvement of the application:

ITQ will focus on the automated analysis of the quality of the generated code, which represents one aspect of the verification process. In this context, various metrics for static code analysis are defined, implemented, and tested within the overall architecture.

Sepp.med will focus on test analysis, test design, test implementation, test automation, and test coverage, which represent another aspect of the verification process.

The company MAX STREICHER will act as the end user and provide the use cases, among other things. They will also align the progress of development and interim results with their own needs and test them under real-world conditions wherever possible.

In addition, the associated partners CODESYS Group and BOSCH Rexroth will support and consult the project.

LLM AI technology continues to evolve rapidly, so improved models are expected on a regular basis. This makes it all the more important not only to consider methods for integrating LLMs to create control applications, but also to design appropriate methods for evaluating LLMs for control tasks and to make these methods available to manufacturing companies.

Reference:

[1] http://www.oscat.de/en/, Oscat.LIB, Abruf 19.5.2026


The project is funded by the Bavarian Ministry of Economic Affairs, Regional Development and Energy under the BayVFP program’s Digitalization funding line.

Read next

Artificial Intelligence
Do you trust your AI agent?

Gereon Weiss
Gereon Weiß
Artificial intelligence / Fraunhofer IKS
Artificial intelligence