
Registered user since Wed 6 Feb 2019
Contributions
Registered user since Wed 6 Feb 2019
Contributions
SATE - Software Engineering at the Era of LLMs
Thu 14 Sep 2023 14:00 - 14:40 at Room FR - SATE - Software Engineering at the Era of LLMs Chair(s): Xin XiaAbstract: LLMs are increasingly used not just for autocompletion, but also for code generation from natural language and APIs and other tasks. The output they produce, however, is based on the input data that is nominally permissively licensed, but is not curated for quality, security, performance, or other factors, such as whether the code’s license is authentic. This leads to buggy, insecure, poorly performing, or inappropriately licensed output that is already poisoning the rapidly growing OSS codebase. Problematic inputs will result in problematic outputs even if all the LLM hallucinations were to be removed, hence stronger provenance tracking and quality assurance for LLM training and fine-tuning inputs is essential to improve quality of the generated code. We suggest approaches to use World of Code research infrastructure to curate LLM training data via de-duplicating and auto curating source code based on the OSS-wide software supply chain properties derived from the nearly complete collection of OSS source code.
Audris Mockus is the Ericsson-Harlan D. Mills Chair Professor of Digital Archeology and Evidence Engineering in the Department of Electrical Engineering and Computer Science of the University of Tennessee, Knoxville. He studies software developers’ culture and behavior through the recovery, documentation, and analysis of digital remains, in other words, Digital Archaeology. These digital traces reflect projections of collective and individual activity. He reconstructs the reality from these projections by designing data mining methods to summarize and augment these digital traces, interactive visualization techniques to inspect, present, and control the behavior of teams and individuals, and statistical models and optimization techniques to understand the nature of individual and collective behavior.
File Attached