While GPT-3, the general-purpose language transformer that powers Codex, has recently been opened to the public, Codex itself remains a technical preview open to a limited selection of users. Codex powers the functionality of GitHub's Copilot, a programming assistant available as a plug-in for Visual Studio Code that is able to offer AI-powered autocomplete and code translation on the fly.

Its capabilities are still rough around the edges, but they give a good idea of what the future has in store for programmers and computer scientists - and, of course, malicious users.

If such a system is bound to become a disruptive element in the daily work of computer engineers, it is natural to wonder how this could affect the activities of cybercriminals. With this in mind, we tested the extent of Codex's capabilities, focusing on the most typical aspects of a cybercriminal: reconnaissance, social engineering, and exploitation.

In a series of blog posts, we explore how Codex's current capabilities affect a malicious user's everyday activities, what precautions developers and regular users can take, and how these capabilities might evolve. This is the first part of the series.

Scavenging for sensitive data

We know that language transformers are trained on massive corpora of text and source code taken from public repositories. We are unlikely to be the first ones to ask the question of what happens to all the information contained in the public repositories once it is sifted through the fine mesh of GPT-3's neural network. While the first issues with Copilot's proposing snippets of copyrighted code had already emerged, we wanted to see if sensitive information was present in GPT-3's knowledge base and if it was possible to exfiltrate it by exploiting Codex's code generation.

Personal and sensitive information leaks through code

Public repositories can be a treasure trove of sensitive data just waiting to be discovered by malicious actors. In our tests, we found that it is possible to trick Codex into exposing sensitive data being left in the repositories by having it generate code that would eventually require access to the data.

Attachments

  • Original Link
  • Original Document
  • Permalink

Disclaimer

Trend Micro Inc. published this content on 07 January 2022 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 07 January 2022 17:57:05 UTC.