Quick answer
AI Summary: Details the training and evaluation of Codex, the code-generation model powering GitHub Copilot, and introduces the industry-standard HumanEval benchmark for measuring programming AI.
AI Summary: Details the training and evaluation of Codex, the code-generation model powering GitHub Copilot, and introduces the industry-standard HumanEval benchmark for measuring programming AI.
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. To measure the model's capabilities, we release HumanEval, a new evaluation set consisting of 164 hand-written programming problems with unit tests. We show that while a standard GPT-3 model solves 0% of the problems, our 12B parameter Codex solves 28.8% zero-shot. Furthermore, we find that repeatedly sampling from the model provides a highly effective strategy for solving difficult programming tasks.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Evaluating Large Language Models Trained on Code.