← Home

Quick answer

AI Summary: Details the training and evaluation of Codex, the code-generation model powering GitHub Copilot, and introduces the industry-standard HumanEval benchmark for measuring programming AI.

Claim

Evaluating Large Language Models Trained on Code

Mark Chen·
Jerry Tworek·
Heewoo Jun·
Qiming Yuan·
Henrique Ponde de Oliveira Pinto·
Jared Kaplan·
Harri Edwards·
Yuri Burda·
Nicholas Joseph·
Greg Brockman·
Alex Ray·
Wojciech Zaremba·
Ilya Sutskever·
et al.

ABSTRACT

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. To measure the model's capabilities, we release HumanEval, a new evaluation set consisting of 164 hand-written programming problems with unit tests. We show that while a standard GPT-3 model solves 0% of the problems, our 12B parameter Codex solves 28.8% zero-shot. Furthermore, we find that repeatedly sampling from the model provides a highly effective strategy for solving difficult programming tasks.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Evaluating Large Language Models Trained on Code.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful