Gabriele Farina - First-order optimality conditions

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Thu, Feb 6th 2025
Lecture 2
First-order optimality conditions
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
First-order optimality conditions define conditions that optimal points need to satisfy. For
this lecture, we will make the blanket assumption that we work with differentiable functions.
L2.1 Unconstrained optimization
I’m pretty sure you have already encountered first-order optimality conditions for uncon-
strained optimization problems before. For example, consider the following optimization
problem.
Example L2.1. Find a solution to the problem
min
𝑥
s.t.
𝑓(𝑥)
𝑥 ∈ ℝ,
where the differentiable function 𝑓 : ℝ → ℝ,
plotted on the right, is defined as
𝑓(𝑥) ≔ −2𝑥 + 𝑒𝑥 − 5.
−3 −2 −1 1 2 3 x
−5
5
10
y
0
Solution. I expect that most students would have the same thought: take the gradient of
the function, set it to 0, and solve for 𝑥! In this case, this leads to −2 + 𝑒𝑥 = 0 which
implies that the optimal point is 𝑥∗ = log 2 ≈ 0.693. □
Now, in the above process we have been pretty informal. It is good to remember that when
facing an optimization problem of the form min𝑥∈ℝ𝑛 𝑓(𝑥), with 𝑓(𝑥) differentiable, solving
∇𝑓(𝑥) = 0 has some limitations:
• It is only a necessary condition that all optimal points need to satisfy; but not all
points that satisfy it are automatically optimal.
[▷ For example, think about what happens with 𝑓(𝑥) = −𝑥2? With 𝑓(𝑥) = 𝑥3? With
𝑓(𝑥) = 𝑥3 + 3𝑥2 − 6𝑥 − 8?]
• In other words, the solutions to ∇𝑓(𝑥) = 0 form a list of possible minimizing points:
solving ∇𝑓(𝑥) = 0 allows us to focus our attention on few promising candidate points
(some people call these “critical points”). It might give false positives but never false
negatives: if a point fails the ∇𝑓(𝑥) = 0 test, it cannot be optimal.
In practice, as you know from experience, solving ∇𝑓(𝑥) = 0 is a practical way of analytically
solving unconstrained problems. Today and next time, we will focus on the following two
big questions:
• What is the correct generalization of the necessary condition ∇𝑓(𝑥) = 0, when we are
faced with a constrained optimization problem?
• Under what circumstances does ∇𝑓(𝑥) = 0 also become sufficient for optimality?
L2.2 Constrained optimization
In order to generalize the “∇𝑓(𝑥) = 0” condition to constrained optimization problems, it
is important to make sure we are all on the same page as to why such a condition arises in
the first place in unconstrained problems. From there, generalizing will be straightforward.
L2.2.1 Why the zero gradient condition in unconstrained optimization?
The idea is very simple: if 𝑥 is a minimizer of the function, then look at the values of the
function 𝑓 : ℝ𝑛 → ℝ along a generic direction 𝑑 ∈ ℝ𝑛. Clearly, 𝑓(𝑥 + 𝑡 ⋅ 𝑑) ≥ 𝑓(𝑥) for all
𝑡 ≥ 0 (or 𝑥 would not be a minimizer). Hence, the directional derivative 𝑓′(𝑥; 𝑑) of 𝑓 at 𝑥
along direction 𝑑,
𝑓′(𝑥; 𝑑) = lim
𝑡↓0
𝑓(𝑥 + 𝑡 ⋅ 𝑑) − 𝑓(𝑥)
𝑡 ≥ 0,
since the limit of a nonnegative sequence must be nonnegative.
By definition of gradient, we have 𝑓′(𝑥; 𝑑) = ⟨∇𝑓(𝑥), 𝑑⟩, and so the previous inequality can
be rewritten as
⟨∇𝑓(𝑥), 𝑑⟩ ≥ 0 ∀𝑑 ∈ ℝ𝑛.
Because the above inequality must hold for all directions 𝑑 ∈ ℝ𝑛, in particular it must hold
for 𝑑 = −∇𝑓(𝑥), leading to
−‖∇𝑓(𝑥)‖2 ≥ 0 ⟺ ∇𝑓(𝑥) = 0.
L2.2.2 The constrained case
Now that we have a clearer picture of why the “∇𝑓(𝑥) = 0” condition arises in unconstrained
problems, the extension to the constrained case is rather natural.
The main difference with the unconstrained case is that, in a constrained set, we might
be limited in the choices of available directions 𝑑 along which we can approach 𝑥 while
remaining in the set. Nonetheless, for any direction 𝑑 such that 𝑥 + 𝑡 ⋅ 𝑑 ∈ Ω for all 𝑡 ≥ 0
sufficiently small, the above argument applies without changes, and we can still conclude
that necessarily ⟨∇𝑓(𝑥), 𝑑⟩ ≥ 0.
So, the natural generalization of the “∇𝑓(𝑥) = 0” condition to constrained problems can
be informally stated as follows: for the optimality of 𝑥 it is necessary that
⟨∇𝑓(𝑥), 𝑑⟩ ≥ 0 for all 𝑑 ∈ ℝ𝑛 that remain in Ω from 𝑥. (1)
In order to instantiate the above condition, two steps are required:
1. first, we need to determine what the set of “directions 𝑑 that remain in Ω from 𝑥” is.
2. then, based on the directions above, see in what way they constrain ∇𝑓(𝑥). For
example, we have seen before that when the set of all directions spans the entire space
ℝ𝑛, then ∇𝑓(𝑥) = 0.
Out of the two, usually the first point is the easiest. In all the cases that will be of
our interest, we can determine the set of directions that remain in Ω from 𝑥 by simply
considering any other 𝑦 ∈ Ω and considering the direction from 𝑥 to 𝑦. This holds trivially
if all line segments between 𝑥 and any point in Ω are entirely contained in Ω, a condition
known as star-convexity at 𝑥.
Definition L2.1 (Star-convexity at 𝑥). A set Ω ⊆ ℝ𝑛 is said to be star-convex at a point
𝑥 ∈ Ω if, for all 𝑦 ∈ Ω, the entire segment from 𝑥 to 𝑦 is contained in Ω. In symbols, if
𝑥 + 𝑡 ⋅ (𝑦 − 𝑥) ∈ Ω ∀𝑡 ∈ [0, 1].
(Note that the condition is equivalent to “𝑡 ⋅ 𝑦 + (1 − 𝑡) ⋅ 𝑥 ∈ Ω for all 𝑦 ∈ Ω and 𝑡 ∈
[0, 1]”, or also “𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅ 𝑦 ∈ Ω for all 𝑦 ∈ Ω and 𝑡 ∈ [0, 1]”.)
In fact, for all our purposes today, we will only consider sets that are star-convex at all of
their points. Such sets are simply called convex.
Definition L2.2 (Convex set). A set Ω is convex if it is star-convex at all of its points
𝑥 ∈ Ω. In other words, Ω is convex if all segments formed between any two points 𝑥, 𝑦 ∈
Ω are entirely contained in Ω. In symbols, if
𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅ 𝑦 ∈ Ω ∀𝑥, 𝑦 ∈ Ω and 𝑡 ∈ [0, 1].
Under assumption of convexity, the condition (1) can be equivalently rewritten as follows.
Theorem L2.1 (First-order necessary optimality condition for a convex feasible set). Let
Ω ⊆ ℝ𝑛 be convex and 𝑓 : ℝ𝑛 → ℝ be a differentiable function. For a point 𝑥 ∈ Ω to be
a minimizer of 𝑓 over Ω it is necessary that
⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ≥ 0 ∀𝑦 ∈ Ω.
L2.2.3 Geometric intuition: normal cones
The condition established in Theorem L2.1 has the following geometric interpretation: the
gradient of 𝑓 at a solution 𝑥 ∈ Ω must form an acute angle with all directions 𝑦 − 𝑥, 𝑦 ∈
Ω. While this makes perfect sense, it is actually more customary, for mental visualization
purposes, to flip signs and instead have the following useful mental picture: at any solution
𝑥 ∈ Ω, the opposite of the gradient −∇𝑓(𝑥) must form an obtuse angle with all directions
𝑦 − 𝑥, 𝑦 ∈ Ω. In other words, −∇𝑓(𝑥) can only “look” in those directions in which the set
is not in the 90° cone of vision.
Of course, depending on the shape of the set Ω and the particular point 𝑥 ∈ Ω, the set
of directions that point away from the set might be extremely limited—for example we
have seen earlier that when Ω = ℝ𝑛, then no directions “point away” from Ω, and the only
possible value for −∇𝑓(𝑥) is therefore 0. This mental picture of “directions pointing away”
from Ω is generally pretty useful, and we give it a name.
Definition L2.3 (Normal cone). Let Ω ⊆ ℝ𝑛 be convex, and let 𝑥 ∈ Ω. The normal cone
to Ω at 𝑥, denoted 𝒩Ω(𝑥), is defined as the set
𝒩Ω(𝑥) ≔ {𝑑 ∈ ℝ𝑛 : ⟨𝑑, 𝑦 − 𝑥⟩ ≤ 0 ∀𝑦 ∈ Ω}.
With this definition, the first-order necessary optimality condition for 𝑥, given in
Theorem L2.1, can be equivalently written as
−∇𝑓(𝑥) ∈ 𝒩Ω(𝑥).
Example L2.2. As an example, here are a few normal cones computed for a convex set.
Ω
𝑥2
𝒩Ω(𝑥2)𝑥1
𝒩Ω(𝑥1)
L2.3 Normal cones at a point in the interior
Let’s build our intuition regarding normal cones by considering examples that are progres-
sively harder. Along the way, we will see that first-order optimality conditions, in all their
simplicity, imply some of the deepest results in optimization theory.
Let’s start from an easy example: the normal cone at a point in
the interior of the feasible set, that is, one for which we can find
an entire ball (of some suitably small radius 𝜀 > 0) centered
in the point, such that the ball is fully contained in the set.
This is always the case when the feasible set is unconstrained:
every point is in the interior in that case!
𝑥
Ω
Example L2.3 (Normal cone at an interior point). The normal cone 𝒩Ω(𝑥) of a point 𝑥
in the interior of the feasible set Ω is 𝒩Ω(𝑥) = {0}.
Solution. In this case, the normal cone contains only the zero vector, that is,
𝒩Ω(𝑥) = {0}.
This is easy to prove: if any 𝑑 ≠ 0 were to belong to 𝒩Ω(𝑥), then we could consider the
point 𝑥 + 𝛿𝑑 for sufficiently small 𝛿 > 0, and have
⟨𝑑, 𝑥 + 𝛿𝑑 − 𝑥⟩ = 𝛿‖𝑑‖2 > 0.
Hence, for a point 𝑥 in the interior of Ω to be optimal, it is necessary that ∇𝑓(𝑥) = 0.□
L2.4 Normal cone to a point on a hyperplane / subspace
Next up, we consider the normal cone to a point on a hyperplane.
Theorem L2.2 (Normal cone to a hyperplane). Consider a hyperplane
Ω ≔ {𝑦 ∈ ℝ𝑛 : ⟨𝑎, 𝑦⟩ = 0}, where 𝑎 ∈ ℝ𝑛, 𝑎 ≠ 0
and a point 𝑥 ∈ Ω. The normal cone at 𝑥 is given by
𝒩Ω(𝑥) = span{𝑎} = {𝜆 ⋅ 𝑎 : 𝜆 ∈ ℝ}.
(See also the picture; this should look pretty intuitive!)
𝑥 𝑎
Ω 𝒩Ω(𝑥)
Proof. In order to convert our geometric intuition into a formal proof, [▷ before continuing,
try to think how you would go about proving this yourself!] it is enough to show two things:
• all points in span{𝑎} do indeed belong to 𝒩Ω(𝑥); by convexity, this means that we
need to show that all points 𝑧 ∈ span{𝑎} satisfy
⟨𝑧, 𝑦 − 𝑥⟩ ≤ 0 ∀𝑦 ∈ Ω;
• none of the points outside of span{𝑎} belong to 𝒩Ω(𝑥); that is, for any point 𝑧 ∉
span{𝑎}, then there exists 𝑦 ∈ Ω such that ⟨𝑧, 𝑦 − 𝑥⟩ > 0.
The first point is straightforward: by definition of span, all points in span{𝑎} are of the
form 𝜆 ⋅ 𝑎 for some 𝜆 ∈ ℝ. But then, for all 𝑦 ∈ Ω,
⟨𝑧, 𝑦 − 𝑥⟩ = ⟨𝜆 ⋅ 𝑎, 𝑦 − 𝑥⟩ = 𝜆 ⋅ ⟨𝑎, 𝑦⟩ − 𝜆 ⋅ ⟨𝑎, 𝑥⟩ = 0 − 0 ≤ 0,
where the last equality follows from the definition of Ω and the fact that both 𝑥 and
𝑦 belong to it. To prove the second point, we can let the geometric intuition guide us.
Draw a vector 𝑧 ∉ span{𝑎} applied to 𝑥, and look at the picture:
𝑧
𝑥
𝑦
𝑎
𝑥 + span{𝑎}
Ω
We can project the point 𝑥 + 𝑧 onto Ω, finding some 𝑦 ∈ Ω, and onto 𝑥 + span{𝑎}, finding
some point 𝑥 + 𝑘 ⋅ 𝑎:
𝑧 = (𝑦 − 𝑥) + 𝑘 ⋅ 𝑎.
We now show that 𝑧 cannot be in 𝒩Ω(𝑥), because it would have a positive inner product
with 𝑦 − 𝑥:
⟨𝑧, 𝑦 − 𝑥⟩ = ⟨(𝑦 − 𝑥) + 𝑘 ⋅ 𝑎, 𝑦 − 𝑥⟩
= ‖𝑦 − 𝑥‖2 + 𝑘 ⋅ ⟨𝑎, 𝑦 − 𝑥⟩ = ‖𝑦 − 𝑥‖2.
Since 𝑧 was not aligned with span{𝑎} by hypothesis, then 𝑦 ≠ 𝑥, and therefore ⟨𝑧, 𝑦 −
𝑥⟩ > 0 as we wanted to show. □
Remark L2.1. Because normal cones are insensitive to shifts in the set, the result above
applies without changes to any affine plane
Ω ≔ {𝑦 ∈ ℝ𝑛 : ⟨𝑎, 𝑦⟩ = 𝑏},
with 𝑎 ∈ ℝ𝑛, 𝑏 ∈ ℝ. Again,
𝒩Ω(𝑥) = span{𝑎} = {𝜆 ⋅ 𝑎 : 𝜆 ∈ ℝ}
at any 𝑥 ∈ Ω.
Remark L2.2. The same argument above, based on decomposing 𝑥 + 𝑧 onto Ω and its
orthogonal complement span{𝑎} applies to lower-dimensional affine subspaces
Ω ≔ {𝑦 ∈ ℝ𝑛 : 𝐴𝑦 = 𝑏}.
In this case, we obtain that
𝒩Ω(𝑥) = colspan(𝐴⊤).
(This immediately recovers Theorem L2.2 by considering 𝐴 = 𝑎⊤)
In the case of Remark L2.2, the argument above with the projection goes through verbatim.
In this case, one would need to project 𝑥 + 𝑧 onto colspan(𝐴⊤) and onto Ω.¹
Remark L2.3 (Lagrange multipliers). The discussion we just had, shows that whenever
we have a problem of the form
min
𝑥
s.t.
𝑓(𝑥)
𝐴𝑥 = 𝑏
𝑥 ∈ ℝ𝑛,
at optimality it needs to hold that
−∇𝑓(𝑥) = 𝐴⊤𝜆, for some 𝜆 ∈ ℝ𝑑
where 𝑑 is the number of rows of 𝐴. This necessity of being able to express—at
optimality—the gradient of the objective as a combination of the constraints is very
general. The entries of 𝜆 are an example of Lagrange multipliers.
In the next two subsections, we will see how the characterization of the normal cone to
affine subspaces enables us to solve a couple of problems that arise in practice.
L2.4.1 Application #1: Projection onto an affine subspace
Example L2.4. Consider the nonempty set Ω ≔ {𝑥 ∈ ℝ𝑛 : 𝐴𝑥 = 𝑏}, where 𝐴 ∈ ℝ𝑑×𝑛 is
such that 𝐴𝐴⊤ is invertible. Prove that the Euclidean projection 𝑥 of a point 𝑧 onto Ω,
that is, the solution to²
min
𝑥
s.t.
1
2 ‖𝑥 − 𝑧‖2
2
𝑥 ∈ Ω
is given by
𝑥 = 𝑧 − 𝐴⊤(𝐴𝐴⊤)−1(𝐴𝑧 − 𝑏).
Solution. Since the gradient of the objective at any point 𝑥 is (𝑥 − 𝑧), from the first-
order optimality conditions any solution 𝑥 must satisfy
−(𝑥 − 𝑧) ∈ 𝒩Ω(𝑥).
From Remark L2.2, we know that at any 𝑥 ∈ Ω, 𝒩Ω(𝑥) = colspan(𝐴⊤) = {𝐴⊤𝜆 : 𝜆 ∈
ℝ𝑛}. So, at optimality there must exist 𝜆 ∈ ℝ𝑑 such that
−(𝑥 − 𝑧) = 𝐴⊤𝜆 ⟹ 𝑥 = 𝑧 − 𝐴⊤𝜆.
Furthermore, since 𝑥 ∈ Ω, we have 𝐴𝑥 = 𝑏. Plugging the above expression for 𝑥 we thus
have
𝐴(𝑧 − 𝐴⊤𝜆) = 𝑏 ⟹ (𝐴𝐴⊤)𝜆 = 𝐴𝑧 − 𝑏.
Solving for 𝜆 and plugging back into 𝑥 = 𝑧 − 𝐴⊤𝜆 yields the result. □
L2.4.2 Application #2: Entropy-regularized linear optimization (softmax)
As a second example application, we will consider a real problem that comes up naturally
in online learning and reinforcement learning: entropy-regularized best responses.
Example L2.5. Consider the set of probability distributions over 𝑛 actions {1, ..., 𝑛}
that have full support, that is, the set ̊ Δ𝑛 ≔ {(𝑥1, ..., 𝑥𝑛) ∈ ℝ𝑛
>0 : 𝑥1 + ⋯ + 𝑥𝑛 = 1}.
Given an assignment of values 𝑣𝑖 for each action 𝑖 = 1, ..., 𝑛, the entropy-regularized best
response given the values is the distribution that solves the following problem:
min
𝑥
s.t.
𝑔(𝑥) ≔ − ∑
𝑛
𝑖=1
𝑣𝑖𝑥𝑖 + ∑
𝑛
𝑖=1
𝑥𝑖 log 𝑥𝑖
𝑥 ∈ ̊ Δ𝑛,
Show that the solution to this problem is the distribution that picks action 𝑖 with
probability proportional to the exponential of the value 𝑣𝑖 of that action:
𝑥𝑖 = 𝑒𝑣𝑖
∑𝑛
𝑖=1 𝑒𝑣𝑖
.
Solution. We’ll leave showing that the nonlinear optimization problem has a solution as
exercise. Here, we show that the first-order optimality conditions imply that the solution
necessarily has components proportional to 𝑒𝑣𝑖 .
Pick any point 𝑥 ∈ ̊ Δ𝑛. The set of directions that remain inside ̊ Δ𝑛 span the entire
plane: the constraint 𝑥𝑖 > 0 is completely inconsequential for the purposes of first-order
optimality conditions. In other words, we are exactly in the same setting as Theorem
L2.2, where in this case 𝑎 = 1 ∈ ℝ𝑛. Hence, whatever the solution 𝑥 to the problem might
be, it is necessary that −∇𝑔(𝑥) be in the normal cone 𝒩 ̊Δ𝑛 (𝑥) = span{1} ⊂ ℝ𝑛. So, there
must exist 𝜆 ∈ ℝ such that
(
(((𝑣1 − 1 − log 𝑥1
⋮
𝑣𝑛 − 1 − log 𝑥𝑛)
)))
⏟⏟⏟⏟⏟⏟⏟⏟⏟
−∇𝑔(𝑥)
= 𝜆 ⋅
(
(((1
⋮
1)
)))
⏟⏟⏟⏟⏟
∈𝒩 ̊
Δ𝑛 (𝑥)
⟺ log 𝑥𝑖 = 𝜆 − 1 + 𝑣𝑖 ∀𝑖 = 1, ..., 𝑛.
Exponentiating on both sides, we have
𝑥𝑖 = exp(𝑣𝑖 − 1 − 𝜆) = 𝛼 ⋅ exp(𝑣𝑖), where 𝛼 ≔ exp(−1 − 𝜆) ∈ ℝ.
This shows that at optimality there exists a proportionality constant 𝛼 such that 𝑥𝑖 =
𝛼 ⋅ 𝑒𝑣𝑖 for all 𝑖 = 1, ..., 𝑛. Since ∑𝑛
𝑖=1 𝑥𝑖 = 1, we find that
𝛼 ∑
𝑛
𝑖=1
𝑒𝑣𝑖 = 1 ⟹ 𝛼 = 1
∑𝑛
𝑖=1 𝑒𝑣𝑖
,
and the result follows. □
Changelog
• Feb 11, 2025: Remarked that 𝑑 ∈ ℝ𝑛 in L2.2.1.
• Feb 13, 2025: fixed typo: “whenver” -> “whenever” (thanks Brandon Eickert!)
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹The orthogonality of colspan(𝐴⊤) and Ω is a reflection of the well-known linear algebra result that the
orthogonal complement of the nullspace of a matrix is the span of the columns of the transpose matrix.
²We already know from Lecture 1 that the projection must exist since Ω is nonempty and closed.

Content

Typo or question?

Metadata