MIT 6.7220/15.084 โ Nonlinear Optimization (Spring โ25) Thu, Feb 6th 2025
Lecture 2
First-order optimality conditions
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
First-order optimality conditions define conditions that optimal points need to satisfy. For
this lecture, we will make the blanket assumption that we work with differentiable functions.
L2.1 Unconstrained optimization
Iโm pretty sure you have already encountered first-order optimality conditions for uncon-
strained optimization problems before. For example, consider the following optimization
problem.
Example L2.1. Find a solution to the problem
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ โ,
where the differentiable function ๐ : โ โ โ,
plotted on the right, is defined as
๐(๐ฅ) โ โ2๐ฅ + ๐๐ฅ โ 5.
โ3 โ2 โ1 1 2 3 x
โ5
5
10
y
0
Solution. I expect that most students would have the same thought: take the gradient of
the function, set it to 0, and solve for ๐ฅ! In this case, this leads to โ2 + ๐๐ฅ = 0 which
implies that the optimal point is ๐ฅโ = log 2 โ 0.693. โก
Now, in the above process we have been pretty informal. It is good to remember that when
facing an optimization problem of the form min๐ฅโโ๐ ๐(๐ฅ), with ๐(๐ฅ) differentiable, solving
โ๐(๐ฅ) = 0 has some limitations:
โข It is only a necessary condition that all optimal points need to satisfy; but not all
points that satisfy it are automatically optimal.
[โท For example, think about what happens with ๐(๐ฅ) = โ๐ฅ2? With ๐(๐ฅ) = ๐ฅ3? With
๐(๐ฅ) = ๐ฅ3 + 3๐ฅ2 โ 6๐ฅ โ 8?]
โข In other words, the solutions to โ๐(๐ฅ) = 0 form a list of possible minimizing points:
solving โ๐(๐ฅ) = 0 allows us to focus our attention on few promising candidate points
(some people call these โcritical pointsโ). It might give false positives but never false
negatives: if a point fails the โ๐(๐ฅ) = 0 test, it cannot be optimal.
In practice, as you know from experience, solving โ๐(๐ฅ) = 0 is a practical way of analytically
solving unconstrained problems. Today and next time, we will focus on the following two
big questions:
โข What is the correct generalization of the necessary condition โ๐(๐ฅ) = 0, when we are
faced with a constrained optimization problem?
โข Under what circumstances does โ๐(๐ฅ) = 0 also become sufficient for optimality?
L2.2 Constrained optimization
In order to generalize the โโ๐(๐ฅ) = 0โ condition to constrained optimization problems, it
is important to make sure we are all on the same page as to why such a condition arises in
the first place in unconstrained problems. From there, generalizing will be straightforward.
L2.2.1 Why the zero gradient condition in unconstrained optimization?
The idea is very simple: if ๐ฅ is a minimizer of the function, then look at the values of the
function ๐ : โ๐ โ โ along a generic direction ๐ โ โ๐. Clearly, ๐(๐ฅ + ๐ก โ ๐) โฅ ๐(๐ฅ) for all
๐ก โฅ 0 (or ๐ฅ would not be a minimizer). Hence, the directional derivative ๐โฒ(๐ฅ; ๐) of ๐ at ๐ฅ
along direction ๐,
๐โฒ(๐ฅ; ๐) = lim
๐กโ0
๐(๐ฅ + ๐ก โ ๐) โ ๐(๐ฅ)
๐ก โฅ 0,
since the limit of a nonnegative sequence must be nonnegative.
By definition of gradient, we have ๐โฒ(๐ฅ; ๐) = โจโ๐(๐ฅ), ๐โฉ, and so the previous inequality can
be rewritten as
โจโ๐(๐ฅ), ๐โฉ โฅ 0 โ๐ โ โ๐.
Because the above inequality must hold for all directions ๐ โ โ๐, in particular it must hold
for ๐ = โโ๐(๐ฅ), leading to
โโโ๐(๐ฅ)โ2 โฅ 0 โบ โ๐(๐ฅ) = 0.
L2.2.2 The constrained case
Now that we have a clearer picture of why the โโ๐(๐ฅ) = 0โ condition arises in unconstrained
problems, the extension to the constrained case is rather natural.
The main difference with the unconstrained case is that, in a constrained set, we might
be limited in the choices of available directions ๐ along which we can approach ๐ฅ while
remaining in the set. Nonetheless, for any direction ๐ such that ๐ฅ + ๐ก โ ๐ โ ฮฉ for all ๐ก โฅ 0
sufficiently small, the above argument applies without changes, and we can still conclude
that necessarily โจโ๐(๐ฅ), ๐โฉ โฅ 0.
So, the natural generalization of the โโ๐(๐ฅ) = 0โ condition to constrained problems can
be informally stated as follows: for the optimality of ๐ฅ it is necessary that
โจโ๐(๐ฅ), ๐โฉ โฅ 0 for all ๐ โ โ๐ that remain in ฮฉ from ๐ฅ. (1)
In order to instantiate the above condition, two steps are required:
1. first, we need to determine what the set of โdirections ๐ that remain in ฮฉ from ๐ฅโ is.
2. then, based on the directions above, see in what way they constrain โ๐(๐ฅ). For
example, we have seen before that when the set of all directions spans the entire space
โ๐, then โ๐(๐ฅ) = 0.
Out of the two, usually the first point is the easiest. In all the cases that will be of
our interest, we can determine the set of directions that remain in ฮฉ from ๐ฅ by simply
considering any other ๐ฆ โ ฮฉ and considering the direction from ๐ฅ to ๐ฆ. This holds trivially
if all line segments between ๐ฅ and any point in ฮฉ are entirely contained in ฮฉ, a condition
known as star-convexity at ๐ฅ.
Definition L2.1 (Star-convexity at ๐ฅ). A set ฮฉ โ โ๐ is said to be star-convex at a point
๐ฅ โ ฮฉ if, for all ๐ฆ โ ฮฉ, the entire segment from ๐ฅ to ๐ฆ is contained in ฮฉ. In symbols, if
๐ฅ + ๐ก โ (๐ฆ โ ๐ฅ) โ ฮฉ โ๐ก โ [0, 1].
(Note that the condition is equivalent to โ๐ก โ ๐ฆ + (1 โ ๐ก) โ ๐ฅ โ ฮฉ for all ๐ฆ โ ฮฉ and ๐ก โ
[0, 1]โ, or also โ๐ก โ ๐ฅ + (1 โ ๐ก) โ ๐ฆ โ ฮฉ for all ๐ฆ โ ฮฉ and ๐ก โ [0, 1]โ.)
In fact, for all our purposes today, we will only consider sets that are star-convex at all of
their points. Such sets are simply called convex.
Definition L2.2 (Convex set). A set ฮฉ is convex if it is star-convex at all of its points
๐ฅ โ ฮฉ. In other words, ฮฉ is convex if all segments formed between any two points ๐ฅ, ๐ฆ โ
ฮฉ are entirely contained in ฮฉ. In symbols, if
๐ก โ ๐ฅ + (1 โ ๐ก) โ ๐ฆ โ ฮฉ โ๐ฅ, ๐ฆ โ ฮฉ and ๐ก โ [0, 1].
Under assumption of convexity, the condition (1) can be equivalently rewritten as follows.
Theorem L2.1 (First-order necessary optimality condition for a convex feasible set). Let
ฮฉ โ โ๐ be convex and ๐ : โ๐ โ โ be a differentiable function. For a point ๐ฅ โ ฮฉ to be
a minimizer of ๐ over ฮฉ it is necessary that
โจโ๐(๐ฅ), ๐ฆ โ ๐ฅโฉ โฅ 0 โ๐ฆ โ ฮฉ.
L2.2.3 Geometric intuition: normal cones
The condition established in Theorem L2.1 has the following geometric interpretation: the
gradient of ๐ at a solution ๐ฅ โ ฮฉ must form an acute angle with all directions ๐ฆ โ ๐ฅ, ๐ฆ โ
ฮฉ. While this makes perfect sense, it is actually more customary, for mental visualization
purposes, to flip signs and instead have the following useful mental picture: at any solution
๐ฅ โ ฮฉ, the opposite of the gradient โโ๐(๐ฅ) must form an obtuse angle with all directions
๐ฆ โ ๐ฅ, ๐ฆ โ ฮฉ. In other words, โโ๐(๐ฅ) can only โlookโ in those directions in which the set
is not in the 90ยฐ cone of vision.
Of course, depending on the shape of the set ฮฉ and the particular point ๐ฅ โ ฮฉ, the set
of directions that point away from the set might be extremely limitedโfor example we
have seen earlier that when ฮฉ = โ๐, then no directions โpoint awayโ from ฮฉ, and the only
possible value for โโ๐(๐ฅ) is therefore 0. This mental picture of โdirections pointing awayโ
from ฮฉ is generally pretty useful, and we give it a name.
Definition L2.3 (Normal cone). Let ฮฉ โ โ๐ be convex, and let ๐ฅ โ ฮฉ. The normal cone
to ฮฉ at ๐ฅ, denoted ๐ฉฮฉ(๐ฅ), is defined as the set
๐ฉฮฉ(๐ฅ) โ {๐ โ โ๐ : โจ๐, ๐ฆ โ ๐ฅโฉ โค 0 โ๐ฆ โ ฮฉ}.
With this definition, the first-order necessary optimality condition for ๐ฅ, given in
Theorem L2.1, can be equivalently written as
โโ๐(๐ฅ) โ ๐ฉฮฉ(๐ฅ).
Example L2.2. As an example, here are a few normal cones computed for a convex set.
ฮฉ
๐ฅ2
๐ฉฮฉ(๐ฅ2)๐ฅ1
๐ฉฮฉ(๐ฅ1)
L2.3 Normal cones at a point in the interior
Letโs build our intuition regarding normal cones by considering examples that are progres-
sively harder. Along the way, we will see that first-order optimality conditions, in all their
simplicity, imply some of the deepest results in optimization theory.
Letโs start from an easy example: the normal cone at a point in
the interior of the feasible set, that is, one for which we can find
an entire ball (of some suitably small radius ๐ > 0) centered
in the point, such that the ball is fully contained in the set.
This is always the case when the feasible set is unconstrained:
every point is in the interior in that case!
๐ฅ
ฮฉ
Example L2.3 (Normal cone at an interior point). The normal cone ๐ฉฮฉ(๐ฅ) of a point ๐ฅ
in the interior of the feasible set ฮฉ is ๐ฉฮฉ(๐ฅ) = {0}.
Solution. In this case, the normal cone contains only the zero vector, that is,
๐ฉฮฉ(๐ฅ) = {0}.
This is easy to prove: if any ๐ โ 0 were to belong to ๐ฉฮฉ(๐ฅ), then we could consider the
point ๐ฅ + ๐ฟ๐ for sufficiently small ๐ฟ > 0, and have
โจ๐, ๐ฅ + ๐ฟ๐ โ ๐ฅโฉ = ๐ฟโ๐โ2 > 0.
Hence, for a point ๐ฅ in the interior of ฮฉ to be optimal, it is necessary that โ๐(๐ฅ) = 0.โก
L2.4 Normal cone to a point on a hyperplane / subspace
Next up, we consider the normal cone to a point on a hyperplane.
Theorem L2.2 (Normal cone to a hyperplane). Consider a hyperplane
ฮฉ โ {๐ฆ โ โ๐ : โจ๐, ๐ฆโฉ = 0}, where ๐ โ โ๐, ๐ โ 0
and a point ๐ฅ โ ฮฉ. The normal cone at ๐ฅ is given by
๐ฉฮฉ(๐ฅ) = span{๐} = {๐ โ ๐ : ๐ โ โ}.
(See also the picture; this should look pretty intuitive!)
๐ฅ ๐
ฮฉ ๐ฉฮฉ(๐ฅ)
Proof. In order to convert our geometric intuition into a formal proof, [โท before continuing,
try to think how you would go about proving this yourself!] it is enough to show two things:
โข all points in span{๐} do indeed belong to ๐ฉฮฉ(๐ฅ); by convexity, this means that we
need to show that all points ๐ง โ span{๐} satisfy
โจ๐ง, ๐ฆ โ ๐ฅโฉ โค 0 โ๐ฆ โ ฮฉ;
โข none of the points outside of span{๐} belong to ๐ฉฮฉ(๐ฅ); that is, for any point ๐ง โ
span{๐}, then there exists ๐ฆ โ ฮฉ such that โจ๐ง, ๐ฆ โ ๐ฅโฉ > 0.
The first point is straightforward: by definition of span, all points in span{๐} are of the
form ๐ โ ๐ for some ๐ โ โ. But then, for all ๐ฆ โ ฮฉ,
โจ๐ง, ๐ฆ โ ๐ฅโฉ = โจ๐ โ ๐, ๐ฆ โ ๐ฅโฉ = ๐ โ โจ๐, ๐ฆโฉ โ ๐ โ โจ๐, ๐ฅโฉ = 0 โ 0 โค 0,
where the last equality follows from the definition of ฮฉ and the fact that both ๐ฅ and
๐ฆ belong to it. To prove the second point, we can let the geometric intuition guide us.
Draw a vector ๐ง โ span{๐} applied to ๐ฅ, and look at the picture:
๐ง
๐ฅ
๐ฆ
๐
๐ฅ + span{๐}
ฮฉ
We can project the point ๐ฅ + ๐ง onto ฮฉ, finding some ๐ฆ โ ฮฉ, and onto ๐ฅ + span{๐}, finding
some point ๐ฅ + ๐ โ ๐:
๐ง = (๐ฆ โ ๐ฅ) + ๐ โ ๐.
We now show that ๐ง cannot be in ๐ฉฮฉ(๐ฅ), because it would have a positive inner product
with ๐ฆ โ ๐ฅ:
โจ๐ง, ๐ฆ โ ๐ฅโฉ = โจ(๐ฆ โ ๐ฅ) + ๐ โ ๐, ๐ฆ โ ๐ฅโฉ
= โ๐ฆ โ ๐ฅโ2 + ๐ โ โจ๐, ๐ฆ โ ๐ฅโฉ = โ๐ฆ โ ๐ฅโ2.
Since ๐ง was not aligned with span{๐} by hypothesis, then ๐ฆ โ ๐ฅ, and therefore โจ๐ง, ๐ฆ โ
๐ฅโฉ > 0 as we wanted to show. โก
Remark L2.1. Because normal cones are insensitive to shifts in the set, the result above
applies without changes to any affine plane
ฮฉ โ {๐ฆ โ โ๐ : โจ๐, ๐ฆโฉ = ๐},
with ๐ โ โ๐, ๐ โ โ. Again,
๐ฉฮฉ(๐ฅ) = span{๐} = {๐ โ ๐ : ๐ โ โ}
at any ๐ฅ โ ฮฉ.
Remark L2.2. The same argument above, based on decomposing ๐ฅ + ๐ง onto ฮฉ and its
orthogonal complement span{๐} applies to lower-dimensional affine subspaces
ฮฉ โ {๐ฆ โ โ๐ : ๐ด๐ฆ = ๐}.
In this case, we obtain that
๐ฉฮฉ(๐ฅ) = colspan(๐ดโค).
(This immediately recovers Theorem L2.2 by considering ๐ด = ๐โค)
In the case of Remark L2.2, the argument above with the projection goes through verbatim.
In this case, one would need to project ๐ฅ + ๐ง onto colspan(๐ดโค) and onto ฮฉ.ยน
Remark L2.3 (Lagrange multipliers). The discussion we just had, shows that whenever
we have a problem of the form
min
๐ฅ
s.t.
๐(๐ฅ)
๐ด๐ฅ = ๐
๐ฅ โ โ๐,
at optimality it needs to hold that
โโ๐(๐ฅ) = ๐ดโค๐, for some ๐ โ โ๐
where ๐ is the number of rows of ๐ด. This necessity of being able to expressโat
optimalityโthe gradient of the objective as a combination of the constraints is very
general. The entries of ๐ are an example of Lagrange multipliers.
In the next two subsections, we will see how the characterization of the normal cone to
affine subspaces enables us to solve a couple of problems that arise in practice.
L2.4.1 Application #1: Projection onto an affine subspace
Example L2.4. Consider the nonempty set ฮฉ โ {๐ฅ โ โ๐ : ๐ด๐ฅ = ๐}, where ๐ด โ โ๐ร๐ is
such that ๐ด๐ดโค is invertible. Prove that the Euclidean projection ๐ฅ of a point ๐ง onto ฮฉ,
that is, the solution toยฒ
min
๐ฅ
s.t.
1
2 โ๐ฅ โ ๐งโ2
2
๐ฅ โ ฮฉ
is given by
๐ฅ = ๐ง โ ๐ดโค(๐ด๐ดโค)โ1(๐ด๐ง โ ๐).
Solution. Since the gradient of the objective at any point ๐ฅ is (๐ฅ โ ๐ง), from the first-
order optimality conditions any solution ๐ฅ must satisfy
โ(๐ฅ โ ๐ง) โ ๐ฉฮฉ(๐ฅ).
From Remark L2.2, we know that at any ๐ฅ โ ฮฉ, ๐ฉฮฉ(๐ฅ) = colspan(๐ดโค) = {๐ดโค๐ : ๐ โ
โ๐}. So, at optimality there must exist ๐ โ โ๐ such that
โ(๐ฅ โ ๐ง) = ๐ดโค๐ โน ๐ฅ = ๐ง โ ๐ดโค๐.
Furthermore, since ๐ฅ โ ฮฉ, we have ๐ด๐ฅ = ๐. Plugging the above expression for ๐ฅ we thus
have
๐ด(๐ง โ ๐ดโค๐) = ๐ โน (๐ด๐ดโค)๐ = ๐ด๐ง โ ๐.
Solving for ๐ and plugging back into ๐ฅ = ๐ง โ ๐ดโค๐ yields the result. โก
L2.4.2 Application #2: Entropy-regularized linear optimization (softmax)
As a second example application, we will consider a real problem that comes up naturally
in online learning and reinforcement learning: entropy-regularized best responses.
Example L2.5. Consider the set of probability distributions over ๐ actions {1, ..., ๐}
that have full support, that is, the set ฬ ฮ๐ โ {(๐ฅ1, ..., ๐ฅ๐) โ โ๐
>0 : ๐ฅ1 + โฏ + ๐ฅ๐ = 1}.
Given an assignment of values ๐ฃ๐ for each action ๐ = 1, ..., ๐, the entropy-regularized best
response given the values is the distribution that solves the following problem:
min
๐ฅ
s.t.
๐(๐ฅ) โ โ โ
๐
๐=1
๐ฃ๐๐ฅ๐ + โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐
๐ฅ โ ฬ ฮ๐,
Show that the solution to this problem is the distribution that picks action ๐ with
probability proportional to the exponential of the value ๐ฃ๐ of that action:
๐ฅ๐ = ๐๐ฃ๐
โ๐
๐=1 ๐๐ฃ๐
.
Solution. Weโll leave showing that the nonlinear optimization problem has a solution as
exercise. Here, we show that the first-order optimality conditions imply that the solution
necessarily has components proportional to ๐๐ฃ๐ .
Pick any point ๐ฅ โ ฬ ฮ๐. The set of directions that remain inside ฬ ฮ๐ span the entire
plane: the constraint ๐ฅ๐ > 0 is completely inconsequential for the purposes of first-order
optimality conditions. In other words, we are exactly in the same setting as Theorem
L2.2, where in this case ๐ = 1 โ โ๐. Hence, whatever the solution ๐ฅ to the problem might
be, it is necessary that โโ๐(๐ฅ) be in the normal cone ๐ฉ ฬฮ๐ (๐ฅ) = span{1} โ โ๐. So, there
must exist ๐ โ โ such that
(
(((๐ฃ1 โ 1 โ log ๐ฅ1
โฎ
๐ฃ๐ โ 1 โ log ๐ฅ๐)
)))
โโโโโโโโโ
โโ๐(๐ฅ)
= ๐ โ
(
(((1
โฎ
1)
)))
โโโโโ
โ๐ฉ ฬ
ฮ๐ (๐ฅ)
โบ log ๐ฅ๐ = ๐ โ 1 + ๐ฃ๐ โ๐ = 1, ..., ๐.
Exponentiating on both sides, we have
๐ฅ๐ = exp(๐ฃ๐ โ 1 โ ๐) = ๐ผ โ exp(๐ฃ๐), where ๐ผ โ exp(โ1 โ ๐) โ โ.
This shows that at optimality there exists a proportionality constant ๐ผ such that ๐ฅ๐ =
๐ผ โ ๐๐ฃ๐ for all ๐ = 1, ..., ๐. Since โ๐
๐=1 ๐ฅ๐ = 1, we find that
๐ผ โ
๐
๐=1
๐๐ฃ๐ = 1 โน ๐ผ = 1
โ๐
๐=1 ๐๐ฃ๐
,
and the result follows. โก
Changelog
โข Feb 11, 2025: Remarked that ๐ โ โ๐ in L2.2.1.
โข Feb 13, 2025: fixed typo: โwhenverโ -> โwheneverโ (thanks Brandon Eickert!)
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนThe orthogonality of colspan(๐ดโค) and ฮฉ is a reflection of the well-known linear algebra result that the
orthogonal complement of the nullspace of a matrix is the span of the columns of the transpose matrix.
ยฒWe already know from Lecture 1 that the projection must exist since ฮฉ is nonempty and closed.
Lecture 2
First-order optimality conditions
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
First-order optimality conditions define conditions that optimal points need to satisfy. For
this lecture, we will make the blanket assumption that we work with differentiable functions.
L2.1 Unconstrained optimization
Iโm pretty sure you have already encountered first-order optimality conditions for uncon-
strained optimization problems before. For example, consider the following optimization
problem.
Example L2.1. Find a solution to the problem
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ โ,
where the differentiable function ๐ : โ โ โ,
plotted on the right, is defined as
๐(๐ฅ) โ โ2๐ฅ + ๐๐ฅ โ 5.
โ3 โ2 โ1 1 2 3 x
โ5
5
10
y
0
Solution. I expect that most students would have the same thought: take the gradient of
the function, set it to 0, and solve for ๐ฅ! In this case, this leads to โ2 + ๐๐ฅ = 0 which
implies that the optimal point is ๐ฅโ = log 2 โ 0.693. โก
Now, in the above process we have been pretty informal. It is good to remember that when
facing an optimization problem of the form min๐ฅโโ๐ ๐(๐ฅ), with ๐(๐ฅ) differentiable, solving
โ๐(๐ฅ) = 0 has some limitations:
โข It is only a necessary condition that all optimal points need to satisfy; but not all
points that satisfy it are automatically optimal.
[โท For example, think about what happens with ๐(๐ฅ) = โ๐ฅ2? With ๐(๐ฅ) = ๐ฅ3? With
๐(๐ฅ) = ๐ฅ3 + 3๐ฅ2 โ 6๐ฅ โ 8?]
โข In other words, the solutions to โ๐(๐ฅ) = 0 form a list of possible minimizing points:
solving โ๐(๐ฅ) = 0 allows us to focus our attention on few promising candidate points
(some people call these โcritical pointsโ). It might give false positives but never false
negatives: if a point fails the โ๐(๐ฅ) = 0 test, it cannot be optimal.
In practice, as you know from experience, solving โ๐(๐ฅ) = 0 is a practical way of analytically
solving unconstrained problems. Today and next time, we will focus on the following two
big questions:
โข What is the correct generalization of the necessary condition โ๐(๐ฅ) = 0, when we are
faced with a constrained optimization problem?
โข Under what circumstances does โ๐(๐ฅ) = 0 also become sufficient for optimality?
L2.2 Constrained optimization
In order to generalize the โโ๐(๐ฅ) = 0โ condition to constrained optimization problems, it
is important to make sure we are all on the same page as to why such a condition arises in
the first place in unconstrained problems. From there, generalizing will be straightforward.
L2.2.1 Why the zero gradient condition in unconstrained optimization?
The idea is very simple: if ๐ฅ is a minimizer of the function, then look at the values of the
function ๐ : โ๐ โ โ along a generic direction ๐ โ โ๐. Clearly, ๐(๐ฅ + ๐ก โ ๐) โฅ ๐(๐ฅ) for all
๐ก โฅ 0 (or ๐ฅ would not be a minimizer). Hence, the directional derivative ๐โฒ(๐ฅ; ๐) of ๐ at ๐ฅ
along direction ๐,
๐โฒ(๐ฅ; ๐) = lim
๐กโ0
๐(๐ฅ + ๐ก โ ๐) โ ๐(๐ฅ)
๐ก โฅ 0,
since the limit of a nonnegative sequence must be nonnegative.
By definition of gradient, we have ๐โฒ(๐ฅ; ๐) = โจโ๐(๐ฅ), ๐โฉ, and so the previous inequality can
be rewritten as
โจโ๐(๐ฅ), ๐โฉ โฅ 0 โ๐ โ โ๐.
Because the above inequality must hold for all directions ๐ โ โ๐, in particular it must hold
for ๐ = โโ๐(๐ฅ), leading to
โโโ๐(๐ฅ)โ2 โฅ 0 โบ โ๐(๐ฅ) = 0.
L2.2.2 The constrained case
Now that we have a clearer picture of why the โโ๐(๐ฅ) = 0โ condition arises in unconstrained
problems, the extension to the constrained case is rather natural.
The main difference with the unconstrained case is that, in a constrained set, we might
be limited in the choices of available directions ๐ along which we can approach ๐ฅ while
remaining in the set. Nonetheless, for any direction ๐ such that ๐ฅ + ๐ก โ ๐ โ ฮฉ for all ๐ก โฅ 0
sufficiently small, the above argument applies without changes, and we can still conclude
that necessarily โจโ๐(๐ฅ), ๐โฉ โฅ 0.
So, the natural generalization of the โโ๐(๐ฅ) = 0โ condition to constrained problems can
be informally stated as follows: for the optimality of ๐ฅ it is necessary that
โจโ๐(๐ฅ), ๐โฉ โฅ 0 for all ๐ โ โ๐ that remain in ฮฉ from ๐ฅ. (1)
In order to instantiate the above condition, two steps are required:
1. first, we need to determine what the set of โdirections ๐ that remain in ฮฉ from ๐ฅโ is.
2. then, based on the directions above, see in what way they constrain โ๐(๐ฅ). For
example, we have seen before that when the set of all directions spans the entire space
โ๐, then โ๐(๐ฅ) = 0.
Out of the two, usually the first point is the easiest. In all the cases that will be of
our interest, we can determine the set of directions that remain in ฮฉ from ๐ฅ by simply
considering any other ๐ฆ โ ฮฉ and considering the direction from ๐ฅ to ๐ฆ. This holds trivially
if all line segments between ๐ฅ and any point in ฮฉ are entirely contained in ฮฉ, a condition
known as star-convexity at ๐ฅ.
Definition L2.1 (Star-convexity at ๐ฅ). A set ฮฉ โ โ๐ is said to be star-convex at a point
๐ฅ โ ฮฉ if, for all ๐ฆ โ ฮฉ, the entire segment from ๐ฅ to ๐ฆ is contained in ฮฉ. In symbols, if
๐ฅ + ๐ก โ (๐ฆ โ ๐ฅ) โ ฮฉ โ๐ก โ [0, 1].
(Note that the condition is equivalent to โ๐ก โ ๐ฆ + (1 โ ๐ก) โ ๐ฅ โ ฮฉ for all ๐ฆ โ ฮฉ and ๐ก โ
[0, 1]โ, or also โ๐ก โ ๐ฅ + (1 โ ๐ก) โ ๐ฆ โ ฮฉ for all ๐ฆ โ ฮฉ and ๐ก โ [0, 1]โ.)
In fact, for all our purposes today, we will only consider sets that are star-convex at all of
their points. Such sets are simply called convex.
Definition L2.2 (Convex set). A set ฮฉ is convex if it is star-convex at all of its points
๐ฅ โ ฮฉ. In other words, ฮฉ is convex if all segments formed between any two points ๐ฅ, ๐ฆ โ
ฮฉ are entirely contained in ฮฉ. In symbols, if
๐ก โ ๐ฅ + (1 โ ๐ก) โ ๐ฆ โ ฮฉ โ๐ฅ, ๐ฆ โ ฮฉ and ๐ก โ [0, 1].
Under assumption of convexity, the condition (1) can be equivalently rewritten as follows.
Theorem L2.1 (First-order necessary optimality condition for a convex feasible set). Let
ฮฉ โ โ๐ be convex and ๐ : โ๐ โ โ be a differentiable function. For a point ๐ฅ โ ฮฉ to be
a minimizer of ๐ over ฮฉ it is necessary that
โจโ๐(๐ฅ), ๐ฆ โ ๐ฅโฉ โฅ 0 โ๐ฆ โ ฮฉ.
L2.2.3 Geometric intuition: normal cones
The condition established in Theorem L2.1 has the following geometric interpretation: the
gradient of ๐ at a solution ๐ฅ โ ฮฉ must form an acute angle with all directions ๐ฆ โ ๐ฅ, ๐ฆ โ
ฮฉ. While this makes perfect sense, it is actually more customary, for mental visualization
purposes, to flip signs and instead have the following useful mental picture: at any solution
๐ฅ โ ฮฉ, the opposite of the gradient โโ๐(๐ฅ) must form an obtuse angle with all directions
๐ฆ โ ๐ฅ, ๐ฆ โ ฮฉ. In other words, โโ๐(๐ฅ) can only โlookโ in those directions in which the set
is not in the 90ยฐ cone of vision.
Of course, depending on the shape of the set ฮฉ and the particular point ๐ฅ โ ฮฉ, the set
of directions that point away from the set might be extremely limitedโfor example we
have seen earlier that when ฮฉ = โ๐, then no directions โpoint awayโ from ฮฉ, and the only
possible value for โโ๐(๐ฅ) is therefore 0. This mental picture of โdirections pointing awayโ
from ฮฉ is generally pretty useful, and we give it a name.
Definition L2.3 (Normal cone). Let ฮฉ โ โ๐ be convex, and let ๐ฅ โ ฮฉ. The normal cone
to ฮฉ at ๐ฅ, denoted ๐ฉฮฉ(๐ฅ), is defined as the set
๐ฉฮฉ(๐ฅ) โ {๐ โ โ๐ : โจ๐, ๐ฆ โ ๐ฅโฉ โค 0 โ๐ฆ โ ฮฉ}.
With this definition, the first-order necessary optimality condition for ๐ฅ, given in
Theorem L2.1, can be equivalently written as
โโ๐(๐ฅ) โ ๐ฉฮฉ(๐ฅ).
Example L2.2. As an example, here are a few normal cones computed for a convex set.
ฮฉ
๐ฅ2
๐ฉฮฉ(๐ฅ2)๐ฅ1
๐ฉฮฉ(๐ฅ1)
L2.3 Normal cones at a point in the interior
Letโs build our intuition regarding normal cones by considering examples that are progres-
sively harder. Along the way, we will see that first-order optimality conditions, in all their
simplicity, imply some of the deepest results in optimization theory.
Letโs start from an easy example: the normal cone at a point in
the interior of the feasible set, that is, one for which we can find
an entire ball (of some suitably small radius ๐ > 0) centered
in the point, such that the ball is fully contained in the set.
This is always the case when the feasible set is unconstrained:
every point is in the interior in that case!
๐ฅ
ฮฉ
Example L2.3 (Normal cone at an interior point). The normal cone ๐ฉฮฉ(๐ฅ) of a point ๐ฅ
in the interior of the feasible set ฮฉ is ๐ฉฮฉ(๐ฅ) = {0}.
Solution. In this case, the normal cone contains only the zero vector, that is,
๐ฉฮฉ(๐ฅ) = {0}.
This is easy to prove: if any ๐ โ 0 were to belong to ๐ฉฮฉ(๐ฅ), then we could consider the
point ๐ฅ + ๐ฟ๐ for sufficiently small ๐ฟ > 0, and have
โจ๐, ๐ฅ + ๐ฟ๐ โ ๐ฅโฉ = ๐ฟโ๐โ2 > 0.
Hence, for a point ๐ฅ in the interior of ฮฉ to be optimal, it is necessary that โ๐(๐ฅ) = 0.โก
L2.4 Normal cone to a point on a hyperplane / subspace
Next up, we consider the normal cone to a point on a hyperplane.
Theorem L2.2 (Normal cone to a hyperplane). Consider a hyperplane
ฮฉ โ {๐ฆ โ โ๐ : โจ๐, ๐ฆโฉ = 0}, where ๐ โ โ๐, ๐ โ 0
and a point ๐ฅ โ ฮฉ. The normal cone at ๐ฅ is given by
๐ฉฮฉ(๐ฅ) = span{๐} = {๐ โ ๐ : ๐ โ โ}.
(See also the picture; this should look pretty intuitive!)
๐ฅ ๐
ฮฉ ๐ฉฮฉ(๐ฅ)
Proof. In order to convert our geometric intuition into a formal proof, [โท before continuing,
try to think how you would go about proving this yourself!] it is enough to show two things:
โข all points in span{๐} do indeed belong to ๐ฉฮฉ(๐ฅ); by convexity, this means that we
need to show that all points ๐ง โ span{๐} satisfy
โจ๐ง, ๐ฆ โ ๐ฅโฉ โค 0 โ๐ฆ โ ฮฉ;
โข none of the points outside of span{๐} belong to ๐ฉฮฉ(๐ฅ); that is, for any point ๐ง โ
span{๐}, then there exists ๐ฆ โ ฮฉ such that โจ๐ง, ๐ฆ โ ๐ฅโฉ > 0.
The first point is straightforward: by definition of span, all points in span{๐} are of the
form ๐ โ ๐ for some ๐ โ โ. But then, for all ๐ฆ โ ฮฉ,
โจ๐ง, ๐ฆ โ ๐ฅโฉ = โจ๐ โ ๐, ๐ฆ โ ๐ฅโฉ = ๐ โ โจ๐, ๐ฆโฉ โ ๐ โ โจ๐, ๐ฅโฉ = 0 โ 0 โค 0,
where the last equality follows from the definition of ฮฉ and the fact that both ๐ฅ and
๐ฆ belong to it. To prove the second point, we can let the geometric intuition guide us.
Draw a vector ๐ง โ span{๐} applied to ๐ฅ, and look at the picture:
๐ง
๐ฅ
๐ฆ
๐
๐ฅ + span{๐}
ฮฉ
We can project the point ๐ฅ + ๐ง onto ฮฉ, finding some ๐ฆ โ ฮฉ, and onto ๐ฅ + span{๐}, finding
some point ๐ฅ + ๐ โ ๐:
๐ง = (๐ฆ โ ๐ฅ) + ๐ โ ๐.
We now show that ๐ง cannot be in ๐ฉฮฉ(๐ฅ), because it would have a positive inner product
with ๐ฆ โ ๐ฅ:
โจ๐ง, ๐ฆ โ ๐ฅโฉ = โจ(๐ฆ โ ๐ฅ) + ๐ โ ๐, ๐ฆ โ ๐ฅโฉ
= โ๐ฆ โ ๐ฅโ2 + ๐ โ โจ๐, ๐ฆ โ ๐ฅโฉ = โ๐ฆ โ ๐ฅโ2.
Since ๐ง was not aligned with span{๐} by hypothesis, then ๐ฆ โ ๐ฅ, and therefore โจ๐ง, ๐ฆ โ
๐ฅโฉ > 0 as we wanted to show. โก
Remark L2.1. Because normal cones are insensitive to shifts in the set, the result above
applies without changes to any affine plane
ฮฉ โ {๐ฆ โ โ๐ : โจ๐, ๐ฆโฉ = ๐},
with ๐ โ โ๐, ๐ โ โ. Again,
๐ฉฮฉ(๐ฅ) = span{๐} = {๐ โ ๐ : ๐ โ โ}
at any ๐ฅ โ ฮฉ.
Remark L2.2. The same argument above, based on decomposing ๐ฅ + ๐ง onto ฮฉ and its
orthogonal complement span{๐} applies to lower-dimensional affine subspaces
ฮฉ โ {๐ฆ โ โ๐ : ๐ด๐ฆ = ๐}.
In this case, we obtain that
๐ฉฮฉ(๐ฅ) = colspan(๐ดโค).
(This immediately recovers Theorem L2.2 by considering ๐ด = ๐โค)
In the case of Remark L2.2, the argument above with the projection goes through verbatim.
In this case, one would need to project ๐ฅ + ๐ง onto colspan(๐ดโค) and onto ฮฉ.ยน
Remark L2.3 (Lagrange multipliers). The discussion we just had, shows that whenever
we have a problem of the form
min
๐ฅ
s.t.
๐(๐ฅ)
๐ด๐ฅ = ๐
๐ฅ โ โ๐,
at optimality it needs to hold that
โโ๐(๐ฅ) = ๐ดโค๐, for some ๐ โ โ๐
where ๐ is the number of rows of ๐ด. This necessity of being able to expressโat
optimalityโthe gradient of the objective as a combination of the constraints is very
general. The entries of ๐ are an example of Lagrange multipliers.
In the next two subsections, we will see how the characterization of the normal cone to
affine subspaces enables us to solve a couple of problems that arise in practice.
L2.4.1 Application #1: Projection onto an affine subspace
Example L2.4. Consider the nonempty set ฮฉ โ {๐ฅ โ โ๐ : ๐ด๐ฅ = ๐}, where ๐ด โ โ๐ร๐ is
such that ๐ด๐ดโค is invertible. Prove that the Euclidean projection ๐ฅ of a point ๐ง onto ฮฉ,
that is, the solution toยฒ
min
๐ฅ
s.t.
1
2 โ๐ฅ โ ๐งโ2
2
๐ฅ โ ฮฉ
is given by
๐ฅ = ๐ง โ ๐ดโค(๐ด๐ดโค)โ1(๐ด๐ง โ ๐).
Solution. Since the gradient of the objective at any point ๐ฅ is (๐ฅ โ ๐ง), from the first-
order optimality conditions any solution ๐ฅ must satisfy
โ(๐ฅ โ ๐ง) โ ๐ฉฮฉ(๐ฅ).
From Remark L2.2, we know that at any ๐ฅ โ ฮฉ, ๐ฉฮฉ(๐ฅ) = colspan(๐ดโค) = {๐ดโค๐ : ๐ โ
โ๐}. So, at optimality there must exist ๐ โ โ๐ such that
โ(๐ฅ โ ๐ง) = ๐ดโค๐ โน ๐ฅ = ๐ง โ ๐ดโค๐.
Furthermore, since ๐ฅ โ ฮฉ, we have ๐ด๐ฅ = ๐. Plugging the above expression for ๐ฅ we thus
have
๐ด(๐ง โ ๐ดโค๐) = ๐ โน (๐ด๐ดโค)๐ = ๐ด๐ง โ ๐.
Solving for ๐ and plugging back into ๐ฅ = ๐ง โ ๐ดโค๐ yields the result. โก
L2.4.2 Application #2: Entropy-regularized linear optimization (softmax)
As a second example application, we will consider a real problem that comes up naturally
in online learning and reinforcement learning: entropy-regularized best responses.
Example L2.5. Consider the set of probability distributions over ๐ actions {1, ..., ๐}
that have full support, that is, the set ฬ ฮ๐ โ {(๐ฅ1, ..., ๐ฅ๐) โ โ๐
>0 : ๐ฅ1 + โฏ + ๐ฅ๐ = 1}.
Given an assignment of values ๐ฃ๐ for each action ๐ = 1, ..., ๐, the entropy-regularized best
response given the values is the distribution that solves the following problem:
min
๐ฅ
s.t.
๐(๐ฅ) โ โ โ
๐
๐=1
๐ฃ๐๐ฅ๐ + โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐
๐ฅ โ ฬ ฮ๐,
Show that the solution to this problem is the distribution that picks action ๐ with
probability proportional to the exponential of the value ๐ฃ๐ of that action:
๐ฅ๐ = ๐๐ฃ๐
โ๐
๐=1 ๐๐ฃ๐
.
Solution. Weโll leave showing that the nonlinear optimization problem has a solution as
exercise. Here, we show that the first-order optimality conditions imply that the solution
necessarily has components proportional to ๐๐ฃ๐ .
Pick any point ๐ฅ โ ฬ ฮ๐. The set of directions that remain inside ฬ ฮ๐ span the entire
plane: the constraint ๐ฅ๐ > 0 is completely inconsequential for the purposes of first-order
optimality conditions. In other words, we are exactly in the same setting as Theorem
L2.2, where in this case ๐ = 1 โ โ๐. Hence, whatever the solution ๐ฅ to the problem might
be, it is necessary that โโ๐(๐ฅ) be in the normal cone ๐ฉ ฬฮ๐ (๐ฅ) = span{1} โ โ๐. So, there
must exist ๐ โ โ such that
(
(((๐ฃ1 โ 1 โ log ๐ฅ1
โฎ
๐ฃ๐ โ 1 โ log ๐ฅ๐)
)))
โโโโโโโโโ
โโ๐(๐ฅ)
= ๐ โ
(
(((1
โฎ
1)
)))
โโโโโ
โ๐ฉ ฬ
ฮ๐ (๐ฅ)
โบ log ๐ฅ๐ = ๐ โ 1 + ๐ฃ๐ โ๐ = 1, ..., ๐.
Exponentiating on both sides, we have
๐ฅ๐ = exp(๐ฃ๐ โ 1 โ ๐) = ๐ผ โ exp(๐ฃ๐), where ๐ผ โ exp(โ1 โ ๐) โ โ.
This shows that at optimality there exists a proportionality constant ๐ผ such that ๐ฅ๐ =
๐ผ โ ๐๐ฃ๐ for all ๐ = 1, ..., ๐. Since โ๐
๐=1 ๐ฅ๐ = 1, we find that
๐ผ โ
๐
๐=1
๐๐ฃ๐ = 1 โน ๐ผ = 1
โ๐
๐=1 ๐๐ฃ๐
,
and the result follows. โก
Changelog
โข Feb 11, 2025: Remarked that ๐ โ โ๐ in L2.2.1.
โข Feb 13, 2025: fixed typo: โwhenverโ -> โwheneverโ (thanks Brandon Eickert!)
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนThe orthogonality of colspan(๐ดโค) and ฮฉ is a reflection of the well-known linear algebra result that the
orthogonal complement of the nullspace of a matrix is the span of the columns of the transpose matrix.
ยฒWe already know from Lecture 1 that the projection must exist since ฮฉ is nonempty and closed.