Gabriele Farina - More on normal cones, and a first taste of duality

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Tue, Feb 11th 2025
Lecture 3
More on normal cones, and a first taste of duality
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In this lecture, we continue our investigation of normal cones to points in feasible sets that
occur frequently. In Lecture 2, we have considered the normal cone to a hyperplane, and
we have seen that the normal cone at a point 𝑥 in a hyperplane is the set of all vectors
that are orthogonal to the hyperplane. We now consider the normal cone at a point in a
halfspace.
An interesting consequence of the characterization of normal cones at the intersection of
halfspaces will be to allow us to derive linear programming duality—one of the “crown
jewels” of linear optimization—as a direct corollary.
L3.1 Normal cone at a point in a halfspace
Let’s start from considering a point in a single halfspace.
Example L3.1 (Normal cone to a halfspace). Consider a halfspace
Ω ≔ {𝑦 ∈ ℝ𝑛 : ⟨𝑎, 𝑦⟩ ≤ 𝑏}, where 𝑎 ∈ ℝ𝑛, 𝑎 ≠ 0
and a point 𝑥 ∈ Ω.
We already know that if 𝑥 is in the interior of Ω (i.e., ⟨𝑎, 𝑥⟩ < 𝑏),
then 𝒩Ω(𝑥) = {0}. On the other hand, if 𝑥 is on the boundary of
Ω, that is, ⟨𝑎, 𝑥⟩ = 𝑏, then it is pretty intuitive from the picture
that
𝒩Ω(𝑥) = {𝜆 ⋅ 𝑎 : 𝜆 ∈ ℝ≥0}.
𝑥 𝑎
𝒩Ω(𝑥)
Ω
The crucial difference between the above result and that of Theorem L2.2 (Normal cone to
a hyperplane), is that the normal cone to a halfspace only points in one direction (𝜆 ≥ 0),
as opposed to the case of the hyperplane where 𝜆 ∈ ℝ. The formal proof can be obtained
by adapting the arguments we used in Theorem L2.2. [▷ You should try to work out the
details!]
L3.2 Normal cone at the intersection of halfspaces
We now consider the case of the intersection of two halfspaces.
Example L3.2 (Normal cone to the intersection of two halfspaces). Consider the interA
section Ω of two halfspaces:
Ω ≔ {𝑦 ∈ ℝ𝑛 : ⟨𝑎1, 𝑦⟩ ≤ 𝑏1
⟨𝑎2, 𝑦⟩ ≤ 𝑏2
}, where 𝑎1, 𝑎2 ∈ ℝ𝑛, 𝑎1, 𝑎2 ≠ 0,
and a point 𝑥 ∈ Ω.
If 𝑥 is in the interior of Ω, or if 𝑥 is on the boundary of
one of the halfspaces but not both, then the normal cone
follows from our prior results. So, consider 𝑥 at the interA
section of the boundaries of the halfspaces, that is, ⟨𝑎1, 𝑥⟩ =
𝑏1, ⟨𝑎2, 𝑥⟩ = 𝑏2.
It is pretty intuitive from the picture that the normal cone
at 𝑥 is the conic hull of 𝑎1 and 𝑎2, that is, the set obtained
from summing the directions in all possible ways:
𝒩Ω(𝑥) = {𝜆1 ⋅ 𝑎1 + 𝜆2 ⋅ 𝑎2 : 𝜆1, 𝜆2 ∈ ℝ≥0}.
𝑥
𝑎1
𝑎2
𝒩Ω(𝑥)
Ω
Formally proving the above result requires a bit of simple machinery that we have not seen
yet; it will be the topic of Lecture 5. For now, we take the result as a given. An important
takeaway to remember is that, at least in the case of halfspaces, the normal cone is obtained
by summing the normal cones given by all the constraints that are active at the point.
The generalization of the result to the case of 𝑚 halfspaces is equally intuitive, and we
present it next.
Theorem L3.1. Let Ω ⊆ ℝ𝑛 be given as the intersection of 𝑚 halfspaces ⟨𝑎𝑗, 𝑥⟩ ≤ 𝑏𝑗.
Then, the normal cone at any point 𝑥 ∈ Ω is obtained by taking nonnegative combinaA
tions of all those 𝑎𝑗’s for which ⟨𝑎𝑗, 𝑥⟩ = 𝑏𝑗. In symbols:
𝒩Ω(𝑥) =
{{{
{{
∑
𝑗∈𝐼(𝑥)
𝜆𝑗 ⋅ 𝑎𝑗 : 𝜆𝑗 ∈ ℝ≥0
}}}
}}
, where 𝐼(𝑥) = {𝑗 ∈ {1, ..., 𝑚} : ⟨𝑎𝑗, 𝑥⟩ = 𝑏𝑗}.
The constraints 𝑗 in 𝐼(𝑥) are often called the “active constraints” at 𝑥 ∈ Ω.
L3.2.1 A remark on equality constraints
Note that the above characterization of normal cones immediately recovers what we saw in
Lecture 2 for hyperplanes and affine subspaces. Indeed, the constraint
⟨𝑎, 𝑥⟩ = 0
is equivalent to the intersection of the halfspaces
⟨𝑎, 𝑥⟩ ≤ 0 ∧ ⟨−𝑎, 𝑥⟩ ≤ 0.
No matter the point 𝑥 considered, the two inequality constraints must be active at the same
time. So, by Theorem L3.1, the normal cone is given by the set
𝒩Ω(𝑥) = {𝜆1𝑎 + 𝜆2(−𝑎) : 𝜆1, 𝜆2 ≥ 0}
= {(𝜆1 − 𝜆2)𝑎 : 𝜆1, 𝜆2 ≥ 0}
= {𝜆𝑎 : 𝜆 ∈ ℝ}
= span(𝑎).
Example L3.3. As an example application, let’s write the normal cone to a point in the
probability simplex
Δ𝑛 ≔ {(𝑥1, ..., 𝑥𝑛) : 𝑥1 + ⋯ + 𝑥𝑛 = 1, 𝑥1 ≥ 0, ..., 𝑥𝑛 ≥ 0}.
The probability simplex is the intersection of the hyperplane 1⊤𝑥 = 1, and the halfspaces
−𝑒⊤
𝑖 𝑥 ≤ 0, for all 𝑖 = 1, ..., 𝑛, where 𝑒𝑖 is the 𝑖Ath unit vector. Hence, by Theorem L3.1,
the normal cone at any point 𝑥 ∈ Δ𝑛 is given by
𝒩Δ𝑛 (𝑥) = {𝛼1 − ∑
𝑖:𝑥𝑖=0
𝜆𝑖𝑒𝑖 : 𝛼 ∈ ℝ, 𝜆𝑖 ≥ 0}.
L3.2.2 Complementary slackness reformulation
The previous result can be reformulated in a more compact way by using the concept of
complementary slackness, as we detail in the next theorem. This reformulation is a simple
algebraic trick, and it does not really bring anything fundamentally new to the table.
However, the final form after the reformulation is very pleasantly concise, and often leads
to nice manipulations.
The key idea is that, in the definition of 𝒩Ω(𝑥) in Theorem L3.1, instead of summing over
𝑗 ∈ 𝐼(𝑥), one could equivalently sum over all 𝑗 = 1, ..., 𝑚, and then impose the constraint
that 𝜆𝑗 = 0 for all 𝑗 ∉ 𝐼(𝑥):
𝒩Ω(𝑥) = {∑
𝑚
𝑗=1
𝜆𝑗 ⋅ 𝑎𝑗 : 𝜆𝑗 ≥ 0, 𝜆𝑗 = 0 if ⟨𝑎𝑗, 𝑥⟩ < 𝑏𝑗}.
Because 𝜆𝑗 ≥ 0, the condition that 𝜆𝑗 = 0 whenever ⟨𝑎𝑗, 𝑥⟩ < 𝑏𝑗 can be succintly rewritten
as
∑
𝑚
𝑗=1
𝜆𝑗(𝑏𝑗 − ⟨𝑎𝑗, 𝑥⟩) = 0.
The above condition is usually called “complementary slackness”.
The reason why the above reformulation is often preferred, is that it lends especially well
to being written in matrix form. In particular, we have the following.
Remark L3.1. Given the intersection of halfspaces expressed in vectorial form as
𝐴𝑥 ≤ 𝑏,
the normal cone at any feasible point 𝑥 given in Theorem L3.1 can be expressed
equivalently as
𝒩Ω(𝑥) = {𝐴⊤𝜆 : 𝜆 ≥ 0, 𝜆⊤(𝑏 − 𝐴𝑥) = 0}.
There is nothing “deep” regarding this rewriting. The important point is that only the
constraints that are active at 𝑥 participate in defining the normal cone. Using the compleA
mentary slackness formulation, or the form in Theorem L3.1 is just a matter of convenience
in how to formalize what it means for a constraint to be active.
L3.3 Application #1: Derivation of linear programming duality
The previous characterization of normal cones at the intersection of halfspaces is natural,
but hides quite the punch. In particular, we show how linear programming duality follows
immediately as a corollary.
Consider the linear program
max
𝑥
s.t.
𝑓(𝑥) ≔ 𝑐⊤𝑥
𝐴𝑥 ≤ 𝑏
𝑥 ∈ ℝ𝑛,
(P)
where the matrix 𝐴 is assumed to have 𝑚 rows. We will show that the firstAorder necessary
optimality conditions imply that in order for 𝑥 to be an optimal solution to (P), then the
following “dual” problem also has an optimal solution, and the value matches that of (P):
min
𝜆
s.t.
𝑔(𝜆) ≔ 𝑏⊤𝜆
𝐴⊤𝜆 = 𝑐
𝜆 ≥ 0
(D)
From the firstAorder necessary optimality conditions for (P), we know that any optimal
𝑥∗ must satisfy the condition (note that the problem is a maximization rather then
minimization)
∇𝑓(𝑥∗) ∈ 𝒩Ω(𝑥), where Ω ≔ {𝑥 ∈ ℝ𝑛 : 𝐴𝑥 ≤ 𝑏}. (3)
From Theorem L3.1 and Remark L3.1, the normal cone 𝒩Ω(𝑥) is a combination of the rows
of 𝐴, that is,
𝒩Ω(𝑥∗) = {𝐴⊤𝜆 : 𝜆⊤(𝑏 − 𝐴𝑥∗) = 0, 𝜆 ≥ 0}, (4)
where the condition 𝜆⊤(𝑏 − 𝐴𝑥∗) = 0 is the complementary slackness condition mentioned
in Remark L3.1. Plugging (4) into (3), together with the fact that ∇𝑓(𝑥∗) = 𝑐, we obtain
that at optimality there exist 𝜆∗ ∈ ℝ𝑛 such that
𝑐 = 𝐴⊤𝜆∗, (𝜆∗)⊤(𝑏 − 𝐴𝑥∗) = 0, 𝜆∗ ≥ 0.
Plugging in the fact that 𝐴⊤𝜆∗ = 𝑐 in the middle equality, yields that
(𝜆∗)⊤𝑏 = 𝑐⊤𝑥∗.
Hence, from the mere existence of an optimal point 𝑥∗ ∈ Ω for (P), we deduced that there
exists a point 𝜆∗ in the feasible set of (D) that achieves objective value 𝑔(𝜆∗) = 𝑏⊤𝜆∗ =
𝑐⊤𝑥∗ = 𝑓(𝑥∗). To conclude that 𝜆∗ is optimal for (D), it just suffices to show that all other
feasible points for (D) must achieve value ≥ 𝑐⊤𝑥∗. This is pretty straightforward. All feasible
vectors 𝜆 for (D) satisfy 𝜆 ≥ 0 and 𝐴⊤𝜆 = 𝑐. Furthermore, 𝑏 ≥ 𝐴𝑥∗ since 𝑥∗ ∈ Ω. Hence,
for any feasible 𝜆 of (D),
𝑔(𝜆) = 𝑏⊤𝜆
≥ (𝐴𝑥∗)⊤𝜆
= (𝑥∗)⊤𝐴⊤𝜆
= (𝑥∗)⊤𝑐 = 𝑓(𝑥∗).
Since 𝑔(𝜆) ≥ 𝑓(𝑥∗) for all feasible 𝜆 in (D), and we know that a feasible 𝜆∗ that achieves
𝑔(𝜆∗) = 𝑓(𝑥∗) exists, it follows that 𝜆∗ must be optimal for (D).
Thus, with just a simple application of corollary of a result that felt simple, we have shown
one of the crown jewels of linear optimization.
Theorem L3.2 (Strong linear programming duality). If (P) admits an optimal solution
𝑥∗, then (D) admits an optimal solution 𝜆∗, such that:
• the values of the two problems coincide: 𝑐⊤𝑥∗ = 𝑏⊤𝜆∗; and
• 𝜆∗ satisfies the complementary slackness condition (𝜆∗)⊤(𝑏 − 𝐴𝑥∗) = 0.
L3.4 Application #2: Projection onto a probability simplex
As a second application, we consider the problem of projecting a vector onto the probability
simplex Δ𝑛, that is, the space of distribution over 𝑛 actions
Δ𝑛 ≔ {(𝑥1, ..., 𝑥𝑛) ∈ ℝ𝑛
≥0 : 𝑥1 + ... + 𝑥𝑛 = 1}.
This problem arises in machine learning, where one often wants to figure out an optimal
distribution over actions by repeatedly adjusting the distribution and projecting back onto
the simplex. More formally, given a point 𝑦 ∈ ℝ𝑛 to project, we are interested in computing
the solution to
min
𝑥
s.t.
1
2 ‖𝑥 − 𝑦‖2
2
𝑥 ∈ Δ𝑛.
Incidentally, since the set is closed, we already know from Lecture 1 that the problem has a
minimizer. As it turns out, unlike the applications we considered in Lecture 2, this problem
does not admit a closed-form solution. However, the firstAorder optimality conditions will
nonetheless point the way to a fast algorithm for computing the projection up to any desired
approximation. In particular, we will see that an 𝜖Aapproximate projection can be computed
in 𝑂(𝑛 log 1
𝜖 ) time, and—with a little extra work—an exact solution can be computed in
𝑂(𝑛 log 𝑛) time.
From Example L3.3, we know that the normal cone at any point 𝑥 ∈ Δ𝑛 is given by
𝒩Δ𝑛 (𝑥) = {𝛼1 − ∑
𝑖:𝑥𝑖=0
𝜆𝑖𝑒𝑖 : 𝛼 ∈ ℝ, 𝜆𝑖 ≥ 0}.
Hence, the firstAorder optimality conditions for the projection problem, −(𝑥 − 𝑦) ∈ 𝒩Δ𝑛 (𝑥),
imply that at optimality there must exist multipliers 𝛼 ∈ ℝ, 𝜆𝑖 ≥ 0 such that
𝑥 = 𝑦 − 𝛼1 + ∑
𝑖:𝑥𝑖=0
𝜆𝑖𝑒𝑖.
Rewritten into complementary slackness form, this is the same as
𝑥 = 𝑦 − 𝛼1 + 𝜆, where 𝜆𝑖𝑥𝑖 = 0 for all 𝑖 = 1, ..., 𝑛.
Looking at the vector equality along the generic row 𝑖, we have that
𝑥𝑖 = 𝑦𝑖 − 𝛼 + 𝜆𝑖, 𝜆𝑖𝑥𝑖 = 0.
▶ Removing the multipliers 𝜆𝑖. As it turns out, we can solve for the multipliers 𝜆𝑖 in the
previous equality and substitute them directly:
• Case I: 𝑦𝑖 − 𝛼 < 0. Since 𝑥𝑖 ≥ 0, the only possibility is that 𝜆𝑖 > 0. But then, 𝑥𝑖 = 0,
which leads to 𝜆𝑖 = 𝛼 − 𝑦𝑖 ≥ 0.
• Case II: 𝑦𝑖 − 𝛼 ≥ 0. If we were to pick any 𝜆𝑖 > 0, this would result in 𝑥𝑖 > 0, which
would then imply 𝜆𝑖 = 0 by complementary slackness. So, the only possibility is that
𝜆𝑖 = 0. But then, 𝑥𝑖 = 𝑦𝑖 − 𝛼.
We see that no matter the value of 𝛼, the only solution for 𝜆𝑖 results in the relationship
𝑥𝑖 = [𝑦𝑖 − 𝛼]+,
where [𝑧]+ ≔ max{𝑧, 0}. This is not exactly obvious: basically, for a point 𝑥 to aspire to be
a projection of 𝑦 onto the simplex, it better be the case that a value 𝛼 ∈ ℝ exists, such that
𝑥𝑖 = [𝑦𝑖 − 𝛼]+ for all 𝑖 = 1, ..., 𝑛.
▶ Solving for 𝛼. How do we solve for 𝛼? The value of 𝛼 is such that the sum of the 𝑥𝑖’s is
1. Hence, we must have
∑
𝑛
𝑖=1
[𝑦𝑖 − 𝛼]+ = 1.
For 𝛼 = max{𝑦𝑖}, the leftAhand side is 0, and reducing alpha results in the sum strictly
increasing monotonically. Furthermore, for 𝛼 = min{𝑦𝑖} − 1, the function is ≥ 1. Hence,
there exists a unique value of 𝛼 that satisfies the equation. However, we do not have a
closedAform expression for this value. Nonetheless, we can use a binary search in the range
[min
𝑖 {𝑦𝑖} − 1, max
𝑖 {𝑦𝑖}]
to compute the value of 𝛼 to any desired precision 𝜖 in 𝑂(𝑛 log 1
𝜖 ) time.
Another approach would be to sort the 𝑦𝑖, and then compute the value of 𝛼 by scanning
the sorted list. This would take 𝑂(𝑛 log 𝑛) time.
Changelog
• Feb 11, 2025: Cosmetic changes.
• Feb 11, 2025: Typo fixes (thanks Jonathan Huang!)
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.

Content

Typo or question?

Metadata