Gabriele Farina - Convex functions and sufficiency of normal cones

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Thu, Feb 13th 2025
Lecture 4
The special case of convex functions
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In the previous lecture, we have seen how any solution 𝑥 to a nonlinear optimization problem
defined on a convex feasible set Ω ⊆ ℝ𝑛 must necessarily satisfy the first-order optimality
condition
⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ≥ 0 ∀𝑦 ∈ Ω.
In general, this optimality condition is only necessary but not sufficient. However, there
exists a notable class of functions for which such a condition is sufficient. These are called
convex functions, and are the topic of today’s lecture.
L4.1 Convex functions
Intuitively, a good mental picture for convex functions is as functions that “curve
upward” (think of a bowl for example). All the following functions are convex:
0.25 0.5 0.75 1 x
0
𝑓(𝑥) = 𝑥 log 𝑥
−2 −1 1 2 x
−2
−1
1
2
0
𝑓(𝑥) = −𝑥
−4 −2 0 2 4 x
1
2
3
4
𝑓(𝑥) = log(1 + 𝑒𝑥)
In particular, due to their curvature, local optima of these functions are also global optima,
and the first-order optimality condition completely characterizes optimal points. To capture
the condition on the curvature in the most general terms (that is, without even assuming
differentiability of the function), the following definition is used.
Definition L4.1 (Convex function). Let Ω ⊆ ℝ𝑛 be convex.
A function 𝑓 : Ω → ℝ is convex if, for any two points
𝑥, 𝑦 ∈ Ω and 𝑡 ∈ [0, 1],
𝑓((1 − 𝑡) ⋅ 𝑥 + 𝑡 ⋅ 𝑦) ≤ (1 − 𝑡) ⋅ 𝑓(𝑥) + 𝑡 ⋅ 𝑓(𝑦).
𝑥 𝑦
L4.1.1 Convexity implies bounding by linearization
Assuming that 𝑓 is not only convex but also differentiable, a very important property of
convex functions is that they lie above their linearization at any point.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x
−0.4
−0.3
−0.2
−0.1
0 𝑓
𝑓(𝑥0) + ⟨∇𝑓(𝑥0), 𝑥 − 𝑥0⟩
𝑥0
This follows directly from the definition, as we show next.
Theorem L4.1. Let 𝑓 : Ω → ℝ be a convex and differentiable function defined on a
convex domain Ω. Then, at all 𝑥 ∈ Ω,
𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩⏟⏟⏟⏟⏟⏟⏟⏟⏟
linearization of 𝑓 around 𝑥
∀𝑦 ∈ Ω.
Proof. Pick any 𝑥, 𝑦 ∈ Ω. By definition of convexity, we have
𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) ≤ 𝑓(𝑥) + 𝑡 ⋅ (𝑓(𝑦) − 𝑓(𝑥)) ∀𝑡 ∈ [0, 1].
Moving the 𝑓(𝑥) from the right-hand side to the left-hand side, and dividing by 𝑡, we
therefore get
𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − 𝑓(𝑥)
𝑡 ≤ 𝑓(𝑦) − 𝑓(𝑥) ∀𝑡 ∈ (0, 1].
Taking a limit as 𝑡 ↓ 0 and recognizing a directional derivative at 𝑥 along direction 𝑦 −
𝑥 on the left-hand side, we conclude that
⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ≤ 𝑓(𝑦) − 𝑓(𝑥).
Rearranging yields the result. □
L4.1.2 Sufficiency of first-order optimality conditions
The above result also immediately shows the sufficiency of first-order optimality conditions.
Theorem L4.2. Let Ω ⊆ ℝ𝑛 be convex and 𝑓 : Ω → ℝ be a convex differentiable function.
Then,
−∇𝑓(𝑥) ∈ 𝒩Ω(𝑥) ⟺ 𝑥 is a minimizer of 𝑓 on Ω
Proof. We already know from Lecture 2 that −∇𝑓(𝑥) ∈ 𝒩Ω(𝑥) is necessary for optimality.
So, we just need to show sufficiency. Specifically, we need to show that if ⟨∇𝑓(𝑥), 𝑦 −
𝑥⟩ ≥ 0 for all 𝑦 ∈ Ω, then surely 𝑓(𝑦) ≥ 𝑓(𝑥) for all 𝑦 ∈ Ω. This follows immediately from
Theorem L4.1. □
L4.2 Equivalent definitions of convexity
Theorem L4.3. Let Ω ⊆ ℝ𝑛 be a convex set, and 𝑓 : Ω → ℝ be a function. The following
are equivalent definitions of convexity for 𝑓:
(1) 𝑓((1 − 𝑡)𝑥 + 𝑡𝑦) ≤ (1 − 𝑡)𝑓(𝑥) + 𝑡𝑓(𝑦) for all 𝑥, 𝑦 ∈ Ω, 𝑡 ∈ [0, 1].
(2) [If 𝑓 is differentiable] 𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ for all 𝑥, 𝑦 ∈ Ω.
(3) [If 𝑓 is differentiable] ⟨∇𝑓(𝑦) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩ ≥ 0 for all 𝑥, 𝑦 ∈ Ω.
(4) [If 𝑓 is twice differentiable and Ω is open] ∇2𝑓(𝑥) ⪰ 0 for all 𝑥 ∈ Ω.
Most general
Most often used
Often easiest to check
The third criterion of Theorem L4.3 is usually the easiest to check in practice.
Example L4.1. For example, from that criterion it follows immediately that these
functions are convex:
• 𝑓(𝑥) = 𝑎⊤𝑥 + 𝑏 for any 𝑎 ∈ ℝ𝑛, 𝑏 ∈ ℝ;
• 𝑓(𝑥) = 𝑥⊤𝐴𝑥 for any 𝐴 ⪰ 0, including 𝑓(𝑥) = ‖𝑥‖2
2;
• the negative entropy function 𝑓(𝑥) = ∑𝑛
𝑖=1 𝑥𝑖 log 𝑥𝑖 defined for 𝑥𝑖 > 0;
• the function 𝑓(𝑥) = − ∑𝑛
𝑖=1 log 𝑥𝑖 defined for 𝑥𝑖 > 0;
• the function 𝑓(𝑥) = log(1 + 𝑒𝑥).
Remark L4.1. Condition (3) is also known as the monotonicity of the gradient ∇𝑓. In
dimension 𝑛 = 1, the condition is equivalent to the statement that the derivative 𝑓′ is
nondecreasing.
Proof of Theorem L4.3. We have already seen how (1) ⟹ (2) in Theorem L4.1. To
conclude the proof, we will show that under differentiability (3) ⟺ (2) ⟹ (1), and that
under twice differentiability and openness of Ω, (3) ⟺ (4). We break the proof into
separate steps.
▶ Proof that (2) ⟹ (1).
Intuition: We sum the linear lower bounds centered in the point 𝑧 ≔ 𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅
𝑦 and looking in the directions 𝑥 − 𝑧 and 𝑦 − 𝑧.
Pick any 𝑥, 𝑦 ∈ Ω and 𝑡 ∈ (0, 1), and consider the point
Ω ∋ 𝑧 ≔ 𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅ 𝑦.
From the linearization bound (2) for the choices (𝑥, 𝑦) = (𝑧, 𝑥), (𝑧, 𝑦), we know that
𝑓(𝑥) ≥ 𝑓(𝑧) + ⟨∇𝑓(𝑧), 𝑥 − 𝑧⟩,
𝑓(𝑦) ≥ 𝑓(𝑧) + ⟨∇𝑓(𝑧), 𝑦 − 𝑧⟩.
Multiplying the first inequality by 𝑡 and the second by 1 − 𝑡, and summing, we obtain
𝑡 ⋅ 𝑓(𝑥) + (1 − 𝑡) ⋅ 𝑓(𝑦) ≥ 𝑓(𝑧) + ⟨∇𝑓(𝑧), 𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅ 𝑦 − 𝑧⟩ = 𝑓(𝑧),
where the equality follows since by definition 𝑧 = 𝑡 ⋅ 𝑥 + (1 − 𝑡) ⋅ 𝑦. Rearranging, we
have (1).
▶ Proof that (2) ⟹ (3).
Intuition: The idea here is to write condition (2) for the pair (𝑥, 𝑦) and for the
symmetric pair (𝑦, 𝑥). Summing the inequalities leads to the statement.
Pick any two 𝑥, 𝑦 ∈ Ω. From (2), we can write
𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
𝑓(𝑥) ≥ 𝑓(𝑦) + ⟨∇𝑓(𝑦), 𝑥 − 𝑦⟩.
Summing the inequalities, we therefore conclude that
0 ≥ ⟨∇𝑓(𝑥) − ∇𝑓(𝑦), 𝑦 − 𝑥⟩ = −⟨∇𝑓(𝑦) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩,
which is the statement.
▶ Proof that (3) ⟹ (4).
Intuition: Condition (4) uses a Hessian matrix (i.e., second derivative), but (3) only
contains a difference of gradients. Unsurprisingly, the idea is to consider (3) for two
close-by points and take a limit to extract an additional derivative.
Pick any 𝑥, 𝑦 ∈ Ω, and define the point 𝑥𝑡 ≔ 𝑥 + 𝑡 ⋅ (𝑦 − 𝑥). Using (3) we have
0 ≤ ⟨∇𝑓(𝑥𝑡) − ∇𝑓(𝑥), 𝑥𝑡 − 𝑥⟩ = 𝑡 ⋅ ⟨∇𝑓(𝑥𝑡) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩.
Rearranging and dividing by 𝑡2, we have
⟨∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩
𝑡 ≥ 0.
Taking the limit as 𝑡 ↓ 0, we therefore have
⟨(𝑦 − 𝑥), ∇2𝑓(𝑥)(𝑦 − 𝑥)⟩ ≥ 0.
Since Ω is open by hypothesis, the direction of 𝑦 − 𝑥 is arbitrary, and therefore we
must have ∇2𝑓(𝑥) ⪰ 0, as we wanted to show.
▶ Proof that (4) ⟹ (3).
Intuition: To go from (3) to (4) we took a derivative in the direction 𝑦 − 𝑥. To go
back, we take an integral on the line 𝑦 − 𝑥 instead.
By hypothesis, for any 𝑥, 𝑦 ∈ Ω and 𝜏 ∈ [0, 1],
0 ≤ ⟨𝑦 − 𝑥, ∇2𝑓(𝑥 + 𝜏 ⋅ (𝑦 − 𝑥)) ⋅ (𝑦 − 𝑥)⟩.
Hence, taking the integral,
0 ≤ ∫
1
0
⟨𝑦 − 𝑥, ∇2𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) ⋅ (𝑦 − 𝑥)⟩ d𝑡
= ⟨𝑦 − 𝑥, ∫
1
0
∇2𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) ⋅ (𝑦 − 𝑥)⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
= d
d𝑡 ∇𝑓(𝑥+𝑡⋅(𝑦−𝑥))
d𝑡⟩ = ⟨𝑦 − 𝑥, ∇𝑓(𝑦) − ∇𝑓(𝑥)⟩.
▶ Proof that (3) ⟹ (2).
Intuition: The idea here it to treat 𝑥 as fixed, and integrate condition (3) on the line
from 𝑥 to 𝑦.
Pick any 𝑥, 𝑦 ∈ Ω, and define the point 𝑥𝑡 ≔ 𝑥 + 𝑡 ⋅ (𝑦 − 𝑥) for 𝑡 ≥ 0. Using condition
(3) we have
0 ≤ ⟨∇𝑓(𝑥𝑡) − ∇𝑓(𝑥), 𝑥𝑡 − 𝑥⟩ = 𝑡 ⋅ ⟨∇𝑓(𝑥𝑡) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩,
which implies that ⟨∇𝑓(𝑥𝑡) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩ ≥ 0 for all 𝑡 ≥ 0.
Letting 𝑡 range from 0 to 1 and integrating,
0 ≤ ∫
1
0
⟨𝑦 − 𝑥, ∇𝑓(𝑥𝑡) − ∇𝑓(𝑥)⟩ d𝑡
= −⟨𝑦 − 𝑥, ∇𝑓(𝑥)⟩ + ∫
1
0
⟨𝑦 − 𝑥, ∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥))⟩ d𝑡
= −⟨𝑦 − 𝑥, ∇𝑓(𝑥)⟩ + 𝑓(𝑦) − 𝑓(𝑥).
Rearranging yields 𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩, which is (2). □
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.

Content

Typo or question?

Metadata