### Unformatted Attachment Preview

Q1. (a) We take n independent samples x1 , x2 , . . . , xn from the distribution:
0
⇣
⌘2 1
r
x
1 C
µ
B
f (x) =
exp
@
A.
2⇡x3
2x
The goal is estimating the unknown parameters and µ of this distribution. Derive the maximum likelihood
estimates of and µ in terms of the samples x1 , x2 , . . . , xn .
Hint. The expression for ˆ should ultimately become
! 1
n
X
1
1
1
ˆ=
(
)
n i=1 xi x̄
P
where x̄ = 1/n ni=1 xi . Also to simplify your expressions you would need to use the assumptions that
µ̂ 6= 0 and ˆ 6= 0.
(b) Suppose that n = 3, and x1 = 1.7538, x2 = 3.0649, x3 = 2.4183. Calculate the values of ˆ and µ̂.
Q2. We take n independent samples x1 , x2 , . . . , xn from the distribution
f (x) =
↵x↵
↵
1
exp
⇣ ⇣ x ⌘↵ ⌘
.
The goal is estimating the parameters and ↵ based on the observed samples x1 , x2 , . . . , xn . Unlike the
previous question where we were able to derive closed form expressions for the ML estimate, for this
problem we would not be able to do such thing, and need to consider a numerical minimization technique.
Let’s go through this task step by step.
(a) Formulate the likelihood function
f (x1 , . . . , xn |↵, ) = . . .
(b) Formulate the negative log-likelihood function
L( , ↵) =
log (f (x1 , . . . , xn |↵, )) .
To estimate the ML estimates ˆ and ↵
ˆ we can either maximize the log likelihood function, or equivalently
minimize the negative log likelihood function L( , ↵). For this purpose we decide to use a gradient descent
scheme. Derive the expression for the gradient components below:
@L
= ...,
@
@L
= ....
@↵
(1)
Hint. If a is a positive constant, the derivative of the function g(x) = ax is g 0 (x) = log(a)ax .
(c) Suppose that n = 5, and x1 = 0.20433, x2 = 0.35718, x3 = 0.35852, x4 = 0.05023, x5 = 0.08290. Use
the Matlab or Python 3D surface tools to plot L( , ↵) as a function of and ↵, in the following region:
0.1
2.5,
0.1 ↵ 2.5.
(d) Again, suppose that n = 5, and x1 = 0.20433, x2 = 0.35718, x3 = 0.35852, x4 = 0.05023, x5 =
0.08290. Pick a programming language of your choice and write up a gradient descent (GD) scheme using
the gradient components you derived in (1). For your scheme set the gradient learning rate to ⌘ = 0.001,
and use the initial values 0 = 2 and ↵0 = 2. Run the GD scheme for 2500 iterates, and report the final
estimates of and ↵. Also provide a plot showing the iterative values of for the iterates from 1 to 2500.
2
Provide a similar plot showing the iterative values of ↵ for the iterates from 1 to 2500. Also attach your
code.
(e) Use an exact same setup as part (d), but this time in your gradient descent scheme use a momentum
term of = 0.9. Report the final estimates of and ↵. Also provide a plot showing the iterative values of
for the iterates from 1 to 2500. Provide a similar plot showing the iterative values of ↵ for the iterates
from 1 to 2500. Comparing these plots with those in part (d), which scheme seems to have converged
faster? Attach your code.
Q3. As you will see in the next lecture, for classification problems, a very popular function is
f (z) = ↵ log 1 + e
z
+ (1
↵) log (1 + ez ) ,
0 ↵ 1,
where ↵ is a known coefficient between 0 and 1.
(a) Show that z = log( 1 ↵↵ ) is a stationary point (point of zero derivative) for f (z).
(b) Use one of the three methods in the lecture (you choose the one that is easier) to show that f (z)
is convex (hint: ez is always a positive function).
(c) Now that you know f (z) is convex, is z = log( 1 ↵↵ ) a minimizer or a maximizer? Why?
(d) Plot f (z) for ↵ = 0.3, and the values of z between -3 and 3.
3
...