Description
Q1:LDA Parameter Tunning
Define a function tune_lda() as follows:
Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json
Fits LDA models (from gensim package) using documents from train_f ile with different parameter values:
Number of clusters (K ) from 2 to 6
Topic distribution prior (i.e. α): 'symmetric' (i.e. α = [1, 1, 1, . . . ] ), 'asymmetric' (i.e.
α = [1/K, 1/K, 1/K, . . . ] ), and 'auto' (i.e. the prior is calculated based on word frequency)
With all parameter combinations, in total, you'll train 15 LDA models. When fitting each model, set the maximum number of iterations to 40 to make sure your model converges. Note, it may take a few minutes to train all models.
For each model, calculate topic coherence using 'u_mass' formula. The details of coherence can be found at https://radimrehurek.com/gensim/models/coherencemo...(https://radimrehurek.com/gensim/models/coherencemo...). Read the paper referenced in the link to make sure you understand the meaning of topic coherence. Note, 'c_v' instead of 'u_mass' is recommended to evaluate topic coherence. For simplicity, let's use 'u_mass' here. However, if you can figure out how to use 'c_v', that's even better.
Create a plot to show how topic coherence changes as the K increases under different α values (i.e., a line for each α value).
Based on the plot, determine best K and α values in terms of topic coherence
This function does not have a return. Write a document to show:
best parameter combination in terms of topic coherence
do you think topic coherence is a good metric for you to choose K ?
Unformatted Attachment Preview
Purchase answer to see full attachment