Using python to do LDA parameter tunning

User Generated

Nxhabybxvln

Programming

Description

Q1:LDA Parameter Tunning

Define a function tune_lda() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json

Fits LDA models (from gensim package) using documents from train_f ile with different parameter values:

Number of clusters (K ) from 2 to 6

Topic distribution prior (i.e. α): 'symmetric' (i.e. α = [1, 1, 1, . . . ] ), 'asymmetric' (i.e.

α = [1/K, 1/K, 1/K, . . . ] ), and 'auto' (i.e. the prior is calculated based on word frequency)

With all parameter combinations, in total, you'll train 15 LDA models. When fitting each model, set the maximum number of iterations to 40 to make sure your model converges. Note, it may take a few minutes to train all models.

For each model, calculate topic coherence using 'u_mass' formula. The details of coherence can be found at https://radimrehurek.com/gensim/models/coherencemo...(https://radimrehurek.com/gensim/models/coherencemo...). Read the paper referenced in the link to make sure you understand the meaning of topic coherence. Note, 'c_v' instead of 'u_mass' is recommended to evaluate topic coherence. For simplicity, let's use 'u_mass' here. However, if you can figure out how to use 'c_v', that's even better.

Create a plot to show how topic coherence changes as the K increases under different α values (i.e., a line for each α value).

Based on the plot, determine best K and α values in terms of topic coherence

This function does not have a return. Write a document to show:

best parameter combination in terms of topic coherence

do you think topic coherence is a good metric for you to choose K ?

Unformatted Attachment Preview

Assignment: Clustering and Topic Modeling In this assignment, you'll need to use the following dataset: text_train.json: This file contains a list of documents. It's used for training models text_test.json: This file contains a list of document and labels of each document. It's used for testing performance. This file is in the format shown below. Note, each document has a list of labels. Text Labels faa issues fire warning for lithium ... [T1, T3] rescuers pull from flooded coal mine ... [T1] .... ... Q1:LDA Parameter Tunning Define a function tune_lda() as follows: Takes two file name strings as inputs: train_f is the file path of text_test.json ile is the file path of text_train.json, and test_f ile Fits LDA models (from gensim package) using documents from train_f parameter values: ile with different Number of clusters (K ) from 2 to 6 Topic distribution prior (i.e. α): 'symmetric' (i.e. α = [1, 1, 1, . . . ] ), 'asymmetric' (i.e. α = [1/K, 1/K, 1/K, . . . ] ), and 'auto' (i.e. the prior is calculated based on word frequency) With all parameter combinations, in total, you'll train 15 LDA models. When fitting each model, set the maximum number of iterations to 40 to make sure your model converges. Note, it may take a few minutes to train all models. For each model, calculate topic coherence using 'u_mass' formula. The details of coherence can be found at https://radimrehurek.com/gensim/models/coherencemodel.html (https://radimrehurek.com/gensim/models/coherencemodel.html). Read the paper referenced in the link to make sure you understand the meaning of topic coherence. Note, 'c_v' instead of 'u_mass' is recommended to evaluate topic coherence. For simplicity, let's use 'u_mass' here. However, if you can figure out how to use 'c_v', that's even better. Create a plot to show how topic coherence changes as the value). K increases under different α values (i.e., a line for each α Based on the plot, determine best K and α values in terms of topic coherence This function does not have a return. Write a document to show: best parameter combination in terms of topic coherence do you think topic coherence is a good metric for you to choose K ? In [ ]:
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

...


Anonymous
I was having a hard time with this subject, and this was a great help.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags