Unformatted Attachment Preview
Project Description
Automatically Retrieve Valuable Information from Feature Requests
(15 points)
Due Dates: 11:59pm on 11/28, 2016
1. Introduction
In open-source software repositories (e.g., sourceforge), forums are provided for users to propose
requests of features that they desire to be developed in the next release of the system. However, the
description of user submitted requests are not all relevant to the features that would be concerned by
the developer team. Besides, after each new release of a system, a large number of new feature
requests are proposed. In order to relieve the pressure of project members in manually reviewing a
large amount of user requests and filtering out the irrelevant information, we propose an approach to
automatically detecting valuable information from each user feature request. After pre-processing the
user requests, a large amount of irrelevant information is removed, which will significantly reduce
the manual effort of project managers in reviewing and making decisions on these shortened feature
requests only containing the relevant and useful information.
The objective of this project is to find the patterns of most valuable contents in user submitted
feature requests. We provide you a set of users’ feature requests from the open source repositories.
We also provide you a list of questions indicating the developers’ interests in those feature requests.
First, please manually locate and retrieve the answers to each question from each feature request. The
answers you identify will become the valuable information extracted from the feature request. Next,
you need to discover and derive the patterns from the answers that you previously identify to each
question. Finally, you should be able to automatically retrieve the answers to each question from the
newly proposed feature requests with your derived patterns.
2. Feature Requests and Questions
A feature request contains two fields: an ID “R{number}” and a textual “description”. Here are five
examples of the feature requests:
R1. I have been looking for an app that does not store passwords and just generates them using a
hash of a master password phrase and a sort of the hostname of the site.
R2. I think Linux port is a good idea. I use both Windows and Linux but cannot use KeePass on my
Linux OS. Please port KeePass to Linux.
R3. Users should only have one instance of KeePass running at anytime, or it will cost more system
resources. If a user started the program while the program was already running, the program should
bring up the already running instance instead of starting a new one.
R4. At the moment, anyone can change the master key of the database, even if he does not know the
original one. It’s a big threaten to password safety.
R5. It would be convenient for users to show numbers of entries near the group.
Besides, we also provide you a list of questions as follows that ask about the relevant information
that may interest the developers. Note that the given list of questions may not be complete.
Q1. Which word(s) or phrase(s) implies that the request proposes a new feature?
Q2. What is the subject of the feature request?
Q3. What is the object of the feature request?
Q4. Which verb or verb phrase is used to describe the feature request?
Q5. Which word or phrase implies the purpose of the feature request?
Q6. What benefit will the requested feature bring to users?
Q7. Which word or phrase tells something that current system cannot fulfil?
Q8. Which word or phrase mentions or implies an existing feature?
Q9. When will the requested feature be needed?
Q10. Where will the requested feature be applied?
3.
Project Tasks
Implementation Language and Platform Requirements:
Developing Language: Java
JDK version: 1.8 or higher
IDE: Eclipse
The project consists of 3 tasks to be elaborated in the following subsections.
Task 1: Label Answers to Questions
3.1. Choose Questions and Prepare Feature Requests
First, please choose two questions from the 10 questions listed in Section 2. Then prepare two sets of
feature requests for each of your two selected questions and each set contains at least 60 feature
requests. Each selected feature request must contain the answer to its corresponding question.
Feature requests may be duplicated in the two sets. Feature requests of various software projects can
be accessed at https://sourceforge.net/. For example, https://sourceforge.net/projects/pnotes/ is the
homepage of project “PNotes” and you may access the page of its feature requests by clicking the
“Tickets” menu (see the screenshot below).
For each question, you may retrieve 60 feature requests from one single project or multiple different
projects in the sourceforge. Then for each question, split the set of feature requests into two subsets:
one contains 50 requests, the other contains 10 requests. The subset with 50 requests will be used as
the Training Set and the subset with the remaining 10 requests will be used as the Testing Set. The
Training Set will be used to derive the patterns of answers for automatically retrieving answers to
each question later. The Testing Set will be used for evaluating the performance of the patterns you
derive.
3.2. Structured Representation of Answers
In order to derive the patterns of answers to the questions, you need to first convert the extracted
answers in natural language into a structured representation. The extracted answer to each question
from one request must be consecutive words/phrases. Eventually, the patterns of answers will be
defined with this structured representation. The ultimate goal of this project is to automatically
retrieve the answers to the questions from the feature requests in the Testing Set leveraging the
structured representation. Below is an example structured representation of extracted answers with 7
attributes:
S1. Index of the sentence containing the answer
S2. Index of the first word of the answer in the sentence
S3. The number of words in the answer
S4. POS tag of the first word in the answer
S5. POS tag of the last word in the answer
S6. POS tag of the word immediately before the answer in the request,
S7. POS tag of the word immediately following the answer in the request.
What is POS tagging and how can we tag an English sentence? Part-of-Speech (POS) tagging identifies
grammatical roles of the words in the natural language text. An automated POS tagging tool CLAWS can be
downloaded at http://ucrel.lancs.ac.uk/claws/trial.html. You can also automatically label POS tag for each
word in a sentence by writing your own code. If writing your own code, please import the Java
Package developed by Stanford NLP group which can be downloaded from
http://nlp.stanford.edu/software/tagger.html. You should first add this package into your Build Path.
Then your code can call the API in this package to label POS tags for a feature request as follows:
Properties props = new Properties();
props.setProperty("annotators", " tokenize, ssplit, pos, lemma, ner, parse, dcoref ");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String featureRequest = ""; //Your feature request
Annotation document = new Annotation(featureRequest);
pipeline.annotate(document);
List sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
List tokens = sentence.get(TokensAnnotation.class);
for (CoreLabel token : tokens) {
String pos = token.get(PartOfSpeechAnnotation.class);//Get POS of each word
}
}
S1-S7 are the required attributes that structurally represent the answers. You are also encouraged to
add additional attributes in your representation if the additional attributes can improve the accuracy
of your results in Task 3. If additional attributes you added into the provided structured
representation template are tested and proved to be effective in Task 3, your project grade will be
improved. Keep in mind that all the attributes (including the ones you added if any) in the structured
representation should be computed for each word in the feature request. S1-S3 are required to be
automatically computed. For S4-S7, you are highly encouraged to implement the automatic
computation.
3.3. Label Answers to Questions with the Given Template
For each feature request in the Training Sets, you should translate your answers to each question into
the structured representation based on the following template.
Request Question Answer
Structured Representation of Answer
ID
ID
Text
S1
S2
S3
S4
S5
S6
S7
The following example demonstrates how to translate the answers (manually retrieved from feature
request R5) to the questions Q4, Q6 and Q10 into the structured representation with 7 attributes
S1-S7.
R5 It would be convenient for users to show numbers of entries near the group.
(1) Locate the answer to each question from the request: {Q4, show numbers of entries}, {Q6,
convenient}, {Q10, near the group}
(2) Calculate the POS tags of each word in R5:
It/PRP would/MD be/VB convenient/JJ for/IN users/NNS to/TO show/VB numbers/NNS of/IN entries/NNS
near/IN the/DT group/NN ./.
(3) Fill in the template of structured representation below.
Table 1. The template of structured representation of answers
Request Question
Answer
Structured Representation of Answer
ID
ID
Text
S1
S2
S3
S4
S5
S6
R5
Q4
show numbers 1
8
4
VB
NNS
TO
of entries
R5
Q6
convenient
1
4
1
JJ
JJ
VB
R5
Q10
near the group
1
12
3
IN
NN
NNS
S7
IN
IN
.
For each feature request in the Testing Sets, you only need to manually locate and retrieve the answer
to each question that you chose. That is, only column 1-3 will be filled in for each feature request in
the Testing Sets.
3.4 Task 1 Deliverables
1. Java classes and compiled executables for calculating all the attributes in the structured
representations of the answers (manually retrieved from each feature request) to the two
questions that you choose.
2. questions.txt containing two questions you choose. Use “Q1” and “Q2” as the question IDs,
3. frtrsf.txt containing 50 feature requests in the Training Set for your first question, and
lrtrsf.txt containing your answer labeling results in the structured representation template.
4. frtrss.txt containing 50 feature requests in the Training Set for your second question, and
lrtrss.txt containing your answer labeling results in the structured representation template.
5. frtesf.txt containing 10 feature requests in the Testing Set of your first question, and lrtesf.txt
containing your answer labeling results in the structured representation template (column 1-3
only).
6. frtess.txt containing 10 feature requests in Testing Set of your second question, and lrtess.txt
containing your answer labeling results in the structured representation template (column 1-3
only).
Each line in a .txt file containing feature requests consists of the request ID and request description.
Request ID and request description are separated by a “TAB”. Here is an example:
R1 I have been looking for …
R2 I think Linux port is a good …
R3 Users should only have one instance…
R4 At the moment, anyone can change …
R5 It would be convenient for users to show numbers …
……
The .txt files with labeled requests in the Training Sets should have the same columns as shown in
Table 1 and the columns are separated by a ‘TAB’. Here is an example of the structured
representation for an answer to the question Q4. The .txt files with labeled requests in the Testing
Sets should only contain the first three columns in the following example.
R5 Q4 show numbers of entries 1
……
8
4
VB NNS
TO IN
All the files in Task 1 deliverables should be packaged in one folder named as “Task 1-{Your Fist
Name}{Your Last Name}”. Please submit one zip file containing the entire folder.
Task 2: Compute Pattern of Answers to Each Question
3.5 Input and Output
The input of this task is all the labeled answers (structured representations) to the question for each
Training Set. The output should be a generalized structured representation (i.e., pattern) of answers to
each question. The pattern of answers consists of all the attributes in the structured representation.
You may take the following approach to derive a pattern from the structured representation of
answers. For the numerical attributes S1, S2 and S3, take a range of value for each attribute and
calculate the percentage of answers in the Training Set that fall into that range. For each attribute,
you may try different ranges of the values and sort them by the percentage of occurrence in
descending order. For the categorical attributes S4, S5, S6 and S7, list all the observed POS tags for
each attribute in the Training Set and sort by the percentage of their occurrence in descending order.
Note that the percentages for each attribute should sum up to 100%.
Below is an example pattern of S1, S2, S4 and S6 of answers to the question Q4.
Table 2. An example pattern of answers to Q4
Representation Attribute
S1
S2
S4
1-2
3-4
3-4
5-6
1-2
VB
NNS TO
Patterns
70% 30% 60%
30% 10% 60% 30% 10%
Percentage
S6
TO
IN
80% 20%
3.6 Task 2 Deliverables
(1) Patterns of S1-S7 for answers to the first question. Name the file as “Task2-Q1.txt”.
(2) Patterns of S1-S7 for answers to the second question. Name the file as “Task2-Q2.txt”.
You are required to submit your derived patterns in the following format.
S1 {1,2},70% {3,4},30%
……
S4 VB,60%
NNS,30% TO,10%
……
(3) Package both files into one folder named as “Task 2-{Your Fist Name}{Your Last Name}”.
Please submit one zip file containing the entire folder.
Task 3: Automatically Retrieve Answers with Your Patterns
3.7 Implementation
This task asks you to implement a system to exploit your derived patterns from Task 2 to
automatically retrieve the answers to the two selected questions from the feature requests in your
Testing Sets. Your implementation must meet the requirements below.
(1) Write your code as a single “.java” class and name the java class as “{Your Fist Name}{Your
Last Name}”
(2) Declare the package of this java class with “package retrieve.answers.auto”
(3) Define static function “retrieve” as follows:
public static void retrieve (String testFilePath, String questionID){
// for each line in testFile retrieve the answer to the question whose id is questionID
// create an ‘answer{questionID}.txt’ file in the same folder of testFile
// write request ID and the answer of each line in testFile into answer{questionID}.txt in one line
}
Tips for retrieving answers automatically:
(1) Locate the first word of the answer. For each word in the given feature request, compute its
possibility to be the first word of the answer, denoted with Pf. The first word of the answer can be
determined by S1, S2, S4 and S6. For example, there is a word W in the feature request R, and we
need to calculate the probability of W to be the first word of the answer to question Q4, denoted by
Pf(W|R,Q4). Assuming the values of S1, S2, S4 and S6 of W in the Testing Set for Q4 are 1, 5, VB
and TO respectively, based on the pattern you have derived in Task 2 (see Table 2), we can compute
that
Pf(W|R,Q4) = (70% + 30% + 60% + 80%) / 4 = 0.60
The above calculation assumes that all attributes in the structured representations are of the same
importance in determining the probability of a word W to be the first word of the answer. You may
also assign weight to each attribute to differentiate its importance. For example, if you believe that
S1 is less important than others in determining the first word of the answer, you could assign weights
as: 0.1 for S1 and 0.3 for S2, S4 and S6 respectively (The sum of all weights should be equal to
ONE.). Then:
Pf(W|R,Q4) = 70% * 0.1 + 30% * 0.3 + 60% * 0.3 + 80% * 0.3 = 0.58
Finally, you can choose the word with the highest Pf as the first word of the answer. Note that once
the computation is formulated, it will be consistently applied to all the feature requests in the Testing
Set for a specific question. That is, you may not change the weights of the attributes applied to the
test instances for a specific question once it’s determined.
(2) Locate the last word of the answer. for each word occurring after the first word in the same
sentence, compute its probability to be the last word of the answer, denoted by Pl. The last word can
be determined by the attributes S3, S5 and S7. The method of calculating Pl is similar to Pf described
in (1).
(3) By locating the first and last words of the answer, you will be able to automatically retrieve the
entire answer from the feature request.
3.8 Task3 Deliverables
(1) The “{Your Fist Name}{Your Last Name}.java” file.
(2) The ‘answer{questionID}.txt’ files.
(3) Package both files into one folder named as “Task 3-{Your Fist Name}{Your Last Name}”.
Please submit one zip file containing the entire folder.