TABLE OF CONTENTS
1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1 An Overview of Our Thesis Work . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.1 Context-aware Argument Mining Models . . . . . . . . . . . . . . . .
5
1.1.2 Intrinsic Evaluation: Cross-validation . . . . . . . . . . . . . . . . . .
6
1.1.3 Extrinsic Evaluation: Automated Essay Scoring . . . . . . . . . . . .
7
1.2 Thesis Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.0 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1 Argumentation Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2 Argument Mining in Di↵erent Domains . . . . . . . . . . . . . . . . . . . .
13
2.3 Argument Mining Tasks and Features . . . . . . . . . . . . . . . . . . . . .
15
2.3.1 Argument Component Identification . . . . . . . . . . . . . . . . . . .
15
2.3.2 Argumentative Relation Classification . . . . . . . . . . . . . . . . . .
17
2.3.3 Argumentation Structure Identification . . . . . . . . . . . . . . . . .
18
3.0 EXTRACTING ARGUMENT AND DOMAIN WORDS FOR IDENTIFYING ARGUMENT COMPONENTS IN TEXTS – COMPLETED
WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2 Persuasive Essay Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.3 Argument and Domain Word Extraction . . . . . . . . . . . . . . . . . . . .
23
3.4 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4.1 Stab & Gurevych 2014 . . . . . . . . . . . . . . . . . . . . . . . . . .
25
iv
3.4.2 Nguyen & Litman 2015 . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.5.1 Proposed vs. Baseline Models . . . . . . . . . . . . . . . . . . . . . .
27
3.5.2 Alternative Argument Word List . . . . . . . . . . . . . . . . . . . . .
29
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.0 IMPROVING ARGUMENT MINING IN STUDENT ESSAYS USING ARGUMENT INDICATORS AND ESSAY TOPICS – COMPLETED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2 Academic Essay Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.3.1 Stab14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.3.2 Nguyen15v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3.3 wLDA+4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3.4 wLDA+4 ablated models . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.4 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.4.1 10-fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.4.2 Cross-topic Validation . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.4.3 Performance on Held-out Test Sets . . . . . . . . . . . . . . . . . . . .
42
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.0 EXTRACTING CONTEXTUAL INFORMATION FOR IMPROVING
ARGUMENTATIVE RELATION CLASSIFICATION – PROPOSED
WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.3 Two Problem Formulations and Baseline Models . . . . . . . . . . . . . . .
49
5.3.1 Relation with Argument Topic . . . . . . . . . . . . . . . . . . . . . .
49
5.3.2 Pair of Argument Components . . . . . . . . . . . . . . . . . . . . . .
50
5.3.3 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.3.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
v
5.4 Software Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.5 Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6.0 IDENTIFYING ARGUMENT COMPONENT AND ARGUMENTATIVE RELATION FOR AUTOMATED ARGUMENTATIVE ESSAY
SCORING – PROPOSED WORK . . . . . . . . . . . . . . . . . . . . . . .
55
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.2 Argument Strength Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
6.3 Argument Mining Features for Automated Argument Strength Scoring . . .
56
6.3.1 First experiment: impact of performance of argument component identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.3.2 Second experiment: impact of performance of argumentative relation
identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.3.3 Third experiment: only argument mining features . . . . . . . . . . .
58
6.4 Argument Mining Features for Predicting Peer Ratings of Academic Essays
58
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
7.0 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
8.0 TIMELINE OF PROPOSED WORK . . . . . . . . . . . . . . . . . . . . .
63
APPENDIX A. LISTS OF ARGUMENT WORDS . . . . . . . . . . . . . . .
64
APPENDIX B. PEER RATING RUBRICS FOR ACADEMIC ESSAYS .
66
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
vi
1.0
INTRODUCTION
Argumentation can be defined as a social, intellectual, verbal activity serving to justify or
refute an opinion, consisting of statements directed towards obtaining the approbation of
an audience. Originally proposed within the realms of Logic, Philosophy, and Law, computational argumentation has become an increasingly central core study within Artificial
Intelligence (AI) which aims at representing components of arguments, and the interactions
between components, evaluating arguments and distinguishing legitimate from invalid arguments [Bench-Capon and Dunne, 2007].
With the rapid growth of textual data and tremendous advances in text mining, argument
(argumentation) mining in text1 has apparently been an emerging research field that is
to draw a bridge between formal argumentation theories and everyday life argumentative
reasoning. Aiming at automatically identifying argument components (e.g., premises, claims,
conclusions) in natural language text, and the argumentative relations (e.g., support, attack)
between components, argument mining is found to promise novel opportunities for opinion
mining, automated essay evaluation as well as o↵ers great improvement for current legal
information systems or policy modeling platforms. Argument mining has been studied in a
variety of text genres like legal documents [Moens et al., 2007, Mochales and Moens, 2008,
Palau and Moens, 2009], scientific papers [Teufel and Moens, 2002,Teufel et al., 2009,Liakata
et al., 2012], news articles [Palau and Moens, 2009,Goudas et al., 2014,Sardianos et al., 2015],
user-generated online comments [Cabrio and Villata, 2012, Boltužić and Šnajder, 2014], and
student essays [Burstein et al., 2003, Stab and Gurevych, 2014b, Rahimi et al., 2014, Ong
et al., 2014]. Problem formulations of argument mining have ranged from the separation of
argumentative from non-argumentative text, the classification of argument components and
1
Argument mining for short.
1
Essay 75:
(0) Do
arts and music improve the quality of life?
(1) My
view is that the [government should give priorities to invest more money on the
basic social welfares such as education and housing instead of subsidizing arts relative
programs]M ajorClaim .
(2) [Art
is not the key determination of quality of life, but education is]Claim . (3) [In
order to make people better o↵, it is more urgent for governments to commit money to
some fundamental help such as setting more scholarships in education section for all
citizens]P remise . (4) This is simply because [knowledge and wisdom is the guarantee of
the enhancement of the quality of people’s lives for a well-rounded social system]P remise .
(5) Admittedly,
[art, to some extent, serve a valuable function about enriching one’s
daily lives]Claim , for example, [it could bring release one’s heavy burden of study pressure and refresh human bodies through a hard day from work ]P remise . (6) However, [it
is unrealistic to pursuit of this high standard of life in many developing countries, in
which the basic housing supply has still been a huge problem with plenty of lower income family have squeezed in a small tight room]P remise . (7) By comparison to these
issues, [the pursuit of art seems unimportant at all ]P remise .
(8) To
conclude, [art could play an active role in improving the quality of people’s
lives]P remise , but I think that [governments should attach heavier weight to other social
issues such as education and housing needs]Claim because [those are the most essential
ways enable to make people a decent life]P remise .
Figure 1: A sample student essay taken from the corpus in [Stab and Gurevych, 2014a]. The
essay has sentences numbered and argument components enclosed in tags for easy look-up.
argumentative relations, to the identification of argumentation structures/schemes.
To illustrate di↵erent tasks in argument mining, let us consider a sample student essay
in Figure 1. The first sentence in the example is the writing prompt. The MajorClaim
which states the author’s stance towards the writing topic is placed at the first of the essay’s
body, i.e., sentence 1. The student author used di↵erent Claims (controversial statements)
to validate/support and attack the major claim, e.g., claims in sentences {2, 5, 8}. Validity
of the claims are underpinned/rebutted by Premises (reasons provided by the author), e.g.,
premises in sentences {5, 6, 7}. As the first task in argument mining, Argument Component Identification aims at recognizing argumentative portions in the text (Argumentative
Discourse Units – ADUs [Peldszus and Stede, 2013]), e.g., a subordinate clause in sentence
1, or the whole sentence 2, and classifying those ADUs accordingly to their argumentative
2
MajorClaim(1)
Support
Attack
Claim(2)
Claim(5)
Support
Premise(5)
Attack
Premise(7)
Support
Premise(6)
Figure 2: Graphical representation of a part of argumentation structure in the example essay.
Argumentative relations are illustrated based on annotation by [Stab and Gurevych, 2014a].
roles, e.g., MajorClaim, Claim, and Premise. The two sub-tasks are often combined into
a multi-way classification problem by introducing the None class. Thus, possible class labels for a candidate ADU are {MajorClaim, Claim, Premise, None}. However, determining
boundaries of candidate ADUs to prepare input for argument mining models is a nontrivial preprocessing task. In order to simplify the main argument mining task, sentences are
usually taken as primary units [Moens et al., 2007], or the gold-standard boundaries are
assumed available [Stab and Gurevych, 2014b].
The second task, Argumentative Relation Classification [Stab and Gurevych, 2014b],
considers possible pairs of argument components in a definite scope, e.g., paragraph,2 or
pairs of argument component and argument topic. For each pair, determines if a component
supports or attacks the other component. As we have in the example essay, the Claim in
2
The definite scope is necessary to make the distribution less skewed. In fact, the number of pairs that
hold an argumentative relation is far smaller than the total number of possible pairs.
3
sentence 2 supports the MajorClaim in sentence 1: Support(Claim (2) , MajorClaim (1) ). We
also have Attack (Claim (5) , MajorClaim (1) ), Support(Premise (5) , Claim (5) ). Given the direct
relations as in examples, one can infer Attack (Premise (5) , MajorClaim (1) ) and so on.
While in argumentative relation classification one does not di↵erentiate direct and inferred relations, Argumentation Structure Identification [Mochales and Moens, 2011] aims
at constructing the graphical representation of argumentation in which edges are direct attachments between argument components. Attachment is an abstraction of support/attack
relations, and is illustrated as arrowhead connectors in Figure 2. Attachment between argument components does not necessarily correspond to the components’ relative positions
in the text. For example, Premise(6) is placed between Claim(5) and Premise(7) in the essay,
but Premise(7) is the direct premise of Claim(5) as shown in the figure.
1.1
AN OVERVIEW OF OUR THESIS WORK
In education, teaching argumentation and argumentative writing to student are in particular need of attention [Newell et al., 2011, Barstow et al., 2015]. Automated essay scoring
(AES) systems have been proven e↵ective to reduce teachers’ workload and facilitate writing
practices, especially in large-scale [Shermis and Burstein, 2013]. AES research has recently
showed interest in automated assessment of di↵erent aspects of written arguments, e.g., evidence [Rahimi et al., 2014], thesis and argument strength [Persing and Ng, 2013,Persing and
Ng, 2015]. However, the application of argument mining in automatically scoring argumentative essays has been studied limitedly [Ong et al., 2014, Song et al., 2014]. Motivated by
the promising application of argument mining as well as the desire of automated support for
argumentative writings in school, our research aims at building models that automatically
mines arguments in natural language text, and applying argument mining outcome to automatically scoring argumentative essays. In particular, we propose context-aware argument
mining models to improve state-of-the-art argument component identification and argumentative relation classification. In order to make the proposed approaches more applicable to
the educational context, our research conducts both intrinsic and extrinsic evaluation when
4
comparing our proposed models to the prior work. Regarding intrinsic evaluation, we perform both random folding cross validation and cross-topic validation to assess the robustness
of models. For extrinsic evaluation, our research investigates the uses of argument mining
for automated essay scoring. Overall, our research on argument mining can be divided into
three components with respect to their functional aspects.
1.1.1
Context-aware Argument Mining Models
The main focus of our research is building models for argument component identification and
argumentative relation classification. As illustrated in [Stab and Gurevych, 2014a], context3
is crucial for identifying argument components and argumentation structures. However,
context dependence has not been addressed adequately in prior work [Stab et al., 2014].
Most of argument mining studies built prediction models that process each textual input4
isolatedly from the surrounding text. To enrich the feature space of such models, history
features such as argumentative roles of one or more preceding components, and features
extracted separately from preceding and/or following text spans have been usually used
[Teufel and Moens, 2002, Hirohata et al., 2008, Palau and Moens, 2009, Guo et al., 2010,
Stab and Gurevych, 2014b]. However, the idea of using surrounding text as a context-rich
representation of the prediction input for feature extraction was studied limitedly in few
research [Biran and Rambow, 2011].
In many writing genres, e.g., debates, student essays, scientific articles, the availability of writing topics provides valuable information to help identify argumentative text as
well as classify their argumentative roles [Teufel and Moens, 2002, Levy et al., 2014]. Especially, [Levy et al., 2014] defined the term Context Dependent Claim to emphasize the
role of discussion topic in distinguishing claims relevant to the topic from the irrelevant
statements. The idea of using topic and discourse information to help resolve ambiguities
are commonly used in word sense disambiguation and sentiment analysis [Navigli, 2009, Liu,
3
The thesis di↵erentiates between global context and local context. While global context refers to the
main topic/thesis of the document, the local context is instantiated by the actual text segment covering the
textual unit of interest, e.g., preceding and following sentences.
4
E.g., candidate ADU in argument component identification, or pair of argument components in argumentative relation classification.
5
2012]. Based on these observations, we hypothesize that argument component identification
and argumentative relation classification can be improved with respect to prediction performance by considering contextual information at both local and global levels when developing
prediction features.
Definition 1. Context segment of a textual unit is a text segment formed by neighboring
sentences and the unit itself. The neighboring sentences are called context sentences, and
must be in the same paragraph with the textual unit.
Instead of building prediction models that process each textual input isolatedly, our
context-aware approach considers the input within its context segment 5 to enable advanced
contextual features for argumentative relation classification. In particular, our approach
aims at extracting discourse relations within the context segment to better characterize the
rhetorical function of the unit in the entire text. Besides, the context segments instead of
their units will be fed to textual entailment and semantic similarity scoring functions to
extract semantic relation features. We expect that a score set by possible pairs extracted
from two segments better represents the semantic relations of the two input units than
their single score. As defining the context and identifying boundaries of context segment
are not a focus of our research, we propose to use di↵erent heuristics, e.g., window-size,
topic segmentation, to approximate the context segment given a textual unit, and evaluate
contribution of such techniques to the final argument mining performance.
Definition 2. Argument words are words that signal the argumentative content, and commonly used across di↵erent argument topics, e.g., ‘believe’, ‘opinion’. In contrast, domain
words are specific terminologies commonly used within the topic, e.g., ‘art’, ‘education’. Domain words are a subset of content words that form the argumentative content.
As of a use of global context, we propose an approach that uses writing topics to guide a
semi-supervised process for separating argument words from domain words.6 The extracted
5
Term “context sentences” was used in [Qazvinian and Radev, 2010] to refer sentences surrounding a
citation, that contain information about the cited source but do not explicitly cite it. In this thesis, we place
no other constrains to context sentences than requiring them to be adjacent to the textual unit.
6
Our definition of argument and domain words shares similarities with the idea of shell language and
content in [Madnani et al., 2012] in that we aim to model the lexical signals of argumentative content.
However while Madnani et al. emphasized the boundaries between argument shell and content, we do not
require such a physical separation between the two aspects of an argument component.
6
vocabularies of argument words and domain words are then used to derive novel features
and constraints for an argument component identification model.
1.1.2
Intrinsic Evaluation: Cross-validation
In educational settings, students can have writing assignments in a wide range of topics.
Therefore a desired argument mining model that has practical application in student essays
is the one that can yield good performance for new essays of di↵erent topic domains than
those of the training essays. As a consequence, features which are less topic-specific will
be more predictive when cross-topic evaluated. Given this inherent requirements to the
argument mining tasks for student essays, our research emphasizes the evaluation of the
robustness of argument mining models. In addition to random-fold cross-validation (i.e.,
training and testing data are randomly split from the corpus), we also conduct cross-topic
validation (i.e., training and testing data are from essays of di↵erent writing topics [Burstein
et al., 2003]) when comparing the proposed approaches with prior studies.
Beyond cross-topic evaluation, our research also uses di↵erent corpora to evaluate e↵ectiveness of the proposed approaches. The first corpus consists of persuasive essays and the associated coding scheme specifies three di↵erent types of argument components: major claim,
claim, and premise [Stab and Gurevych, 2014a]. The second corpus are academic writings
collected from college Psychology classes and has sentences classified based on their argumentative roles: hypothesis, support finding, opposition finding, or non-argumentative [Barstow
et al., 2015].
1.1.3
Extrinsic Evaluation: Automated Essay Scoring
Aiming at high performance and robust models of argument mining, the second goal of
our research is to seek for an application of argument mining in automated argumentative
essay evaluation. As proposed in the literature, an direct approach would be using prediction
outcome (e.g., arguments identified by prediction models) to recall students’ attention to not
only the organization of their writings but also the plausibility of the provided arguments
in the text [Burstein et al., 2004, Falakmasir et al., 2014]. Such feedback information also
7
helps teachers quickly evaluate writing performance of their students for better instructions.
However, deploying an argument mining model to an existing computer-supported writing
service, and evaluate it benefit to student learning would require a great amount of time and
e↵ort. Thus, it is set up as the long-term goal of our research. In the course of this thesis,
we instead look for answers to the question whether the outcome of automated argument
mining can predict essay scores.
For this goal, our research uses two corpora to conduct automated essay scoring experiments. The first corpus is the academic essays that were used for our argument mining
experiments. Each essay in the corpus was reviewed by student peers, and was given both
textual comments and numerical ratings by its peer reviewers. Therefore our research makes
use of peer ratings as the gold standard for the essay scoring experiment. The second corpus
is the Argument Strength Corpus, in which argumentative student essays were annotated
with argument strength scores [Persing and Ng, 2015]. The argumentative essays of this
corpus have certain similarities with the persuasive essays in the [Stab and Gurevych, 2014a]
which are used for our argument mining study. Besides, both two corpora were originally
used for automated essay scoring studies, thus the prior scoring models are perfect baselines
to evaluate our proposed approach. In this research we employ two approaches for applying
argument mining to automated essay scoring. The first approach simply uses statistics of
argument components and argumentative relations identified by our argument mining models to train a scoring prediction model [Ong et al., 2014]. The second approach uses those
statistics to augment the scoring model in [Persing and Ng, 2015].
1.2
THESIS STATEMENTS
Motivated by the benefit of contextual information from writing topics and context segments
in argument mining, we propose context-aware argument mining that make use of additional
context features derived from such contextual information. In this thesis, we aim to prove
the following hypotheses of the e↵ectiveness of our proposed context features:
• H1. Our proposed context features helps improve the argument mining performance.
8
This hypothesis is divided into two sub-hypotheses:
– H1-1. Adding the context features improves the argument component identification
in student essays in cross-fold and cross-topic validations. This hypothesis is proven
in §3 and §4.
– H1-2. Adding the context features improves the argumentative relation classification
in student essays in cross-fold and cross-topic validations. This hypothesis will be
tested in §5.
• H2. Prediction output of our proposed argument component identification and argumentative relation classification models for student essays improve automated argumentative
essay scoring. This hypothesis will be tested in §6.
1.3
PROPOSAL OUTLINE
In the next chapter, we briefly discuss argument mining from its theoretical fundamentals to
existing computational studies in di↵erent domains. Chapter 3 and 4 present our completed
work on argument component identification. In Chapter 3, we present a novel algorithm to
extract argument and domain words to use as new features and constraints for improving the
argument component identification in student essays. Chapter 4 presents an evaluation of
our proposed model for automated argument component identification in student essay using
cross-topic validation. Chapter 5 and 6 describe our proposed work on argumentative relation
classification in student essays and applying argument mining to automated argumentative
essay scoring.
9
2.0
2.1
BACKGROUND
ARGUMENTATION THEORIES
From the ancient roots in dialectics and philosophy, models of argumentation have spread to
core areas of AI including knowledge representation, non-monotonic reasoning, and multiagent system research [Bench-Capon and Dunne, 2007]. This has given the rise of computational argumentation with two main approaches which are abstract argumentation and
structured argumentation [Lippi and Torroni, 2015].1 Abstract argumentation considers each
argument as a primary element without internal structure, and focuses on the relation between arguments, or sets of them. In contrast, structured argumentation studies internal
structure (i.e., argument components and their interaction) of argument that is described in
terms of some knowledge representation formalism. Structured argumentation models are
those typically employed in argument mining when the goal is to extract argument components from natural language. In this section, we describe two notable structured argumentation theories which are Macro-structure of Argument by [Freeman, 1991], and Argumentation
Scheme by [Walton et al., 2008]. From the provided description of argumentation theories,
we expect to give a concise yet sufficient introduction of related argument mining studies
from a theoretical perspective.
Among a vast amount of structured argumentation theories have been proposed [Bentahar et al., 2010, Besnard et al., 2014], the premise-conclusion models of argument structure [Freeman, 1991, Walton et al., 2008] are the most commonly used in argument mining
1
Abstract argumentation which is also called macro argumentation considers argumentation as a process.
Structured argumentation, on the contrary, considers argumentation as a product and is also called micro
argumentation [Mochales and Moens, 2011, Stab et al., 2014]
10
Premise1
Support
Conclusion1
Support
Conclusion2
Premise2
Figure 3: A complex macro-structure of argument consisting of linked structure (i.e., the
support of Premise1 and Premise2 to Conclusion1 ) ,and serial structure (i.e., the support of
the two premises to Conclusion2 ).
studies. In fact, the two corpora of argumentative writings that are studied in this thesis
have coding schemes derived from the premise-conclusion structure of argument. [Walton
et al., 2008] gave a simple and intuitive description of argument which specifies an argument as a set of statement consisting a conclusion, a set of premises, and an inference from
the premises to the conclusion. In literature, claims are sometimes used as a replacement
of conclusion, and premises are mentioned as evidences or reasons [Freeley and Steinberg,
2008]. The conclusion is the central component of the argument, and is what “we seek to
establish by our argument” [Freeley and Steinberg, 2008]. The conclusion statement should
not be accepted without additional reasons provided in premises. The second component
of argument, i.e., premise, is therefore necessary to underpin the plausibility of the conclusion. Premises are “connected series of sentences, statements or propositions that are
intended to give reason” for the conclusion [Freeley and Steinberg, 2008]. In a more general representation, premise can either support or attack the conclusion (i.e., giving reason
or refutation) [Besnard and Hunter, 2008, Peldszus and Stede, 2013, Besnard et al., 2014].
Based on the premise-conclusion standard, argument mining studies have proposed di↵erent
argumentative relation schemes to scope with the great diversity of argumentation in natural
language text, for instances claim justification [Biran and Rambow, 2011], claim support vs.
attack [Stab and Gurevych, 2014b], verifiability of support [Park and Cardie, 2014].
While premise-conclusion models do not di↵erentiate functions of di↵erent premises2 , it
2
Toulmin’s argument structure theory [Toulmin, 1958] distinguishes the role of di↵erent types of premise,
i.e., data, warrant, and backing, in the argument.
11
Argument from cause to e↵ect
• Major premise: Generally, if A occurs, then B will (might) occur.
• Minor premise: In this case, A occurs (might occur).
• Conclusion: Therefore, in this case, B will (might) occur.
Critical questions
1. Critique the major premise: How strong is the causal generalization (if
it is true at all)?
2. Critique the minor premise: Is the evidence cited (if there is any) strong
enough to warrant to the generalization as stated?
3. Critique the production: Are there other factors that would or will interfere with or counteract the production of the e↵ect in this case?
Figure 4: Argumentation scheme: Argument from Cause to E↵ect.
enables the Macro-structure of arguments which specifies the di↵erent ways that premises
and conclusions combine to form larger complexes [Freeman, 1991].3 For example, [Freeman, 1991] identified four main macro-structures of arguments: linked, serial, convergent,
and divergent, to represent whether di↵erent premises contribute together, in sequence, or
independently to one or multiple conclusions. An example of complex macro-structure of
argument is shown in Figure 3. Based on Freeman’s theory, [Peldszus and Stede, 2013]
expand the macro-structure to cover more complex attack and counter-attack relations. In
argument mining, the argumentation structure identification task aims at identifying the
macro-structure of arguments in text [Palau and Moens, 2009, Peldszus and Stede, 2015].
Another notable construct of premise-conclusion abstraction is the Argumentation Scheme
Theory [Walton et al., 2008]. The authors used the argumentation scheme notion to identify
and evaluate reasoning patterns commonly used in everyday conversational argumentation,
and other contexts, notably legal and scientific argumentation. In Argumentation Scheme
Theory, arguments are instances of abstract argumentation schemes each of which requires
premises, the assumption implicitly holding, and the exceptions that may undercut the argument. Each scheme has a set of critical questions matching the scheme and correspond to its
3
In the Macro-structure Structure of Argument Theory the term ‘argument’ is thus not for premises, but
for the complex of one or more premises put forward in favor of the conclusion.
12
premises, assumptions and exceptions, and such a set represents standard ways of critically
probing into an argument to find aspects of it that are open criticism. Figure 4, illustrates
the Argument-from-Cause-to-E↵ect scheme consisting of two premises and a conclusion. As
we can realize argument schemes are distinguished by their content templates rather than
their premise-conclusion structures. Identifying the argumentation scheme in the written
argument has been considered to help recovering implicit premises and re-construct the full
argument [Feng and Hirst, 2011]. On the other hand, research was also conducted to analyze the similarity and di↵erence between argumentation schemes and discourse relations
(i.e., Penn Discourse Treebank discourse relations [Prasad et al., 2008]) which is considered
a fruitful support of automated argument classification and process [Cabrio et al., 2013].
2.2
ARGUMENT MINING IN DIFFERENT DOMAINS
Argument mining is a relatively novel research domain [Mochales and Moens, 2011,Peldszus
and Stede, 2013, Lippi and Torroni, 2015] so its problem formulation is not well-defined
but rather is considered potentially relevant to any text mining application that targets
to argumentative text. Moreover, there is no consensus yet on an annotation scheme for
argument components, or on the minimal textual units to be annotated. For these reasons,
we follow [Peldszus and Stede, 2013] and consider in this study “argument mining as the
automatic discovery of an argumentative text portion, and the identification of the relevant
components of the argument presented there.” We also borrow the term “argumentative
discourse unit” [Peldszus and Stede, 2013] to refer the textual unit, e.g., text segment,
sentences, clauses, which are considered as argument components.
In scientific domain, research has been long focusing on identifying the rhetorical status (i.e., the contribution to the overall text function of the article) of text segments, i.e.,
zone, to support summarization and information extraction of scientific publications [Teufel
and Moens, 2002]. Di↵erent zone mining studies were also conducted for di↵erent scientific
domains, e.g., chemistry, biology, and proposed di↵erent zone annotation schemes that targets the full-text or only abstract section of the articles [Lin et al., 2006, Hirohata et al.,
13
2008, Teufel et al., 2009, Guo et al., 2010, Liakata et al., 2012]. However, none of the zone
mining models described local interactions across segments and thus the embedded argument structures in text are totally ignored. Despite this mismatch between zone mining and
argument mining, the two areas solve a similar core problem which is text classification,
which makes zone mining an inspiration of argument mining models.
Two other domains that have argument mining intensively studied are legal documents
and user-generated comments. In legal domain, researchers seek for applications of automated recognition of arguments and argumentation structures in legal documents to support visualizing and qualifying arguments. A wide range of argument mining tasks have been
studied including argumentative text identification [Moens et al., 2007], argument component classification (i.e., premise vs. conclusion), and argumentation structure identification [Mochales and Moens, 2008, Palau and Moens, 2009]. While the computational models
for such argument mining tasks were evaluated using legal document corpora, those studies
all employed the genre-independent premise-conclusion framework to represent the argument
structure. Therefore many prediction features used in argument mining models for legal text,
e.g., indicative keywords for argumentation, discourse connectives, are generally applicable
to other argumentative text genres, e.g., student essays.
In user-generated comments, argument mining has been studied as a natural extension
to opinion mining. While opinion mining answers what people think about for instance a
product [Somasundaran and Wiebe, 2009], argument mining identifies reasons that explain
the opinion. Among the first research on argument in user comments, [Cabrio and Villata,
2012] studied the acceptability of arguments in online debates by first determining whether
two user comments support each other or not.4 [Boltužić and Šnajder, 2014] extended the
work by mining user comments for more fine-grained relations, i.e., {explicit, implicit} ⇥
{support, attack}. [Park and Cardie, 2014] addressed a di↵erent aspect of argumentative
relation which is the verifiability of argumentative propositions in user comments. While the
task does not solve whether the given proposition is a support or opposition of the debate
topic, it provides a mean to analyze the arguments in terms of the adequacy of their support
4
In their study, arguments are pros and cons user comments of the debate topic and were manually
selected.
14
assuming support/attack propositions are labeled already.
Argument mining in student essays is rooted in argumentative discourse analysis for
automated essay scoring [Burstein et al., 2003]. In argumentative5 writing assignments,
students are given a topic and asked to propose a thesis statement and justify support for the
thesis. Oppositions are sometime required to make the thesis risky and nontrivial [Barstow
et al., 2015]. Classifying argumentative elements in student essays has been used to support
automated essay grading [Ong et al., 2014], peer review assistance [Falakmasir et al., 2014],
and providing writing feedback [Burstein et al., 2004]. [Burstein et al., 2003] built a discourse
analyzer for persuasive essays that aimed at identifying di↵erent discourse elements (i.e.,
sentence) such as for instance thesis, supporting idea, conclusion. Similarly, [Falakmasir
et al., 2014] aimed at identifying thesis and conclusion statements in student writings, and
used the prediction outcome to sca↵old peer reviewers of an online peer review system.
[Stab and Gurevych, 2014a] annotated persuasive essays using a domain-independent scheme
specifying three types of argument components (major claim, claim, and premise) and two
types of argumentative relations (support and attack). [Stab and Gurevych, 2014b] utilized
the corpus for automated argument component and argumentative relation identification.
[Ong et al., 2014] developed a rule-based system that labels each sentence in student writings
in psychology classes an argumentative role, e.g., hypothesis, support, opposition, and found
a strong relation between the presence of argumentative elements and essay scores. [Song
et al., 2014] proposed to annotate argument analysis essays to identify responses of critical
questions to judge the argument in writing prompts. The annotation were then used as novel
features to improve an existing essay scoring model.
While studies in [Ong et al., 2014,Song et al., 2014] aimed at predicting the holistic score
of the essays, research on automated essay scoring have recently investigated possibilities of
grading essays on argument aspects, e.g., evidence [Rahimi et al., 2014], thesis clarity [Persing
and Ng, 2013], and argument strength [Persing and Ng, 2015]. While these studies did not
actually identified thesis statements or argument components in the essays, they provide
strong baseline models as well as annotated data for research on application of argument
mining on essay score prediction.
5
The term “persuasive” was also used as an equivalent [Burstein et al., 2003, Stab and Gurevych, 2014a].
15
2.3
2.3.1
ARGUMENT MINING TASKS AND FEATURES
Argument Component Identification
To solve the argumentative label identification tasks (e.g., argumentative vs. not, premise vs.
conclusion, rhetorical status of sentence), a wide variety of machine learning models has been
applied ranging from classification models, e.g., Naive Bayes, Logistic Regression, Support
Vector Machine (SVM), to sequence labeling models such as Hidden Markov Model (HMM),
Conditional Random Field (CRF). Especially for zone mining in scientific articles, sequence
labeling is a more natural approach given an observation that the flow of scientific writing
exposes typical moves of rhetorical roles across sentences. Studies have been conducted
to explore both HMM and CRF for automatically labeling rhetorical status of sentences
in scientific publications using features derived from language models and relative sentence
position [Lin et al., 2006, Hirohata et al., 2008, Liakata et al., 2012].
In the realm of argument mining, argument component identification studies have been
focusing on deriving features that represent the argumentative discourse while being loyal to
traditional classifiers such as SVM, Logistic Regression. Sequence labeling models were not
used mostly due to the loose organization of natural language texts, e.g., student essays, user
comments studied here. Prior studies have often used seed lexicons, e.g., indicative phrases
for argumentation [Knott and Dale, 1994], discourse connectives [Prasad et al., 2008], to
represent the organizational shell of argumentative content [Burstein et al., 2003, Palau and
Moens, 2009, Stab and Gurevych, 2014b, Peldszus, 2014]. While the use of such lexicons
shows e↵ective, their coverage is far from efficient given the great diversity of argumentative
writing in terms of both topic and style. Given the fact that the argumentative discourse
consists of a language used to express claims, evidences and another language used to organize them, researchers have explored both supervised and unsupervised approaches to
mine the organizational elements of argumentative text. [Madnani et al., 2012] used CRF
to train a supervised sequence model using simple features like word frequency, word position, regular expression patterns. To leverage the availability of large amount of unprocessed
data, [Séaghdha and Teufel, 2014] and [Du et al., 2014] built topic models based on LDA [Blei
16
et al., 2003] to learn two language models: topic language and shell language (rhetorical language, cf. [Séaghdha and Teufel, 2014]). While [Madnani et al., 2012] and [Du et al., 2014]
used data which were annotated for shell boundaries to evaluate how well the proposed
model separates shell from content, [Séaghdha and Teufel, 2014] showed that features extracted from the learned language models help improves a supervised zone mining model. In
a similar vein, we post-process LDA output to extract argument and domain words which
are used to improve the argument component identification.
In addition, contextual features were also applied to represent the dependency nature of
argument components. The most popular are history features that indicate the argumentative label of preceding one or more components, and features extracted from preceding
and following components [Teufel and Moens, 2002, Palau and Moens, 2009, Liakata et al.,
2012, Stab and Gurevych, 2014b]. In many writing genres, e.g., debate, essay, scientific article, the availability of argumentative topics provide valuable information to help identify
argumentative portions in text as well as classify their argumentative roles. [Levy et al.,
2014] proposed the context-dependent claim detection task in which a claim is determine
with respect to a given context - i.e., the input topic. To represent the contextual dependency, the authors made use of cosine similarity between the candidate sentence and the
topic as a feature. For scientific writings, genre-specific contextual features were also considered including common words with headlines, section order [Teufel and Moens, 2002,Liakata
et al., 2012]. As of context feature, we use writing topic to guide the separation of argument
words from domain words. We also use common words with surrounding sentences and with
writing topic as features.
2.3.2
Argumentative Relation Classification
The next step of identifying argument components is determining the argumentative relations, e.g., attack and support, between those components, or between arguments formed by
those components. Research have explored di↵erent argumentative relation schemes that can
be applied to pair of components, e.g., support vs. not [Biran and Rambow, 2011,Cabrio and
Villata, 2012, Stab and Gurevych, 2014b], implicit and explicit support and attack [Boltužić
17
and Šnajder, 2014]. Because the instances being classified are pair of textual units, features usually involve information from both elements (i.e., source and target) of the pair
(e.g., word pair, discourse indicators in source and target) and the relative position between
them [Stab and Gurevych, 2014b]. Beyond features from superficial level, features were
also extracted from semantic level of the relation including textual entailment and semantic
similarity [Cabrio and Villata, 2012, Boltužić and Šnajder, 2014].
Unlike argument component identification where textual units are sentences or clauses,
textual units in argumentative relation classification vary from clauses [Stab and Gurevych,
2014b] to multiple sentences [Biran and Rambow, 2011, Cabrio and Villata, 2012, Boltužić
and Šnajder, 2014]. However, only few research has investigated the use of discourse relation
within the text fragment to support the argumentative relation prediction. [Biran and Rambow, 2011] proposed that justifications of claim usually contain discourse structure which
characterize the argumentation provided in the justification in support of the claim. However, their study made use of only discourse indicators but not the semantic relations. On the
other hand, [Cabrio et al., 2013] studied the similarities and di↵erences between Penn Discourse Treebank [Prasad et al., 2008] discourse relations and argumentation schemes [Walton
et al., 2008]), and showed some PDTB discourse relations can be appropriate interpretations
of particular argumentation schemes. Inspired by these pioneering studies, our thesis proposes to consider each argumentative unit in its relation with other surrounding text to
enable advanced features extracted from the discourse context of the unit.
2.3.3
Argumentation Structure Identification
In contrast to the argumentative relation task, argumentation structure task emphasizes the
attachment identification that is to determine if two argument components directly attach
to each other, based on their rhetorical functions for the persuasion purpose of the text.
Attachment is considered a generic argumentative relationship that abstracts both support
and attack and is restricted to tree-structures in that a node attaches (has out-going edge)
to only one other node, while can be attached (has in-coming edge) from one or more other
nodes. [Palau and Moens, 2009] viewed legal argumentation as rooted at final decision that
18
is attached by conclusions which are further attached by premises. They manually examined
a set of legal text and defined a context-free argumentative grammar to show a possibility
of argumentative parsing for case law argumentation. [Peldszus and Stede, 2015] similarly
assumed the tree-like representation of argumentation that have central claim be the root
node to which pointed by claims (i.e., support or attack). Their data-driven approach took a
fully-connected graph of all argument components as input and determined the edge weights
based on features extracted from each component such as lemma, part-of-speech, dependency,
as well the relative distance between the components. The minimum spanning tree of such
weighted graph is returned as the output argumentation structure of the text.
Assuming that premises, conclusions and their attachment were already identified, [Feng
and Hirst, 2011] aimed at determining the argumentation scheme [Walton et al., 2008] of the
argument with the ultimate goal of recovering the implicit premises (enthymemes) of arguments. Besides the general features (relative position between conclusion and premises, number of premises) the study included scheme-specific features which are di↵erent for each target
scheme (in one-vs-others classification) and based on pre-defined keywords and phrases.
A challenge to our context-aware argument mining model is determining the right context
segment given the argument component. An ideal context segment is the minimal context
segment that expresses a complete justification in a support of the argument component.
Thus identifying the ideal context segment of an argument component requires to identify the
argumentation structure. To make the context-aware argument mining idea more practical
and easier to implement, our research does not require sentences in context segment must be
semantically or topically related while some kind of relatedness among those sentences might
be useful for the final argument mining tasks. In the course of this thesis, context segments
are determined using simple heuristics such as window-size and topic segmentation output.
In future, an use of argument structure identification for determining segment context is
worth an investigation.
19
3.0
EXTRACTING ARGUMENT AND DOMAIN WORDS FOR
IDENTIFYING ARGUMENT COMPONENTS IN TEXTS – COMPLETED
WORK
3.1
INTRODUCTION
Argument component identification studies often use lexical (e.g., n-grams) and syntactic (e.g., grammatical production rules) features with all possible values [Burstein et al.,
2003, Stab and Gurevych, 2014b]. However, such large and sparse feature spaces can cause
difficulty for feature selection. In our study [Nguyen and Litman, 2015], we propose an
innovative algorithm that post-processes the output of LDA topic model [Blei et al., 2003]
to extract argument words (argument indicators, e.g. ‘hypothesis’, ‘reason’, ‘think ’) and
domain words (specific terms commonly used within the topic’s domain, e.g. ‘bystander ’,
‘education’) which are used as novel features and constraints to improve the feature space.
Particularly, we keep only argument words from unigram features, and remove higher order
n-gram features (e.g., bigrams, trigrams). Instead of productions rules, we derive features
from dependency parses which enable us to both retain syntactic structures and incorporate
abstracted lexical constraints. Our lexicon extraction algorithm is semi-supervised in that
we use manually-selected argument seed words to guide the process.
Di↵erent data-driven approaches for sublanguage identification in argumentative texts
have been proposed to separate organizational content (shell) from topical content, e.g.,
supervised sequence modeling [Madnani et al., 2012], probabilistic topic models [Séaghdha
and Teufel, 2014,Du et al., 2014]. Post-processing LDA [Blei et al., 2003] output was studied
to identify topics of visual words [Louis and Nenkova, 2013] and representative words of
topics [Brody and Elhadad, 2010, Funatsu et al., 2014]. Our algorithm has a similarity
20
with [Louis and Nenkova, 2013] in that we use seed words to guide the separation.
3.2
PERSUASIVE ESSAY CORPUS
The dataset for this study is an annotated corpus of persuasive essays [Stab and Gurevych,
2014a]. The essays are student writings in response to sample test questions of standardized
English tests for foreign learners, and were posted online1 for others’ feedback. In the essays,
the writers state their opinions (labeled as MajorClaim), towards the writing topics and
validate those opinions with convincing arguments consisting of controversial statements (i.e.,
Claim) that support or attack the major claims, and evidences (i.e., Premise) that underpin
the validity of the claims. Three experts identified possible argument components, i.e.,
MajorClaim, Claim, Premise, within each sentence, and connect the argument components
using argumentative relations: Support and Attack. An example of persuasive essay in the
corpus is given below.
Example essay 1:
(0) E↵ects
of Globalization (Decrease in Global Tension)
(1) During
the history of the world, every change has its own positive and negative sides.
as a gradual change a↵ecting all over the world is not an exception.
(3) Although it has undeniable e↵ects on the economics of the world; it has side e↵ects
which make it a controversial issue.
(2) Globalization
(4) [Some
people prefer to recognize globalization as a threat to ethnic and religious values of
people of their country]Claim . (5) They think that [the idea of globalization put their inherited
culture in danger of uncontrolled change and make them vulnerable against the attack of
imperialistic governments]P remise .
(6) Those
who disagree, believe that [globalization contribute e↵ectively to the global improvement of the world in many aspects]Claim . (7) [Developing globalization, people can have
more access to many natural resources of the world ]P remise and [it leads to increasing the
pace of scientific and economic promotions of the entire world ]P remise . (8) In addition, they
admit that [globalization can be considered a chance for people of each country to promote
their lifestyle through the stu↵s and services imported from other countries]P remise .
(9) Moreover,
[the proponents of globalization idea point out globalization results in considerable decrease in global tension]Claim due to [convergence of benefits of people of the world
which is a natural consequence of globalization]P remise .
1
www.essayforum.com
21
(10) In
conclusion, [I would rather classify myself in the proponents of globalization as a speeding factor of global progress]M ajorClaim . (11) I think [it is more likely to solve the problems
of the world rather than intensifying them]P remise .
According to the coding scheme in [Stab and Gurevych, 2014a], each essay has one
and only one MajorClaim. An essay sentence (e.g., sentence 9) can simultaneously have
multiple argument components which are clauses of the sentence (Argumentative spans),
and text spans that do not belong to any argument components (None spans). An argument
component can be either a clause or a whole sentence (e.g., sentence 4). Sentences that do
not contain any argument component are labeled Non-argumentative (e.g., sentences {1, 2,
3}). The three experts achieved inter-rater accuracy 0.88 for argument component labels
and Krippendor↵’s ↵U 0.72 for argument component boundaries.
Forming prediction inputs from Persuasive Essay Corpus is complicate due to the multiplecomponent sentences. For an illustration, let consider sentence 9 in the example. We have
following text spans with their respective labels2 :
Text span
Label
Moreover,
None
the proponents of globalization idea point out globalization results
Claim
in considerable decrease in global tension
due to
None
convergence of benefits of people of the world which is a natural
Premise
consequence of globalization
.
None
In this study, we use the model developed in [Stab and Gurevych, 2014b] as a baseline
to evaluate our proposed approach. Following [Stab and Gurevych, 2014b], the None spans
are not considered as prediction inputs. Therefore, a proper input of the prediction model
is either a Non-argumentative sentence or an Argumentative span. Overall, the Persuasive
Essay Corpus has 327 Non-argumentative sentences and 1346 Argumentative sentences. A
distribution of argumentative labels is shown in the Table 1.
2
A single punctuation is a proper span.
22
Argumentative label
Major-claim
#instances
90
Claim
429
Premise
1033
Non-argumentative
Total
327
1879
Table 1: Number of instances of each argumentative label in Persuasive Essay Corpus.
3.3
ARGUMENT AND DOMAIN WORD EXTRACTION
In this section we briefly describe the algorithm to extract argument and domain words from
a development dataset using predefined argument keywords [Nguyen and Litman, 2015]. We
recall that argument words are those playing a role of argument indicators and commonly
used in di↵erent argument topics, e.g. ‘reason’, ‘opinion’, ‘think ’. In contrast, domain words
are specific terminologies commonly used within the topic, e.g. ‘art’, ‘education’. Our notions
of argument and domain languages share a similarity with the idea of shell language and
content in [Madnani et al., 2012] in that we aim to model the lexical signals of argumentative
content. However while [Madnani et al., 2012] emphasized the boundaries between argument
shell and content, we emphasize more the lexical signals themselves and allow argument
words to occur in the argument content. For example, the MajorClaim in Figure 1 has two
argument words ‘should ’ and ‘instead ’ which make the statement controversial.
The development data for the Persuasive Essay Corpus are 6794 unlabeled essays (Persuasive Set) with titles collected from www.essayforum.com. We manually select 10 argument
keywords/seeds that are the 10 most frequent words in the titles that seemed argument related: agree, disagree, reason, support, advantage, disadvantage, think, conclusion, result,
opinion. We extract seeds of domain words as those in the titles but not argument keywords
or stop words, and obtain 3077 domain seeds (with 136482 occurrences). Each domain seed
23
Topic 1 reason exampl support agre think becaus disagre statement opinion believe therefor idea conclus ...
Topic 2 citi live big hous place area small apart town build communiti factori urban ...
Topic 3 children parent school educ teach kid adult grow childhood behavior taught ...
Table 2: Samples of top argument words (topic 1), and top domain words (topics 2 and 3)
extracted from the Persuasive Set. Words are stemmed.
is associated with an in-title occurrence frequency f .
All words in the development set including seed words are stemmed, and named entities are replaced with the corresponding NER labels by the Stanford parser. We run
GibbsLDA++ implementation [Phan and Nguyen, 2007] of LDA [Blei et al., 2003] on the
development set, and assign each identified LDA topic three weights: domain weight (DW )
is the sum of domain seed frequencies; argument weight (AW ) is the number of argument
keywords3 ; and combined weight CW = AW
DW . For example, topic 2 in the LDA’s
output of Persuasive Set in Table 2 has AW = 5,4 DW = 0.15, CW = 4.85, f (citi ) =
381/136482 = 0.0028 given its 381 occurrences in the 136482 domain seed occurrences in the
titles. LDA topics are ranked by CW with the top topic has highest CW value. We vary
number of LDA topics k and select the k with the highest CW ratio of the top-2 topics (k
= 36). The argument word list is the LDA topic with the largest combined weight given
the best k. Domain words are the top words of other LDA topics but not argument or stop
words.
Given 10 argument keywords, our algorithm returns a list of 263 argument words5 which
is a mixture of keyword variants (e.g. think, believe, viewpoint, opinion, argument, claim),
3
Argument keywords are weighted more than domain seeds to reduce the size disparity of the two seed
sets.
4
Five argument keywords not shown in the table are: {more, conclusion, advantage, who, which}
5
The complete list is shown in the APPENDIX A.
24
connectives (e.g. therefore, however, despite), and other stop words. 1582 domain words are
extracted by the algorithm. We note that domain seeds are not necessarily present in the
extracted domain words partially because words with occurrence less than 3 are removed
from LDA topics.6 On the other hand, the domain word list of Persuasive Set has 6% not in
the domain seed set. Table 2 shows examples of top argument and domain words (stemmed)
returned by the algorithm.
3.4
3.4.1
PREDICTION MODELS
Stab & Gurevych 2014
The model in [Stab and Gurevych, 2014b] (Stab14) uses following features extracted from
the Persuasive Essay Corpus:
• Structural features: #tokens and #punctuations in argument component (AC)7 , in covering sentence, and preceding/following the AC in sentence; token ratio between covering
sentence and AC. Two binary features indicate if the token ratio is 1 and if the sentence
ends with a question mark. Five position features are covering sentence’s position in essay, whether the AC is in the first/last paragraph, the first/last sentence of a paragraph.
• Lexical features: all n-grams of length 1-3 extracted from the text span that include the
AC and its preceding text which is not covered by other AC’s in sentence; verbs like
‘believe’; adverbs like ‘also’; and whether the AC has a modal verb.
• Syntactic features: #sub-clauses and depth of syntactic parse tree of the covering sentence of the AC; tense of main verb and grammatical production rules (VP ! VBG NP)
from the sub-tree that represent the AC.
• Discourse markers: discourse connectives of 3 relations: Comparison, Contingency, and
6
Our implementation of [Stab and Gurevych, 2014b] model obtained performance improvement when
removing rare n-grams, i.e., tokens with less than 3 occurrences. Thus, we applied the rare threshold of 3 to
our pre-processing of the data.
7
Gold-standard boundaries are used to identify Argumentative spans of the component.
25
Expansion8 are extracted by the addDiscourse program [Pitler et al., 2009]. A binary
feature indicates if the corresponding discourse connective precedes the AC.
• First person pronouns: Five binary features indicate whether each of I, me, my, mine,
and myself is present in the covering sentence. An additional binary feature indicates if
one of five first person pronouns is present in the covering sentence.
• Contextual features: #tokens, #punctuations, #sub-clauses, and presence of modal verb
in preceding and following sentences of the AC.
In this study, we re-implement Stab14 to use as a baseline model. To evaluate our proposed model (described below) we compare its performance with the performance reported
in [Stab and Gurevych, 2014b] as well as the performance of our implementation of Stab14.
3.4.2
Nguyen & Litman 2015
Our proposed model [Nguyen and Litman, 2015]9 (Nguyen15) improves Stab14 by using
extracted argument and domain words as novel features and constraints to replace its n-gram
and production rule features. Compared to n-grams in lexical aspect, argument words are
believed to provide a much more compact representation of the argument indicators. As for
the structural aspect, instead of production rules, e.g. “S ! NP VP ”, we use dependency
parses to extract pairs of subject and main verb of sentences, e.g. “I.think ”, “view.be”.
Dependency relations are minimal syntactic structures compared to production rules. To
further make the features topic-independent, we keep only dependency pairs that do not
include domain words. In summary, our proposed model takes all features from the baseline
except n-grams and production rules, and adds the following features: argument words as
unigrams; filtered dependency pairs which are argumentative subject–verb pairs are used as
skipped bigrams; and numbers of argument and domain words (see Figure 5). Our proposed
model is compact with 956 original features compared to 5132 of the baseline.10
8
Authors of [Stab and Gurevych, 2014b] manually collected 55 Penn Discourse Treebank markers after
removing those that do not indicate argumentative discourse, e.g. markers of Temporal relations. Because
the list of 55 discourse markers was not publicly available, we used a program to extract discourse connectives.
9
In the paper, we named our model AD which stands for Argument and Domain word-based model.
10
Counted in our implementation of Stab14. Because our implementation removes n-grams with less than
3 occurrences, it has smaller feature space than the original model in [Stab and Gurevych, 2014b].
26
Stab14 (Stab & Gurevych 2014b)
1-, 2-, 3-grams
Verbs, adverbs, presence of model verb
Discourse connectives,
Singular first person pronouns
Lexical
(I)
Argument words as unigrams
(I)
Production rules
Tense of main verb
#sub-clauses, depth of parse tree
Parse
(II)
Structure
(III)
Context
(IV)
Nguyen15 (Nguyen & Litman 2015)
Same as Stab14
Argumentative subject-verb pairs
(II)
Same as Stab14
#tokens, token ratio, #punctuation, sentence position,
first/last paragraph, first/last sentence of paragraph
(III)
Stab14 + #argument words
+ #domain words
#tokens, #punctuation, #sub-clauses, modal
verb in preceding/following sentences
(IV)
Same as Stab14
Figure 5: Feature illustration of Stab14 and Nguyen15. 1-, 2-, 3-grams and production
rules in Stab14 are replaced by argument words and argumentative subject–verb pairs in
Nguyen15.
3.5
3.5.1
EXPERIMENTAL RESULTS
Proposed vs. Baseline Models
This experiment replicates what was conducted in [Stab and Gurevych, 2014b]. We perform
10-fold cross validations and report the average results. In each run models are trained using LibLINEAR [Fan et al., 2008] algorithm with top 100 features returned by the InfoGain
feature selection algorithm performed in the training folds. We use LightSIDE (lightsidelabs.com) to extract n-grams and production rules, the Stanford parser [Klein and Manning,
2003] to parse the texts, and Weka [Hall et al., 2009] to conduct the machine learning experiments. Table 3 (left) shows the performances of three models: BaseR and BaseI are respectively the reported performance and our implementation of Stab14 [Stab and Gurevych,
2014b], and Nguyen15 is our proposed model. Because of the skewed label distribution, all
reported precision and recall are un-weighted average values from by-class performances.
27
BaseR
BaseI
Nguyen15
BaseI
Nguyen15
#features
100
100
100
130
70
Accuracy
0.77
0.783
0.794+
0.803
0.828*
Kappa
NA
0.626
0.649*
0.640
0.692*
Precision
0.77
0.760
0.756
0.763
0.793
Recall
0.68
0.687
0.697
0.680
0.735+
Table 3: Model performances with top 100 features (left) and best number of features (right).
+, * indicate p < 0.1, p < 0.05 respectively in AD vs. BaseI comparison. Best values are in
bold.
AltAD
Nguyen15
Accuracy
0.770
0.794*
Kappa
0.623
0.649*
Precision
0.748
0.756
Recall
0.688
0.697
F1:MajorClaim
0.558
0.506
F1:Claim
0.468
0.527*
F1:Premise
0.826
0.844*
F1:None
1.000
1.000
Table 4: 10-fold performance with di↵erent argument words lists.
We note that there are performance disparities between BaseI (our implementation), and
BaseR (reported performance in [Stab and Gurevych, 2014b]). The di↵erences may mostly
be due to dissimilar feature extraction methods and NLP/ML toolkits. Comparing BaseI
and Nguyen15 shows that our proposed model Nguyen15 yields higher Kappa (significantly)
and accuracy (trending).
28
To further analyze performance improvement by the Nguyen15 model, we use 75 randomlyselected essays to train and estimate the best numbers of features of BaseI and Nguyen15
(w.r.t F1 score) through a 9-fold cross validation, then test on 15 remaining essays. As shown
in Table 3 (right), Nguyen15’s test performance is consistently better with far smaller number
of top features (70) than BaseI (130). Nguyen15 has 6 of 31 argument words not present in
BaseI’s 34 unigrams: analyze, controversial, could, debate, discuss, ordinal . Nguyen15 keeps
only 5 dependency pairs: I.agree, I.believe, I.conclude, I.think and people.believe while BaseI
keeps up to 31 bigrams and 13 trigrams in the top features. These indicate the dominance
of our proposed features over generic n-grams and syntactic features.
3.5.2
Alternative Argument Word List
In this experiment, we study the prediction transfer of argument words when the development
data to extract them is of a di↵erent genre than the test data. In a preliminary, we run the
argument word extraction algorithm on a set of 254 academic writings (see §4.2 for a detailed
description of this type of student essay) and extracted 429 argument keywords.11
To build an model based on the alternative argument word list (AltAD), we replace the
argument words in Nguyen15 with those 429 argument words, re-filter the dependency pairs
and update the number of argument words. We follow the same setting in the experiment
above to train Nguyen15 and AltAD using top 100 features. As shown in Table 4, AltAD
performs worse than Nguyen15, except a higher F1:MajorClaim but not significant. AltAD
yields significantly lower accuracy, Kappa, F1:Claim and F1:Premise.
Comparing the two argument word lists gives us interesting insights. The two lists have
142 common words with 9 discourse connectives (e.g. ‘therefore’, ‘despite’), 72 content
words (e.g. ‘result’, ‘support’), and 61 stop words. 30 of the common argument words
appear in top 100 features of AltAD, but only 5 are content words: ‘conclusion’, ‘topic’,
‘analyze’, ‘show ’, and ‘reason’. This shows that while the two argument word lists have a
fair amount of common words, the transferable part is mostly limited to function words, e.g.
11
The five argument keywords for this development set were hypothesis, support, opposition, finding, study.
In that experiment, we did not consider each essay as an input document of LDA. Instead we broke essays
into sections at citation sentences
29
discourse connectives, stop words. In contrast, 270 of the 285 unique words to AltAD are not
selected for top 100 features, and most of those are popular terms in academic writings, e.g.
‘research’, ‘hypothesis’, ‘variable’. Moreover, Nguyen15’s top 100 features have 20 argument
words unique to the model, and 19 of those are content words, e.g. ‘believe’, ‘agree’, ‘discuss’,
‘view ’. These non-transferable parts suggest that argument words should be learned from
appropriate seeds and development sets for best performance.
3.6
CONCLUSIONS
Our proposed features are shown to efficiently replace generic n-grams and production rules
in argument mining tasks for significantly better performance. The core component of our
feature extraction is a novel algorithm that post-processes LDA output to learn argument
and domain words with a minimal seeding. These results proves our first sub-hypothesis (H11, §1.2) of e↵ectiveness of context features in argument component identification. Moreover,
our analysis gives insights into the lexical signals of argumentative content. While argument
word lists extracted for di↵erent data can have parts in common, there are non-transferable
parts which are genre-dependent and necessary for the best performance.
30
4.0
IMPROVING ARGUMENT MINING IN STUDENT ESSAYS USING
ARGUMENT INDICATORS AND ESSAY TOPICS – COMPLETED WORK
4.1
INTRODUCTION
Argument mining systems for student essays need to be able to reliably identify argument
components independently of particular writing topics. Prior argument mining studies have
explored linguistic indicators of argument such as pre-defined indicative phrases for argumentation [Mochales and Moens, 2008], syntactic structures, discourse markers, first person
pronouns [Burstein et al., 2003, Stab and Gurevych, 2014b], and words and linguistic constructs that express rhetorical function [Séaghdha and Teufel, 2014]. However only a few
studies have attempted to abstract over the lexical items specific to argument topics for new
features, e.g., common words with title [Teufel and Moens, 2002], cosine similarity with the
topic [Levy et al., 2014], or to perform cross-topic evaluations [Burstein et al., 2003]. In a
classroom, students can have writing assignments in a wide range of topics, thus features
that work well when trained and tested on di↵erent topics (i.e., writing-topic independent
features) are more desirable.
[Stab and Gurevych, 2014b] studied the argument component identification problem
in persuasive essays, and used linguistic features like ngrams and production rules (e.g.,
VP!VBG NP, NN!sign) in their argument mining system. While their features were
e↵ective, their feature space was large and sparse. Our prior work [Nguyen and Litman, 2015]
(see §3), addressed that issue by replacing n-grams with a set of argument words learned in a
semi-supervised manner, and using dependency rather than constituent-based parsers, which
were then filtered based on the learned argument versus domain word distinctions. While
our new features were derived from a semi-automatically learned lexicon of argument and
31
domain words, the role of using such a lexicon was not quantitatively evaluated. Moreover,
neither [Stab and Gurevych, 2014b] nor we used features that abstracted over topic lexicons,
nor performed cross-topic evaluation.
In this chapter, we present our new study [Nguyen and Litman, 2016] that addresses
the above limitations in four ways. First, in §4.2 we introduce a newly annotated corpus of
academic essays from college classes and run all of our studies using both the new corpus and
the prior persuasive essay corpus [Stab and Gurevych, 2014a] (see §3.2). Second, we present
new features to model not only indicators of argument language but also to abstract over
essay topics. Third, we build ablated models that do not use the extracted argument and
domain words to derive new features and feature filters, so we can quantitatively evaluate
the utility of extracting such word lists. Finally, in addition to 10-fold cross validation,
we conduct cross-topic validation to evaluate model robustness when trained and tested on
di↵erent writing topics.
Through experiments on two di↵erent corpora, we aim to provide support for the following three model-robustness hypotheses: models enhanced with our new features will outperform baseline models when evaluated using (h1) 10-fold cross validation and (h2) cross-topic
validation; our new models will demonstrate topic-robustness in that (h3) their cross-topic
and 10-fold cross validation performance levels will be comparable.
4.2
ACADEMIC ESSAY CORPUS
The Academic Essay Corpus consists of 115 student essays collected from a writing assignment of university introductory Psychology classes in 2014. The assignment requires each
student to write an introduction of the observational study that she conducted. In the study,
the student student proposes one or two hypotheses about the e↵ects of di↵erent observational variables to a dependent variable, e.g., e↵ect of gender to politeness. The student is
asked to use relevant studies/theories to justify support for the hypotheses, and to present
at least one theoretical opposition with a hypothesis. The students are required to write
their introduction in form of an argumentative essay and follow the APA guideline that uses
32
Argumentative label
#sentences
Hypothesis
185
Finding
131
– Support finding
50
– Opposition finding
81
Non-argumentative
2998
Total
3314
Table 5: Number of sentences of each argumentative label in Academic Essay Corpus.
citations whenever they refer to prior studies. Compared to Persuasive Essay Corpus, while
claims in the persuasive essays are mostly substantiated by personal experience, hypotheses
in the academic essays are elaborated by findings from the literature. This makes the most
distinguished di↵erence between the two corpora.
We had two experts label each sentence of the essays whether it is a Hypothesis statement,
Support finding, or Opposition finding (if so it is an argumentative sentence, no sentences
have multiple labels). As the focus of this study is the identification of argument component
without caring about the argumentative relation between components, Support and Opposition sentences are grouped into Finding category. The two annotators achieved inter-rater
kappa 0.79 for the agreement on sentence labels for the coding scheme Hypothesis-Finding.
For an example, two last paragraphs of an academic essay is given bellow. The essay’s topic
is “Amount of Bystanders E↵ect on Helping Behavior”.
Example essay 2: (1) Several studies have been done in the past that also examine the ideas
of the bystander e↵ect and di↵usion of responsibility, and their roles in social situations.
(2) [Daniel M. Wegner conducted a study in 1978 that demonstrated the bystander e↵ect on a
college campus by comparing the ratio of bystanders to victim, which showed that the more
bystanders in comparison to the victims led to less people helping (Wegner, 1983).]Support
(3) [Another supporting study was conducted Rutkowski in 1983 that also demonstrated
that with larger groups comes less help for victims in non-emergency situations due to
less social pressure (Rutkowski, 1983).]Support (4) Although these studies demonstrate the
bystander e↵ect and di↵usion of responsibility, other studies oppose these ideas. (5) [One
33
strong study that opposes the bystander e↵ect was done in 1980 by Junji Harada that
showed that increase in group size, even in a face to face proximity, did not decrease the
likelihood of being helped (Harada, 1980).]Opposition
(6) In
order to find out specifically the e↵ects that the bystander e↵ect has in diverse settings,
this study focuses on a non-emergency situation on a college campus. (7) [The hypothesis,
based on the bystander e↵ect demonstrated in Wegner’s study (1978), is that with more
people around, less people will take the time to help the girl pick up her papers.]Hypothesis
In the example, the main content of argumentative sentences that express the argumentative role of the sentences (e.g., hypothesis, support, or opposition) are italicized. Given
the annotation, Finding sentences are {2, 3, 5}. Table 5 shows the label distribution in the
corpus. As we can see, the dataset is very skewed with Non-argumentative sentences are
more than 90% of the data. Also while each essay has at least one Hypothesis statement,
not all essays have Support and Opposition sentences.
4.3
4.3.1
PREDICTION MODELS
Stab14
As described in §3.4.1, Stab14 model was developed using Persuasive Essay Corpus. Despite
the di↵erences between persuasive essays and academic essays, the Stab14 model is also
applicable to the Academic Essay Corpus. First, the two corpora share certain similarities in
writing styles and coding schemes. Both corpora consist of student writings whose content
is developed to elaborate a main hypothesis for a persuasion purpose. Regarding coding
schemes, MajorClaims in persuasive essays correspond to Hypothesis statements in academic
essays, and Claims match Support and Opposition findings. Premises in persuasive essays
can be considered student writer’s elaborations of previous studies in academic essay. Second,
most of prediction features proposed in their study are generic and genre-independent, e.g.,
n-grams, grammatical production rules, and discourse connectives, which are expected to
work for student writings in general. Therefore, we adapt [Stab and Gurevych, 2014b],
Stab14, model to the Academic Essay Corpus for a baseline model to evaluate our approach.
The version of Stab14 that works for Persuasive Essay is described in §3.4.1.
34
As the Academic Essay Corpus has annotation done at sentence-level and contains no
information of argument component boundaries, all features of Stab14 that involve boundaries information are not applicable to Academic Essay Corpus. Therefore, Stab14 model
is adapted to Academic Essay Corpus by simply extracting all features from the sentences,
and removing features that require both argument component and covering sentence, e.g.,
token ratio.
4.3.2
Nguyen15v2
We implement two modified versions of the Nguyen15 model (§3.4.2) as the second baseline
(Nguyen15v2),1 one for each corpus. Additional experiments with Persuasive Essay Corpus
showed that argument and domain word count features were not e↵ective, so we decided to
remove these two features from Nguyen15. For each version we re-implement the argument
and domain word extraction algorithm (§3.3) to extract argument and domain words from
a development dataset.
For the Academic Essay Corpus, we use 254 unannotated essays (Academic Set) with
titles from Psychology classes in years 2011 and 2013 as the development data. We select
5 argument keywords which were specified in the writing assignments: hypothesis, support,
opposition, finding, study. Filtering out argument keywords and stop words in essay titles of
the academic set, we obtain 264 domain seeds (with 1588 occurrences). The argument and
domain word extraction algorithm returns 11 LDA topics, 315 (stemmed) argument words,2
and 1582 (stemmed) domain words. The learned argument words are a mixture of keyword
variants (e.g. research, result, predict), methodology terms (e.g. e↵ect, observe, variable,
experiment, interact), connectives (e.g. also, however, therefor ), and other stop words.
Learned domain words have 86% not in the domain seed set. Table 6 shows examples of top
argument and domain words (stemmed) returned by the algorithm.
1
In the paper, we named this model Nguyen15 [Nguyen and Litman, 2016]. We do not use the original
in this thesis because it might make people confused with Nguyen15 model described in §3.4.2.
2
The complete list is shown in the APPENDIX A.
35
Topic 1 studi research observ result hypothesi time find howev
predict support expect oppos ...
Topic 2 respons stranger group greet confeder individu verbal
social size peopl sneez ...
Topic 3 more gender women polit femal male men behavior di↵er
prosoci express gratitud ...
Table 6: Samples of top argument words (topic 1), and top domain words (topics 2 and 3)
extracted from Academic Set. Words are stemmed.
4.3.3
wLDA+4
Our proposed model of this study, wLDA+4, is Nguyen15v2 (with the LDA supported
features) expanded with 4 new feature sets extracted from the covering sentences of the
associated argument components. A summary of features used in this model is given in
Figure 6. To model the topic cohesion of essays, we include two common word counts:
1. Numbers of common words of the given sentence with the preceding one and with the
essay title.
We also proposed new lexical features for better indicators of argument language. We
observe that in argumentative essays students usually use comparison language to compare
and contrast ideas. However not all comparison words are independent of the essay topics.
For example, while adverbs (e.g., ‘more’) are commonly used across essays, adjectives (e.g.,
‘cheaper ’, ‘richer ’) seem specific to the particular topics. Thus, we introduce the following
comparison features:
2. Comparison words: comparative and superlative adverbs. Comparison POS : two binary
features indicating the presences of RBR and RBS part-of-speech tags.
We also see that student authors may use plural first person pronouns (we, us, our,
ours, and ourselves) as a rhetorical device to make their statement sound more objec36
Stab14 (Stab & Gurevych 2014b)
Lexical
(I)
Parse
(II)
Structure
(III)
Context
(IV)
1-, 2-, 3-grams
Verbs, adverbs, presence of model verb
Discourse connectives,
Singular first person pronouns
Production rules
Tense of main verb
#sub-clauses, depth of parse tree
Nguyen15v2
wLDA+4 (this study)
Argument words as unigrams
(I)
Same as Stab14
Argumentative subject-verb pairs
(II)
#tokens, token ratio, #punctuation, sentence position,
first/last paragraph, first/last sentence of paragraph
(III)
#tokens, #punctuation, #sub-clauses, modal
verb in preceding/following sentences
(IV)
Same as Stab14
Same as Stab14
Nguyen15v2
1. Numbers of common
words with title and
preceding sentence
2. Comparative &
superlative adverbs and
POS
3. Plural first person
pronouns
4. Discourse relation
labels
Figure 6: Feature illustration of Stab14, Nguyen15v2 and wLDA+4. 1-, 2-, 3-grams and
production rules in Stab14 are replaced by argument words and argumentative subject–verb
pairs in Nguyen15v2. wLDA+4 extends Nguyen15v2Footer
with 4 new feature sets.
tive/persuasive, for instance “we always find that we need the cooperation.” We supplement
the first person pronoun set in the baseline models with 5 plural first person pronouns:
3. Five binary features indicating whether each of 5 plural first person pronouns is present.
We notice that many discourse connectives used in baseline models are duplicates of our
extracted argument words, e.g., ‘however ’. Thus using both argument words and discourse
connectives may inefficiently enlarge the feature space. To emphasize the discourse information, we include discourse relations as identified by addDiscourse program [Pitler et al.,
2009] as new features:
4. Three binary features showing if each of Comparison, Contingency, Expansion discourse
relations is present.3
4.3.4
wLDA+4 ablated models
We propose two simple alternatives to wLDA+4 to examine the role of argument and domain
word lists in our argument mining task:
3
The temporal discourse relation was not used in [Stab and Gurevych, 2014b] and thus is ignored in this
study.
37
2
• woLDA: we disable the LDA-enabled features and constraints in wLDA+4 so that
woLDA does not include argument words, but uses all possible subject–verb pairs. All other
features of wLDA+4 are una↵ectedly applied to woLDA. Comparing woLDA to wLDA+4
will show the contribution of the extracted argument and domain words to the model performance.
• Seed: extracted argument and domain word lists are replaced with only the seeds that
were used to start the semi-supervised argument and domain word learning process (see next
section). Comparing Seed to wLDA+4 will show whether it is necessary to use the semisupervised approach for expanding the seeds to construct the larger/more comprehensive
argument and domain word lexicons.
4.4
4.4.1
EXPERIMENTAL RESULT
10-fold Cross Validation
We first conduct 10-fold cross validations to evaluate our proposed model and the baseline
models. All models are trained using the SMO (as in [Stab and Gurevych, 2014b]) implementation of SVM in Weka [Hall et al., 2009]. LightSIDE (lightsidelabs.com) and Stanford
parser [Klein and Manning, 2003] are used to extract n-grams, parse trees and named entities. We follow [Stab and Gurevych, 2014b] and use top 100 features ranked by InfoGain
algorithm on training folds to train the models. To obtain enough samples for a significance
test when comparing model performance in 10-fold cross validation to cross-topic validation,
we perform 10 runs of 10-fold cross validations (10⇥10 cross-validation) and report the average results over 10 runs.4 We use T-tests to compare performance of models given that
each model evaluation returns 10 samples of 10-fold cross validation performance. As the
two corpora are very class-skewed, we report unweighted precision and recall. Also while
accuracy is a common metric, kappa is a more meaningful value given our imbalanced data.
4
From our prior study [Nguyen and Litman, 2015], and additional experiments, we also noticed that the
skewed distributions of our corpora make stratified 10-fold cross validation performance notably a↵ected by
the random seeds. Thus, we decided to conduct multiple cross validations in this experiment to reduce any
e↵ect of random folding.
38
Persuasive Essay Corpus
Metric
Stab14
Nguyen15v2
woLDA
Seed
wLDA+4
Accuracy
0.787*
0.792*
0.780*
0.781*
0.805
Kappa
0.639*
0.649*
0.629*
0.632*
0.673
Precision
0.741*
0.745*
0.746*
0.740*
0.763
Recall
0.694*
0.698*
0.695*
0.695*
0.720
Academic Essay Corpus
Metric
Stab14
Nguyen15v2
woLDA
Seed
wLDA+4
Accuracy
0.934*
0.942+
0.933*
0.935*
0.941
Kappa
0.558*
0.635
0.528*
0.564*
0.629
Precision
0.804*
0.830+
0.829
0.826
0.825
Recall
0.628*
0.695
0.594*
0.637*
0.695
Table 7: 10⇥10-fold cross validation results. Best values in bold. +: p < 0.1, *: p < 0.05
by T-test when comparing with wLDA+4.
Model performances are reported in Table 7.
Our first analysis is about the performance improvement of our proposed model over
the two baselines. We see that our model wLDA+4 significantly outperforms Stab14 in all
reported metrics across both two corpora. However comparing wLDA+4 and Nguyen15v2
reveals inconsistent patterns. While wLDA+4 yields a significantly higher performances
than Nguyen15v2 when evaluated in the persuasive corpus, our proposed model performs
worse than that baseline in the academic corpus. Looking at individual metrics of these two
models we see that Nguyen15v2 has trending higher accuracy (p = 0.05) and also trending
higher precision (p = 0.09) than wLDA+4 in academic corpus. The di↵erences on kappa and
recall between the two models are not significant. These results partially support our first
model-robustness hypothesis (h1) in that our proposed features improve over both baselines
using 10-fold cross validation in the persuasive corpus only.
39
We now turn to our feature ablation results. Removing the LDA-enabled features from
wLDA+4, we see that woLDA’s performance figures are all significantly worse than wLDA+4
except for precision in the academic corpus. Furthermore, we find that argument keywords
and domain seeds are poor substitutes for the full argument and domain word lists learned
from these seeds. This is shown by the significantly lower performances of Seed compared
to wLDA+4, except for precision in the academic corpus. Nonetheless, adding the features
computed from just argument keywords and domain seeds still helps Seed perform better than
woLDA (with higher accuracy, kappa and recall in both persuasive and academic corpora).
4.4.2
Cross-topic Validation
To better evaluate the models when predicting essays of unseen topics we conduct cross-topic
validations where training and testing essays are from di↵erent topics [Burstein et al., 2003].
We examined 90 persuasive essays and categorized them into 12 groups including 11 singletopic groups, each corresponds to a major topics (have 4 to 11 essays), e.g., Technologies
(11 essays), National Issues (10), School (8), Policies (7), and a mixed group of 17 essays of
minor topics (each has less than 3 essays), e.g., Prepared Food (2 essays).
We manually split 115 academic essays into 5 topics accordingly to the studied variables.
Attractiveness as a function of clothing color (20 essays), Email-response rate as a function
of recipient size (22), Helping-behavior with e↵ects of gender and group size (31), Politeness
as a function of gender (23), Self-description and word choices with influences of gender and
self-esteem (19).
Again all models are trained using the top 100 features selected in training folds. In each
folding, we use essays of one topic for evaluation and all other essays to train the model.
T-test is used to compare each two sets of by-fold performances.
We first evaluate the performance improvement of our model compared to the baselines.
As shown in Table 8, wLDA+4 again yields higher performance than Stab14 in all metrics
of both corpora, and the improvements are significant except for precision in the academic
essay. Moreover we generally observe a larger performance gap between wLDA+4 and Stab14
in cross-topic validation than in 10-fold cross validation. More importantly, with cross40
Persuasive Essay Corpus
Metric
Stab14
Nguyen15v2
woLDA
Seed
wLDA+4
Accuracy
0.780*
0.796
0.774*
0.776*
0.807
Kappa
0.623*
0.654+
0.618*
0.623*
0.675
Precision
0.722*
0.757*
0.751
0.734
0.771
Recall
0.670*
0.695*
0.681*
0.686*
0.722
Academic Essay Corpus
Table 8:
Metric
Stab14
Nguyen15v2
woLDA
Seed
wLDA+4
Accuracy
0.928*
0.939+
0.931*
0.935*
0.944
Kappa
0.491*
0.598+
0.474*
0.547*
0.630
Precision
0.768
0.832
0.866
0.839*
0.851
Recall
0.565*
0.664
0.551*
0.617*
0.686
Cross topic validation results. Best values in bold. +: p < 0.1, *: p < 0.05 by
T-test when comparing with wLDA+4.
topic validation, wLDA+4 now yields better performance than Nguyen15v2 for all metrics
in both persuasive and academic corpora. Especially, our proposed model now even has
trending higher accuracy and kappa than Nguyen15v2 in academic corpus. This shows a
clear contribution of our new features in the overall performance, and supports our second
model-robustness hypothesis (h2) that our new features improve the cross-topic performance
in both corpora compared to the baselines.
With respect to feature ablation results, our findings are consistent with the prior crossfold results in that woLDA and Seed both have lower performance (often significantly) than
wLDA+4 (with one exception). Seed again generally outperforms woLDA, indicating that
deriving features from even impoverished argument and domain word lists is better than not
using such lexicons at all.
Next, we compare wLDA+4 performance across the cross-fold and cross-topic experimen41
tal settings (using a T-test to compare the mean of 10 samples of 10-fold cross validation
performance versus the mean of cross-topic validation performance). In both corpora we see
that wLDA+4 yields higher performance for all metrics in cross-topic versus 10-fold cross
validation, except for recall in the academic corpus. Of these cross-topic performance figures,
wLDA+4 has significantly higher precision and trending higher accuracy in the persuasive
corpus. In academic corpus, wLDA+4’s cross-topic accuracy, precision and recall are all
significantly better than the corresponding figures for 10-fold cross validation. These results support strongly our third model-robustness hypothesis (h3) that our proposed model’s
cross-topic performance is as high as 10-fold cross validation performance.
In contrast, Nguyen15v2’s performance di↵erence between cross-topic and random-folding
validations does not hold a consistent direction. Stab14 returns significantly higher results in
10-fold cross validation than cross-topic validation in both persuasive and academic corpora.
Also woLDA and Seed’s cross-topic performances are largely worse than those of 10-fold cross
validation. Overall, the cross-topic validation shows the ability of our proposed model to
perform reliably when the testing essays are from new topics, and the essential contribution
of our new features to this high performance.
To conclude this section, we give a qualitative analysis of the top features selected in
our proposed model. In each folding we record the top 100 features with associated ranks.
By the end of cross-topic validation, we have a pool of top features (⇡200 for each corpus),
with an average rank for each. First we see that the proportion of argument words is about
49% of pooled features in both corpora, and the proportion of argumentative subject–verb
pairs varies from 8% (in persuasive corpus) to 15% (in academic corpus). The new features
introduced in wLDA+4 that are present in the top features include: two common word
counts; RBR part-of-speech; person pronouns We and Our ; discourse labels Comparison,
Expansion, Contingency. All of those are in the top 50 except that Comparison label has
average rank 79 in the persuasive corpus. This shows the utility of our new feature sets.
Especially the e↵ectiveness of common word counts encourages us to study advanced topic
cohesion features in future work.
42
Stab’s test set
Nguyen’s test set
Metric
Stab best
Our SMO
Nguyen best
Our SMO
Our Lib-LINEAR
Accuracy
0.77
0.816
0.828
0.819
0.837
Kappa
–
0.682
0.692
0.679
0.708
Precision
0.77
0.794
0.793
0.762
0.811
Recall
0.68
0.726
0.735
0.703
0.755
Table 9: Model performance on test sets. Best values in bold.
4.4.3
Performance on Held-out Test Sets
The experiments above used 10⇥10-fold cross-validation and cross-topic validation to investigate the robustness of prediction features. Note that this required us to re-implement
both baselines as neither had previously been evaluated using cross-topic validation.5 However, since both baselines were evaluated on single held-out test sets of Persuasive Essay
Copora, that were available to us, our last experiment compares wLDA+4’s performance
with the best reported results for the original baseline implementations [Stab and Gurevych,
2014b, Nguyen and Litman, 2015] using their exact same training/test set splits. That is,
we train wLDA+4 trained using SMO classifier with top 100 features with the two training
sets of 72 essays [Stab and Gurevych, 2014b] and 75 essays [Nguyen and Litman, 2015], and
report the corresponding held-out test performances in Table 9.
While test performance of our model is higher than [Stab and Gurevych, 2014b], our
model has worse test results than [Nguyen and Litman, 2015]. This is reasonable as our
model was trained following the same configuration as in [Stab and Gurevych, 2014b]6 , but
was not optimized as in [Nguyen and Litman, 2015]. In fact, [Nguyen and Litman, 2015]
obtained their best performing model using LibLINEAR classifier with top 70 features. If
5
While Nguyen15v2 (but not Stab14) had been evaluated using 10-fold cross-validation, the random fold
data cannot be replicate.
6
With respect to the cross validations, while our chosen setting is in favor of Stab14, it still o↵ers an
acceptable evaluation as it is not the best configuration for either Nguyen15v2 or wLDA+4.
43
we keep our top 100 features but replace SMO with LibLINEAR, then wLDA+4 gains
performance improvement with accuracy 0.84 and Kappa 0.71. Thus, the conclusions from
our new cross fold/topic experiments also hold when wLDA+4 is directly compared with
published baseline test set results.
4.5
CONCLUSIONS
Motivated by practical argument mining for student essays (where essays may be written
in response to di↵erent assignments), we have presented new features that model argument
indicators and abstract over essay topics, and introduced a new corpus of academic essays
to better evaluate the robustness of our models. Our proposed model in this study shows
robustness in that it yields performance improvement with both cross-topic and 10-fold cross
validations for di↵erent types of student essays, i.e., academic and persuasive. Moreover, our
model’s cross-topic performance is even higher than cross-fold performances for almost all
metrics.
Experimental results also show that while our model makes use of e↵ective baseline
features that are derived from extracted argument and domain words, the high performance
of our model, especially in cross-topic validation, is also due to our new features which are
generic and independent of essay topics. That is, to achieve the best performance, the new
features are a necessary supplement to the learned and noisy argument and domain words.
These results along with the results obtained in Chapter 3 strongly prove our first subhypothesis (H1-1, §1.2) of the e↵ectiveness of contextual features in argument component
identification.
44
5.0
EXTRACTING CONTEXTUAL INFORMATION FOR IMPROVING
ARGUMENTATIVE RELATION CLASSIFICATION – PROPOSED WORK
5.1
INTRODUCTION
Research on classifying argumentative relation between pairs of arguments or argument
components has proposed a variety of features ranging from superficial level, e.g., word pair,
relative position, to semantic level, e.g., semantic similarity, textual entailment. [Cabrio
and Villata, 2012, Boltužić and Šnajder, 2014] studied online debate corpora and aimed at
identifying whether user comments support or attack the debate topic.1 They proposed to use
content-rich features including semantic similarity and textual entailment. In principle, they
expec...
Purchase answer to see full
attachment