Week 5 - Assignment: Interpret Statistical Output
Hide Folder Information
Turnitin®
This assignment will be submitted to Turnitin®.
Instructions
Now that you have analyzed your data, you will need to interpret the output that you obtained from your data
analysis. Specifically, you need to discuss what the data analysis findings mean in relation to your research
questions and hypotheses, and what actions should be taken as a result.
For this assignment, you must provide a narrative that discusses the key insights from your data analysis findings
and highlights the limitations of your analysis. Limitations should pertain to weaknesses in your design and
limits on your ability to make conclusions. For example, if you are not able to determine cause and effect, that
would be a limitation. If your dataset is small, that would be another limitation.
Length: 5-7 pages, not including title and reference pages
References: Include a minimum of 5 scholarly resources.
The completed assignment should demonstrate thoughtful consideration of the ideas and concepts presented in
the course by providing new thoughts and insights relating directly to this topic. The content should reflect
scholarly writing and current APA standards and should adhere to Northcentral University's Academic Integrity
Policy.
Upload your document and click the Submit to Dropbox button.
Due Date
Jan 9, 2022 11:59 PM
KJA
Korean Journal of Anesthesiology
Statistical Round
pISSN 2005-6419 • eISSN 2005-7563
What is the proper way to apply the
multiple comparison test?
Sangseok Lee1 and Dong Kyu Lee2
Department of Anesthesiology and Pain Medicine, 1Sanggye Paik Hospital, Inje University College of Medicine,
2
Guro Hospital, Korea University School of Medicine, Seoul, Korea
Multiple comparisons tests (MCTs) are performed several times on the mean of experimental conditions. When the null
hypothesis is rejected in a validation, MCTs are performed when certain experimental conditions have a statistically significant mean difference or there is a specific aspect between the group means. A problem occurs if the error rate increases while multiple hypothesis tests are performed simultaneously. Consequently, in an MCT, it is necessary to control the
error rate to an appropriate level. In this paper, we discuss how to test multiple hypotheses simultaneously while limiting
type I error rate, which is caused by α inflation. To choose the appropriate test, we must maintain the balance between
statistical power and type I error rate. If the test is too conservative, a type I error is not likely to occur. However, concurrently, the test may have insufficient power resulted in increased probability of type II error occurrence. Most researchers
may hope to find the best way of adjusting the type I error rate to discriminate the real differences between observed data
without wasting too much statistical power. It is expected that this paper will help researchers understand the differences
between MCTs and apply them appropriately.
Keywords: Alpha inflation; Analysis of variance; Bonferroni; Dunnett; Multiple comparison; Scheffé; Statistics; Tukey;
Type I error; Type II error.
Multiple Comparison Test and Its Imitations
We are not always interested in comparison of two groups
per experiment. Sometimes (in practice, very often), we may
have to determine whether differences exist among the means
of three or more groups. The most common analytical method
used for such determinations is analysis of variance (ANO-
Corresponding author: Dong Kyu Lee, M.D., Ph.D.
Department of Anesthesiology and Pain Medicine, Guro Hospital,
Korea University School of Medicine, 148 Gurodong-ro, Guro-gu,
Seoul 08308, Korea
Tel: 82-2-2626-3237, Fax: 82-2-2626-1438
Email: entopic@naver.com
ORCID: https://orcid.org/0000-0002-4068-2363
Received: August 19, 2018.
Revised: August 26, 2018.
Accepted: August 27, 2018.
Korean J Anesthesiol 2018 October 71(5): 353-360
https://doi.org/10.4097/kja.d.18.00242
VA).1) When the null hypothesis (H0) is rejected after ANOVA,
that is, in the case of three groups, H0: μA = μB = μC, we do not
know how one group differs from a certain group. The result of
ANOVA does not provide detailed information regarding the
differences among various combinations of groups. Therefore,
researchers usually perform additional analysis to clarify the
differences between particular pairs of experimental groups. If
the null hypothesis (H0) is rejected in the ANOVA for the three
groups, the following cases are considered:
μA ≠ μB ≠ μC or μA ≠ μB = μC or μA = μB ≠ μC or μA ≠ μC = μB
In which of these cases is the null hypothesis rejected? The
only way to answer this question is to apply the ‘multiple comparison test’ (MCT), which is sometimes also called a ‘post-hoc
test.’
1)
In this paper, we do not discuss the fundamental principles of ANOVA.
For more details on ANOVA, see Kim TK. Understanding one-way
ANOVA using conceptual figures. Korean J Anesthesiol 2017; 70: 22-6.
CC This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/
licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright ⓒ The Korean Society of Anesthesiologists, 2018
Online access in http://ekja.org
VOL. 71, NO. 5, October 2018
Applying the multiple comparison test
Meaning of P value and ɑ Inflation
In a statistical hypothesis test, the significance probability, asymptotic significance, or P value (probability value) denotes the
probability that an extreme result will actually be observed if H0
is true. The significance of an experiment is a random variable
that is defined in the sample space of the experiment and has a
value between 0 and 1.
Type I error occurs when H0 is statistically rejected even
though it is actually true, whereas type II error refers to a false
negative, H0 is statistically accepted but H0 is false (Table 1). In
the situation of comparing the three groups, they may form the
following three pairs: group 1 versus group 2, group 2 versus
group 3, and group 1 versus group 3. A pair for this comparison
is called ‘family.’ The type I error that occurs when each family
is compared is called the ‘family-wise error’ (FWE). In other
words, the method developed to appropriately adjust the FWE is
a multiple comparison method. The α inflation can occur when
the same (without adjustment) significant level is applied to the
statistical analysis to one and other families simultaneously [2].
For example, if one performs a Student’s t-test between two given groups A and B under 5% α error and significantly indifferent
statistical result, the probability of trueness of H0 (the hypothesis
that groups A and B are same) is 95%. At this point, let us consider another group called group C, which we want to compare
it and groups A and B. If one performs another Student’s t-test
between the groups A and B and its result is also nonsignificant,
the real probability of a nonsignificant result between A and B,
and B and C is 0.95 × 0.95 = 0.9025, 90.25% and, consequently,
Table 1. Types of Erroneous Conclusions in Statistical Hypothesis
Testing
Actual fact
Error types
Statistical inference
354
H0 true
H0 true
H0 false
H0 false
Type II error (β)
Correct
Type I error (α)
Correct
the testing α error is 1 − 0.9025 = 0.0975, not 0.05. At the same
time, if the statistical analysis between groups A and C also has a
nonsignificant result, the probability of nonsignificance of all the
three pairs (families) is 0.95 × 0.95 × 0.95 = 0.857 and the actual
testing α error is 1 − 0.857 = 0.143, which is more than 14%.
Inflated α = 1 − (1 − α)N, N = number of hypotheses tested
(equation 1)
The inflation of probability of type I error increases with
the increase in the number of comparisons (Fig. 1, equation 1).
Table 2 shows the increases in the probability of rejecting H0 according to the number of comparisons.
Unfortunately, the result of controlling the significance level
for MCT will probably increase the number of false negative
cases which are not detected as being statistically significant, but
they are really different (Table 1). False negatives (type II errors)
can lead to an increase in cost. Therefore, if this is the case, we
may not even want to attempt to control the significance level
for MCT. Clearly, such deliberate avoidance increases the possibility of occurrence of false positive findings.
Classification (or Type) of Multiple C
omparison:
Single-step versus Stepwise Procedures
As mentioned earlier, repeated testing with given groups
results in the serious problem known as α inflation. Therefore,
numerous MCT methods have been developed in statistics over
the years.2) Most of the researchers in the field are interested in
understanding the differences between relevant groups. These
groups could be all pairs in the experiments, or one control and
1.00
Probability of at least
one P value less than 0.05
There are several methods for performing MCT, such as the
Tukey method, Newman-Keuls method, Bonferroni method,
Dunnett method, Scheffé’s test, and so on. In this paper, we
discuss the best multiple comparison method for analyzing
given data, clarify how to distinguish between these methods,
and describe the method for adjusting the P value to prevent α
inflation in general multiple comparison situations. Further, we
describe the increase in type I error (α inflation), which should
always be considered in multiple comparisons, and the method
for controlling type I error that applied in each corresponding
multiple comparison method.
0.75
0.50
0.25
0.00
0
25
50
75
100
Number of hypothesis test
Fig. 1. Depiction of the increasing error rate of multiple comparisons.
The X-axis represents the number of simultaneously tested hypotheses,
and the Y-axis represents the probability of rejecting at least on true null
hypothesis. The curved line follows the function value of 1 − (1 − α)N
and N is the number of hypotheses tested.
Online access in http://ekja.org
KOREAN J ANESTHESIOL
Lee and Lee
Table 2. Inflation of Significance Level according to the Number of
Multiple Comparisons
Number of comparisons
Significance level*
1
2
3
4
5
6
0.05
0.098
0.143
0.185
0.226
0.265
*Significance level (α) = 1 − (1 − α)N, where N = number of hypothesis
test (Adapted from Kim TK. Korean J Anesthesiol 2017; 70: 22-6).
other groups, or more than two groups (one subgroup) and another experiment groups (another subgroup). Irrespective of the
type of pairs to be compared, all post hoc subgroup comparing
methods should be applied under the significance of complete
ANOVA result.3)
Usually, MCTs are categorized into two classes, single-step
and stepwise procedures. Stepwise procedures are further divided into step-up and step-down methods. This classification
depends on the method used to handle type I error. As indicated
by its name, single-step procedure assumes one hypothetical
type I error rate. Under this assumption, almost all pairwise
comparisons (multiple hypotheses) are performed (tested using
one critical value). In other words, every comparison is independent. A typical example is Fisher’s least significant difference (LSD) test. Other examples are Bonferroni, Sidak, Scheffé,
Tukey, Tukey-Kramer, Hochberg’s GF2, Gabriel, and Dunnett
tests.
The stepwise procedure handles type I error according to
previously selected comparison results, that is, it processes
pairwise comparisons in a predetermined order, and each comparison is performed only when the previous comparison result
is statistically significant. In general, this method improves the
statistical power of the process while preserving the type I error
rate throughout. Among the comparison test statistics, the most
significant test (for step-down procedures) or least significant
test (for step-up procedures) is identified, and comparisons
are successively performed when the previous test result is
significant. If one comparison test during the process fails to
reject a null hypothesis, all the remaining tests are rejected. This
method does not determine the same level of significance as single-step methods; rather, it classifies all relevant groups into the
statistically similar subgroups. The stepwise methods include
Ryan-Einot-Gabriel-Welsch Q (REGWQ), Ryan-Einot-Gabriel-Welsch F (REGWF), Student-Newman-Keuls (SNK), and
Duncan tests. These methods have different uses, for example,
the SNK test is started to compare the two groups with the largest differences; the other two groups with the second largest differences are compared only if there is a significant difference in
Online access in http://ekja.org
prior comparison. Therefore, this method is called as step-down
methods because the extents of the differences are reduced as
comparisons proceed. It is noted that the critical value for comparison varies for each pair. That is, it depends on the range
of mean differences between groups. The smaller the range of
comparison, the smaller the critical value for the range; hence,
although the power increases, the probability of type I error increases.
All the aforementioned methods can be used only in the situation of equal variance assumption. If equal variance assumption
is violent during the ANOVA process, pairwise comparisons
should be based on the statistics of Tamhane’s T2, Dunnett’s T3,
Games-Howell, and Dunnett’s C tests.
Tukey method
This test uses pairwise post-hoc testing to determine whether there is a difference between the mean of all possible pairs
using a studentized range distribution. This method tests every
possible pair of all groups. Initially, the Tukey test was called
the ‘Honestly significant difference’ test, or simply the ‘T test,’4)
because this method was based on the t-distribution. It is noted
that the Tukey test is based on the same sample counts between
groups (balanced data) as ANOVA. Subsequently, Kramer
modified this method to apply it on unbalanced data, and it
became known as the Tukey-Kramer test. This method uses the
harmonic mean of the cell size of the two comparisons. The statistical assumptions of ANOVA should be applied to the Tukey
method, as well.5)
Fig. 2 depicts the example results of one-way ANOVA and
2)
There are four criteria for evaluating and comparing the methods of posthoc multiple comparisons: ‘Conservativeness,’ ‘optimality,’ ‘convenience,’
and ‘robustness.’ Conservativeness involves making a strict statistical
inference throughout an analysis. In other words, the statistical result
of a multiple comparison method has significance only with a certain
controlled type I error, that is, this method could produce a reckless result
when there are small differences between groups. The second criterion
is optimality. The optimal statistic is statistically the smallest CI among
conservative statistics. In other words, the standard error is the smallest
statistic among conservative statistics. Conservatism is more important
than optimality because the former is a characteristic evaluated under
conservative. The third criterion convenience is literally considered easy
to calculate. Most statistical computer programs will handle this; however,
extensive mathematics is required to understand its nature, which
means that the criterion is less convenient to use if it is too complicated.
The fourth criterion is ‘insensitivity to assumption violence,’ which
is commonly referred to as robustness. In other words, in the case of
violation of the assumption of equal variance in ANOVA, some methods
presented below are less insensitive. Therefore, in this context, it is
appropriate to use methods like Tamhane’s T2, Games-Howell, Dunnett’s
T2, and Dunnett’s C, which are available in some statistical applications [3].
3)
This is true only if conducted by the post-hoc test of ANOVA.
4)
It is different from and should not be confused with Student’s t-test.
355
VOL. 71, NO. 5, October 2018
Applying the multiple comparison test
Oneway
ANOVA
Value
Between groups
Within groups
Total
Sum of
squares
df
Mean square
F
Sig.
85.929
83.000
168.929
2
11
13
42.964
7.545
5.694
.020
Post hoc tests
Multiple comparisons
Dependent variable: value
Tukey HSD
95% confidence interval
Upper bound
Lower bound
Mean difference
(I-J)
Std. error
Sig.
B
C
5.70000*
1.10000
1.84268
1.84268
.026
.825
10.6768
6.0768
.7232
3.8768
B
A
C
5.70000*
4.60000
1.84268
1.73729
.026
.055
.7232
.0922
10.6768
9.2922
C
A
B
1.10000
4.60000
1.84268
1.73729
.825
.055
3.8768
9.2922
6.0768
.0922
(I) Group
(J) Group
A
*The mean difference is significant at the 0.05 level.
Homogeneous subsets
Value
a, b
Tukey HSD
Group
N
A
C
B
Sig.
4
5
5
Subset for alpha = 0.05
1
4.5000
5.6000
.819
2
5.6000
10.2000
.065
Means for groups in homogeneous subsets are displayed.
a. Uses harmonic mean sample size = 4.615
b. The group sizes are unequal.
The harmonic mean of the group sizes is used.
Type I error levels are not guaranteed.
Fig. 2. An example of a one-way analysis of variance (ANOVA) result with Tukey test for multiple comparison performed using IBM Ⓡ SPSSⓇ
Statistics (ver 23.0, IBMⓇ Co., USA). Groups A, B, and C are compared. The Tukey honestly significant difference (HSD) test was performed under
the significant result of ANOVA. Multiple comparison results presented statistical differences between groups A and B, but not between groups A and
C and between groups B and C. However, in the last table ‘Homogenous subsets’, there is a contradictory result: the differences between groups A and
C and groups B and C are not significant, although a significant difference existed between groups A and B. This inconsistent interpretation could
have originated from insufficient evidence.
Tukey test for multiple comparisons. According to this figure,
the Tukey test is performed with one critical level, as described
earlier, and the results of all pairwise comparisons are presented
in one table under the section ‘post-hoc test.’ The results conclude that groups A and B are different, whereas groups A and
5)
Independent variables must be independent of each other (independence),
dependent variables must satisfy the normal distribution (normality),
and the variance of the dependent variable distribution by independent
variables should be the same for each group (equivalence of variance).
356
C are not different and groups B and C are also not different.
These odd results are continued in the last table named ‘Homogeneous subsets.’ Groups A and C are similar and groups B and
C are also similar; however, groups A and B are different. An
inference of this type is different with the syllogistic reasoning.
In mathematics, if A = B and B = C, then A = C. However, in
statistics, when A = B and B = C, A is not the same as C because
all these results are probable outcomes based on statistics. Such
contradictory results can originate from inadequate statistical
power, that is, a small sample size. The Tukey test is a generous
Online access in http://ekja.org
KOREAN J ANESTHESIOL
method to detect the difference during pairwise comparison (less
conservative); to avoid this illogical result, an adequate sample
size should be guaranteed, which gives rise to smaller standard
errors and increases the probability of rejecting the null hypothesis.
Bonferroni method: ɑ splitting (Dunn’s method)
The Bonferroni method can be used to compare different
groups at the baseline, study the relationship between variables,
or examine one or more endpoints in clinical trials. It is applied
as a post-hoc test in many statistical procedures such as ANOVA
and its variants, including analysis of covariance (ANCOVA)
and multivariate ANOVA (MANOVA); multiple t-tests; and
Pearson’s correlation analysis. It is also used in several nonparametric tests, including the Mann-Whitney U test, Wilcoxon
signed rank test, and Kruskal-Wallis test by ranks [4], and as a
test for categorical data, such as Chi-squared test. When used
as a post hoc test after ANOVA, the Bonferroni method uses
thresholds based on the t-distribution; the Bonferroni method is
more rigorous than the Tukey test, which tolerates type I errors,
and more generous than the very conservative Scheffé’s method.
However, it has disadvantages, as well, since it is unnecessarily conservative (with weak statistical power). The adjusted
α is often smaller than required, particularly if there are many
tests and/or the test statistics are positively correlated. Therefore,
this method often fails to detect real differences. If the proposed
study requires that type II error should be avoided and possible effects should not be missed, we should not use Bonferroni
correction. Rather, we should use a more liberal method like
Fisher’s LSD, which does not control the family-wise error rate
(FWER).6) Another alternative to the Bonferroni correction to
yield overly conservative results is to use the stepwise (sequential)
method, for which the Bonferroni-Holm and Hochberg methods are suitable, which are less conservative than the Bonferroni
test [5].
Dunnett method
This is a particularly useful method to analyze studies having
control groups, based on modified t-test statistics (Dunnett’s
t-distribution). It is a powerful statistic and, therefore, can discover relatively small but significant differences among groups
or combinations of groups. The Dunnett test is used by researchers interested in testing two or more experimental groups
against a single control group. However, the Dunnett test has the
disadvantage that it does not compare the groups other than the
control group among themselves at all.
As an example, suppose there are three experimental groups
A, B, and C, in which an experimental drug is used, and a
Online access in http://ekja.org
Lee and Lee
control group in a study. In the Dunnett test, a comparison of
control group with A, B, C, or their combinations is performed;
however, no comparison is made between the experimental
groups A, B, and C. Therefore, the power of the test is higher
because the number of tests is reduced compared to the ‘all pairwise comparison.’
On the other hand, the Dunnett method is capable of ‘twotailed’ or ‘one-tailed’ testing, which makes it different from other
pairwise comparison methods. For example, if the effect of a
new drug is not known at all, the two-tailed test should be used
to confirm whether the effect of the new drug is better or worse
than that of a conventional control. Subsequently, a one-sided
test is required to compare the new drug and control. Since the
two-sided or single-sided test can be performed according to the
situation, the Dunnett method can be used without any restrictions.
Scheffé’s method: exploratory post-hoc method
Scheffé’s method is not a simple pairwise comparison test.
Based on F-distribution, it is a method for performing simultaneous, joint pairwise comparisons for all possible pairwise
combinations of each group mean [6]. It controls FWER after
considering every possible pairwise combination, whereas the
Tukey test controls the FWER when only all pairwise comparisons are made.7) This is why the Scheffé’s method is very conservative than other methods and has small power to detect the differences. Since Scheffé’s method generates hypotheses based on
all possible comparisons to confirm significance, this method is
preferred when theoretical background for differences between
groups is unavailable or previous studies have not been completely implemented (exploratory data analysis). The hypotheses
generated in this manner should be tested by subsequent studies
that are specifically designed to test new hypotheses. This is important in exploratory data analysis or the theoretic testing process (e.g., if a type I error is likely to occur in this type of study
and the differences should be identified in subsequent studies).
Follow-up studies testing specific subgroup contrasts discovered
through the application of Scheffé’s method should use. Bonferroni methods that are appropriate for theoretical test studies. It is
further noted that Bonferroni methods are less sensitive to type
6)
In this paper, we do not discuss Fisher’s LSD, Duncan’s multiple range
test, and Student-Newman-Keul’s procedure. Since these methods do not
control FWER, they do not suit the purpose of this paper.
7)
Basically, a multiple pairwise comparison should be designed according
to the planned contrasts. A classical deductive multiple comparison is
performed using predetermined contrasts, which are decided early in
the study design step. By assigning a contrast to each group, pairing can
be varied from some or all pairs of two selected groups to subgroups,
including several groups that are independent or partially dependent on
each other.
357
Applying the multiple comparison test
I errors than Scheffé’s method. Finally, Scheffé’s method enables
simple or complex averaging comparisons in both balanced and
unbalanced data.
Violation of the assumption of equivalence of
variance
One-way ANOVA is performed only in cases where the
assumption of equivalence of variance holds. However, it is
a robust statistic that can be used even when there is a deviation from the equivalence assumption. In such cases, the
Games-Howell, Tamhane’s T2, Dunnett’s T3, and Dunnett’s C
tests can be applied.
The Games-Howell method is an improved version of the
Tukey-Kramer method and is applicable in cases where the
equivalence of variance assumption is violated. It is a t-test
using Welch’s degree of freedom. This method uses a strategy
for controlling the type I error for the entire comparison and is
known to maintain the preset significance level even when the
size of the sample is different. However, the smaller the number
of samples in each group, the it is more tolerant the type I error
control. Thus, this method can be applied when the number of
samples is six or more.
Tamhane’s T2 method gives a test statistic using the t-distribution by applying the concept of ‘multiplicative inequality’
introduced by Sidak. Sidak’s multiplicative inequality theorem
implies that the probability of occurrence of intersection of each
event is more than or equal to the probability of occurrence of
each event. Compared to the Games-Howell method, Sidak’s
theorem provides a more rigorous multiple comparison method
by adjusting the significance level. In other words, it is more
conservative than type I error control. Contrarily, Dunnett’s T3
method does not use the t-distribution but uses a quasi-normalized maximum-magnitude distribution (studentized maximum
modulus distribution), which always provides a narrower CI
than T2. The degrees of freedom are calculated using the Welch
methods, such as Games-Howell or T2. This Dunnett’s T3 test
is understood to be more appropriate than the Games-Howell test when the number of samples in the each group is less
than 50. It is noted that Dunnett’s C test uses studentized range
distribution, which generates a slightly narrower CI than the
Games-Howell test for a sample size of 50 or more in the experimental group; however, the power of Dunnett’s C test is better
than that of the Games-Howell test.
Methods for Adjusting P value
Many research designs use numerous sources of multiple
comparison, such as multiple outcomes, multiple predictors,
subgroup analyses, multiple definitions for exposures and
358
VOL. 71, NO. 5, October 2018
outcomes, multiple time points for outcomes (repeated measures), and multiple looks at the data during sequential interim
monitoring. Therefore, multiple comparisons performed in a
previous situation are accompanied by increased type I error
problem, and it is necessary to adjust the P value accordingly.
Various methods are used to adjust the P value. However, there
is no universally accepted single method to control multiple test
problems. Therefore, we introduce two representative methods
for multiple test adjustment: FWER and false discovery rate
(FDR).
Controlling the family-wise error rate: Bonferroni
adjustment
The classic approach for solving a multiple comparison problem involves controlling FWER. A threshold value of α less than
0.05, which is conventionally used, can be set. If the H0 is true
for all tests, the probability of obtaining a significant result from
this new, lower critical value is 0.05. In other words, if all the null
hypotheses, H0, are true, the probability that the family of tests
includes one or more false positives due to chance is 0.05. Usually, these methods are used when it is important not to make
any type I errors at all. The methods belonging to this category
are Bonferroni, Holm, Hochberg, Hommel adjustment, and so
on. The Bonferroni method is one of the most commonly used
methods to control FWER. With an increase in the number of
hypotheses tested, type I error increases. Therefore, the significance level is divided into numbers of hypotheses tests. In this
manner, type I error can be lowered. In other words, the higher
the number of hypotheses to be tested, the more stringent the
criterion, the lesser the probability of production of type I errors,
and the lower the power.
For example, for performing 50 t-tests, one would set each
t-test to 0.05 / 50 = 0.001. Therefore, one should consider the
test as significant only for P < 0.001, not P < 0.05 (equation 2).
Adjusted alpha (α) = α / k (number of hypothesis tested)
(equation 2)
The advantage of this method is that the calculation is
straightforward and intuitive. However, it is too conservative,
since when the number of comparisons increases, the level of
significance becomes very small and the power of the system decreases [7]. The Bonferroni correction is strongly recommended
for testing a single universal null hypothesis (H0) that all tests
are not significant. This is true for the following situations,
as well: to avoid type I error or perform many tests without a
preplanned hypothesis for the purpose of obtaining significant
results [8].
The Bonferroni correction is suitable when one false positive
Online access in http://ekja.org
KOREAN J ANESTHESIOL
Lee and Lee
in a series of tests are an issue. It is usually useful when there are
numerous multiple comparisons and one is looking for one or
two important ones. However, if one requires many comparisons and items that are considered important, Bonferroni modifications can have a high false negative rate [9].
Controlling the false discovery rate: BenjaminiHochberg adjustment
An alternative to controlling the FWER is to control the
FDR using the Benjamini-Hochberg and Benjamini & Yekutieli
adjustments. The FDR controls the expected rate of the null
hypothesis that is incorrectly rejected (type I error) in the rejected hypothesis list. It is less conservative. By performing the
comparison procedure with a greater power compared to FWER
control, the probability that a type I error will occur can be increased [10].
Although FDR limits the number of false discoveries, some
will still be obtained; hence, these procedures may be used if
some type I errors are acceptable. In other words, it is a method
to filter the hypotheses that have errors in the test from the hypotheses that are judged important, rather than testing all the
hypotheses like FWER.
The Benjamini-Hochberg adjustment is very popular due to
its simplicity. Rearrange all the P values in order from the smallest to largest value. The smallest P value has a rank of i = 1, the
next smallest has i = 2, and so on.
p(1) ≤ p(2) ≤ p(3) ≤…≤ p(i) ≤ p(N)
Compare each individual P value to its Benjamini-Hochberg
critical value (equation 3).
Benjamini-Hochberg critical value = (i / m)∙Q (equation 3)
(i, rank; m, total number of tests; Q, chosen FDR)
The largest P value for which P < (i / m)∙Q is significant, and
all the P values smaller than the largest value are also significant,
even the ones that are not less than their Benjamini-Hochberg
critical value.
When you perform this correcting procedure with an FDR
≧ 0.05, it is possible for individual tests to be significant, even
though their P ≧ 0.05. Finally, only the hypothesis smaller than
the individual P value among the listed rejected regions adjusted
by FDR will be rejected.
One should be careful while choosing FDR. If we decide to
proceed with more experiments on interesting individual results and if the additional cost of the experiments is low and the
cost of false positives (missing potentially important findings)
is high, then we should use a high FDR, such as 0.10 or 0.20,
to ensure that important things are not missed. Moreover, it is
noted that both Bonferroni correction and Benjamini-Hochberg
procedure assume the individual tests to be independent.
Conclusions and Implications
The purpose of the multiple comparison methods mentioned
in this paper is to control the ‘overall significance level’ of the
set of inferences performed as a post-test after ANOVA or as a
pairwise comparison performed in various assays. The overall
significance level is the probability that all the tested null hypotheses are conditional, at least one is denied, or one or more
CIs do not contain a true value.
In general, the common statistical errors found in medical
research papers arise from problems with multiple comparisons
[11]. This is because researchers attempt to test multiple hypotheses concurrently in a single experiment, the authors of this
Range test
Test statistics
Pairwise multiple
comparison test
Range test
Pairwise multiple comparison test
With control group
Stepwise
Procedures
Single-step
F-distribution
Sample distribution
t-distribution
t-distribution
Range distribution based on error rate
Conservativeness
Reckless
Dunnett
Online access in http://ekja.org
Strict
Newman-Keuls
Tukey HSD
Bonferroni
(Dunn)
Scheffe'
Fig. 3. Comparative chart of multiple
comparison tests (MCTs). Five repre
sentative methods are listed along
the X-axis, and the parameters to be
compared among these methods are
listed along the Y-axis. Some methods
use the range test and pairwise MCT
concomitantly. The Dunnett and New
man- Keuls methods are comparable
with respect to conservativeness. The
Dunnett method uses one significance
level, and the Newman-Keuls method
compares pairs using the stepwise
procedure based on the changes in range
test statistics during the procedure.
According to the range between the
groups, the significance level is changed
in the Newman-Keuls method. HSD:
honestly significant difference.
359
VOL. 71, NO. 5, October 2018
Applying the multiple comparison test
paper have already pointed out this issue.
Since biomedical papers emphasize the importance of multiple comparisons, a growing number of journals have started
including a process of separately ascertaining whether multiple
comparisons are appropriately used during the submission
and review process. According to the results of a study on the
appropriateness of multiple comparisons of articles published
in three medical journals for 10 years, 33% (47/142) of papers
did not use multiple comparison correction. Comparatively, in
61% (86/142) of papers, correction without rationale was applied. Only 6.3% (9/142) of the examined papers used suitable
correction methods [8]. The Bonferroni method was used in
35.9% of papers. Most (71%) of the papers provided little or no
discussion, whereas only 29% showed some rationale for and/
or discussion on the method [8]. The implications of these results are very significant. Some authors make the decision to not
use adjusted P values or compare the results of corrected and
uncorrected P values, which results in a potentially complicated
interpretation of the results. This decision reduces the reliability
of the results of published studies.
In a study, many situations occur that may affect the choice
of MCTs. For example, a group might have different sample sizes. A several multiple comparison analysis tests was specifically
developed to handle nonidentical groups. In the study, power
can be a problem, and some tests have more power than others.
Whereas all comparative tests are important in some studies,
only predetermined combinations of experimental groups or
comparators should be tested in others. When a special situation
affects a particular pairwise analysis, the selection of multiple
comparative analysis tests should be controlled by the ability
of specific statistics to address the questions of interest and
the types of data to be analyzed. Therefore, it is important that
researchers select the tests that best suit their data, the types of
information on group comparisons, and the power required for
analysis (Fig. 3).
In general, most of the pairwise MCTs are based on balanced
data. Therefore, when there are large differences in the number
of samples, care should be taken when selecting multiple comparison procedures. LSD, Sidak, Bonferroni, and Dunnett using
the t-statistic do not pose any problems, since there is no assumption that the number of samples in each group is the same.
The Tukey test using the studentized range distribution can be
problematic since there is a premise that all sample sizes are the
same in the null hypothesis. Therefore, the Tukey-Kramer test,
which uses the harmonic mean of sample numbers, can be used
when the sample numbers are different. Finally, we must check
whether the equilibrium of variance assumption is satisfied. The
methods of multiple comparisons that have been mentioned
previously are all assumed to be equally distributed. Tamhane’s
T2, Dunnett’s T3, Games-Howell, and Dunnett’s C are multiple
comparison tests that do not assume equilibrium.
Although the Korean Journal of Anesthesiology has not formally examined this view, it is expected that the journal’s view
on this subject is not significantly different from the view expressed by this paper [8]. Therefore, it is important that all authors are aware of the problems posed by multiple comparisons,
and further research is required to spread awareness regarding
these problems and their solutions.
ORCID
Sangseok Lee, https://orcid.org/0000-0001-7023-3668
Dong Kyu Lee, https://orcid.org/0000-0002-4068-2363
References
1. Lee DK. Alternatives to P value: confidence interval and effect size. Korean J Anesthesiol 2016; 69: 555-62.
2. Kim TK. Understanding one-way ANOVA using conceptual figures. Korean J Anesthesiol 2017; 70: 22-6.
3. Stoline MR. The status of multiple comparisons: simultaneous estimation of all pairwise comparisons in one-way ANOVA designs. Am Stat
1981; 35: 134-41.
4. Dunn OJ. Multiple comparisons among means. J Am Stat Assoc 1961; 56: 52-64.
5. Chen SY, Feng Z, Yi X. A general introduction to adjustment for multiple comparisons. J Thorac Dis 2017; 9: 1725-9.
6. Scheffé H. A method for judging all contrasts in the analysis of variance. Biometrika 1953; 40: 87-110.
7. Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 1955; 50: 1096-121.
8. Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Opt 2014; 34: 502-8.
9. Streiner DL, Norman GR. Correction for multiple testing: is there a resolution? Chest 2011; 140: 16-8.
10. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B
(Method) 1995; 57: 289-300.
11. Lee S. Avoiding negative reviewer comments: common statistical errors in anesthesia journals. Korean J Anesthesiol 2016; 69: 219-26.
360
Online access in http://ekja.org
Copyright of Korean Journal of Anesthesiology is the property of Korean Society of
Anesthesiologists and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.
1
Dataset Analysis
Uchechukwu Ohiri
TIM-7101 V1: Statistics with Technology Applications
North Central University (NCU)
Dr. Nicholas Harkiolakis
January 2, 2021
2
Dataset Analysis
Description of the Problem
The video game dataset provided comprises columns labeled, Date, Visits, VisitTime,
TotalTime, Game, and Advertising. The two main variables to be used for this study are the
Visits, Game, and Advertising. The independent variables to be used for the study are the
type of player (police officer or thief) and advertising period (advertising period or no
advertising period), while the dependent variable is the number of video game visits. The first
objective is to determine if the number of video game visits are different for the type of
player, while the second objective is to determine if the number of video game visits are
different for the advertising period. Since the data is normally distributed, independent t-tests
will be used as the inferential model, which will be compared to the two levels of each
independent variable.
Hypotheses to be Tested
Given there are two objectives, the null and alternate hypotheses to be tested are:
H0:
•
There is no statistically significant difference in the number of video game visits for
the type of player
•
There is no statistically significant difference in the number of video game visits for
the advertising period
H1:
•
There is a statistically significant difference in the number of video game visits for the
type of player
3
•
There is a statistically significant difference in the number of video game visits for the
advertising period
Data Characteristics
The below descriptive statistics output report is for the number of video game visits.
The data properties included in the report are central tendency, variability measures, outlier
detection, and other distribution attributes. The mean (average) is a measure of location for
all the observations obtained by dividing the sum of all the observations by the number of
observations. The number of observations (Count) is 44. The mean for the data is 1.45, while
the standard error is 0.40; standard error indicates the preciseness of the sample mean relative
to the population mean. Therefore, a small standard error like the one obtained for this data
shows that the mean for the sample data set provides a more precise estimate of the
population value. The median and mode values for the number of video game visits are both
zero. The median is a measure of location, and this value splits the frequency distribution into
two equal parts in ordered data values, while the mode is the value that occurs most
frequently (Trajkovski, 2016). The standard deviation, which is a measure of spread (scatter)
between the individual data values and the sample mean is 2.67. The sample variance of 7.14
shows how widely the individual data values (observations) vary from the sample mean. The
kurtosis for the data is 2.60 while the skewness is 1.93. The two measures indicate the
comparison between the distribution’s shape to the normal and symmetric distribution. Since
the two values do not deviate significantly from zero, it is an indication that the data does
follow a normal distribution. The minimum and maximum values for the data set are 0 and
10, respectively, giving the data range of 10. The range is based on the two most extreme
values within the data, and it increases with the sample size. The sum of all values for video
game visits is 64.
Table 1
4
Dataset of Visits
Visits
Mean
Standard Error
Median
Mode
Standard
Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
1.454545455
0.402758125
0
0
2.671595164
7.137420719
2.601907486
1.93103248
10
0
10
64
44
Table 2 below is a pivot table of the total number of video game visits for Police are
31, while the total number of video game visits for Thief are 33. Table 3 is the total number
of video game visits for the advertising period are 59, while the total number of video game
visits when there is no advertising is 5.
Table 2
Visits vs Game
Sum of
Visits
Row Labels
0
1
3
5
6
7
8
10
Grand Total
Table 3
Column Labels
Police
Grand
Thief Total
0
0
2
6
3
3
5
6
6
7
8
8
10
31
33
0
8
6
5
12
7
16
10
64
5
Visits vs Advertising
Sum of
Visits
Row Labels
0
1
3
5
6
7
8
10
Grand Total
Column Labels
No
0
2
3
5
Grand
Yes Total
0
6
3
5
12
7
16
10
59
0
8
6
5
12
7
16
10
64
Statistical Assumptions & Findings
T-tests assess the means of one or two data sets. The standard form tests the null
hypotheses (that the two sample means are equal) and alternate hypotheses (that the two
sample means are not equal). The assumption is that the significance level is 0.05, and if the
p-value is less than the significance level, the null hypotheses will be rejected because it will
indicate that the difference between the two means is statistically significant. The vice versa
is true. It is essential to note that the p-value is the probability that an extreme result will be
observed if the null hypotheses is true.
Given the data contains both numerical and non-numerical values, a Chi-square test is
appropriate for the categorical data (Lee & Lee, 2018). Chi-square test is a non-parametric
test used in discrete data to show the probability of some non-random factors likely to take
account of the correlation observed between non-numeric variables (Turhan, 2020). One of
the attributes of the Chi-square test is the test of independence like the one conducted in this
scenario, with the goal of establishing association between variables. The main assumptions
linked to the Chi-square test include (Rana & Singhal, 2015; McHugh, 2013):
•
The data used is randomly obtained from a population.
6
•
The data used is in frequencies or counts as opposed to percentages or other data
transformations.
•
The cell values are adequate when the expected counts are 5 or more, and there are no
cells containing zero values.
•
The sample size is large enough to avoid type II error – this type of error arises when
the null hypothesis is accepted when it is actually false. A sample size of 20-50 is
considered to be the adequate minimum, of which our analyses are being doing using
a sample size of 64.
•
The variables under consideration are mutually exclusive – they are counted only
once in a given category.
To analyze the data, two contingency tables have been created, one for the type of
player (Police/Thief) and the second, for the advertising period (Yes/No). The Chi-square
statistic is based on the relationship between the observed and the expected counts. The sum
totals for the Chi-square statistics calculations are 23.9609 (type of player) and 22.34576
(advertising period). In order to determine if we should reject our null hypotheses based on
the Chi-square statistic, the degrees of freedom (df) are obtained. The df is again used to
obtain the p-value and Chi-square critical value. The df for both data sets is 6 because it is
obtained by multiplying the number of rows minus one by the number of columns minus one.
If the Chi-square statistic is higher than the critical value, then we will reject the null
hypothesis. Additionally, if the p-value is less than the 0.05 significance level, we will still
reject the null hypothesis. The findings show that the Chi-square critical value for both the
type of player and advertising period is 12.59159 because both variables have the same df
and significance level. Lastly, the p-values obtained from the calculations are 0.000530978
for the type of player and 0.00104799 for the advertising period.
Results of the Inferential Analyses
7
The results of the inferential analyses show that the p-values are lower than the 0.05
significance levels. This result was observed after analyzing both the data for type of player
and advertising period. For the first null hypothesis, the aim of the study was to determine
that there is no difference in the number of video game visits for the type of player
(Police/Thief), of which the p-value result from the Chi-square test is 0.000531. For the
second null hypothesis, the aim was to determine that there is no difference in the number of
video game visits for the advertising period (advertising period/no advertising). The Chisquare test result for the second analysis yielded a p-value of 0.001048.
8
References
Lee, S., & Lee, D. K. (2018). What is the proper way to apply the multiple comparison
test? Korean Journal of Anesthesiology, 71(5), 353–360.
https://doi.org/10.4097/kja.d.18.00242
McHugh, M. L. (2013). The Chi-square test of independence. Biochemia Medica, 23(2), 143–
149. https://doi.org/10.11613/bm.2013.018
(PDF) Chi-square test and its application in hypothesis testing. (n.d.). ResearchGate.
https://www.researchgate.net/publication/277935900_Chisquare_test_and_its_application_in_hypothesis_testing
Trajkovski, V. (2016). How to Select Appropriate Statistical Test in Scientific
Articles [review of how to select appropriate statistical test in scientific articles].
ProQuest; Skopje.
https://www.proquest.com/openview/9cf20c8ed2794218965e44ac4196385f/1?pqorigsite=gscholar&cbl=52199
Nihan, S. T. (2020). Karl Pearsons chi-square tests. Educational Research and
Reviews, 15(9), 575–580. https://doi.org/10.5897/err2019.3817
ПРЕДГОВОР
ПРЕДГОВОР
EDITORIAL
КАКО ДА СЕ ОДБЕРЕ НАЈСООДВЕТНИОТ
СТАТИСТИЧКИ ТЕСТ ВО НАУЧНИТЕ
ТРУДОВИ
HOW TO SELECT APPROPRIATE
STATISTICAL TEST IN SCIENTIFIC
ARTICLES
Владимир ТРАЈКОВСКИ
Vladimir TRAJKOVSKI
Дефектолошка теорија и практика
Институт за дефектологија
Филозофски факултет
Скопје, Република Македонија
Journal of Special Education and Rehabilitation
Institute of Special Education and Rehabilitation
Faculty of Philosophy
Skopje, Republic of Macedonia
Примено: 08.07.2016
Прифатено: 20.07.2016
Recived: 08.07.2016
Accepted: 20.07.2016
Editorial
Резиме
Abstract
Статистиката е дел од математиката која се
занимава со собирање, анализирање, интерпретирање и презентирање маса (голем број
примероци) нумерички податоци со цел да се
извлечат релевантни заклучоци од истата.
Статистиката е форма на математичка анализа која користи квантификувани модели, репрезентации и синопсиси за даден број експериментални податоци или истражувања
кои се спроведуваат со жива материја. Студентите и младите истражувачи во биомедицинските науки како и во специјалната едукација и рехабилитација често го искажуваат
своето мислење дека одбрале да се запишат
на тие студии поради тоа што не поседуваат
големо знаење или интерес за математика.
Тоа е тажна изјава, но има и вистина во неа.
Целта на овој едиторијал е да им послужи и
да им помогне на младите истражувачи да ја
одберат најсоодветната техника за статистичка обработка на податоците која ќе соодветствува на целите и условите на одредена
анализа. Најважните статистички тестови ќе
бидат прикажани во овој труд.
Statistics is mathematical science dealing
with the collection, analysis, interpretation,
and presentation of masses of numerical data
in order to draw relevant conclusions.
Statistics is a form of mathematical analysis
that uses quantified models, representations
and synopses for a given set of experimental
data or real-life studies. The students and
young researchers in biomedical sciences and
in special education and rehabilitation often
declare that they have chosen to enroll that
study program because they have lack of
knowledge or interest in mathematics. This is
a sad statement, but there is much truth in it.
The aim of this editorial is to help young
researchers to select statistics or statistical
techniques and statistical software appropriate
for the purposes and conditions of a particular
analysis. The most important statistical tests
are reviewed in the article.
Адреса за кореспонденција:
Владимир ТРАЈКОВСКИ
Дефектолошка теорија и практика
Универзитет „Св. Кирил и Методиј“
Филозофски факултет
Институт за дефектологија
Бул. Гоце Делчев 9А, 1000 Скопје
Република Македонија
Е-пошта: vladotra@fzf.ukim.edu.mk
Corresponding address:
Vladimir TRAJKOVSKI
Journal of Special Education and Rehabilitation
“Ss Cyril and Methodius” University
Faculty of Philosophy
Institute of Special Education and Rehabilitation
Bull. Goce Delchev 9A 1000 Skopje
Republic of Macedonia
E-mail: vladotra@fzf.ukim.edu.mk
ДЕФЕКТОЛОШКА ТЕОРИЈА И ПРАКТИКА 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
5
EDITORIAL
Да знаеш како да го одбереш правилниот
статистички тест е многу важна одлука во
делот на обработката на добиените податоци
и во пишувањето на научниот труд. Младите
истражувачи и автори би требало да знаат
како да ги одберат и како да ги користат овие
статистички методи. Компетентниот истражувач мора да поседува одреден степен на
знаење во однос на статистичките процедури. Тука може да се подразбира курс за вовед
во статистиката и, се разбира, користењето
на добар учебник по статистика. За оваа цел,
постои потреба предметот Статистика да биде задолжителен предмет во наставната програма на Институтот за дефектологија при
Филозофскиот факултет во Скопје. Младите
истражувачи имаат потреба од дополнителни
курсеви за да се здобијат со поголемо знаење
во областа на статистиката. Тие мора да се
обучат за да користат статистички компјутерски програми на соодветен начин.
Knowing how to choose right statistical test
is an important asset and decision in the
research data processing and in the writing
of scientific papers. Young researchers and
authors should know how to choose and
how to use statistical methods. The
competent researcher will need knowledge
in statistical procedures. That might include
an introductory statistics course, and it most
certainly includes using a good statistics
textbook. For this purpose, there is need to
return of Statistics mandatory subject in the
curriculum of the Institute of Special
Education and Rehabilitation at Faculty of
Philosophy in Skopje. Young researchers
have a need of additional courses in
statistics. They need to train themselves to
use statistical software on appropriate way.
Клучни зборови: Статистичка селекција на
тест, статистика, научен труд,
статистички програми
Keywords: statistical test selection,
statistics, scientific article, statistical
software
Вовед
Introduction
Статистиката е дел од математиката која се
занимава со собирање, анализирање, интерпретирање и презентирање маса (голем број
примероци) нумерички податоци со цел да се
извлечат релевантни заклучоци од истата.
Статистиката е форма на математичка анализа која користи квантификувани модели, репрезентации и синопсиси за даден број експериментални податоци или истражувања
кои се спроведуваат со жива материја. Статистиката се користи во неколку различни
дисциплини (како научни така и кај оние кои
не се занимаваат со наука) за добиените податоци да се сведат на заклучоци (1).
Студентите и младите истражувачи во биомедицинските науки како и во специјалната
едукација и рехабилитација, често објавуваат
и го искажуваат сопственото мислење дека
одбрале да се запишат на тие студии поради
тоа што не поседуваат големо знаење или интерес за математика. Тоа е тажна изјава, но
има и вистина во неа. Тие често не знаат како да ја извршат статистичката обработка на
добиените податоци од истражувањето за
нивните додипломски, последипломски, па и
докторски студии, па затоа најчесто бараат
помош од статистичари. За оваа цел, тие мора да платат одредена сума пари. Најчесто во
Statistics is mathematical science dealing
with the collection, analysis, interpretation,
and presentation of masses of numerical data
in order to draw relevant conclusions.
Statistics is a form of mathematical analysis
that uses quantified models, representations
and synopses for a given set of experimental
data or real-life studies. Statistics is used in
several different disciplines (both scientific
and non-scientific) to make decisions and
draw conclusions based on data (1).
The students and young researchers in
biomedical sciences and in special education
and rehabilitation often declare that they
have chosen to enroll that study program
because they have lack of knowledge or
interest in mathematics. This is a sad
statement, but there is much truth in it. They
often do not know to make their statistical
processing of data for its undergraduate,
master's and doctoral theses, and seek help
from a statistician. For this purpose, they
have to pay certain amount of money. There
6
JOURNAL OF SPECIAL EDUCATION AND REHABILITATION 2016; 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
ПРЕДГОВОР
нивните тези има погрешно одбрани статистички методи кои водат кон погрешни заклучоци. Селекцијата на правилниот статистички метод или техника може да претставува
голем проблем за младите истражувачи. Во
истражувањето, значајните заклучоци можат
да бидат изведени од собраните податоци од
валиден научен дизајн користејќи го соодветниот статистички метод или техника.
Во однос на селекцијата на статистичкиот
метод кој би се користел, најважното прашање е „Која е главната хипотеза на истражувањето?“ Во некои случаи нема главна хипотеза; истражувачот само сака да „види што има
таму“. На пример, во студија за преваленција
нема хипотеза која би се тестирала, и големината на студијата е одредена од тоа колку
прецизно истражувачот сака да ја одреди
преваленцијата. Доколку нема поставено
главна хипотеза, тогаш нема статистички
тест. Важно е уште пред да се започне со истражувањето кои хипотези би се потврдиле
како точни (тоа се однесува на некои за некои врски за кои се претпоставува дека би
излегол таков резултат) и кои би биле прелиминарни (индицирани од добиените податоци од истражувањето) (2).
Во истражувачки студии користењето на погрешните статистички тестови може да се
види во голем број случаи, како користењето
на тестови за парен број на податоци кај податоци добиени од непарен број или користењето на параметриски статистички тест за
обработка на податоците кој не ја следи нормалната дистрибуција или некомпатибилен
статистички тест за добиените податоци од
истражувањето (3).
Достапноста на различни типови статистички програми, го прави изведувањето на статистиката и статистичките тестови многу
лесно, но изборот на соодветниот статистички тест или метод сè уште претставува проблем. Најдобар пристап е чекор по чекор
систематски да се дојде до одлука на кој начин да се анализираат добиените податоци.
Се препорачува да се следат овие чекори (4).
да се одреди и специфицира во форма
на прашање што сакаме да постигнеме
со истражувањето;
да се постави прашањето во форма на
статистичка нулта хипотеза и да се издвојат алтернативни хипотези од главната или нултата хипотеза;
да се одредат кои варијабли се релевантни за прашањето;
да се одреди од кој тип е секоја варијабДЕФЕКТОЛОШКА ТЕОРИЈА И ПРАКТИКА 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
are in the theses very often wrong selected
statistical methods which then lead to
erroneous conclusions. Selecting the right
statistical test may represent a huge problem
for younger researchers. In research,
meaningful conclusions can only be drawn
based on data collected from a valid
scientific design using appropriate statistical
tests.
Regarding to selecting a statistical test, the
most important question is "what is the main
study hypothesis?" In some cases there is no
hypothesis; the investigator just wants to
"see what is there". For example, in a
prevalence study there is no hypothesis to
test, and the size of the study is determined
by how accurately the investigator wants to
determine the prevalence. If there is no
hypothesis, then there is no statistical test. It
is important to decide a priori which
hypotheses are confirmatory (that is, are
testing some presupposed relationship), and
which are exploratory (are suggested by the
data) (2).
In research studies wrong statistical tests can
be seen in many conditions like use of paired
test for unpaired data or use of parametric
statistical tests for the data which does not
follow the normal distribution or
incompatibility of statistical tests with the
type of data (3).
The availability of different types of
statistical software makes performing of the
statistical tests to become easy, but selection
of appropriate statistical test is still a
problem. Systematic step-by-step approach
is the best way to decide how to analyze
data. It is recommended that you follow
these steps (4):
Specify the question you are asking.
Put the question in the form of a
statistical null hypothesis and alternate
hypothesis.
Determine which variables are relevant
to the question.
Determine what kind of variable each
7
EDITORIAL
ла посебно;
да се дизајнира студија која ги контролира или ги распределува случајните варијабли;
да се одбере најдобриот статистички
тест или метод базиран врз бројот и видот на варијаблите за да се утврди дали
очекуваните резултати одговараат на
претпоставките кои сме ги поставиле во
параметрите и да се тестираат хипотезите;
доколку е можно, да се направи претходна анализа за да се одреди големината на примерокот кој ќе се испитува во
истражувањето;
да се направи истражувањето;
да се прегледаат добиените податоци и
да се утврди дали соодветствуваат со
претпоставките од статистичкиот тест
кој е одбран. Доколку не се соодветни
тогаш се бара посоодветен тест;
да се спроведе статистичкиот тест кој ќе
се покаже како најсоодветен и да се интерпретираат резултатите и
да се презентираат добиените резултати
ефективно, најчесто со графикони или
табели.
Marusteri и Bacarea укажуваат и на други услови кои би требало да се земат предвид кога
вршиме анализа на добиените податоци од
одредено истражување:
основно ниво на познавање на базичната статистичка терминологија и концепти;
да се поседува знаење за неколку аспекти поврзани со податоците кои сме ги
добиле за време на истражувањето / експериментот (пр. каков тип на податоци
сме добиле – номинални, ординални,
интервални или размерни скали (скали
на односи) како се организирани добиените податоци, колку истражувачки групи се опфатени (обично експериментална и контролна група), дали групите се
во пар или непар, дали примерокот или
примероците припаѓаат на нормална
дистрибуирана / Гаусова популација);
добро разбирање на целта за нашата
статистичка анализа;
добра анализа на целиот статистички
протокол во еден добар структуриран,
разгранет, алгоритамски начин, со цел
да се избегнат можни грешки (5).
Целта на овој едиторијал е да им помогне на
младите истражувачи да можат да ги одберат
статистичките техники или статистички
8
one is.
Design a study that controls or
randomizes the confounding variables.
Based on the number of variables, the
kinds of variables, the expected fit to the
parametric assumptions, and the
hypothesis to be tested, choose the best
statistical test to use.
If possible, do a power analysis to
determine a good sample size for the
study.
Do the study.
Examine the data to see if it meets the
assumptions of the statistical test you
chose. If it doesn't, choose a more
appropriate test.
Apply the statistical test you chose, and
interpret the results.
Show your results effectively, usually
with a table or a figure.
Marusteri and Bacarea mentioned other
things we should have in our mind when we
are analyzing the data from some study:
Decent understanding of some basic
statistical terms and concepts;
Some knowledge about few aspects
related to the data we collected during
the research/experiment (e.g. what types
of data we have - nominal, ordinal,
interval or ratio, how the data are
organized, how many study groups
(usually experimental and control at
least) we have, are the groups paired or
unpaired, and are the sample(s) extracted
from a normally distributed/Gaussian
population);
Good understanding of the goal of our
statistical analysis;
We have to parse the entire statistical
protocol in a well structured - decision
tree /algorithmic manner, in order to
avoid some mistakes (5).
The aim of this editorial is to help young
researchers to select statistics or statistical
JOURNAL OF SPECIAL EDUCATION AND REHABILITATION 2016; 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
ПРЕДГОВОР
компјутерски програми кои би биле соодветни во исполнувањето на целите и условите
на одредена анализа. Неколку од овие чекори
ќе бидат подетално објаснети во долунаведениот текст.
techniques
and
statistical
software
appropriate for the purposes and conditions
of a particular analysis. In the following text
it will be explained some of these steps.
Видови скали
Types of scales
Пред да можеме да ја спроведеме статистичката анализа, мораме да извршиме мерење на
зависната варијабла. Начинот на кој се врши
мерењето ќе зависи целосно од типот на
варијаблата која е вклучена при самата анализа. Различни типови се мерат на различен
начин. Иако процедурите за мерење се разликуваат една од друга на многу начини, можат да бидат класифицирани користејќи неколку фундаментални категории. Во секоја
категорија сите процедури меѓусебно споделуваат дел од важните особини. Постојат четири типови скали.
Before we can conduct a statistical analysis,
we need to measure our dependent variable.
Exactly how the measurement is carried out
depends on the type of variable involved in
the analysis. Different types are measured
differently. Although procedures for
measurement differ in many ways, they can
be classified using a few fundamental
categories. In a given category, all of the
procedures share some properties that are
important to know about. There are four
types of scales.
Номинални скали
Кога при мерењата се користи номиналната
скала, тогаш само се именуваат или категоризираат дадени одговори. Пол, брачен
статус, омилена боја, како и религиска определба се примери на варијабли измерени со
номинална скала. Есенцијалната цел на номиналните скали се состои во тоа што тие не
вршат подредување на дадени одговори од
субјектите кои се испитуваат. На пример,
кога ги класифицираме луѓето според нивната омилена боја, нема смисла кога црвената
боја е ставена пред жолтата. Одговорите само се категоризираат. Со номиналните скали
се отелотворуваат најниските видови мерења
во статистиката (6).
Nominal scales
When measuring using a nominal scale, one
simply names or categorizes responses.
Gender, marital status, handedness, favorite
color, and religion are examples of
variables measured on a nominal scale. The
essential point about nominal scales is that
they do not imply any ordering among the
responses. For example, when classifying
people according to their favorite color,
there is no sense in which red is placed
“ahead of” yellow. Responses are merely
categorized. Nominal scales embody the
lowest level of measurement (6).
Ординални скали
Истражувач кој сака да изврши мерење на
варијаблата колку се задоволни родителите
од третманот на нивното дете во текот на
наставата, може да им постави специфично
прашање за тоа како се чувствуваат: „многу
незадоволно“, „малку незадоволно“, „малку
задоволно“, „многу задоволно“. Во овој случај варијаблите се подредени, рангирајќи од
најмалку до најмногу задоволни. Ова е основната разлика помеѓу ординарната и номиналната скала. За разлика од номиналните
скали, ординарните скали дозволуваат да се
направи споредба до кој степен два субјекти
кои се испитуваат ја поседуваат зависната
варијабла. На пример, нашето задоволство
ДЕФЕКТОЛОШКА ТЕОРИЈА И ПРАКТИКА 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
Ordinal scales
A researcher wishing to measure satisfaction
of parents with treatment of their child in
regular classroom might ask them to specify
their feelings as either “very dissatisfied,”
“somewhat
dissatisfied,”
“somewhat
satisfied,” or “very satisfied.” The items in
this scale are ordered, ranging from least to
most satisfied. This is what distinguishes
ordinal from nominal scales. Unlike nominal
scales, ordinal scales allow comparisons of
the degree to which two subjects possess the
dependent variable. For example, our
satisfaction ordering makes it meaningful to
assert that one person is more satisfied than
another with their microwave ovens. Such an
assertion reflects the first person's use of a
9
EDITORIAL
при купување на микробранова печка може
да е поголемо од она на други купувачи. Од
друга страна, со ординалните скали не можеме да ги добиеме важните информации кои
се присутни кај другите видови скали. На
пример, разликата помеѓу две нивоа на една
ординална скала не може да се претпостави
дека ќе бидат исти како и разликата помеѓу
други две нивоа. Кај скалите со кои се мери
задоволството, на пример, разликата помеѓу
одговорите „многу незадоволен“ и „малку
незадоволен“ сигурно не е еквивалентна на
разликата помеѓу „малку незадоволен“ и
„малку задоволен“. Ништо што е во нашата
процедура за мерење не може да ни детерминира дали двете разлики ја рефлектираат истата разлика во психолошко задоволство (6).
verbal label that comes later in the list than
the label chosen by the second person. On the
other hand, ordinal scales fail to capture
important information that will be present in
the other scales we examine. In particular, the
difference between two levels of an ordinal
scale cannot be assumed to be the same as the
difference between two other levels. In
satisfaction scale, for example, the difference
between the responses “very dissatisfied” and
“somewhat dissatisfied” is probably not
equivalent to the difference between
“somewhat dissatisfied” and “somewhat
satisfied.” Nothing in our measurement
procedure allows us to determine whether the
two differences reflect the same difference in
psychological satisfaction (6).
Интервални скали
Интервалните скали се нумерички скали кои
вклучуваат: возраст (години), тежина (кг)
или должина на коска (цм), во која интервалите ја имаат истата интерпретација низ целата скала. Интервалните податоци се подредени по значаен редослед и го поседуваат
квалитетот кој е еднаков со интервалите направени помеѓу мерењата и ја претставуваат
истата промена во квантитетот на тоа што го
мериме. Но кај овие типови податоци не постои природна нула. На пример, во Целзиусовата скала за температура. Во Целзиусовата
скала, не постои природна нула, така што не
можеме да кажеме дека 50°C е дупло од
25°C. Кај интервалните скали нулта точката
може да биде поставена арбитражно. IQ-тестот исто така претставува податок за интервална скала кај која не постои природна (апсолутна) нула (7).
Interval scales
Interval scales are numerical scales
including: age (years), weight (kg) or length
of bone (cm), in which intervals have the
same interpretation throughout. Interval
data has a meaningful order and also has the
quality that equal intervals between
measurements represent equal changes in
the quantity of whatever is being measured.
But these types of data have no natural
zero. Example is Celsius scale of
temperature. In the Celsius scale, there is no
natural zero, so we cannot say that 50°C is
double than 25°C. In interval scale, zero
point can be chosen arbitral. IQ test is also
interval data as it has no natural zero (7).
Размерни скали
Размерната скала за мерење на добиени податоци содржи најголем број информации.
Тоа е интервална скала со дополнителна особина каде што положбата на нулата посочува
на отсуство од квантитетот што се мери. За
размерната скала може да се каже дека е составена од сите три претходни скали. Како и
номиналната скала, ни дава име или категорија за одреден објект (броевите служат како
обележја). Како кај ординалната скала, објектите се подредени (како подредување на
броеви). Кај размерната скала истата разлика
на две места го има истото значење, како и
кај интервалната скала. Но исто така, истиот
размер на две места на скалата носи исто
Ratio scales
The ratio scale of measurement is the most
informative scale. It is an interval scale with
the additional property that its zero position
indicates the absence of the quantity being
measured. You can think of a ratio scale as
the three earlier scales rolled up in one.
Like a nominal scale, it provides a name or
category for each object (the numbers serve
as labels). Like an ordinal scale, the objects
are ordered (in terms of the ordering of the
numbers). Like an interval scale, the same
difference at two places on the scale has the
same meaning. And in addition, the same
ratio at two places on the scale also carries
the same meaning. Example of a ratio scale
10
JOURNAL OF SPECIAL EDUCATION AND REHABILITATION 2016; 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
ПРЕДГОВОР
значење. Пример за размерна скала е количината на пари која ја имате во овој момент
(500 денари, 1000 денари итн.). Парите се мерат со размерна скала, бидејќи, покрај тоа
што ги имаат особините на интервална скала,
постои вистинска нулта точка: доколку имате нула денари, ова посочува на отсуство на
пари. Бидејќи парите имаат вистинска нулта
точка, има смисла да кажеме дека некој со
1000 денари има двапати повеќе отколку некој со 500 денари (или дека Марк Цукерберг
има милион пати повеќе пари отколку што
имате вие) (6).
is the amount of money you have in your
pocket right now (500 denars, 1000 denars,
etc.). Money is measured on a ratio scale
because, in addition to having the properties
of an interval scale, it has a true zero point:
if you have zero money, this implies the
absence of money. Since money has a true
zero point, it makes sense to say that
someone with 1000 denars has twice as
much money as someone with 500 denars
(or that Mark Zuckerberg has a million
times more money than you do) (6).
Нормална дистрибуција или не
Normal distribution or not
Ова е уште еден проблем при селекцијата на
правилниот статистички тест. Доколку знаете каков е видот на податоците (номинални,
ординални, интервални или размерни) и дистрибуцијата на податоците (нормална дистрибуција или ненормална дистрибуција),
селекцијата на статистичкиот тест е многу
лесна. Нема потреба да се проверува дистрибуцијата кај ординалните и номиналните скали на податоци добиени од истражувањето.
Дистрибуцијата обично се проверува само
кај интервални или размерни податоци. Ако
вашите податоци ја следат нормалната дистрибуција, би бил користен параметриски
(стандардизиран) статистички тест, додека доколку не се следи нормалната дистрибуција,
тогаш би се користел непараметриски тест.
Постојат различни методи за да се провери
нормалната дистрибуција, некои од нив
преку различни видови на хистограми, мерење на искривеност на кривата и куртозис, како на пример, статистичкиот тест на нормалност (Колмогоров-Смирнов тест, ШапироВилк-тестот итн.). Формалните статистички
тестови како Колмогоров-Смирнов-тестот и
Шапиро-Вилк-тестот најчесто се користат за
да се провери дистрибуцијата на добиените
податоци. Сите овие тестови се базирани на
нултата хипотеза дека податоците се земени
од популација која ја следи нормалната дистрибуција. P вредноста се одредува за да се
увиди алфа грешката. Доколку P вредноста е
помала од 0,05, тогаш добиените податоци
не ја следат нормалната дистрибуција и во
овој случај би требало да се користи нестандардизиран тест. Доколку примерокот кој се
испитува е помал, веројатноста за ненормална дистрибуција се зголемува (7).
This is another issue for selection of right
statistical test. If you know the type of data
(nominal, ordinal, interval, and ratio) and
distribution of data (normal distribution or
not normal distribution), selection of
statistical test will be very easy. There is no
need to check distribution in the case of
ordinal and nominal data. Distribution
should only be checked in the case of ratio
and interval data. If your data are following
the
normal
distribution,
parametric
statistical test should be used and
nonparametric tests should only be used
when normal distribution is not followed.
There are various methods for checking the
normal distribution, some of them are
plotting histogram, plotting box and
whisker plot, plotting Q-Q plot, measuring
skewness and kurtosis, using formal
statistical test for normality (KolmogorovSmirnov test, Shapiro-Wilk test, etc).
Formal statistical tests like KolmogorovSmirnov and Shapiro-Wilk are used
frequently to check the distribution of data.
All these tests are based on null hypothesis
that data are taken from the population
which follows the normal distribution. P
value is determined to see the alpha error. If
P value is less than 0.05, data is not
following the normal distribution and
nonparametric test should be used in that
kind of data. If the sample size is less,
chances of non-normal distribution are
increased (7).
ДЕФЕКТОЛОШКА ТЕОРИЈА И ПРАКТИКА 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
11
EDITORIAL
Параметриски и непараметриски
процедури
Parametric and non-parametric
procedures
Стандардизираните статистички процедури
се основани на претпоставки за формата на
дистрибуцијата (се претпоставува нормална
дистрибуција) во основната популација и за
формата на параметрите кои се земени (начини и стандардни девијации) од претпоставената дистрибуција.
Нестандардизираните статистички процедури
се поткрепуваат на неколку претпоставки во
однос на формата на параметрите на популациската дистрибуција од која самиот примерок
бил извлечен (8). Нестандардизираните методи
обично се послаби и помалку флексибилни за
разлика од стандардизираните. Стандардизираните методи се користат тогаш кога претпоставките можеме да ги оправдаме. Некогаш можеме да направиме трансформација на добиените податоци за да извршиме оправдување на
претпоставките, како трансформација на дневник (9). Табела 1 ни ја покажува употребата на
стандардизирани и нестандардизирани статистички методи.
Parametric statistical procedures rely on
assumptions about the shape of the
distribution (assume a normal distribution) in
the underlying population and about the form
or parameters (means and standard
deviations) of the assumed distribution.
Nonparametric statistical procedures rely on
no or few assumptions about the shape or
parameters of the population distribution from
which the sample was drawn (8). Nonparametric methods are typically less
powerful and less flexible than their
parametric counterparts. Parametric methods
are preferred if the assumptions can be
justified. Sometimes a transformation can be
applied to the data to satisfy the assumptions,
such as log transformation (9). Table 1 shows
the use of parametric and non-parametric
statistical methods.
Табела 1. Параметриски наспроти непараметриски методи/
Table 1. Parameteric vs non-parametric methods
Assumed distribution
Assumed variance
Typical data
Dataset r lationships
Usual central measure
Benefits
Tests
Choosing
Correlation test
Independent measures,
2 groups
Independent measures,
>2 groups
Repeated measures,
2 conditions
Repeated measures,
>2 conditions
Parametric
Normal
Homogeneous
Ratio or interval
Independent
Mean
Can draw more conclusions
Non-parametric
Any
Any
Ordinal or nominal
Any
Median
Simplicity; Less affected by outliers
Choosing parametric test
Pearson
Independent-measures t-test
Choosing non-parametric test
Spearman
Mann-Whitney test
One-way independent-measures ANOVA
Kruskal-Wallis test
Matched-pair t-test
Wilcoxon test
One-way repeated measures ANOVA
Friedman’s test
Аритметичка средина (или просек) претставува мерење на локација од една група
вредности добиени преку податоците; сумата
на сите добиени податоци поделена со бројот
на елементи во дистрибуцијата. Придружен
елемент на мерење кој ја следи аритметичката средина обично е стандардната девијација. За разлика од медијаната и модата, не
е соодветно да се користи овој тип на мерење
за да се карактеризира или опише искривена
(ненормална) дистрибуција.
12
Arithmetic Mean (or average): a measure of
location for a batch of data values; the sum of
all data values divided by the number of
elements in the distribution. Its accompanying
measure of spread is usually the standard
deviation. Unlike the median and the mode,
it is not appropriate to use the mean to
characterize a skewed distribution.
JOURNAL OF SPECIAL EDUCATION AND REHABILITATION 2016; 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
ПРЕДГОВОР
Медијаната е уште едно мерење на локација
како и аритметичката средина. Вредноста
која ја дели дистрибуцијата на фреквенцијата
на средина кога сите податоци се подредени
по редослед. Кај овој тип на мерења се гледа
дека не постои сензитивност кога се мерат
мали броеви во екстремно големи резултати
во една дистрибуција. Затоа, таа е преферирана мерка за мерење на централната тенденција кај искривена дистрибуција (каде аритметичката средина е пристрасна) и обично
оди заедно со интеркварталниот ранг (dQ)
како придружна мерка за раст.
Интерквартален ранг (dQ) е мерка на раст
и е спротивна на стандардната девијација кај
искривена или ненормална дистрибуција на
податоците. dQ е растојанието помеѓу горните и долните квартали (Qu- QL).
Варијанса е нумеричка вредност која се
користи за да се утврди и укаже на тоа колку
индивидуите на една група се разликуваат
или варираат во однос на некои особини кои
ние ги мериме. Ако индивидуалната опсервација се разликува многу од средната вредност добиена за групата, тогаш разликата е
голема; и обратно. Многу е важно да се прави разлика помеѓу разликата во една популација и разликата кај еден примерок. Тие се
забележани на различен начин, и податоците
за секој од нив се обработува посебно. Варијабилноста кај популацијата се обележува со
σ2, а варијабилноста на еден примерок се
обележува со s2.
Стандардна девијација (SD) претставува
мерка за мерење на одреден сет податоци и
нивниот раст. За разлика од варијансата која
е изразена во квадратни единици, SD се изразува во истите единици како и оригиналните
податоци добиени од истражувањето. Се
пресметува според отстапувањата помеѓу
секој податок поединечно како и од аритметичката средина на примерокот. Тоа е квадратниот корен од варијансата. За различни
цели, n (целосниот број на вредности) или n1 може да се користи при пресметувањето на
варијабилноста/SD. Доколку ја имате пресметано SD делејќи ја со n но сакате да ја
претворите во SD и да одговара на именителот на n-1, тогаш се множи резултатот со
квадратниот корен од n/(n-1). Доколку дистрибуцијата на SD е поголема од аритметичката средина, тогаш аритметичката средина не е адекватна како репрезентативна
единица за мерење на централната тенденција. За податоци кои имаат нормална дистриДЕФЕКТОЛОШКА ТЕОРИЈА И ПРАКТИКА 2016; 17(3–4):5–28
DOI: 10.19057/jser.2016.7
Median is another measure of location just
like the mean. The value that divides the
frequency distribution in half when all data
values are listed in order. It is insensitive to
small numbers of extreme scores in a
distribution. Therefore, it is the preferred
measure of central tendency for a skewed
distribution (in which the mean would be
biased) and is usually paired with
the interquartile
range
(dQ) as
the
accompanying measure of spread.
Interquartile range (dQ) is a measure of
spread and is the counterpart of the standard
deviation for skewed distributions. dQ is the
distance between the upper and lower
quartiles (QU-QL).
Variance is a numerical value used to
indicate how widely individuals in a group
vary. If individual observations vary greatly
from the group mean, the variance is big; and
vice versa. It is important to distinguish
between the variance of a population and the
variance of a sample. They have different
notation, and they are computed differently.
The variance of a population is denoted by σ2;
and the variance of a sample, by s2.
Standard deviation (SD): is a measure of
spread (scatter) of a set of data. Unlike
variance, which is expressed in squared units
of measurement, the SD is expressed in the
same units as the measurements of the
original data. It is calculated from the
deviations between each data value and the
sample mean. It is the square root of the
variance. For different purposes, n (the total
number of values) or n-1 may be used in
computing the variance/SD. If you have a SD
calculated by dividing by n and want to
convert it to a SD corresponding to a
denominator of n-1, multiply the result by the
square root of n/(n-1). If a distribution's SD is
greater than its mean, the mean is inadequate
as a representative measure of central
tendency. For normally distributed data
13
EDITORIAL
буција, приближно 68% од дистрибуцијата
припаѓа ±1 SD од аритметичката средина,
95% од дистрибуцијата припаѓа на ± 2 SD од
аритметичката средина, и 99.7% од дистрибуцијата припаѓа на ± 3 SD од аритметичката
средина (емпириско правило).
Стандардна грешка (SE) или како што ...
Purchase answer to see full
attachment