Chapter 4 Psychometric Validity: Basic Concepts

Screencasted Lecture Link

The focus of this lecture is to provide an introduction to validity. This includes understanding some of the concerns of validity, different aspects of validity, and factors as they affect validity coefficients.

4.1 Navigating this Lesson

There is just over one hour of lecture. If you work through the materials with me, plan for an additional hour.

While the majority of R objects and data you will need are created within the R script that sources the chapter, occasionally there are some that cannot be created from within the R framework. Additionally, sometimes links fail. All original materials are provided at the GitHub site that hosts the book. More detailed guidelines for ways to access all these materials are provided in the OER’s introduction

4.1.1 Learning Objectives

Focusing on this week’s materials, make sure you can:

Distinguish between different types of validity based on short descriptions.
Compute and interpret validity coefficients.
Evaluate the incremental validity of an instrument-of-interest.
Define and interpret the standard error of estimate.
Develop a rationale that defends importance of establishing the validity of a measuring instrument.

4.1.2 Planning for Practice

In each of these lessons I provide suggestions for practice that allow you to select one or more problems that are graded in difficulty. The least complex is to change the random seed in the research and rework the problem demonstrated in the lesson. The results should map onto the ones obtained in the lecture.

The second option involves utilizing one of the simulated datasets available in this OER. The last lesson in the OER contains three simulations that could be used for all of the statistics-based practice suggestions. At this very first lesson which involves statistics, I strongly encourage you to select one of these sets and use it for all of the statistics-based homework. Working a single instrument “all the way through” a psychometric evaluation is a great way to (a) understand the psychometric evaluation workflow and (b) get to know an instrument.

Alternatively, Lewis and Neville’s (2015) Gendered Racial Microaggressions Scale for Black Women will be used in the lessons on exploratory factor analysis; Keum et al.’s Gendered Racial Microaggressions Scale for Asian American Women (Keum et al., 2018) will be used in the lessons on confirmatory factor analysis; and Conover et al.’s (2017) Ableist Microaggressions Scale is used in the lesson on invariance testing. Any of these would be suitable for the PCA and PAF homework assignments.

As a third option, you are welcome to use data to which you have access and is suitable for validity testing. In any case, please select a scale that has item-level data for which there is a theorized total scale score as well as two or more subscales (three subscales is ideal). With each of these options, plan to:

Create a correlation matrix that includes the instrument-of-interest and the variables that will have varying degrees of relation*.
With convergent and discriminant validity in mind, interpret the validity coefficients; this should include an assessment about whether the correlation coefficients (at least two different pairings) are statistically significantly different from each other.
With at least three variables, evaluate the degree to which the instrument demonstrates incremental validity

4.1.3 Readings & Resources

In preparing this chapter, I drew heavily from the following resource(s). Other resources are cited (when possible, linked) in the text with complete citations in the reference list.

Jhangiani, R. S., Chiang, I.-C. A., Cuttler, C., & Leighton, D. C. (2019). Reliability and Validity. In Research Methods in Psychology. https://doi.org/10.17605/OSF.IO/HF7DQ
Clark, L. A. & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309-319.
- In this manuscript, Clark and Watson (1995) create a beautiful blend of theoretical issues and practical suggestions for creating measures that evidence construct validity. From the practical perspective, the authors first guide potential scale constructors through the literature review and creating an item pool (including tips on writing items). The authors address structural validity by first beginning with strategies for constructing the test. In this section, the authors revisit the issue of dimensionality (i.e., alpha vs. factor analysis). Finally, the authors look at initial data collection (addressing sample size) and psychometric evaluation.

4.1.4 Packages

The packages used in this lesson are embedded in this code. When the hashtags are removed, the script below will (a) check to see if the following packages are installed on your computer and, if not (b) install them.

# will install the package if not already installed
# if(!require(tidyverse)){install.packages('tidyverse')}
# if(!require(MASS)){install.packages('MASS')}
# if(!require(psych)){install.packages('psych')}

4.2 Research Vignette

This lesson provides descriptions of numerous pathways for establishing an instrument’s validity. In fact, best practices involving numerous demonstrations of validity. Across several lessons, we will rework several of the correlational analyses reported in the research vignette. For this lesson in particular, the research vignette allows demonstrations of convergent/discriminant validity and incremental validity.

The research vignette for this lesson is the development and psychometric evaluation of the Perceptions of the LGBTQ College Campus Climate Scale (Szymanski & Bissonette, 2020). The scale is six items with responses rated on a 7-point Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree). Higher scores indicate more negative perceptions of the LGBTQ campus climate. Szymanski and Bissonette have suggested that the psychometric evaluation supports using the scale in its entirety or as subscales composed of the following items:

College Response to LGBTQ students:
- My university/college is cold and uncaring toward LGBTQ students.
- My university/college is unresponsive to the needs of LGBTQ students.
- My university/college provides a supportive environment for LGBTQ students. [un]supportive; must be reverse-scored
LGBTQ Stigma:
- Negative attitudes toward LGBTQ persons are openly expressed on my university/college campus.
- Heterosexism, homophobia, biphobia, transphobia, and cissexism are visible on my university/college campus.
- LGBTQ students are harassed on my university/college campus.

A preprint of the article is available at ResearchGate.

Because data is collected at the item level (and I want this resource to be as practical as possible), I have simulated the data for each of the scales utilized in the research vignette at the item level. Simulating the data involved using factor loadings, means, standard deviations, and correlations between the scales. Because the simulation will produce “out-of-bounds” values, the code below re-scales the scores into the range of the Likert-type scaling and rounds them to whole values.

Five additional scales were reported in the Szymanski and Bissonette article (2020). Unfortunately, I could not locate factor loadings for all of them; and in two cases, I used estimates from a more recent psychometric analysis. When the individual item and their factor loadings were known, I assigned names based on item content (e.g., “lo_energy”) rather than using item numbers (e.g., “PHQ4”). When I am doing psychometric analyses, I prefer item-level names so that I can quickly see (without having to look up the item content) how the items are behaving. While the focus of this series of chapters is on the LGBTQ Campus Climate scale, this simulated data might be useful to you in one or more of the suggestions for practice (e.g., examining the psychometric characteristics of one or the other scales). The scales, their original citation, and information about how I simulated data for each are listed below.

Sexual Orientation-Based Campus Victimization Scale (Herek, 1993) is a 9-item item scale with Likert scaling ranging from 0 (never) to 3 (two or more times). Because I was not able to locate factor loadings from a psychometric evaluation, I simulated the data by specifying a 0.8 as a standardized factor loading for each of the items.
College Satisfaction Scale (Helm et al., 1998) is a 5-item scale with Likert scaling ranging from 1 (strongly disagree) to 7 (strongly agree). Higher scores represent greater college satisfaction. Because I was not able to locate factor loadings from a psychometric evaluation, I simulated the data by specifying a 0.8 as a standardized factor loading for each of the items.
Institutional and Goals Commitment (Pascarella & Terenzini, 1980) is a 6-item subscale from a 35-item measure assessing academic/social integration and institutional/goal commitment (5 subscales total). The measure had with Likert scaling ranging from 1 (strongly disagree) to 5 (strongly agree). Higher scores on the institutional and goals commitment subscale indicate greater intentions to persist in college. Data were simulated using factor loadings in the source article.
GAD-7 (Spitzer et al., 2006) is a 7-item scale with Likert scaling ranging from 0 (not at all) to 3 (nearly every day). Higher scores indicate more anxiety. I simulated data by estimating factor loadings from Brattmyr et al. (2022).
PHQ-9 (Kroenke et al., 2001) is a 9-item scale with Likert scaling ranging from 0 (not at all) to 3 (nearly every day). Higher scores indicate higher levels of depression. I simulated data by estimating factor loadings from Brattmyr et al. (2022).

#Entering the intercorrelations, means, and standard deviations from the journal article

Szymanski_generating_model <- '
        #measurement model
        CollegeResponse  =~ .88*cold + .73*unresponsive + .73*supportive 
        Stigma =~ .86*negative + .76*heterosexism + .71*harassed
        Victimization =~ .8*Vic1 + .8*Vic2 + .8*Vic3 + .8*Vic4 + .8*Vic5 + .8*Vic6 + .8*Vic7 + .8*Vic8 + .8*Vic9
        CollSat =~ .8*Sat1 + .8*Sat2 + .8*Sat3 + .8*Sat4 + .8*Sat5
        Persistence =~ .69*graduation_importance + .63*right_decision + .62*will_register + .59*not_graduate + .45*undecided + .44*grades_unimportant
        Anxiety =~ .851*nervous + .887*worry_control + .894*much_worry + 674*cant_relax + .484*restless + .442*irritable + 716*afraid
        Depression =~ .798*anhedonia + .425*down +  .591*sleep +  .913*lo_energy +  .441*appetite +  .519*selfworth +  .755*concentration +  .454*too_slowfast + .695*s_ideation
   
        #Means
         CollegeResponse ~ 2.71*1
         Stigma ~3.61*1
         Victimization ~ 0.11*1
         CollSat ~ 5.61*1
         Persistence ~ 4.41*1
         Anxiety ~ 1.45*1
         Depression ~1.29*1

         
        #Correlations
         CollegeResponse ~~ .58*Stigma
         CollegeResponse ~~ -.25*Victimization
         CollegeResponse ~~  -.59*CollSat
         CollegeResponse ~~  -.29*Persistence
         CollegeResponse ~~  .17*Anxiety
         CollegeResponse ~~  .18*Depression
         
         Stigma ~~ .37*Victimization
         Stigma ~~  -.41*CollSat
         Stigma ~~  -.19*Persistence
         Stigma ~~  .27*Anxiety
         Stigma ~~  .24*Depression
         
         Victimization ~~  -.22*CollSat
         Victimization ~~  -.04*Persistence
         Victimization ~~  .23*Anxiety
         Victimization ~~  .21*Depression
         
         CollSat ~~  .53*Persistence
         CollSat ~~  -.29*Anxiety
         CollSat ~~  -.32*Depression
         
         Persistence ~~  -.22*Anxiety
         Persistence ~~  -.26*Depression
         
         Anxiety ~~  .76*Depression
        '

set.seed(240218)
dfSzy <- lavaan::simulateData(model = Szymanski_generating_model,
                              model.type = "sem",
                              meanstructure = T,
                              sample.nobs=646,
                              standardized=FALSE)

#used to retrieve column indices used in the rescaling script below
col_index <- as.data.frame(colnames(dfSzy))

#The code below loops through each column of the dataframe and assigns the scaling accordingly
#Rows 1 thru 6 are the Perceptions of LGBTQ Campus Climate Scale
#Rows 7 thru 15 are the Sexual Orientation-Based Campus Victimization Scale
#Rows 16 thru 20 are the College Satisfaction Scale
#Rows 21 thru 26 are the Institutional and Goals Commitment Scale 
#Rows 27 thru 33 are the GAD7
#Rows 34 thru 42 are the PHQ9

for(i in 1:ncol(dfSzy)){  
  if(i >= 1 & i <= 6){   
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(1, 7))
  }
    if(i >= 7 & i <= 15){   
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(0, 3))
    }
      if(i >= 16 & i <= 20){  
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(1, 7))
      }
        if(i >= 21 & i <= 26){  
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(1, 5))
        }
  if(i >= 27 & i <= 33){  
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(0, 3))
  }
    if(i >= 34 & i <= 42){  
    dfSzy[,i] <- scales::rescale(dfSzy[,i], c(0, 3))
  }
}

#rounding to integers so that the data resembles that which was collected
library(tidyverse)
dfSzy <- dfSzy %>% round(0) 

#quick check of my work
#psych::describe(dfSzy) 

#Reversing the supportive item on the Perceptions of LGBTQ Campus Climate Scale so that the exercises will be consistent with the format in which the data was collected

dfSzy <- dfSzy %>%
  dplyr::mutate(supportiveNR = 8 - supportive)

#Reversing three items on the Institutional and Goals Commitments scale so that the exercises will be consistent with the format in which the data was collected

dfSzy <- dfSzy %>%
  dplyr::mutate(not_graduateNR = 8 - not_graduate)%>%
  dplyr::mutate(undecidedNR = 8 - undecided)%>%
  dplyr::mutate(grades_unimportantNR = 8 - grades_unimportant)

dfSzy <- dplyr::select(dfSzy, -c(supportive, not_graduate, undecided, grades_unimportant))

The optional script below will let you save the simulated data to your computing environment as either an .rds object (preserves any formatting you might do) or a.csv file (think “Excel lite”).

# to save the df as an .rds (think 'R object') file on your computer;
# it should save in the same file as the .rmd file you are working
# with saveRDS(dfSzy, 'SzyDF.rds') bring back the simulated dat from
# an .rds file dfSzy <- readRDS('SzyDF.rds')

# write the simulated data as a .csv write.table(dfSzy,
# file='SzyDF.csv', sep=',', col.names=TRUE, row.names=FALSE) bring
# back the simulated dat from a .csv file
dfSzy <- read.csv("SzyDF.csv", header = TRUE)

As we move into the lecture, allow me to provide a content advisory. Individuals who hold LGBTQIA+ identities are frequently the recipients of discrimination and harassment. If you are curious about why these items are considered to be stigmatizing or non-responsive, please do not ask a member of the LGBTQIA+ community to explain it to you; it is not their job to educate others on discrimination, harassment, and microaggressions. Rather, please read the article in its entirety. Additionally, resources such as The Trevor Project, GLSEN, and Campus Pride are credible sources of information for learning more.

4.3 Fundamentals of Validity

Validity (the classic definition) is the ability of a test to measure what it purports to measure. Supporting that definition are these notions:

Validity is extent of matching, congruence, or “goodness of fit” between the operational definition and concept it is supposed to measure.
An instrument is said to be valid if it taps the concept it claims to measure.
Validity is the appropriateness of the interpretation of the results of an assessment procedure for a given group of individuals, not to the procedure itself.
Validity is a matter of degree; it does not exist on an all-or-none basis.
Validity is always specific to some particular use or interpretation.
Validity is a unitary concept.
Validity involves an overall evaluative judgment.

Over the years (and, perhaps within each construct), validity has somewhat of an evolutionary path from a focus on content, to prediction, to theory and hypothesis testing.

When the focus is on content, we are concerned with the:

Assessment of what individuals had learned in specific content areas.
Relevance of its content (i.e., we compare the content to the content domain).

When the focus is on prediction, we are concerned with:

How different persons respond in a given situation (now or later).
The correlation coefficient between test scores (predictor) and the assessment of a criterion (performance in a situation)

A focus on theory and hypothesis testing adds:

A strengthened theoretical orientation.
A close linkage between psychological theory and verification through empirical and experimental hypothesis testing.
An emphasis on constructs in describing and understanding human behavior.

Constructs are broad categories, derived from the common features shared by directly observable behavioral variables. They are theoretical entities and not directly observable. Construct validity is at the heart of psychometric evaluation. We define construct validity as the fundamental and all-inclusive validity concept, insofar as it specifies what the test measures. Content and predictive validation procedures are among the many sources of information that contribute to the understanding of the constructs assessed by a test.

4.4 Validity Criteria

We have just stated that validity is an overall, evaluative judgment. Within that umbrella are different criteria by which we judge the validity of a measure. We casually refer to them as types, but each speaks to that unitary concept.

4.4.1 Content Validity

Content validity is concerned with the representativeness of the domain being assessed. Content validation procedures may differ depending on whether the test is in the educational/achievement context or if it is more of an attitude/behavioral survey.

In the educational/achievement context, content validation seeks to ensure the items on an exam are appropriate for the content domain being assessed.

A table of specifications is a two-way chart which indicates the instructionally relevant learning tasks to be measured. Percentages in the table indicate the relative degree of emphasis that each content area.

Let’s imagine that I was creating a table of specifications for items on a quiz for this very chapter. The columns represent the types of outcomes we might expect. The American Psychological Association often talks about KSAs (knowledge, skills, attitudes), so I will utilize those as a framework. You’ll notice that the number of items and percentages do not align mathematically. Because, in the exam, I would likely weight application items (e.g., “work the problem”) more highly than knowledge items (e.g., multiple choice, true/false), the relative weighting may differ.

Table of Specifications

Learning Objectives	Knowledge	Skills	Attitudes	% of test
Distinguish between different types of validity based on short descriptions.	6 items			30%
Compute and interpret validity coefficients.		2 items		15%
Evaluate the incremental validity of an instrument-of-interest.		1 item		20%
Define and interpret the standard error of estimate.	1 item			15%
Develop a rationale that defends importance of establishing the validity of a measuring instrument.			1 item	20%
TOTALS	7 items	3 items	1 item	100%

Subject matter experts (SMEs) are individuals chosen to evaluate items based on their degree of knowledge of the subject being assessed. If SMES are used, the researcher should:

Report how many SMEs and list their professional qualifications.
Report any directions the SMEs were given; if they were used to evaluate items, report the extent of agreement.

Empirical procedures for enhancing content validity of educational assessments may include:

Comparing item-level and total scores with grades; lower grades should get lower scores.
Analyzing individual errors.
Observing student work methods (have the students “think aloud” in front of an examiner).
Evaluating the role of speed, noting how many do not complete the test in the time allowed.
Correlating the scores with a reading comprehension test (if the exam is highly correlated, then it may be a test of reading and not another subject). Alternatively, if it is a reading comprehension test, give the student the questions (without the passage) to see how well they answered the questions on the basis of prior knowledge.

For surveys and tests outside of educational settings, content validation procedures ask, “Does the test cover a representative sample of the specified skills and knowledge?” and “Is test performance reasonably free from the influence of irrelevant variables?” Naturally, SMEs might be used.

An example of content validation from Industrial-Organizational Psychology is the job analysis which precedes the development of test for employee selection and classification. Not all tests require content analysis. In aptitude and personality tests we are probably more interested in other types of validity evaluation.

4.4.2 Face Validity: The “Un”validity

Face validity is concerned with the question, “How does an assessment look on the ‘face of it’?” Let’s imagine that on a qualification exam for electricians, a math item asks the electrician to estimate the amount of yarn needed to complete a project. The item may be more face valid if the calculation was with wire. Thus, face validity can often be improved by reformulating test items in terms that appear relevant and plausible for the context.

Face validity should never be regarded as a substitute for objectively determined validity. In contrast, it should not be assumed that when a (valid and reliable) test has been modified to increase its face validity, that its objective validity and reliability is unaltered. That is, it should be reevaluated.

4.4.3 Criterion-Related Validity

Criterion-related validity has to do with the test’s ability to predict an outcome (the criterion). If the criterion is something that occurs simultaneously, it is an assessment of concurrent validity; if it is in the future, it is an assessment of predictive validity.

A criterion is the “thing” that the test should, theoretically, be able to predict. That prediction could be occurring at the same time (concurrent validity) or at a future time (predictive validity). Regardless, the estimate of the criteria must be independent of the survey/assessment being evaluated. The table below provides examples of types of tests and concurrent and predictive validity criteria.

Type of Test	Concurrent Criteria Example	Predictive Criteria Example
A shorter (or cheaper) standardized achievement test	school grades, existing standardized tests	subsequent graduation/college admissions, cumulative GPA
Employee selection tests	decision made by a search committee	subsequent retention or promotion of the selected employee
Assessment of depression severity (shorter or cheaper)	diagnosis from a mental health professional; correlation with an established measure	inpatient hospitalization or act of self-harm

Contrasted groups is a specific type of criterion-related validity. Clearly differentiated groups (e.g., salesclerks versus executives; engineers versus musicians) are chosen to see if exam performance or profiles differ in predictable ways.

Criterion contamination occurs when test scores, themselves, are used to make decisions about the criteria. To prevent this:

No person who participates in the assignment of criterion ratings can have any knowledge of the examinee’s test scores.
The test scores must be kept strictly confidential.

There are a number of issues related to criterion-related validity.

Is the criterion choice appropriate?
- Criterion validity is only as good as the validity of the criterion to which one is making a comparison.
- In the 1980s and 1990s there was more attention in this area; that is critics questioned the quality of the criterion being used.
To what degree can the results of criterion-related validity be generalized?
- Most tests are developed (intentionally) for a local context, setting, or audience. Consequently, in the local context, the criterion-prediction sample is usually too small (i.e., 50 cases or less).
- Those who want to generalize the test to a broader population should evaluate the test in relation to the new purpose.
Is there a role for meta-analysis?
- Repeated validation studies of our tests, on different samples, results in a number of small-scale studies, each with their own validity coefficients.
- We can use meta-analytic procedures in reporting the results of validity coefficients when they are used for establishing criterion validity.

4.4.4 Construct Validity

Construct validity was introduced in 1954 in the first edition of APA’s testing standards and is defined as the extent to which the test may be said to measure a theoretical construct or trait. The overarching focus is on the role of psychological theory in test construction and the ability to formulate hypotheses that can be supported (or not) in the evaluation process. Construct validity is established by the accumulation of information from a variety of sources.

There are a number of sources that can be used to support construct validity.

4.4.5 Internal Consistency

In the next chapter, you will learn that internal consistency is generally considered to be an index of reliability. In the context of criterion-related validity, a goal is to ensure that the criterion is the total score on the test itself. To that end, some of the following could also support this aspect of validity:

Comparing high and low scorers. Items that fail to show a significantly greater proportion of “passes” in the upper than the lower group are considered invalid and are modified or eliminated.
Computing a biserial correlation between the item and total score.
Correlating the subtest score with the total score. Any subtest whose correlation with the total score is too low is eliminated.

Although some take issue with this notion, the degree of homogeneity (the degree to which items assess the same thing) has some bearing on construct validity. There is a tradeoff between items that measure a narrow slice of the construct definition (internal consistency estimates are likely to be higher) and those that sample the construct definition more broadly (internal consistency estimates are likely to be lower).

Admittedly, the contribution of internal consistency data is limited. In absence of external data, it tells us little about WHAT the test measures.

4.4.6 Structural Validity

4.4.6.1 Exploratory Factor Analysis

Exploratory factor analysis (EFA) is used to simplify the description of behavior by reducing the number of categories (factors or dimensions) to fewer than the number of items. In our research vignette the 6-item Perceptions of Campus Climate Scale will be represented by two factors (Szymanski & Bissonette, 2020) In instrument development, techniques like principal components analysis or principal axis factoring are used to identify clusters (latent factors) among items. We frequently treat these as scales and subscales.

Imagine the use of 20 tests to 300 people. There would be 190 correlations.

Irrespective of content, we can probably summarize the intercorrelations of tests with 5-6 factors.
When the clustering of tests includes vocabulary, analogies, opposites, and sentence completions, we might suggest a “verbal comprehension factor.”
Factorial validity is the correlation of the test with whatever is common to a group of tests or other indices of behavior. If our single test has a correlation of .66 with the factor on which it loads, then the “factorial validity of the new test as a measure of the common trait is .66.”

When EFA is utilized, the items are “fed” into an iterative process that analyzes the relations and “reveals” (or suggests – we are the ones who interpret the data) how many factors (think scales/subscales) and which items comprise them.

4.4.6.2 Confirmatory Factor Analysis

Confirmatory factor analysis (CFA) involves specifying, a priori, a proposed relationship of items, scales, and subscales and then testing its goodness of fit. In CFA (a form of structural equation modeling [SEM]), the latent variables (usually the higher order scales and total scale score) are positioned to cause the responses on the indicators/items.

Subsequent lessons provide examples of both EFA and CFA approaches to psychometrics.

4.4.7 Experimental Interventions

Construct validity is also supported by hypothesis testing and experimentation. If we expect that the construct assessed by the instrument is malleable (e.g., depression) and that an intervention could change it, then a random clinical trial that evaluated the effectiveness of an intervention (and it worked – depression scores declined) would simultaneously provide support for the intervention as well as the instrument.

4.4.8 Convergent and Discriminant Validity

In a psychometric evaluation, we will often administer our instrument-of-interest along with a battery of instruments that are more-and-less related. Convergent validity is supported when there are moderately high correlations between our tests and the instruments with which we expect moderately high correlations. In contrast, discriminant validity is established by low and/or non-significant correlations between our instrument-of-interest and instruments that should be unrelated. For example, we want a low and non-significant correlation between a quantitative reasoning test and scores on a reading comprehension test. Why? Because if the correlation is too high, the test cannot discriminate between reading comprehension and math.

There are no strict cut-offs to establish convergence or discrimination. We can even ask, “Could a correlation intended to support convergence be too high?” It is possible! Unless the instrument-of-interest offers advantages such as brevity or cost, then correlations that fall into the ranges of multicollinearity or singularity can indicate unnecessary duplication or redundancy.

In our research vignette, Szymanski and Bissonette (2020) conducted a correlation matrix that reports the bivariate relations between the LGBTQ Campus Climate full-scale as well as the College Response and Stigma subscales with measures that assess (a) LGBTQ victimization, (b) satisfaction with college, (c) persistence attitudes, and (d) anxiety, and (e) depression.

In order to produce this correlation matrix, we must first score each of the scales. In the items we prepared, the Perceptions of LGBTQ Campus Climate scale had one reverse-scored item. Similarly, the Institutional Goals and Commitments Scale had three reversed items. A first step in scoring is reversing these items.

The naming conventions that researchers use vary. I added an “NR” (for “needs reversing) to the original items so that I would remember to reverse-score them. I also am careful to reverse-score items into new variables. Otherwise, we risk getting confused about whether/not items are in their original or reversed formats.

Reverse-scoring the item is easily accomplished by subtracting the variable from “one plus” the scaling. Because both of these scales were on a 7-point scale, we will subtract the “NR” variables from 8.

# Reverse scoring the single item from the LGBTQ Campus Climate Scale
dfSzy <- dfSzy %>%
    dplyr::mutate(unsupportive = 8 - supportiveNR)

# Reversing three items on the Institutional and Goals Commitments
# scale

dfSzy <- dfSzy %>%
    dplyr::mutate(not_graduate = 8 - not_graduateNR) %>%
    dplyr::mutate(undecided = 8 - undecidedNR) %>%
    dplyr::mutate(grades_unimportant = 8 - grades_unimportantNR)

Next, we create scale and/or subscale scores. The sjstats::mean_n() function allows us to specify how many items (whole number) or what percentage of items should be present in order to get the mean. It is customary to require 75-80% of items to be present for scoring. Three-item variables might allow one missing (i.e., 66%). In the code below, I first make lists of the variables that belong in each scale and subscale, then I create the new variables.

# Making the list of variables
LGBTQ_Climate <- c("cold", "unresponsive", "unsupportive", "negative",
    "heterosexism", "harassed")
CollResponse <- c("cold", "unresponsive", "unsupportive")
Stigma <- c("negative", "heterosexism", "harassed")
Victimization <- c("Vic1", "Vic2", "Vic3", "Vic4", "Vic5", "Vic6", "Vic7",
    "Vic8", "Vic9")
CampSat <- c("Sat1", "Sat2", "Sat3", "Sat4", "Sat5")
Persist <- c("graduation_importance", "right_decision", "will_register",
    "not_graduate", "undecided", "grades_unimportant")
GAD7 <- c("nervous", "worry_control", "much_worry", "cant_relax", "restless",
    "irritable", "afraid")
PHQ9 <- c("anhedonia", "down", "sleep", "lo_energy", "appetite", "selfworth",
    "concentration", "too_slowfast", "s_ideation")

# Creating the new variables
dfSzy$LGBTQclimate <- sjstats::mean_n(dfSzy[, LGBTQ_Climate], 0.75)
dfSzy$CollegeRx <- sjstats::mean_n(dfSzy[, CollResponse], 0.66)
dfSzy$Stigma <- sjstats::mean_n(dfSzy[, Stigma], 0.66)
dfSzy$Victimization <- sjstats::mean_n(dfSzy[, Victimization], 0.8)
dfSzy$CampusSat <- sjstats::mean_n(dfSzy[, CampSat], 0.75)
dfSzy$Persistence <- sjstats::mean_n(dfSzy[, Persist], 0.8)
dfSzy$Anxiety <- sjstats::mean_n(dfSzy[, GAD7], 0.75)
dfSzy$Depression <- sjstats::mean_n(dfSzy[, PHQ9], 0.8)

# If the scoring code above does not work for you, try the format
# below which involves inserting to periods in front of the variable
# list. One example is provided. dfLewis$Belonging <-
# sjstats::mean_n(dfLewis[, ..Belonging_vars], 0.80)

A correlation matrix of the scaled scores allows us to compare our scale(s) of interest to others within its nomological net.

apaTables::apa.cor.table(dfSzy[c("LGBTQclimate", "CollegeRx", "Stigma",
    "Victimization", "CampusSat", "Persistence", "Anxiety", "Depression")],
    filename = "SzyCor.doc", table.number = 1, show.sig.stars = TRUE, landscape = TRUE)



Table 1 

Means, standard deviations, and correlations with confidence intervals
 

  Variable         M    SD   1            2            3           
  1. LGBTQclimate  4.00 0.63                                       
                                                                   
  2. CollegeRx     4.04 0.77 .83**                                 
                             [.81, .85]                            
                                                                   
  3. Stigma        3.96 0.76 .83**        .37**                    
                             [.80, .85]   [.31, .44]               
                                                                   
  4. Victimization 1.55 0.33 .01          -.17**       .20**       
                             [-.06, .09]  [-.25, -.10] [.13, .27]  
                                                                   
  5. CampusSat     4.24 0.70 -.49**       -.46**       -.35**      
                             [-.55, -.43] [-.52, -.40] [-.41, -.28]
                                                                   
  6. Persistence   3.03 0.42 -.21**       -.17**       -.17**      
                             [-.28, -.13] [-.25, -.10] [-.25, -.10]
                                                                   
  7. Anxiety       1.49 0.38 .17**        .12**        .17**       
                             [.10, .25]   [.04, .19]   [.09, .24]  
                                                                   
  8. Depression    1.52 0.29 .18**        .14**        .15**       
                             [.10, .25]   [.07, .22]   [.08, .23]  
                                                                   
  4            5            6            7         
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
                                                   
  -.17**                                           
  [-.25, -.10]                                     
                                                   
  -.04         .34**                               
  [-.11, .04]  [.27, .40]                          
                                                   
  .15**        -.20**       -.10*                  
  [.07, .23]   [-.28, -.13] [-.18, -.02]           
                                                   
  .15**        -.23**       -.10**       .54**     
  [.08, .23]   [-.30, -.15] [-.18, -.03] [.48, .59]
                                                   

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

Examination of these values follow some expected patterns. First, the LGBTQ climate score (i.e., the total scale score) is highly correlated with each of its subscales (College Response r = .83, p < 0.01; Stigma r = .83, p = 0.01). These strong correlations are somewhat misleading because half of the items on the total scale are the same items on each of the subscales. The correlation between the two subscales is still statistically significant, but much lower (r = 0.37, p = 0.01).

Convergent and discriminant validity are of interest when we compare the LGBTQ Climate total scale score and the College Response and Stigma subscales with the additional measures. Regarding the total LGBTQ Climate score, a very strong correlation was observed with campus satisfaction (r = -0.49, p < 0.01); less strong correlations were observed with and persistence (r = -0.21, p < 0.01), anxiety (r = 0.17, p < 0.01), and depression (r = 0.18, p < 0.01). Recalling that higher scores on the LGBTQ Campus Climate score indicate a negative climate, we see that as the LGBTQ campus climate becomes increasingly stigmatizing and nonresponsive, students experience lower overall campus satisfaction and are less likely to persist at that institution. The correlation between LGBTQ Campus Climate and victimization was non-significant (r = 0.01, p > 0.05).

In assessing patterns of convergent and discriminant validity, the researcher would also take the time to map out the subscales (i.e., College Response, Stigma) with the additional measures.

4.4.8.1 Determining Statistically Significant Differences Between Correlations

Without a formal test, it is inappropriate for researchers to declare that one correlation is stronger than another. The package cocor allows the comparisons of dependent (i.e., all respondents are from the same sample) and independent (i.e., correlations are compared between two different samples) where the correlations themselves can be overlapping (i.e., with one common variable) or non-overlapping (i.e., the variables in both correlations are different).

Because all of the correlations we computed are within the same sample, they are dependent. When assessing convergent and discriminant validity it is common to ask if the correlations between the additional measures are different between the subscales of the focal measure. Results could support the notion that the subscales are related and yet different. An example of this might be comparing the correlations between victimization with the College Response subscale (r = -.17, p < 0.01) and victimization with Stigma subscale (r = 0.20, p < 0.01). This test would be overlapping because the victimization variable is in common. Here’s the code:

cocor::cocor(formula = ~CollegeRx + Victimization | Stigma + Victimization,
    data = dfSzy)


  Results of a comparison of two overlapping correlations based on dependent groups

Comparison between r.jk (Victimization, CollegeRx) = -0.1741 and r.jh (Victimization, Stigma) = 0.2015
Difference: r.jk - r.jh = -0.3756
Related correlation: r.kh = 0.3734
Data: dfSzy: j = Victimization, k = CollegeRx, h = Stigma
Group size: n = 646
Null hypothesis: r.jk is equal to r.jh
Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
Alpha: 0.05

pearson1898: Pearson and Filon's z (1898)
  z = -8.9455, p-value = 0.0000
  Null hypothesis rejected

hotelling1940: Hotelling's t (1940)
  t = -9.0342, df = 643, p-value = 0.0000
  Null hypothesis rejected

williams1959: Williams' t (1959)
  t = -9.0340, df = 643, p-value = 0.0000
  Null hypothesis rejected

olkin1967: Olkin's z (1967)
  z = -8.9455, p-value = 0.0000
  Null hypothesis rejected

dunn1969: Dunn and Clark's z (1969)
  z = -8.7137, p-value = 0.0000
  Null hypothesis rejected

hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
  t = -9.0341, df = 643, p-value = 0.0000
  Null hypothesis rejected

steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
  z = -8.6124, p-value = 0.0000
  Null hypothesis rejected

meng1992: Meng, Rosenthal, and Rubin's z (1992)
  z = -8.5080, p-value = 0.0000
  Null hypothesis rejected
  95% confidence interval for r.jk - r.jh: -0.4678 -0.2926
  Null hypothesis rejected (Interval does not include 0)

hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
  z = -8.6124, p-value = 0.0000
  Null hypothesis rejected

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r.jk - r.jh: -0.4568 -0.2921
  Null hypothesis rejected (Interval does not include 0)

Fisher’s z-test (z = -8.6124, p < 0.001) tells us that the correlations are statistically significantly different from each other; Zhou’s confidence intervals provide the CI for the size of the difference between the two correlations. That is, the difference could be as large as -0.4568 or as small as -0.2921.

Another type of correlation comparison is with a the total and/or subscales, looking at the relative magnitude of their correlation with different variables. For example, we might wish to ask if the LGBTQ Campus Climate total scale is different degrees of correlation with anxiety (r = 0.17, p < 0.01) and depression (r = .18, p < 0.01).

cocor::cocor(formula = ~LGBTQclimate + Anxiety | LGBTQclimate + Depression,
    data = dfSzy)


  Results of a comparison of two overlapping correlations based on dependent groups

Comparison between r.jk (LGBTQclimate, Anxiety) = 0.1724 and r.jh (LGBTQclimate, Depression) = 0.1795
Difference: r.jk - r.jh = -0.0071
Related correlation: r.kh = 0.5387
Data: dfSzy: j = LGBTQclimate, k = Anxiety, h = Depression
Group size: n = 646
Null hypothesis: r.jk is equal to r.jh
Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
Alpha: 0.05

pearson1898: Pearson and Filon's z (1898)
  z = -0.1914, p-value = 0.8482
  Null hypothesis retained

hotelling1940: Hotelling's t (1940)
  t = -0.1911, df = 643, p-value = 0.8485
  Null hypothesis retained

williams1959: Williams' t (1959)
  t = -0.1909, df = 643, p-value = 0.8487
  Null hypothesis retained

olkin1967: Olkin's z (1967)
  z = -0.1914, p-value = 0.8482
  Null hypothesis retained

dunn1969: Dunn and Clark's z (1969)
  z = -0.1909, p-value = 0.8486
  Null hypothesis retained

hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
  t = -0.1911, df = 643, p-value = 0.8485
  Null hypothesis retained

steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
  z = -0.1909, p-value = 0.8486
  Null hypothesis retained

meng1992: Meng, Rosenthal, and Rubin's z (1992)
  z = -0.1909, p-value = 0.8486
  Null hypothesis retained
  95% confidence interval for r.jk - r.jh: -0.0825 0.0678
  Null hypothesis retained (Interval includes 0)

hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
  z = -0.1909, p-value = 0.8486
  Null hypothesis retained

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r.jk - r.jh: -0.0798 0.0656
  Null hypothesis retained (Interval includes 0)

Fisher’s z-test (z = -0.1909, p = 0.8486) tells us that the correlations are not statistically significantly different from each other; Zhou’s confidence intervals indicate that the differences range between -0.0798 and 0.0656. Because this interval crosses zero, we know that the difference could be zero, or reversed in direction.

4.4.8.2 Multitrait-Multimethod Matrix

The multitrait-multimethod matrix is a systematic experimental design for the dual approach of convergent and discriminant validation, which requires the assessment of two or more traits (classically, math, English, and reading scores) by two more methods (self, parent, and teacher). Conducting a web-based image search on this term will show a matrix of alpha coefficients and correlation coefficients that are interpreted in relationship to each other. Roughly:

alpha coefficients (internal consistency) should be the highest,
validity coefficients (correlations of the same trait assessed by different methods) should be higher than correlations between different traits measured by different methods,
validity coefficients (correlations of the same trait assessed by different methods) should be higher than different traits measured by the same method.

4.4.9 Incremental Validity

Incremental validity is the increase in predictive validity attributable to the test. It indicates the contribution the test makes to the selection of individuals who will meet the minimum standards in criterion performance. There are different ways to assess this – one of the most common is to first enter known predictors and then see if the instrument-of-interest continues to account variance over-and-above those that are entered.

In the Szymanski and Bissonette (2020) psychometric evaluation, the negative relations with satisfaction with college and intention to persist in college as well as positive relations with both anxiety and depression persisted even after controlling for LGBTQ victimization experiences.

I will demonstrate this procedure, predicting the contribution that the LGBTQ Campus Climate total scale score has on predicting intention to persist in college, over and above LGBTQ victimization.

The process is to use hierarchical linear regression. Two models are built. In the first mode (“PfV” stands [in my mind] for “Persistence from Victimization”), persistence is predicted from victimization. The second model adds the LGBTQ Campus Climate Scale. I asked for summaries of each model. Then the anova() function compares the model.

PfV <- lm(Persistence ~ Victimization, data = dfSzy)
PfVC <- lm(Persistence ~ Victimization + LGBTQclimate, data = dfSzy)
summary(PfV)


Call:
lm(formula = Persistence ~ Victimization, data = dfSzy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.19281 -0.34150 -0.02281  0.29669  1.32226 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)    3.09906    0.07947  38.997 <0.0000000000000002 ***
Victimization -0.04566    0.05023  -0.909               0.364    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4224 on 644 degrees of freedom
Multiple R-squared:  0.001281,  Adjusted R-squared:  -0.0002694 
F-statistic: 0.8263 on 1 and 644 DF,  p-value: 0.3637

From the PfV model we learn that victimization has a non-significant effect on intentions to persist in college (B = -0.046, p = 0.364). Further, the \(R^2\) is quite small (0.001).

summary(PfVC)


Call:
lm(formula = Persistence ~ Victimization + LGBTQclimate, data = dfSzy)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1696 -0.2842  0.0094  0.2569  1.3571 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    3.64440    0.12795  28.483 < 0.0000000000000002 ***
Victimization -0.04183    0.04919  -0.850                0.395    
LGBTQclimate  -0.13788    0.02568  -5.369          0.000000111 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4135 on 643 degrees of freedom
Multiple R-squared:  0.04413,   Adjusted R-squared:  0.04116 
F-statistic: 14.84 on 2 and 643 DF,  p-value: 0.0000004991

In the PfVC model, we see that the LGBTQ Campus Climate full scale score has a significant impact on intentions to persist. Specifically, for each additional point higher on the LGBTQ climate score, intentions to persist decrease by .14 points (p < 0.001). Together, the model accounts for 4% of the variance, representing a \(\Delta{R^2}\) of 4%.

# calculating R2 change
0.04413 - 0.001281

[1] 0.042849

anova(PfV, PfVC)

Analysis of Variance Table

Model 1: Persistence ~ Victimization
Model 2: Persistence ~ Victimization + LGBTQclimate
  Res.Df    RSS Df Sum of Sq      F       Pr(>F)    
1    644 114.88                                     
2    643 109.95  1     4.929 28.824 0.0000001108 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that there is a statistically significant difference between the models \(F(1, 643) = 28.824, p < 0.001)\).

Let’s try another model. With anxiety as our dependent variable, the code below asks if LGBTQ Campus Climate accounts for a proportion of the variance over-and-above victimization.

AfV <- lm(Anxiety ~ Victimization, data = dfSzy)
AfVC <- lm(Anxiety ~ Victimization + LGBTQclimate, data = dfSzy)
summary(AfV)


Call:
lm(formula = Anxiety ~ Victimization, data = dfSzy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.05943 -0.30935 -0.01528  0.31306  1.24148 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    1.21756    0.07131  17.073 < 0.0000000000000002 ***
Victimization  0.17427    0.04508   3.866             0.000122 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.379 on 644 degrees of freedom
Multiple R-squared:  0.02268,   Adjusted R-squared:  0.02116 
F-statistic: 14.95 on 1 and 644 DF,  p-value: 0.0001219

summary(AfVC)


Call:
lm(formula = Anxiety ~ Victimization + LGBTQclimate, data = dfSzy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.00814 -0.29249 -0.02563  0.29877  1.20705 

Coefficients:
              Estimate Std. Error t value         Pr(>|t|)    
(Intercept)    0.81071    0.11561   7.012 0.00000000000595 ***
Victimization  0.17141    0.04444   3.857         0.000126 ***
LGBTQclimate   0.10287    0.02320   4.433 0.00001092659570 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3736 on 643 degrees of freedom
Multiple R-squared:  0.05166,   Adjusted R-squared:  0.04872 
F-statistic: 17.52 on 2 and 643 DF,  p-value: 0.0000000392

anova(AfV, AfVC)

Analysis of Variance Table

Model 1: Anxiety ~ Victimization
Model 2: Anxiety ~ Victimization + LGBTQclimate
  Res.Df    RSS Df Sum of Sq      F     Pr(>F)    
1    644 92.513                                   
2    643 89.770  1    2.7436 19.651 0.00001093 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model is a little more exciting in that our first model (AfV) is statistically significant \((B = 0.174, p < 0.001)\). That is, victimization has a statistically significant effect on anxiety, accounting for 2% of the variance. In the second model, the LGBTQ Campus Climate total scale score is also significant \((B = 0.103, p < 0.001)\), and accounts for an additional 3% of variance (\(\Delta{R^2} = 0.029\). There is a statistically significant difference between models (F[1, 643] = 19.651, p < .001).

# calculating change in R2
0.05166 - 0.02268

[1] 0.02898

4.5 Factors Affecting Validity Coefficients

Keeping in mind that a validity coefficient is merely the correlation between the test and some criteria, the same elements that impact the magnitude and significance of a correlation coefficient will similarly affect a validity coefficient.

Nature of the group. A test that has high validity in predicting a particular criterion in one population, may have little or no validity in predicting the same criterion in another population. If a test is designed for use in diverse populations, information about the population generalizability should be reported in the technical manuals.

Sample heterogeneity. Other things being equal, if there is a linear relationship between X and Y, it will have a greater magnitude when the sample is heterogeneous.

Pre-selection. Just like internal and external validity in a research design can be threatened by selection issues, pre-selection can also impact the validity coefficients of a measure. For example, if we are evaluating a new test for job selection, we may select a group of newly hired employees. We plan to collect some measure of job performance at a later date. Our results may be limited by the criteria used to select the employees. Were they volunteers? Were they only those hired? Were they ALL of the applicants?

Validity coefficients may change over time. Consider the relationship between the college boards and grade point average at Yale University. Fifty years ago, \(r_{xy} = .72\); today \(r_{xy} = .52\). Why? The nature of the student body has become more diverse (50 years ago, the student body was predominantly White, high SES, and male).

The form of the relationship matters. The Pearson R assumes the relationship between the predictor and criterion variables is linear, uniform, and homoscedastistic (equal variability throughout the range of a bivariate distribution). When the variability is unequal throughout the range of the distribution the relationship is heteroscedastic.

Figure 4.1: Illustration of heteroscedasticity

There could also be other factors involved in the relationship between the instrument and the criterion:

curvilinearity
an undetected mechanism, such as a moderator

Finally, what is our threshold for acceptability?

Consider statistical significance – but also its limitations (e.g., power, Type I error, Type II error)
Consider the magnitude of the correlation; and also \(R^2\) (the proportion of variance accounted for)
Consider error:
- The standard error of the estimate shows the margin of error to be expected in the individuals predicted criterion score as the result of the imperfect validity of the instrument.

\[SE_{est} = SD_{y}\sqrt{1 - r_{xy}^{2}}\] Where:

\(r_{xy}^{2}\) is the square of the validity coefficient, and
\(SD_{y}\) is the standard deviation of the criterion scores.

If the validity were perfect (\(r_{xy}^{2}\) = 1.00), the error of estimate would be 0.00. If the validity were zero, the error of estimate would equal \(SD_{y}\).

Interpreting \(SE_{est}\):

If \(r_{xy}\) = .80, then \(\sqrt{1 - r_{xy}^{2}} = .60\).
Error is 60% as large as it would be by chance. Stated another way, predicting an individual’s criterion performance has a margin of error that is 40% smaller than it would be by chance.

To obtain the \(SE_{est}\), we merely multiply by the \(SD_{y}\). This puts error in the metric of the criterion variable.

Your Turn:

If \(r_{xy}\) = .25, then \(\sqrt{1 - r_{xy}^{2}} =\) ??

Make a statement about chance. Make a statement about margin of error.

4.6 Practice Problems

In each of these lessons, I provide suggestions for practice that allow you to select one or more problems that are graded in difficulty. With each of these options, I encourage you to interpret examine aspects of the construct validity through the creation and interpretation of validity coefficients. Ideally, you will examine both convergent/discriminant validity as well as incremental validity.

4.6.1 Problem #1: Play around with this simulation.

Copy the script for the simulation and then change (at least) one thing in the simulation to see how it impacts the results.

If calculating is new to you, perhaps you just change the number in “set.seed(210907)” from 210907 to something else. Your results should parallel those obtained in the lecture, making it easier for you to check your work as you go.

4.6.2 Problem #2: Conduct the reliability analysis selecting different variables.

The Szymanski and Bissonette (2020) article conducted a handful of incremental validity assessments. Select different outcome variables (e.g., depression) and/or use the subscales as the instrument-of-interest.

4.6.3 Problem #3: Try something entirely new.

Using data for which you have permission and access (e.g., IRB approved data you have collected or from your lab; data you simulate from a published article; data from an open science repository; data from other chapters in this OER), create validity coefficients and use three variables to estimate the incremental validity of the instrument-of-interest.

4.6.4 Grading Rubric

Assignment Component	Points Possible	Points Earned
1. Check and, if needed, format and score the data.	5	_____
2. Create a correlation matrix that includes the instrument-of-interest and the variables that will have varying degrees of relation.	5	_____
3. With convergent and discriminant validity in mind, interpret the validity coefficients; this should include an assessment about whether the correlation coefficients (at least two different pairings) are statistically significantly different from each other.	5	_____
4. With at least three variables, evaluate the degree to which the instrument demonstrates incremental validity (this should involve two regression equations and their statistical comparison).	5	_____
5. Explanation to grader.	5	_____
Totals	25	_____

4.7 Homeworked Example

Screencast Link

For more information about the data used in this homeworked example, please refer to the description and codebook located at the end of the introduction in first volume of ReCentering Psych Stats.

As a brief review, this data is part of an IRB-approved study, with consent to use in teaching demonstrations and to be made available to the general public via the open science framework. Hence, it is appropriate to use in this context. You will notice there are student- and teacher- IDs. These numbers are not actual student and teacher IDs, rather they were further re-identified so that they could not be connected to actual people.

Because this is an actual dataset, if you wish to work the problem along with me, you will need to download the ReC.rds data file from the Worked_Examples folder in the ReC_Psychometrics project on the GitHub.

The course evaluation items can be divided into three subscales:

Valued by the student includes the items: ValObjectives, IncrUnderstanding, IncrInterest
Traditional pedagogy includes the items: ClearResponsibilities, EffectiveAnswers, Feedback, ClearOrganization, ClearPresentation
Socially responsive pedagogy includes the items: InclusvClassrm, EquitableEval, MultPerspectives, DEIintegration

In this homework focused on validity we will score the total scale and subscales, create a correlation matrix of our scales with a different scale (or item), formally test to see if correlation coefficients are statistically significantly different from each other, conduct a test of incremental validity.

4.7.1 Check and, if needed, format data

big <- readRDS("ReC.rds")

Let’s check the structure…

str(big)

Classes 'data.table' and 'data.frame':  310 obs. of  33 variables:
 $ deID                   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CourseID               : int  57085635 57085635 57085635 57085635 57085635 57085635 57085635 57085635 57085635 57085635 ...
 $ Dept                   : chr  "CPY" "CPY" "CPY" "CPY" ...
 $ Course                 : Factor w/ 3 levels "Psychometrics",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ StatsPkg               : Factor w/ 2 levels "SPSS","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ Centering              : Factor w/ 2 levels "Pre","Re": 2 2 2 2 2 2 2 2 2 2 ...
 $ Year                   : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
 $ Quarter                : chr  "Fall" "Fall" "Fall" "Fall" ...
 $ IncrInterest           : int  5 3 4 2 4 3 5 3 2 5 ...
 $ IncrUnderstanding      : int  2 3 4 3 4 4 5 2 4 5 ...
 $ ValObjectives          : int  5 5 4 4 5 5 5 5 4 5 ...
 $ ApprAssignments        : int  5 4 4 4 5 3 5 3 3 5 ...
 $ EffectiveAnswers       : int  5 3 5 3 5 3 4 3 2 3 ...
 $ Respectful             : int  5 5 4 5 5 4 5 4 5 5 ...
 $ ClearResponsibilities  : int  5 5 4 4 5 4 5 4 4 5 ...
 $ Feedback               : int  5 3 4 2 5 NA 5 4 4 5 ...
 $ OvInstructor           : int  5 4 4 3 5 3 5 4 3 5 ...
 $ MultPerspectives       : int  5 5 4 5 5 4 5 5 5 5 ...
 $ OvCourse               : int  3 4 4 3 5 3 5 3 2 5 ...
 $ InclusvClassrm         : int  5 5 5 5 5 4 5 5 4 5 ...
 $ DEIintegration         : int  5 5 5 5 5 4 5 5 5 5 ...
 $ ClearPresentation      : int  4 4 4 2 5 3 4 4 4 5 ...
 $ ApprWorkload           : int  5 5 3 4 4 2 5 4 4 5 ...
 $ MyContribution         : int  4 4 4 4 5 4 4 3 4 5 ...
 $ InspiredInterest       : int  5 3 4 3 5 3 5 4 4 5 ...
 $ Faith                  : int  5 NA 4 2 NA NA 4 4 4 NA ...
 $ EquitableEval          : int  5 5 3 5 5 3 5 5 3 5 ...
 $ SPFC.Decolonize.Opt.Out: chr  "" "" "" "" ...
 $ ProgramYear            : Factor w/ 3 levels "Second","Transition",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ ClearOrganization      : int  3 4 3 4 4 4 5 4 4 5 ...
 $ RegPrepare             : int  5 4 4 4 4 3 4 4 4 5 ...
 $ EffectiveLearning      : int  2 4 3 4 4 2 5 3 2 5 ...
 $ AccessibleInstructor   : int  5 4 4 4 5 4 5 4 5 5 ...
 - attr(*, ".internal.selfref")=<externalptr>

We will need to create the three subscales. The codebook above, lists which variables go in each subscale score.

# Making the list of variables
ValuedVars <- c("ValObjectives", "IncrUnderstanding", "IncrInterest")
TradPedVars <- c("ClearResponsibilities", "EffectiveAnswers", "Feedback",
    "ClearOrganization", "ClearPresentation")
SRPedVars <- c("InclusvClassrm", "EquitableEval", "MultPerspectives", "DEIintegration")
Total <- c("ValObjectives", "IncrUnderstanding", "IncrInterest", "ClearResponsibilities",
    "EffectiveAnswers", "Feedback", "ClearOrganization", "ClearPresentation",
    "InclusvClassrm", "EquitableEval", "MultPerspectives", "DEIintegration")

# Creating the new variables
big$Valued <- sjstats::mean_n(big[, ValuedVars], 0.66)
big$TradPed <- sjstats::mean_n(big[, TradPedVars], 0.75)
big$SRPed <- sjstats::mean_n(big[, SRPedVars], 0.75)
big$Total <- sjstats::mean_n(big[, Total], 0.8)

# If the scoring code above does not work for you, try the format
# below which involves inserting to periods in front of the variable
# list. One example is provided. dfLewis$Belonging <-
# sjstats::mean_n(dfLewis[, ..Belonging_vars], 0.80)

4.7.2 Create a correlation matrix that includes the instrument-of-interest and the variables that will have varying degrees of relation

Unfortunately, data from the course evals don’t include any outside scales. However, I didn’t include the “Overall Instructor” (OvInstructor) in any of the items, so we could think of it as a way to look at convergent and discriminant validity.

apaTables::apa.cor.table(big[c("Valued", "TradPed", "SRPed", "OvInstructor")],
    filename = "ReC_cortable.doc", table.number = 1, show.sig.stars = TRUE,
    landscape = TRUE)



Table 1 

Means, standard deviations, and correlations with confidence intervals
 

  Variable        M    SD   1          2          3         
  1. Valued       4.25 0.68                                 
                                                            
  2. TradPed      4.25 0.76 .70**                           
                            [.63, .75]                      
                                                            
  3. SRPed        4.52 0.58 .56**      .71**                
                            [.48, .64] [.65, .76]           
                                                            
  4. OvInstructor 4.37 0.94 .63**      .80**      .67**     
                            [.56, .70] [.76, .84] [.60, .73]
                                                            

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

All the correlations are strong and positive. However, look at the correlation between Overall Instructor and SCRPed!

4.7.3 With convergent and discriminant validity in mind, interpret the validity coefficients; this should include an assessment about whether the correlation coefficients (at least two different pairings) are statistically significantly different from each other.

We need to see if these correlations are statistically significantly different from each other. I am interested in knowing if the correlations between Overall Instructor and each of the three course dimensions (Valued [r = 0.63, p < 0.01], TradPed [r = 0.80, p < 0.01], SRPed [r = 0.67, p < 0.01]) are statistically significantly different from each other.

cocor::cocor(formula = ~Valued + OvInstructor | TradPed + OvInstructor,
    data = big)


  Results of a comparison of two overlapping correlations based on dependent groups

Comparison between r.jk (OvInstructor, Valued) = 0.6344 and r.jh (OvInstructor, TradPed) = 0.7997
Difference: r.jk - r.jh = -0.1652
Related correlation: r.kh = 0.697
Data: big: j = OvInstructor, k = Valued, h = TradPed
Group size: n = 307
Null hypothesis: r.jk is equal to r.jh
Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
Alpha: 0.05

pearson1898: Pearson and Filon's z (1898)
  z = -5.4939, p-value = 0.0000
  Null hypothesis rejected

hotelling1940: Hotelling's t (1940)
  t = -6.2651, df = 304, p-value = 0.0000
  Null hypothesis rejected

williams1959: Williams' t (1959)
  t = -6.1447, df = 304, p-value = 0.0000
  Null hypothesis rejected

olkin1967: Olkin's z (1967)
  z = -5.4939, p-value = 0.0000
  Null hypothesis rejected

dunn1969: Dunn and Clark's z (1969)
  z = -5.9983, p-value = 0.0000
  Null hypothesis rejected

hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
  t = -6.2651, df = 304, p-value = 0.0000
  Null hypothesis rejected

steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
  z = -5.9444, p-value = 0.0000
  Null hypothesis rejected

meng1992: Meng, Rosenthal, and Rubin's z (1992)
  z = -5.9182, p-value = 0.0000
  Null hypothesis rejected
  95% confidence interval for r.jk - r.jh: -0.4644 -0.2333
  Null hypothesis rejected (Interval does not include 0)

hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
  z = -5.8868, p-value = 0.0000
  Null hypothesis rejected

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r.jk - r.jh: -0.2282 -0.1089
  Null hypothesis rejected (Interval does not include 0)

Fisher’s z-test (z = -5.887, p < 0.001) indicates that the correlation of overall instructor with the valued subscale (r = 0.63) is lower than its correlation with the traditional pedagogy subscale (r = 0.80).

cocor::cocor(formula = ~TradPed + OvInstructor | SRPed + OvInstructor,
    data = big)


  Results of a comparison of two overlapping correlations based on dependent groups

Comparison between r.jk (OvInstructor, TradPed) = 0.7962 and r.jh (OvInstructor, SRPed) = 0.6751
Difference: r.jk - r.jh = 0.1211
Related correlation: r.kh = 0.7091
Data: big: j = OvInstructor, k = TradPed, h = SRPed
Group size: n = 298
Null hypothesis: r.jk is equal to r.jh
Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
Alpha: 0.05

pearson1898: Pearson and Filon's z (1898)
  z = 4.2785, p-value = 0.0000
  Null hypothesis rejected

hotelling1940: Hotelling's t (1940)
  t = 4.6684, df = 295, p-value = 0.0000
  Null hypothesis rejected

williams1959: Williams' t (1959)
  t = 4.5800, df = 295, p-value = 0.0000
  Null hypothesis rejected

olkin1967: Olkin's z (1967)
  z = 4.2785, p-value = 0.0000
  Null hypothesis rejected

dunn1969: Dunn and Clark's z (1969)
  z = 4.5174, p-value = 0.0000
  Null hypothesis rejected

hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
  t = 4.6684, df = 295, p-value = 0.0000
  Null hypothesis rejected

steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
  z = 4.4945, p-value = 0.0000
  Null hypothesis rejected

meng1992: Meng, Rosenthal, and Rubin's z (1992)
  z = 4.4834, p-value = 0.0000
  Null hypothesis rejected
  95% confidence interval for r.jk - r.jh: 0.1510 0.3855
  Null hypothesis rejected (Interval does not include 0)

hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
  z = 4.4678, p-value = 0.0000
  Null hypothesis rejected

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r.jk - r.jh: 0.0676 0.1802
  Null hypothesis rejected (Interval does not include 0)

Fisher’s z-test (z = 4.4678, p < 0.001) indicates that the correlation of overall instructor with the traditional pedagogy subscale (r = 0.80) is higher than its correlation with the socially responsive pedagogy subscale (r = 0.67).

cocor::cocor(formula = ~Valued + OvInstructor | SRPed + OvInstructor, data = big)


  Results of a comparison of two overlapping correlations based on dependent groups

Comparison between r.jk (OvInstructor, Valued) = 0.6338 and r.jh (OvInstructor, SRPed) = 0.6717
Difference: r.jk - r.jh = -0.0379
Related correlation: r.kh = 0.5624
Data: big: j = OvInstructor, k = Valued, h = SRPed
Group size: n = 299
Null hypothesis: r.jk is equal to r.jh
Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
Alpha: 0.05

pearson1898: Pearson and Filon's z (1898)
  z = -1.0091, p-value = 0.3129
  Null hypothesis retained

hotelling1940: Hotelling's t (1940)
  t = -1.0355, df = 296, p-value = 0.3013
  Null hypothesis retained

williams1959: Williams' t (1959)
  t = -1.0071, df = 296, p-value = 0.3147
  Null hypothesis retained

olkin1967: Olkin's z (1967)
  z = -1.0091, p-value = 0.3129
  Null hypothesis retained

dunn1969: Dunn and Clark's z (1969)
  z = -1.0062, p-value = 0.3143
  Null hypothesis retained

hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
  t = -1.0355, df = 296, p-value = 0.3013
  Null hypothesis retained

steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
  z = -1.0060, p-value = 0.3144
  Null hypothesis retained

meng1992: Meng, Rosenthal, and Rubin's z (1992)
  z = -1.0058, p-value = 0.3145
  Null hypothesis retained
  95% confidence interval for r.jk - r.jh: -0.1948 0.0627
  Null hypothesis retained (Interval includes 0)

hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
  z = -1.0058, p-value = 0.3145
  Null hypothesis retained

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r.jk - r.jh: -0.1129 0.0360
  Null hypothesis retained (Interval includes 0)

Fisher’s z-test (z = -1.006, p = 0.315) indicates that the correlation of overall instructor with the valued subscale (r = 0.4) is not statistically significantly different than its correlation with the socially responsive pedagogy subscale (r = 0.67).

4.7.4 With at least three variables, evaluate the degree to which the instrument demonstrates incremental validity (this should involve two regression equations and their statistical comparison)

Playing around with these variables, let’s presume our outcome of interested is the student’s valuation of the class (i.e., the “Valued by the Student variable”) and we usually predict it through traditional pedagogy. What does SRPed contribute over-and-above?

Please understand, that we would normally have a more robust dataset with other indicators – maybe predicting students’ grades?

Also, we are completely ignoring the multi-level nature of this data. The published manuscript takes a multi-level approach to analyzing the data and my lessons on multi-level modeling address this as well.

big <- na.omit(big)  #included b/c there was uneven missingness and the subsequent comparison required equal sample sizes in the regression models
Step1 <- lm(Valued ~ TradPed, data = big)
Step2 <- lm(Valued ~ TradPed + SRPed, data = big)
summary(Step1)


Call:
lm(formula = Valued ~ TradPed, data = big)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43330 -0.25471  0.04673  0.25388  1.79522 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  1.67482    0.18581   9.014 <0.0000000000000002 ***
TradPed      0.61426    0.04191  14.656 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4274 on 213 degrees of freedom
Multiple R-squared:  0.5021,    Adjusted R-squared:  0.4998 
F-statistic: 214.8 on 1 and 213 DF,  p-value: < 0.00000000000000022

summary(Step2)


Call:
lm(formula = Valued ~ TradPed + SRPed, data = big)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.39671 -0.22675  0.03228  0.24841  1.71917 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  1.44912    0.26349   5.500          0.000000109 ***
TradPed      0.56933    0.05602  10.162 < 0.0000000000000002 ***
SRPed        0.09116    0.07554   1.207                0.229    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.427 on 212 degrees of freedom
Multiple R-squared:  0.5055,    Adjusted R-squared:  0.5009 
F-statistic: 108.4 on 2 and 212 DF,  p-value: < 0.00000000000000022

In the first step we see that traditional pedagogy had a statistically significant effect on the valued dimension \(B = 0.614, p < 0.001)\). This model accounted for 50% of variance.

In the second step, socially responsive pedagogy was not a statistically significant predictor, over and above traditional pedagogy \(B = 0.091, p = 0.228\). This model accounted for 51% of variance.

We can formally compare these two models with an the anova() function in base R.

anova(Step1, Step2)

Analysis of Variance Table

Model 1: Valued ~ TradPed
Model 2: Valued ~ TradPed + SRPed
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    213 38.918                           
2    212 38.652  1   0.26554 1.4564 0.2288

We see that socially responsive pedagogy adds only a non-significant proportion of variance over traditional pedagogy \((F[1, 212] = 38.652, p = 0.229)\).

REFERENCES

Brattmyr, M., Lindberg, M. S., Solem, S., Hjemdal, O., & Havnen, A. (2022). Factor structure, measurement invariance, and concurrent validity of the Patient Health Questionnaire-9 and the Generalized Anxiety Disorder Scale-7 in a Norwegian psychiatric outpatient sample. BMC Psychiatry, 22(1), 461. https://doi.org/10.1186/s12888-022-04101-z

Conover, K. J., Israel, T., & Nylund-Gibson, K. (2017). Development and Validation of the Ableist Microaggressions Scale. The Counseling Psychologist, 45(4), 30. https://doi.org/10.1177/001 1000017715317

Helm, E. G., Sedlacek, W. E., & Prieto, D. O. (1998). The Relationship Between Attitudes Toward Diversity and Overall Satisfaction of University Students by Race. Journal of College Counseling, 1(2), 111–120. https://doi.org/10.1002/j.2161-1882.1998.tb00130.x

Herek, G. M. (1993). Documenting prejudice against lesbians and gay men on campus: The Yale Sexual Orientation Survey. Journal of Homosexuality, 25(4), 15–30. https://doi.org/10.1300/J082v25n04_02

Keum, B. T., Brady, J. L., Sharma, R., Lu, Y., Kim, Y. H., & Thai, C. J. (2018). Gendered Racial Microaggressions Scale for Asian American Women: Development and initial validation. Journal of Counseling Psychology, 65(5), 571–585. https://doi.org/10.1037/cou0000305

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613. https://doi.org/10.1046/j.1525-1497.2001.016009606.x

Lewis, J. A., & Neville, H. A. (2015). Construction and initial validation of the Gendered Racial Microaggressions Scale for Black women. Journal of Counseling Psychology, 62(2), 289–302. https://doi.org/10.1037/cou0000062

Messick, S. (2000). Consequences of test interpretation and use: The fusion of validity and values in psychological assessment. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in human assessment: Honoring Douglas N. Jackson at seventy. (pp. 3–20). Kluwer Academic/Plenum Publishers. https://doi.org/10.1007/978-1-4615-4397-8_1

Pascarella, E. T., & Terenzini, P. T. (1980). Predicting Freshman Persistence and Voluntary Dropout Decisions from a Theoretical Model. The Journal of Higher Education, 51(1), 60–75. https://doi.org/10.1080/00221546.1980.11780030

Spitzer, R. L., Kroenke, K., Williams, J. B. W., & Löwe, B. (2006). A Brief Measure for Assessing Generalized Anxiety Disorder: The GAD-7. Archives of Internal Medicine, 166(10), 1092. https://doi.org/10.1001/archinte.166.10.1092

Szymanski, D. M., & Bissonette, D. (2020). Perceptions of the LGBTQ College Campus Climate Scale: Development and psychometric evaluation. Journal of Homosexuality, 67(10), 1412–1428. https://doi.org/10.1080/00918369.2019.1591788