Research-Methodology

Literature Review: Measures for Validity

literature review

According to Brown (2006) there are five criteria for the evaluation of the validity of literature review: purpose, scope, authority, audience and format. Accordingly, each of these criteria have been taken into account and appropriately addressed during the whole process of literature review.

McNabb (2008), on the other hand, formulates three fundamental purposes of literature review that are described below:

First, literature review shows the audience of the study that the author is familiar with the major contributions that have already been done to the research area by other authors. Second, literature helps to identify the key issues in the research area and obvious gaps in the current literature.

Third, the literature review assists the readers of the research in term of comprehending the principles and theories that have been used by the author in different parts of the study.

  • Brown RB, 2006, Doing Your Dissertation in Business and Management: The Reality of Research and Writing, Sage Publications
  • McNabb, DE, 2008, Research Methods in Public Administration and Non-Profit Management: Qualitative and Quantitative Approaches, 2 nd edition, ME Sharpe
  • Wysocki, DK, 2007, Readings in Social Research Methods, Cengage Learning

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • How to Write a Literature Review | Guide, Examples, & Templates

How to Write a Literature Review | Guide, Examples, & Templates

Published on January 2, 2023 by Shona McCombes . Revised on September 11, 2023.

What is a literature review? A literature review is a survey of scholarly sources on a specific topic. It provides an overview of current knowledge, allowing you to identify relevant theories, methods, and gaps in the existing research that you can later apply to your paper, thesis, or dissertation topic .

There are five key steps to writing a literature review:

  • Search for relevant literature
  • Evaluate sources
  • Identify themes, debates, and gaps
  • Outline the structure
  • Write your literature review

A good literature review doesn’t just summarize sources—it analyzes, synthesizes , and critically evaluates to give a clear picture of the state of knowledge on the subject.

Instantly correct all language mistakes in your text

Be assured that you'll submit flawless writing. Upload your document to correct all your mistakes.

upload-your-document-ai-proofreader

Table of contents

What is the purpose of a literature review, examples of literature reviews, step 1 – search for relevant literature, step 2 – evaluate and select sources, step 3 – identify themes, debates, and gaps, step 4 – outline your literature review’s structure, step 5 – write your literature review, free lecture slides, other interesting articles, frequently asked questions, introduction.

  • Quick Run-through
  • Step 1 & 2

When you write a thesis , dissertation , or research paper , you will likely have to conduct a literature review to situate your research within existing knowledge. The literature review gives you a chance to:

  • Demonstrate your familiarity with the topic and its scholarly context
  • Develop a theoretical framework and methodology for your research
  • Position your work in relation to other researchers and theorists
  • Show how your research addresses a gap or contributes to a debate
  • Evaluate the current state of research and demonstrate your knowledge of the scholarly debates around your topic.

Writing literature reviews is a particularly important skill if you want to apply for graduate school or pursue a career in research. We’ve written a step-by-step guide that you can follow below.

Literature review guide

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Writing literature reviews can be quite challenging! A good starting point could be to look at some examples, depending on what kind of literature review you’d like to write.

  • Example literature review #1: “Why Do People Migrate? A Review of the Theoretical Literature” ( Theoretical literature review about the development of economic migration theory from the 1950s to today.)
  • Example literature review #2: “Literature review as a research methodology: An overview and guidelines” ( Methodological literature review about interdisciplinary knowledge acquisition and production.)
  • Example literature review #3: “The Use of Technology in English Language Learning: A Literature Review” ( Thematic literature review about the effects of technology on language acquisition.)
  • Example literature review #4: “Learners’ Listening Comprehension Difficulties in English Language Learning: A Literature Review” ( Chronological literature review about how the concept of listening skills has changed over time.)

You can also check out our templates with literature review examples and sample outlines at the links below.

Download Word doc Download Google doc

Before you begin searching for literature, you need a clearly defined topic .

If you are writing the literature review section of a dissertation or research paper, you will search for literature related to your research problem and questions .

Make a list of keywords

Start by creating a list of keywords related to your research question. Include each of the key concepts or variables you’re interested in, and list any synonyms and related terms. You can add to this list as you discover new keywords in the process of your literature search.

  • Social media, Facebook, Instagram, Twitter, Snapchat, TikTok
  • Body image, self-perception, self-esteem, mental health
  • Generation Z, teenagers, adolescents, youth

Search for relevant sources

Use your keywords to begin searching for sources. Some useful databases to search for journals and articles include:

  • Your university’s library catalogue
  • Google Scholar
  • Project Muse (humanities and social sciences)
  • Medline (life sciences and biomedicine)
  • EconLit (economics)
  • Inspec (physics, engineering and computer science)

You can also use boolean operators to help narrow down your search.

Make sure to read the abstract to find out whether an article is relevant to your question. When you find a useful book or article, you can check the bibliography to find other relevant sources.

You likely won’t be able to read absolutely everything that has been written on your topic, so it will be necessary to evaluate which sources are most relevant to your research question.

For each publication, ask yourself:

  • What question or problem is the author addressing?
  • What are the key concepts and how are they defined?
  • What are the key theories, models, and methods?
  • Does the research use established frameworks or take an innovative approach?
  • What are the results and conclusions of the study?
  • How does the publication relate to other literature in the field? Does it confirm, add to, or challenge established knowledge?
  • What are the strengths and weaknesses of the research?

Make sure the sources you use are credible , and make sure you read any landmark studies and major theories in your field of research.

You can use our template to summarize and evaluate sources you’re thinking about using. Click on either button below to download.

Take notes and cite your sources

As you read, you should also begin the writing process. Take notes that you can later incorporate into the text of your literature review.

It is important to keep track of your sources with citations to avoid plagiarism . It can be helpful to make an annotated bibliography , where you compile full citation information and write a paragraph of summary and analysis for each source. This helps you remember what you read and saves time later in the process.

Prevent plagiarism. Run a free check.

To begin organizing your literature review’s argument and structure, be sure you understand the connections and relationships between the sources you’ve read. Based on your reading and notes, you can look for:

  • Trends and patterns (in theory, method or results): do certain approaches become more or less popular over time?
  • Themes: what questions or concepts recur across the literature?
  • Debates, conflicts and contradictions: where do sources disagree?
  • Pivotal publications: are there any influential theories or studies that changed the direction of the field?
  • Gaps: what is missing from the literature? Are there weaknesses that need to be addressed?

This step will help you work out the structure of your literature review and (if applicable) show how your own research will contribute to existing knowledge.

  • Most research has focused on young women.
  • There is an increasing interest in the visual aspects of social media.
  • But there is still a lack of robust research on highly visual platforms like Instagram and Snapchat—this is a gap that you could address in your own research.

There are various approaches to organizing the body of a literature review. Depending on the length of your literature review, you can combine several of these strategies (for example, your overall structure might be thematic, but each theme is discussed chronologically).

Chronological

The simplest approach is to trace the development of the topic over time. However, if you choose this strategy, be careful to avoid simply listing and summarizing sources in order.

Try to analyze patterns, turning points and key debates that have shaped the direction of the field. Give your interpretation of how and why certain developments occurred.

If you have found some recurring central themes, you can organize your literature review into subsections that address different aspects of the topic.

For example, if you are reviewing literature about inequalities in migrant health outcomes, key themes might include healthcare policy, language barriers, cultural attitudes, legal status, and economic access.

Methodological

If you draw your sources from different disciplines or fields that use a variety of research methods , you might want to compare the results and conclusions that emerge from different approaches. For example:

  • Look at what results have emerged in qualitative versus quantitative research
  • Discuss how the topic has been approached by empirical versus theoretical scholarship
  • Divide the literature into sociological, historical, and cultural sources

Theoretical

A literature review is often the foundation for a theoretical framework . You can use it to discuss various theories, models, and definitions of key concepts.

You might argue for the relevance of a specific theoretical approach, or combine various theoretical concepts to create a framework for your research.

Like any other academic text , your literature review should have an introduction , a main body, and a conclusion . What you include in each depends on the objective of your literature review.

The introduction should clearly establish the focus and purpose of the literature review.

Depending on the length of your literature review, you might want to divide the body into subsections. You can use a subheading for each theme, time period, or methodological approach.

As you write, you can follow these tips:

  • Summarize and synthesize: give an overview of the main points of each source and combine them into a coherent whole
  • Analyze and interpret: don’t just paraphrase other researchers — add your own interpretations where possible, discussing the significance of findings in relation to the literature as a whole
  • Critically evaluate: mention the strengths and weaknesses of your sources
  • Write in well-structured paragraphs: use transition words and topic sentences to draw connections, comparisons and contrasts

In the conclusion, you should summarize the key findings you have taken from the literature and emphasize their significance.

When you’ve finished writing and revising your literature review, don’t forget to proofread thoroughly before submitting. Not a language expert? Check out Scribbr’s professional proofreading services !

This article has been adapted into lecture slides that you can use to teach your students about writing a literature review.

Scribbr slides are free to use, customize, and distribute for educational purposes.

Open Google Slides Download PowerPoint

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility

 Statistics

  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

A literature review is a survey of scholarly sources (such as books, journal articles, and theses) related to a specific topic or research question .

It is often written as part of a thesis, dissertation , or research paper , in order to situate your work in relation to existing knowledge.

There are several reasons to conduct a literature review at the beginning of a research project:

  • To familiarize yourself with the current state of knowledge on your topic
  • To ensure that you’re not just repeating what others have already done
  • To identify gaps in knowledge and unresolved problems that your research can address
  • To develop your theoretical framework and methodology
  • To provide an overview of the key findings and debates on the topic

Writing the literature review shows your reader how your work relates to existing research and what new insights it will contribute.

The literature review usually comes near the beginning of your thesis or dissertation . After the introduction , it grounds your research in a scholarly field and leads directly to your theoretical framework or methodology .

A literature review is a survey of credible sources on a topic, often used in dissertations , theses, and research papers . Literature reviews give an overview of knowledge on a subject, helping you identify relevant theories and methods, as well as gaps in existing research. Literature reviews are set up similarly to other  academic texts , with an introduction , a main body, and a conclusion .

An  annotated bibliography is a list of  source references that has a short description (called an annotation ) for each of the sources. It is often assigned as part of the research process for a  paper .  

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, September 11). How to Write a Literature Review | Guide, Examples, & Templates. Scribbr. Retrieved February 15, 2024, from https://www.scribbr.com/dissertation/literature-review/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, what is a theoretical framework | guide to organizing, what is a research methodology | steps & tips, how to write a research proposal | examples & templates, what is your plagiarism score.

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Writing a Literature Review

OWL logo

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

A literature review is a document or section of a document that collects key sources on a topic and discusses those sources in conversation with each other (also called synthesis ). The lit review is an important genre in many disciplines, not just literature (i.e., the study of works of literature such as novels and plays). When we say “literature review” or refer to “the literature,” we are talking about the research ( scholarship ) in a given field. You will often see the terms “the research,” “the scholarship,” and “the literature” used mostly interchangeably.

Where, when, and why would I write a lit review?

There are a number of different situations where you might write a literature review, each with slightly different expectations; different disciplines, too, have field-specific expectations for what a literature review is and does. For instance, in the humanities, authors might include more overt argumentation and interpretation of source material in their literature reviews, whereas in the sciences, authors are more likely to report study designs and results in their literature reviews; these differences reflect these disciplines’ purposes and conventions in scholarship. You should always look at examples from your own discipline and talk to professors or mentors in your field to be sure you understand your discipline’s conventions, for literature reviews as well as for any other genre.

A literature review can be a part of a research paper or scholarly article, usually falling after the introduction and before the research methods sections. In these cases, the lit review just needs to cover scholarship that is important to the issue you are writing about; sometimes it will also cover key sources that informed your research methodology.

Lit reviews can also be standalone pieces, either as assignments in a class or as publications. In a class, a lit review may be assigned to help students familiarize themselves with a topic and with scholarship in their field, get an idea of the other researchers working on the topic they’re interested in, find gaps in existing research in order to propose new projects, and/or develop a theoretical framework and methodology for later research. As a publication, a lit review usually is meant to help make other scholars’ lives easier by collecting and summarizing, synthesizing, and analyzing existing research on a topic. This can be especially helpful for students or scholars getting into a new research area, or for directing an entire community of scholars toward questions that have not yet been answered.

What are the parts of a lit review?

Most lit reviews use a basic introduction-body-conclusion structure; if your lit review is part of a larger paper, the introduction and conclusion pieces may be just a few sentences while you focus most of your attention on the body. If your lit review is a standalone piece, the introduction and conclusion take up more space and give you a place to discuss your goals, research methods, and conclusions separately from where you discuss the literature itself.

Introduction:

  • An introductory paragraph that explains what your working topic and thesis is
  • A forecast of key topics or texts that will appear in the review
  • Potentially, a description of how you found sources and how you analyzed them for inclusion and discussion in the review (more often found in published, standalone literature reviews than in lit review sections in an article or research paper)
  • Summarize and synthesize: Give an overview of the main points of each source and combine them into a coherent whole
  • Analyze and interpret: Don’t just paraphrase other researchers – add your own interpretations where possible, discussing the significance of findings in relation to the literature as a whole
  • Critically Evaluate: Mention the strengths and weaknesses of your sources
  • Write in well-structured paragraphs: Use transition words and topic sentence to draw connections, comparisons, and contrasts.

Conclusion:

  • Summarize the key findings you have taken from the literature and emphasize their significance
  • Connect it back to your primary research question

How should I organize my lit review?

Lit reviews can take many different organizational patterns depending on what you are trying to accomplish with the review. Here are some examples:

  • Chronological : The simplest approach is to trace the development of the topic over time, which helps familiarize the audience with the topic (for instance if you are introducing something that is not commonly known in your field). If you choose this strategy, be careful to avoid simply listing and summarizing sources in order. Try to analyze the patterns, turning points, and key debates that have shaped the direction of the field. Give your interpretation of how and why certain developments occurred (as mentioned previously, this may not be appropriate in your discipline — check with a teacher or mentor if you’re unsure).
  • Thematic : If you have found some recurring central themes that you will continue working with throughout your piece, you can organize your literature review into subsections that address different aspects of the topic. For example, if you are reviewing literature about women and religion, key themes can include the role of women in churches and the religious attitude towards women.
  • Qualitative versus quantitative research
  • Empirical versus theoretical scholarship
  • Divide the research by sociological, historical, or cultural sources
  • Theoretical : In many humanities articles, the literature review is the foundation for the theoretical framework. You can use it to discuss various theories, models, and definitions of key concepts. You can argue for the relevance of a specific theoretical approach or combine various theorical concepts to create a framework for your research.

What are some strategies or tips I can use while writing my lit review?

Any lit review is only as good as the research it discusses; make sure your sources are well-chosen and your research is thorough. Don’t be afraid to do more research if you discover a new thread as you’re writing. More info on the research process is available in our "Conducting Research" resources .

As you’re doing your research, create an annotated bibliography ( see our page on the this type of document ). Much of the information used in an annotated bibliography can be used also in a literature review, so you’ll be not only partially drafting your lit review as you research, but also developing your sense of the larger conversation going on among scholars, professionals, and any other stakeholders in your topic.

Usually you will need to synthesize research rather than just summarizing it. This means drawing connections between sources to create a picture of the scholarly conversation on a topic over time. Many student writers struggle to synthesize because they feel they don’t have anything to add to the scholars they are citing; here are some strategies to help you:

  • It often helps to remember that the point of these kinds of syntheses is to show your readers how you understand your research, to help them read the rest of your paper.
  • Writing teachers often say synthesis is like hosting a dinner party: imagine all your sources are together in a room, discussing your topic. What are they saying to each other?
  • Look at the in-text citations in each paragraph. Are you citing just one source for each paragraph? This usually indicates summary only. When you have multiple sources cited in a paragraph, you are more likely to be synthesizing them (not always, but often
  • Read more about synthesis here.

The most interesting literature reviews are often written as arguments (again, as mentioned at the beginning of the page, this is discipline-specific and doesn’t work for all situations). Often, the literature review is where you can establish your research as filling a particular gap or as relevant in a particular way. You have some chance to do this in your introduction in an article, but the literature review section gives a more extended opportunity to establish the conversation in the way you would like your readers to see it. You can choose the intellectual lineage you would like to be part of and whose definitions matter most to your thinking (mostly humanities-specific, but this goes for sciences as well). In addressing these points, you argue for your place in the conversation, which tends to make the lit review more compelling than a simple reporting of other sources.

  • UNC Chapel Hill

Department of Family Medicine

Critical Analysis of Reliability and Validity in Literature Reviews

Chetwynd, E.J.

Introduction

Literature reviews can take many forms depending on the field of specialty and the specific purpose of the review. The evidence base for lactation integrates research that cuts across multiple specialties (Dodgson, 2019) but the most common literature reviews accepted in the Journal of Human Lactation include scoping reviews, systematic reviews, and meta-analyses. Scoping reviews map out the literature in a particular topic area or answer a question about a particular concept or characteristic of the literature about a particular topic. They are broad, detailed, often focused on emerging evidence, and can be used to determine whether a more rigorous systematic review would be useful (Munn et al., 2018). To this end, a scoping review can draw from various sources of evidence, including expert opinion and policy documents, sometimes referred to as “grey literature” (Tricco, et al., 2018). A systematic review has a different purpose to a scoping review. According to the Cochrane Library (www. cochranelibrary.com), under the the section heading “What is a Systemic Review?” the following is stated: it will “identify, appraise and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a specific research question” https://www.cochranelibrary.com/about/ about-cochrane-reviews.). Meta-analysis takes the process of systematic review one step further by pooling the data collected and presenting aggregated summary results (Ahn & Kang, 2018). Each type of analysis or review requires a critical analysis of the methodologies used in the reviewed articles (Dodgson, 2021). In a scoping review, the results of the critical analysis are integrated and reported descriptively since they are designed to broadly encapsulate all of the research in a topic area and identify the current state of the science, rather than including only research that meets specific established quality guidelines (Munn et al., 2018). Systematic reviews and meta-analyses use critical analysis differently. In these types of reviews and analyses, the quality of research methods and study instruments becomes an inclusion criterion for deciding which studies to include for analysis so that their authors can ensure rigor in their processes (Page et al., 2021). Reliability and validity are research specific terms that may be applied throughout scientific literature to assess many elements of research methods, designs, and outcomes; however, here we are focusing specifically on their use for assessing measurement in quantitative research methodology. Specifically, we will be examining how they are used within literature review analyses to describe the nature of the instruments applied to measure study variables. Within this framework, reliability refers to the reproducibility of the study results, should the same measurement instruments be applied in different situations (Revelle & Condon, 2019). Validity tests the interpretation of study instruments and refers to whether they measure what they have been reported to be measuring, as supported by evidence and theory in the topic area of investigation (Clark & Watson, 2019). Reliability and validity can exist separately in a study; however, robust studies are both reliable and valid (Urbina & Monks, 2021). In order to establish a benchmark for determining the quality and rigor across all methodologies and reporting guidelines (Dodgson, 2019), the Journal of Human Lactation requires that the authors of any type of literature review include two summary tables. The first table illustrates the study design broadly, asking for study aims, a description of the sample, and the research design for each of the reviewed articles. The second required table is focused on measurement. It guides authors to list each study’s variables, the instruments used to measure each variable, and the reliability and validity of each of these study instrument (https://journals .sagepub.com/author-instructions/jhl#LiteratureReview; Simera et. al., 2010). The techniques used to describe the measurement reliability and validity are sometimes described explicitly using either statistical testing or other recognized forms of testing (Duckett, 2021). However, there are times when the methods for evaluating the measurement used have not been explicitly stated. This situation requires the authors of the review to have a clear understanding of reliability and validity in measurement to extrapolate the methods researchers may have used. Lactation is a topic area that incorporates many fields of specialty; therefore, this article will not be an exhaustive exploration of all types of tests for measurement of reliability and validity. The aim, instead, is to provide readers with enough information to feel confident about finding and assessing implicit types of measurement reliability and validity within published manuscripts. Additionally, readers will be better able to evaluate the usefulness of reviews and the instruments included in those reviews. To that end, this article will: (1) describe types of reliability and validity used in measurement; (2) demonstrate how realiability and validity might be implemented; and (3) discuss how to critically review reliability and validity in literature reviews.

Chetwynd EM, Wasser HM, Poole C. Breastfeeding Support Interventions by International Board Certified Lactation Consultants: A Systemic Review and Meta-Analysis. J Hum Lact. 2019 Aug;35(3):424-440. doi: 10.1177/0890334419851482. Epub 2019 Jun 17. PMID: 31206317.

Publication Link

English Editing Research Services

literature review and validity

Assessing Validity in Systematic Reviews (Internal and External)

internal validity external validity in systematic reviews

The validity of results and conclusions is a critical aspect of a systematic review. A systematic review that doesn’t answer a valid question or hasn’t used valid methods won’t have a valid result. And then it won’t be generalizable to a larger population, which makes it have little impact or value in the literature.

So then, how can you be sure that your systematic review has an acceptable level of validity? Look at it from the perspective of both external validity and internal validity.

What you’ll learn in this post

• The definitions of internal validity and external validity in systematic reviews.

• Why validity is so important to consider and assess when you write a systematic literature review.

• How validity will help expand the impact and reach of your review paper.

• The key relationship of bias and validity.

• Where to take free courses to educate yourself on reviews, and how to speed your review to publication, while ensuring it’s valid and valuable.

What is validity and why is it important for systematic reviews?

Validity for systematic reviews is how trustworthy the review’s conclusions are for a reader.

Systematic reviews compile different studies and present a summary of a range of findings.

It’s strength in numbers – and this strength is why they’re at the top of the evidence pyramid , the strongest form of evidence.

Many health organizations, for instance, use evidence from systematic reviews, especially Cochrane reviews , to draft practice guidelines. This is precisely why your systematic review must have trustworthy conclusions. These will give it impact and value, which is why you spent all that time on it.

Validity measures this trustworthiness. It depends on the strength of your review methodology. External validity and internal validity are the two main means of evaluation, so let’s look at each.

External validity in systematic reviews

External validity is how generalizable the results of a systematic review are. Can you generalize the results to populations not included in your systematic review? If “yes,” then you’ve achieved good external validity.

If you’re a doctor and read a systematic review that found a particular drug effective, you may wonder if you can use that drug to treat your patients. For example, this systematic review concluded antidepressants worked better than placebo in adults with major depressive disorder. But…

  • Can the results of this study also be applied to older patients with major depressive disorder?
  • How about for adolescents or certain cultures?
  • Is the treatment regimen self-manageable?

Various factors will impact the external validity. The main ones are…

Sample size

Sampling is key. The results of a systematic review with a larger sample size will typically be more generalizable than those with a smaller sample size.

This meta-analysis estimated how sample size affected treatment effects when different studies were pooled together. The authors found the treatment effects were 32% larger in studies with a smaller sample size vs. a larger one. Trials with smaller sample sizes could provide more exaggerated results than those with larger sample sizes and, by extension, the greater population.

Using a smaller sample size for your systematic review will lower its generalizability (and thus, external validity). The simple takeaway is:

Include as many studies as possible.

This will improve the external validity of your work.

Participant characteristics

Let’s say the conclusions of your systematic review are restricted to a specific sex, age, geographic region, socioeconomic profile, etc. This limits generalizability to participants with a different set of characteristics.

For example, this review concluded that a mean of 27.22% of medical students in China had anxiety (albeit with a range of 8.54% to 88.30%). That’s a key finding from 21 studies.

But what about medical students from a different country?

Or, for that matter, what about Chinese students not studying medicine? Will a similar percentage of them suffer from anxiety?

These questions don’t decrease the value of the findings. The review provides work to build on. But technically, its external validity faces some limitations.

Study setting

Let’s say that your systematic review examined a particular risk factor for a disease in a specific setting.

Can you extrapolate those findings to other settings?

For example, this study evaluated different determinants of population health in urban settings. The authors found that income, education, air quality, occupation status, mobility, and smoking habits impacted morbidity and mortality, in different urban settings.

Are the same findings valid in other urban settings in a different country? Are the findings adaptable to rural settings?

Comparators

With what are you comparing your treatment of interest in your systematic review?

If you compare a new treatment with a placebo, you may find a vast difference in treatment effects. But if you compare a new treatment with another active treatment, the difference in effects may be less prominent. See this systematic review and meta-analysis of treatments for hypertrophic scar and keloid. This review examined two treatments and a placebo to increase its external validity.

The comparator you chose for your systematic review should ideally be a close match to real-world practice. This is another way of upping its external validity.

Reporting external validity

Many systematic reviews insist that you report internal validity yet overlook external validity. In fact, researchers don’t usually use the very term external validity . Many authors use “generalizability,” “applicability,” “feasibility,” or “interchangeability.” They are essentially different terms for the same thing.

The PRISMA guidelines are (as of this writing) what your systematic reviews should follow. Read this article to learn about navigating PRISMA. But even PRISMA doesn’t insist on external validity as much as internal validity.

Authors usually don’t see the need to stress external validity in systematic reviews for all these reasons. Researchers have pointed this out and suggested the importance of reporting external validity.

Nevertheless, internal validity may receive greater attention and is also critical for your systematic review’s overall validity and worth.

Internal validity in systematic reviews

As the name implies, internal validity looks at the inside of the study rather than the external factors. It’s about how strong the study methodology is, and in a systematic review, it’s largely defined by the extent of bias.

Internal validity is easier to measure and achieve than external validity. This owes to the extensive work that’s gone into measuring it. Many organizations, such as Cochrane collaboration and the Joanna Briggs Institute , have developed tools for calculating bias (see below). A similar effort hasn’t gone into measuring external validity.

As a systematic reviewer, you must check the methodological quality of the studies in your systematic review and report the extent of different types of bias within them. This accumulates toward your own study’s internal validity.

Selection bias

Selection bias refers to the selection of participants in a trial.

If the baseline characteristics of two groups in a study are considerably different, selection bias is likely present.

For example, in a randomized controlled trial (RCT) of a new drug for heart failure, if one group has more diabetic patients than the other, then this group is likely to have lower treatment success.

Non-uniform allocation of intervention between two can negatively affect the results.

Strong randomization can reduce selection bias. This is why RCTs are considered the gold standard in evidence generation.

To check selection bias in an RCT in your systematic review, search for words that describe how randomization was done. If the study describes a random number table, sequence generation for randomization, and/or allocation concealment before patients are assigned to the different groups, then there’s probably no selection bias.

This neurological RCT is a good example of strong randomization, despite a relatively small population (n=35).

Performance bias

Performance bias checks if all treatment groups in a study have received a similar level of care. A varying level of care between the groups can bias the results. Researchers often blind or mask the study participants and caregivers to reduce performance bias. An RCT with no details about blinding or masking probably suffers from performance bias.

Blinding, however, isn’t always possible, so a non-blinded study may still have worth and still warrant inclusion in your review.

For example, a cancer drug trial may compare one drug given orally and another injected drug. Or a surgical technique trial may compare surgery with non-invasive treatment.

In both situations, blinding is not practical. The existing bias should be acknowledged in such cases.

Detection bias

Detection bias can occur if the outcome assessors are aware of the intervention the groups received. If an RCT mentions that the outcome assessors were blinded or masked, this suggests a low risk of detection bias.

Blinding of outcome assessors is important when an RCT measures subjective outcomes. This study , for instance, assessed postoperative pain after gallbladder surgery. The postoperative dressing was identical so that the patients would be unaware of (blinded from) the treatment received.

Attrition bias

Attrition bias results from incomplete study data.

Over the course of an RCT, patients may be excluded from analysis or may not return for follow-up. Both result in attrition. All RCTs have some attrition, but if the attrition rate is considerably different between the study groups, the results become skewed.

Attrition bias decreases when using intention-to-treat analysis. But in a per-protocol analysis, attrition bias is usually high. If a study uses both these analyses and finds the results are similar, the attrition bias is considered low.

For example, this RCT of a surgical procedure found that the intention-to-treat analysis and per-protocol analysis were similar. This suggests low attrition bias.

If you find the RCT included in your systematic review hasn’t performed an intention-to-treat analysis, then it’s likely that the included RCT suffers from attrition bias.

Reporting bias

When there are remarkable differences between reported and unreported findings in an RCT, that’s usually a case of reporting bias.

This bias can also arise when study authors report only statistically significant results, leaving out the non-significant ones. Many journals encourage authors to share their data sets to overcome this bias.

For an expert look at risk of bias in systematic reviews, see this article .

Calculating and reporting internal validity/bias

As bias can hurt your review’s internal validity, you must identify the different types of bias present in the studies you include.

Many tools now exist to help with this. Which tool you use depends on the nature of the studies in your review.

  • For RCTs, try Cochrane’s risk-of-bias tool for randomized trials (RoB-2) tool .
  • For non-randomized trials, try the ROBINS-I tool .
  • For case-control studies, there’s the Newcastle–Ottawa Scale (NOS) .
  • The AMSTAR-2 tool can be used for checking systematic review quality.

Do your systematic review or meta-analysis in less than one month

A systematic review is a valuable contribution to the literature. It’s top-level evidence and it will advance your research career.

We have published experts ready to assist you with all steps of the systematic review process. Go here to find how you can put Edanz on your team and get a systematic review in as little as 3 weeks!

And find how Edanz’s other research services can help you reach new heights .

Critical Analysis of Reliability and Validity in Literature Reviews

Affiliation.

  • 1 Department of Family Medicine, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
  • PMID: 35796304
  • DOI: 10.1177/08903344221100201

Keywords: breastfeeding; lactation research; literature review; measurement; quantitative research; reliability; research methodology; validity.

Publication types

  • Breast Feeding*
  • Reproducibility of Results

Banner

Literature Review - what is a Literature Review, why it is important and how it is done

  • Strategies to Find Sources

Evaluating Literature Reviews and Sources

Reading critically, tips to evaluate sources.

  • Tips for Writing Literature Reviews
  • Writing Literature Review: Useful Sites
  • Citation Resources
  • Other Academic Writings
  • Useful Resources

A good literature review evaluates a wide variety of sources (academic articles, scholarly books, government/NGO reports). It also evaluates literature reviews that study similar topics. This page offers you a list of resources and tips on how to evaluate the sources that you may use to write your review.

  • A Closer Look at Evaluating Literature Reviews Excerpt from the book chapter, “Evaluating Introductions and Literature Reviews” in Fred Pyrczak’s Evaluating Research in Academic Journals: A Practical Guide to Realistic Evaluation , (Chapter 4 and 5). This PDF discusses and offers great advice on how to evaluate "Introductions" and "Literature Reviews" by listing questions and tips. First part focus on Introductions and in page 10 in the PDF, 37 in the text, it focus on "literature reviews".
  • Tips for Evaluating Sources (Print vs. Internet Sources) Excellent page that will guide you on what to ask to determine if your source is a reliable one. Check the other topics in the guide: Evaluating Bibliographic Citations and Evaluation During Reading on the left side menu.

To be able to write a good Literature Review, you need to be able to read critically. Below are some tips that will help you evaluate the sources for your paper.

Reading critically (summary from How to Read Academic Texts Critically)

  • Who is the author? What is his/her standing in the field.
  • What is the author’s purpose? To offer advice, make practical suggestions, solve a specific problem, to critique or clarify?
  • Note the experts in the field: are there specific names/labs that are frequently cited?
  • Pay attention to methodology: is it sound? what testing procedures, subjects, materials were used?
  • Note conflicting theories, methodologies and results. Are there any assumptions being made by most/some researchers?
  • Theories: have they evolved overtime?
  • Evaluate and synthesize the findings and conclusions. How does this study contribute to your project?

Useful links:

  • How to Read a Paper (University of Waterloo, Canada) This is an excellent paper that teach you how to read an academic paper, how to determine if it is something to set aside, or something to read deeply. Good advice to organize your literature for the Literature Review or just reading for classes.

Criteria to evaluate sources:

  • Authority : Who is the author? what is his/her credentials--what university he/she is affliliated? Is his/her area of expertise?
  • Usefulness : How this source related to your topic? How current or relevant it is to your topic?
  • Reliability : Does the information comes from a reliable, trusted source such as an academic journal?

Useful site - Critically Analyzing Information Sources (Cornell University Library)

  • << Previous: Strategies to Find Sources
  • Next: Tips for Writing Literature Reviews >>
  • Last Updated: Nov 25, 2021 10:46 AM
  • URL: https://lit.libguides.com/Literature-Review

The Library, Technological University of the Shannon: Midwest

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 9, Issue 10
  • Questionnaire validation practice: a protocol for a systematic descriptive literature review of health literacy assessments
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-5704-0490 Melanie Hawkins 1 ,
  • Gerald R Elsworth 1 ,
  • Richard H Osborne 2
  • 1 School of Health and Social Development, Faculty of Health , Deakin University , Burwood , Victoria , Australia
  • 2 Global Health and Equity, Faculty of Health, Arts and Design , Swinburne University of Technology , Hawthorn , Victoria , Australia
  • Correspondence to Melanie Hawkins; melanie.hawkins{at}deakin.edu.au

Introduction Contemporary validity testing theory holds that validity lies in the extent to which a proposed interpretation and use of test scores is justified, the evidence for which is dependent on both quantitative and qualitative research methods. Despite this, we hypothesise that development and validation studies for assessments in the field of health primarily report a limited range of statistical properties, and that a systematic theoretical framework for validity testing is rarely applied. Using health literacy assessments as an exemplar, this paper outlines a protocol for a systematic descriptive literature review about types of validity evidence being reported and if the evidence is reported within a theoretical framework.

Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity evidence reported, as assessed against the criteria of the 2014 Standards for Educational and Psychological Testing , and in the number of studies using a theoretical validity testing framework. The search process will be consistent with the Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. Outcomes of the review will describe patterns in reported validity evidence, methods used to generate the evidence and theoretical frameworks underpinning validation practice and claims. This review will inform a theoretical basis for future development and validity testing of health assessments in general.

Ethics and dissemination Ethics approval is not required for this systematic review because only published research will be examined. Dissemination of the review findings will be through publication in a peer-reviewed journal, at conference presentations and in the lead author’s doctoral thesis.

  • validity testing theory
  • health literacy
  • health assessment
  • measurement

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2019-030753

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

This is the first systematic literature review to examine types of validity evidence for a range of health literacy assessments within the framework of the authoritative reference for validity testing theory, The Standards for Educational and Psychological Testing .

The review is grounded in the contemporary definition of validity as a quality of the interpretations and inferences made from measurement scores rather than as solely based on the properties of a measurement instrument.

The search for the review will be limited only by the end search date (March 2019) because health literacy is a relatively new field and publications are not expected prior to about 30 years ago.

All definitions of health literacy and all types of health literacy assessment instruments will be included.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias.

Introduction

Historically, the focus of validation practice has been on the statistical properties of a test or other measurement instrument, and this has been adopted as the basis of validity testing for individual and population assessments in the field of health. 1 However, advancements in validity testing theory hold that validity lies in the justification of a proposed interpretation of test scores for an intended purpose, the evidence for which includes but is not limited to the test’s statistical properties. 2–7 Therefore, to validate means to investigate , through a range of methods, the extent to which a proposed interpretation and use of test scores is justified. 7–9 The term ‘test’ in this paper is used in the same sense as Cronbach uses it in his 1971 Test Validation chapter 8 to refer to all procedures for collecting data about individuals and populations. In health, these procedures include objective tests (eg, clinical assessments) and subjective tests (eg, patient questionnaires) or a combination of both and may involve quantitative (eg, questionnaire) or qualitative methods (eg, interview). The act of testing results in data that require interpretation. In the field of health, such interpretations are usually used for making decisions about individuals or populations. The process of validation needs to provide evidence that these interpretations and decisions are credible, and a theoretical framework to guide this process is warranted. 1 2 10

The authoritative reference for validity testing theory comes from education and psychology: the Standards for Educational and Psychological Testing (the Standards ). 3 The Standards define validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’ and that ‘the process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations’ (p.11). 3 A test’s proposed score interpretation and use is described in Kane’s argument-based approach to validation as an interpretation/use argument (IUA; also called an interpretive argument). 11 12 Validity testing theory requires test developers and users to generate and evaluate a range of validity evidence such that a validity argument can determine the plausibility of the IUA. 3 7 9 11 12 Despite this contemporary stance on validity testing theory and practice, the application of validity testing theory and methodology is not common practice for individual and population assessments in the field of health. 1 Furthermore, there are calls for developers, users and translators/adapters of health assessments to establish theoretically driven validation plans for IUAs such that validity evidence can be systematically collected and evaluated. 1 2 7 10

The Standards provide a theoretical framework that can be used or adapted to form a validation plan for development of a new test or to evaluate the validity of an IUA for a new context. 1 2 Based on the notion that construct validity is the foundation of test development and use, the theoretical framework of the Standards outlines five sources of evidence on which validity arguments should be founded: (1) test content, (2) response processes, (3) internal structure, (4) relationship of scores to other variables and (5) validity and the consequences of testing ( table 1 ). 3

  • View inline

The five sources of validity evidence 3

Validity testing in the health context

Two of the five sources of validity evidence defined by the Standards (internal structure and relationship of scores to other variables) have a focus on the statistical properties of a test. However, the other three (test content, response processes and consequences of testing) are strongly reliant on evidence based on qualitative research methods. Greenhalgh et al have called for more credence and publication space to be given to qualitative research in the health sciences. 13 Zumbo and Chan (p.350, 2014) call specifically for more validity evidence from qualitative and mixed methods. 1 It is time to systematically assess if test developers and users in health are generating and integrating a range of quantitative and qualitative evidence to support inferences made from these data. 1

In chapter 1 of their book, Zumbo and Chan report the results of a systematic search of validation studies from the 1960s to 2010. Results from this search for the health sciences categories of ‘life satisfaction, well-being or quality of life’ and ‘health or medicine’, show that there is a dramatic increase in publication of validation studies since the 1990s that produce primarily what is classified as construct validity. 1 Given this was a snapshot review of validation practice during these years, the authors do not delve into the methods used to generate evidence for construct validity. However, Barry et al , in a systematic review investigating the frequency with which psychometric properties were reported for validity and reliability in health education and behaviour (also published in 2014), found that the primary methods used to generate evidence for construct validity were factor analysis, correlation coefficient and χ 2 . 14 This limited view of construct validity as simply correlation between items or tests measuring the same or similar constructs is at odds with the Standards where evaluation and integration of evidence from perhaps several other sources (ie, test content, response processes, internal structure, relationships with theoretically predicted external variables, and intended and unintended consequences) is needed to determine the degree to which a construct is represented by score interpretations (p.11). 3

Health literacy

This literature review will examine validity evidence for health literacy assessments. Health literacy is a relatively new area of measurement, and there has been a rapid development in the definition and measurement of this multi-dimensional concept. 15–18 Health literacy is now a priority of the WHO, 19 and many countries have incorporated it into health policy, 20–24 and are including it in national health surveys. 25–27

Definitions of health literacy include those for functional health literacy (ie, a focus on comprehension and numeric abilities) to multi-dimensional definitions such as that used by the WHO: ‘the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand and use information in ways which promote and maintain good health’. 28 The general purpose of health literacy assessment is to determine pathways to facilitate access to and improve understanding and use of health information and services, as well as to improve or support the health literacy responsiveness of health services. 28–31 However, these two uses of data (in general, to improve patient outcomes and to improve organisational procedures) may require evaluative integration of different types of evidence to justify score interpretations to inform patient interventions or organisational change. 3 7 9 11 32 A strong and coherent evidence-based conception of the health literacy construct is required to support score interpretations. 14 33–35 Decisions that arise from measurements of health literacy will affect individuals and populations and, as such, there must be strong argument for the validity of score interpretations for each measurement purpose.

To enhance the quality and transparency of the proposed systematic descriptive literature review, this protocol paper outlines the scope and purpose of the review. 36 37 Using the theoretical framework of the five sources of validity evidence of the Standards , and health literacy assessments as an exemplar, the results of this systematic descriptive literature review will indicate current validation practice. The assumptions that underlie this literature review are that, despite the advancement of contemporary validity testing theory in education and psychology, a systematic theoretical framework for validity testing has not been applied in the field of health, and that validation practice for health assessments remains centred on general psychometric properties that typically provide insufficient evidence that the test is fit for its intended use. The purpose of the review is to investigate quantitative and qualitative validity evidence reported for the development and testing of health literacy assessments to describe patterns in the types of validity evidence reported, 38–45 and identify use of theory for validation practice. Specifically, the review will address the following questions:

What is being reported as validity evidence for health literacy assessment data?

Do the studies place the validity evidence within a validity testing framework, such as that offered by the Standards ?

Methods and analysis

Review method.

This review is designed to provide the basis for a critique of validation practice for health literacy assessments within the context of the validity testing framework of the Standards . It is not an evaluation of the specific arguments that authors have made about validity from the data that have been gathered for individual measurement instruments. The review is intended to quantify the types of validity evidence being reported so a systematic descriptive literature review was chosen as the most appropriate review technique. Described by King and He (2005) 42 as belonging towards the qualitative end of a continuum of review techniques, a descriptive literature review nevertheless employs a frequency analysis to reveal interpretable patterns in a research area; such as, in this review, in the types of validity evidence being reported for health literacy assessments and in the number of studies that refer to a validity testing framework. A descriptive literature review can include qualitative and quantitative research and is based on a systematic and exhaustive review method. 38–41 43 44 38 39 The method for this review will be guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. 46

Eligibility criteria

This literature review is not an assessment of participant data but a collation of reported validity evidence. As such, the focus is not on the participants in the studies but on the evidence presented in support of the validity of interpretations and uses of health literacy assessment data. This means that it will be the type of study that is considered for inclusion rather than the type of study participant. Inclusion criteria are as follows:

Development/application/validation studies about health literacy assessments : We expect to find many papers that describe the development and initial validation studies of health literacy assessments. Papers that use an existing health literacy assessment to measure outcomes but do not claim to conduct validity testing will not be included. Studies of comparison (eg, participant groups) or of prediction (eg, health literacy and hospital admissions) will be included only if the authors openly claim that the study results contribute validation evidence for the health literacy assessment instrument.

Not limited by date : There will be no start date to the search such that papers about validation and health literacy assessments from the early days of health literacy measurement will be included in the search. Health literacy is a relatively new concept and the earliest papers are expected to date back only about 30 years. The end search date was in March 2019.

Studies published and health literacy assessments developed in the English language : Due to resource limitations, the search will be restricted to studies published in the English language and instruments developed in the English language. Translated instruments will be excluded. We realise that these exclusions introduce an English language and culture bias, and we recommend that a similar descriptive review of published studies about health literacy assessments developed in or translated to other languages is warranted.

Qualitative and quantitative research methods : Given that comprehensive validity testing includes both qualitative and quantitative methods, studies employing either or both will be included.

All definitions of health literacy : Definitions of health literacy have been accumulating over the past 30 years and reflect a range of health literacy testing methods as well as contexts, interpretations and uses of the data. We include all definitions of health literacy and all types of health literacy assessment instruments, which may include objective, subjective, uni-dimensional and multi-dimensional measurement instruments.

Exclusion criteria

Systematic reviews and other types of reviews captured by the search will not be included in the analysis. However, before being excluded, the reference lists will be checked for articles that may have been missed by the database search. Predictive, association or other comparative studies that do not explicitly claim in the abstract to contribute validity evidence will also not be included. Instruments developed in languages other than English, and translation studies, will be excluded as noted previously.

Information sources

Systematic electronic searches of the following databases will be conducted in EBSCOhost: MEDLINE Complete, Global Health, CINAHL Complete, PsycINFO and Academic Search Complete. EMBASE will also be searched. The electronic database search will be supplemented by searching for dissertations and theses through proquest.com, dissertation.com and openthesis.org. Reference lists of pertinent systematic reviews that are identified in the search will be scanned, as well as article reference lists and the authors’ personal reference lists, to ensure all relevant articles have been captured. The search terms will use medical subject headings and text words related to types of assessment instruments, health literacy, validation and validity testing. Peer reviewed full articles and examined theses will be included in the search.

Search strategy

An expert university librarian has been consulted as part of planning the literature search strategy. The strategy will focus on health literacy, types of assessment instruments, validation and validity, and methods used to determine the validity of interpretation and use of data from health literacy assessments. The search terms have been determined through scoping searches and examining search terms from other measurement and health literacy systematic reviews. The database searches were completed in March 2019 and the search terms used are described in online supplementary file 1 .

Supplemental material

Study selection.

Literature search results will be saved and the titles and abstracts downloaded to Endnote Reference Manager X9. Titles and abstracts of the search results will be screened for duplicates and according to the inclusion and exclusion criteria. The full texts of articles that seem to meet the eligibility criteria or that are potentially eligible will then be obtained and screened. Excluded articles and reasons for exclusions will be recorded. The PRISMA flow diagram will be used to document the review process. 46

Data extraction

The data extraction framework will be adapted from tables in Hawkins et al 2 (p.1702) and Cox and Owen (p.254). 47 Data extraction from eligible articles will be conducted by one reviewer (MH) and comprehensively checked by a second reviewer (GE).

Subjective and objective health literacy assessments will be identified along with those that combine objective and subjective items or scales. Data to be extracted will include the date and source of publication; the context of the study (eg, country, type of organisation/institution, type of investigation, representative population); statements about the use of a theoretical validity testing framework; the types of validity evidence reported; the methods used to generate the evidence; and the validation claims made by the authors of the papers, as based on their reported evidence.

Data synthesis and analysis

A descriptive analysis of extracted data, as based on the theoretical framework of the Standards , will be used to identify patterns in the types of validity evidence being reported, the methods used to generate the evidence and theoretical frameworks underlying validation practice. Where possible and relevant to the concept of validity, changes in validation practice and assessment of health literacy over time will be explored. It is possible that one study may use more than one method and generate more than one type of validity evidence. Statements about a theoretical underpinning to the generation of validity evidence will be collated.

Patient and public involvement

Patients and the public were not involved in the development or design of this literature review.

With the increasing use of health assessment data for decision-making, the health of individuals and populations relies on test developers and users to provide evidence for validity arguments for the interpretations and uses of these data. This systematic descriptive literature review will collate existing validity evidence for health literacy assessments developed in English and identify patterns of reporting frequency according to the five sources of evidence in the Standards , and establish if the validity evidence is being placed within a theoretical framework for validation planning. 3 The potential implications of this review include finding that, when assessed against the Standards’ theoretical framework, current validation practice in health literacy (and possibly in health assessment in general) has limited capacity for determining valid score interpretation and use. The Standards’ framework challenges the long-held perception in health assessment that validity refers to an assessment tool rather than to the interpretation of data for a specific use. 48 49

The validity of decisions based on research data is a critical aspect of health services research. Our understanding of the phenomena we research is dependent on the quality of our measurement of the constructs of interest, which, in turn, affects the validity of the inferences we make and actions we take from data interpretations. 6 7 Too often the measurement quality is considered separate to the decisions that need to be made. 6 50 However, questionable measurement (perhaps through use of an instrument that was developed using suboptimal methods, was inappropriately applied or through gaps in validity testing) cannot lead to valid inferences. 3 50 To make appropriate and responsible decisions for individuals, communities, health services and policy development, we must consider the integrity of the instruments, and the context and purpose of measurement, to justify decisions and actions based on the data.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias. A similar review of health literacy assessments developed in or translated to other languages is warranted. A further limitation is that we rely on the information authors provide in identified articles. It is possible that some authors have an incomplete understanding of the specific methods they are using and reporting, and may not accurately or clearly provide details on validity testing procedures employed. Documentation for decisions made during data extraction will be kept by the researchers.

Health literacy is a relatively new area of research. We are fortunate to be at the start of a burgeoning field and can include all publications about validity testing of English-language health literacy assessments. The inclusion of the earliest to the most recent publications provides the opportunity to understand changes and advancements in health literacy measurement and methods of analysis since the introduction of the concept of health literacy. Using health literacy assessments as an exemplar, the outcomes of this review will guide and inform a theoretical basis for the future practice of validity testing of health assessments in general to ensure, as far as is possible, the integrity of the inferences made from data for individual and population benefits.

Acknowledgments

The authors acknowledge and thank Rachel West, Deakin University Liaison Librarian, for her expertise and advice during the preparation of this systematic literature review.

  • Hawkins M ,
  • Elsworth GR ,
  • American Educational Research Association
  • American Psychological Association
  • National Council on Measurement in Education
  • Cronbach LJ
  • Sawatzky R ,
  • Zumbo BD , et al
  • Greenhalgh T ,
  • Annandale E ,
  • Ashcroft R , et al
  • Piazza-Gardner AK , et al
  • Sørensen K ,
  • Van den Broucke S ,
  • Fullam J , et al
  • Jordan JE ,
  • Osborne RH ,
  • Buchbinder R
  • Nguyen TH ,
  • Paasche-Orlow MK ,
  • Kim MT , et al
  • Valerio MA ,
  • McCormack LA , et al
  • World Health Organization
  • Scotland NHS
  • Schaeffer D ,
  • Berens E-M ,
  • Weishaar H , et al
  • Australian Commission on Safety and Quality in Health Care
  • U.S. Department of Health and Human Services Office of Disease Prevention and Health Promotion
  • Osborne RH , et al
  • Australian Institute of Health and Welfare
  • New Zealand Ministry of Health
  • Batterham R ,
  • Beauchamp A ,
  • Trezona A ,
  • Nutbeam D ,
  • Premkumar P
  • Buchbinder R ,
  • Hoffmann F ,
  • Mathes T , et al
  • Schlagenhaufer C ,
  • Winemiller DR ,
  • Mitchell ME ,
  • Sutliff J , et al
  • Jackson SE ,
  • Trudel M-C ,
  • Jaana M , et al
  • Liberati A ,
  • Tetzlaff J , et al
  • Hubley AM ,
  • DeVellis RF ,
  • Alfieri WS ,
  • Ahluwalia IB

Twitter @4MelanieHawkins

Contributors MH and RHO conceptualised the research question and analytical plan. Under supervision from RHO, MH led the development of the search strategy, selection criteria, data extraction criteria and analysis method, which was then comprehensively assessed and checked by GRE. MH drafted the initial manuscript and led subsequent drafts. GRE and RHO read and provided feedback on manuscript iterations. All authors approved the final manuscript. RHO is the guarantor.

Funding MH is funded by a National Health and Medical Research Council (NHMRC) of Australia Postgraduate Scholarship (APP1150679). RHO is funded in part through a National Health and Medical Research Council (NHMRC) of Australia Principal Research Fellowship (APP1155125).

Competing interests None declared.

Patient consent for publication Not required.

Ethics approval Ethics approval is not required for this systematic review because only published research will be examined. Dissemination will be through publication in a peer-reviewed journal and at conference presentations, and in the lead author’s doctoral thesis.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

Duke University Libraries

Literature Reviews

  • Getting started

What is a literature review?

Why conduct a literature review, stages of a literature review, lit reviews: an overview (video), check out these books.

  • Types of reviews
  • 1. Define your research question
  • 2. Plan your search
  • 3. Search the literature
  • 4. Organize your results
  • 5. Synthesize your findings
  • 6. Write the review
  • Thompson Writing Studio This link opens in a new window
  • Need to write a systematic review? This link opens in a new window

literature review and validity

Contact a Librarian

Ask a Librarian

literature review and validity

Definition: A literature review is a systematic examination and synthesis of existing scholarly research on a specific topic or subject.

Purpose: It serves to provide a comprehensive overview of the current state of knowledge within a particular field.

Analysis: Involves critically evaluating and summarizing key findings, methodologies, and debates found in academic literature.

Identifying Gaps: Aims to pinpoint areas where there is a lack of research or unresolved questions, highlighting opportunities for further investigation.

Contextualization: Enables researchers to understand how their work fits into the broader academic conversation and contributes to the existing body of knowledge.

literature review and validity

tl;dr  A literature review critically examines and synthesizes existing scholarly research and publications on a specific topic to provide a comprehensive understanding of the current state of knowledge in the field.

What is a literature review NOT?

❌ An annotated bibliography

❌ Original research

❌ A summary

❌ Something to be conducted at the end of your research

❌ An opinion piece

❌ A chronological compilation of studies

The reason for conducting a literature review is to:

literature review and validity

Literature Reviews: An Overview for Graduate Students

While this 9-minute video from NCSU is geared toward graduate students, it is useful for anyone conducting a literature review.

literature review and validity

Writing the literature review: A practical guide

Available 3rd floor of Perkins

literature review and validity

Writing literature reviews: A guide for students of the social and behavioral sciences

Available online!

literature review and validity

So, you have to write a literature review: A guided workbook for engineers

literature review and validity

Telling a research story: Writing a literature review

literature review and validity

The literature review: Six steps to success

literature review and validity

Systematic approaches to a successful literature review

Request from Duke Medical Center Library

literature review and validity

Doing a systematic review: A student's guide

  • Next: Types of reviews >>
  • Last Updated: Feb 15, 2024 1:45 PM
  • URL: https://guides.library.duke.edu/lit-reviews

Duke University Libraries

Services for...

  • Faculty & Instructors
  • Graduate Students
  • Undergraduate Students
  • International Students
  • Patrons with Disabilities

Twitter

  • Harmful Language Statement
  • Re-use & Attribution / Privacy
  • Support the Libraries

Creative Commons License

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Validity and reliability of a questionnaire : a literature review

Profile image of Shyamalima Bhattacharyya

Questionnaires form an important part of research methodology. However many a times the research hypothesis of our concern does not have a standard questionnaire or item to measure from and we often end up using self made questionnaires. The main problem behind such questionnaires is its validity and reliability. With the advancement in technology though we have much software to measure these but with the lack of basic knowledge about validity and reliability, the software are of no use. This article highlights and summaries the important aspect of validity and reliability regarding a questionnaire.

Related Papers

John Muyonga

literature review and validity

Journal of Nutrition Education and Behavior

Richard Bukenya

The European Journal of Orthodontics

Yassir Yassir

Journal of Evaluation in Clinical Practice

Ashling Murphy

Evidence Based Care

Abolfazl Tohidast

Background: To date, there is no specific instrument to measure evidence-based practice (EBP) in Speech and Language Pathology (SLP). Therefore, it is essential to design a valid and reliable instrument in the EBP field for SLP. Aim: To develop a speech and language pathology evidence-based practice questionnaire (SLP-EBPQ) for the Iranian context and evaluate its psychometric properties. Method: This study was performed in two stages, first development of the instrument based on the literature review and semi-structured interviews with 14 speech and language pathologists and second the evaluation of the psychometric properties. Content validity of the instrument was assessed by SLP experts who were experienced in the field of EBP. Furthermore, exploratory factor analysis (EFA) and comparison of the recognized groups were conducted to determine the initial construct validity of the SLP-EBPQ. The reliability of the questionnaire was determined using internal consistency and test-rete...

Massitah Kipli

Validation is crucial in ensuring that the results obtained are reliable in addressing the problem of a study. The objective of this paper is to discuss on the use of Content Validity Index (CVI) to validate the instrument constructed based on the Context, Input, Process, and Product (CIPP) model that is used to evaluate an educational programme. Five content experts were consulted, and by using the Expert Panel Rating Sheet (EPRS), their responses were further calculated through item-level CVI (I-CVI) and scale level CVI (S-CVI) method. The result yielded an acceptable level of validity where five percent of the items from the survey were either omitted or modified while 90 percent were maintained. Self-administered questionnaires were next distributed online to students, lecturers, graduates and employers for pilot testing and achieved a high level of the alpha coefficient, which translated to high reliability. The final result indicated that the instrument was valid and reliable ...

BMC medical education

With the availability of more healthcare courses and an increased intake of nursing students, education institutions are facing challenges to attract school leavers to enter nursing courses. The comparison of career choice influences and perception of nursing among healthcare students can provide information for recruitment strategies. An instrument to compare the influences of healthcare career choice is lacking. The purpose of this study is to develop and evaluate the psychometric properties of an instrument to compare the influences of healthcare career choice with perceptions of nursing as a career choice. The study was conducted in two phases. In phase one, two sets of scales with parallel items that measure the influences of healthcare career choice and perceptions of nursing as a career choice were developed through an earlier qualitative study, literature review, and expert validation. Phase two involved testing the construct validity, concurrent validity and reliability wit...

JPMA. The Journal of the Pakistan Medical Association

Mohammad Ali Chughtai

OBJECTIVE To develop an instrument for assessing the ethical sensitivity of freshly graduated dentists.. METHODS This instrument development study was done at Sardar Begum Dental College, Peshawar, Pakistan, from September 2014 to April 2015. The instrument developed was the Dental Ethical Sensitivity Scale in accordance with the guidelines for the development of educational instruments. Data was obtained from freshly graduated dentists through the instrument containing vignettes related to three domains of ethics; beneficence, autonomy and confidentiality. Content validity index and Angoff&#39;s method were used to determine the validity and cut-off passing score respectively. Reliability analysis comprised internal consistency and test re-test. RESULTS Of the 138 house officers approached, 107(77%) responded. Of them, 37(34.5%) were males and 70(65.4%) were females. The overall mean age was 23.7±1. 1 years. Overall, 51(47.6%) subjects were ethically sensitive whereas 50(46.7%) wer...

Vijeta Angural

Cronbach’s alpha is commonly used in statistics to measure reliability of tests. Reliability of any measuring instrument/questionnaire refers to the extent to which it measures consistently. Cronbach’s alpha is one of the way of measuring the strength of that consistency. This article is intended to highlight its use in dental research by creating an understanding of Cronbach’s alpha by giving suitable example.

Environmental Health and Preventive Medicine

Mohammad Sehlo

RELATED PAPERS

Journal of Caring Sciences

The Patient - Patient-Centered Outcomes Research

Shane Sinclair

Eman Almaghaslah

Journal of plastic, reconstructive & aesthetic surgery : JPRAS

Archives of Physiotherapy

Matthew Chiwaridzo

Medico Research Chronicles

Nursing Reports

Telake Azale

Basheer Chedi

Dental Press Journal of Orthodontics

Dr. Elbe Peter

BMC Psychiatry

André Tylee , Anthony Mann

Che Suriya Che Kar

Revista Nacional de Administración

Wilson Rojas Herrera

nahla aljojo

Pataraporn Kheawwan

Nasrin Khaki

BMC Nursing

Mansoureh Zagheri

Open Access Macedonian Journal of Medical Sciences

Hamid Allahverdipour

Palliative and Supportive Care

sayaka takenouchi

Stephen Amukune

Abolfazl Mohammadbeigi

Journal of Education and Health Promotion

mandana shirazi

General Psychiatry

Adeel Ahmed Khan

International Journal of Management, Technology, and Social Sciences (IJMTS)

Srinivas Publication , Sreeramana Aithal

Drug, Healthcare and Patient Safety

Ruzmayuddin Mamat

BMC Public Health

Faris El-Dahiyat

David Feeny

mohammad mehdi sadoughi

Sultan Qaboos University medical journal

Ayiesah Ramli

Journal of Harmonized Research in Applied Science

paran gowda

International Journal of Telerehabilitation

Daniela Manaloto

Iranian Journal of Medical Sciences

Gholamreza Khademi

International Journal of Academics & Research, IJARKE Journals

PARAN GOWDA

Indian Journal of Anaesthesia

Narayana Yaddanapudi

Journal of Health Research

Fazlollah Ghofranipour

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: artificial intelligence for literature reviews: opportunities and challenges.

Abstract: This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in providing more effective support for researchers, moving towards the semi-automatic creation of literature reviews. Our study focuses on how AI techniques are applied in the semi-automation of SLRs, specifically in the screening and extraction phases. We examine 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Finally, the paper discusses current trends in the field, outlines key research challenges, and suggests directions for future research.

Submission history

Access paper:.

  • Download PDF
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Health Serv Res
  • v.54(5); 2019 Oct

A systematic review of the validity and reliability of patient‐reported experience measures

Claudia bull.

1 Centre for Applied Health Economics (CAHE), Griffith University, Brisbane, Queensland, Australia

2 Menzies Health Institute Queensland (MHIQ), Brisbane, Queensland, Australia

Joshua Byrnes

Ruvini hettiarachchi, martin downes, associated data.

To identify patient‐reported experience measures ( PREM s), assess their validity and reliability, and assess any bias in the study design of PREM validity and reliability testing.

Data Sources/Study Setting

Articles reporting on PREM development and testing sourced from MEDLINE , CINAHL and Scopus databases up to March 13, 2018.

Study Design

Systematic review.

Data Collection/Extraction Methods

Critical appraisal of PREM study design was undertaken using the Appraisal tool for Cross‐Sectional Studies ( AXIS ). Critical appraisal of PREM validity and reliability was undertaken using a revised version of the COSMIN checklist.

Principal Findings

Eighty‐eight PREM s were identified, spanning across four main health care contexts. PREM validity and reliability was supported by appropriate study designs. Internal consistency (n = 58, 65.2 percent), structural validity (n = 49, 55.1 percent), and content validity (n = 34, 38.2 percent) were the most frequently reported validity and reliability tests.

Conclusions

Careful consideration should be given when selecting PREM s, particularly as seven of the 10 validity and reliability criteria were not undertaken in ≥50 percent of the PREM s. Testing PREM responsiveness should be prioritized for the application of PREM s where the end user is measuring change over time. Assessing measurement error/agreement of PREM s is important to understand the clinical relevancy of PREM scores used in a health care evaluation capacity.

1. INTRODUCTION

Patient‐reported experience measures (PREMs) are tools that capture “what” happened during an episode of care, and “how” it happened from the perspective of the patient. 1 , 2 , 3 PREMs differ from patient‐reported outcome measures (PROMs), which aim to measure patients’ health status, 4 and the more subjective patient satisfaction measures, which are an indication of how well a patient's expectations were met, 5 a benchmark which is criticized for being too heavily influenced by past health care encounters. 6

Patient‐reported experience measures are gaining attention as an indicator of health care quality and can provide information regarding the patient‐centeredness of existing services as well as areas for potential improvement regarding health care delivery. 7 The purpose of employing PREMs is consistent with the Institute of Medicine's (IOM) definition of health care quality, defined as care that is patient‐centered, effective, efficient, timely, and equitable. 8 In recent years, PREMs have been used to inform pay‐for‐performance (P4P) and benchmarking schemes, adjunct with other health care quality domains, including clinical quality/effectiveness, health information technology, and resource use. 9 , 10 Such schemes see health care services financially rewarded for their performance across these domains of health care quality, as opposed to the traditional fee‐for‐service payment system, which may inadvertently promote low‐value care. 10 , 11

While there is evident merit behind utilizing PREMs in health care quality evaluations, there remains some conjecture regarding their use. Manary and colleagues 12 identify three main limitations expressed by critics of PREMs. Firstly, patient‐reported experience is largely seen as congruent with terms such as “patient satisfaction” and “patient expectation,” both of which are subjective terms that can be reflective of judgments on the adequacy of health care and not the quality . 12 , 13 , 14 Secondly, PREMs may be confounded by factors not directly related to the quality of health care experienced by the patient, such as health outcomes. 12 And finally, PREMs can be a reflection of patients’ preconceived health care “ideals” or expectations and not their actual care experience. 12 All three limitations are indicative of a blurring of concept boundaries and inappropriate interchanging of concepts. While this is not unique to PREMs, it does suggest a low level of concept maturity regarding patient‐reported experiences 15 and, consequently, is an area of research that warrants greater attention.

Despite these limitations, PREMs have gained international recognition as an indicator of health care quality. This is largely because: (a) they enable patients to comprehensively reflect on the interpersonal aspects of their care experience 16 ; (b) they can be utilized as a common measure for public reporting, benchmarking of institutions/centers and health care plans 10 ; and (c) they can provide patient‐level information that is useful in driving service quality improvement strategies. 17 , 18

Understanding the validity and reliability of PREMs is integral to the appropriate selection of instruments for quality assessment of health care services, in conjunction with other aspects, such as the clinical relevance of an instrument and the domains of patient‐reported experience that the PREM covers. Validity refers to the ability of an instrument to measure what it intends to measure, and reliability refers to the ability of an instrument to produce consistent results under similar circumstances, as well as to discriminate between the performance of different providers. 19 , 20 It is important to assess these properties in order to understand the risk of bias that may arise in employing certain instruments 21 and whether instruments are suitable for capturing patient‐reported experience data.

While two previously published systematic reviews have examined the psychometric testing of PREMs, one related to PREMs for inpatients, 16 and the other for emergency care service provision, 22 there has been no comprehensive examination of the tools available across a range of health care contexts. The aim of this systematic review was to identify PREMs, assess their validity and reliability, and assess any bias in the study design of PREM validity and reliability testing, irrespective of the health care context the PREMs are designed to be used in.

1.1. Objectives

  • To identify existing tools for measuring patient‐reported experiences in health care, irrespective of the context
  • To critically appraise bias in the study design employed in PREM validity and reliability testing, and
  • To critically appraise the results of validity and reliability testing undertaken for these PREMs.

In conducting this systematic review, the authors conformed to the Preferred Reporting Items for Systematic Reviews and Meta‐Analysis (PRISMA) statement. 23 This review was registered with PROSPERO (registration number: CRD42018089935).

2.1. Search strategy and eligibility criteria

The databases searched were MEDLINE (Ovid), CINAHL Plus with full text (EBSCOhost), and Scopus (Elsevier). No date restriction was applied to the search strategy; records were searched up to March 13, 2018. Patient “satisfaction” was included in search terms in order to not to limit our results as there is a blurring of these terms and some PREMs may be labeled as satisfaction measures.

Articles were included in the systematic review if they met all the following criteria:

  • Described the development and evaluation of PREMs
  • Published in English
  • Full‐text published in peer‐reviewed journals
  • Labeled as a satisfaction questionnaire, but framed around measuring patients’ experiences (eg, the Surgical In‐Patient Satisfaction (SIPs) instrument 24 )

Articles were excluded if they met any of the following criteria:

  • Instruments labeled as a satisfaction questionnaire which were: (a) framed around measuring patient levels of satisfaction; (b) inclusive of a global satisfaction question or visual analogue scale; and (c) developed based on satisfaction frameworks or content analyses
  • Patient expectation questionnaires
  • Quality of care questionnaires
  • Patient participation questionnaires
  • Related to patient experiences of a specific treatment or intervention (eg, insulin devices, hearing aids, food services, anesthesia, and medication/pharmaceutical dispensary) or specific program (eg, education programs)
  • Measuring emotional care received by patients (eg, empathy)
  • Studies where PREMs were completed entirely by proxy (completed by populations not receiving the care); however, if proxy‐reported data comprised only a small proportion of data collected (patient‐reported data also reported), then the study was still included
  • Quality improvement initiatives
  • Patient attitude scales
  • Patient experience questionnaires comprised of a single domain, or
  • PREMs superseded by a more up‐to‐date version of the same PREM with corresponding updated validity and reliability testing

The full search strategy for each database is provided in Appendix  S1 . All references were imported into EndNote (Version 8, Clarivate Analytics), and duplicates were removed. Two reviewers independently screened paper titles and abstracts for inclusion. Where the title and abstract were not informative enough to make a decision, the full‐text article was reviewed. Figure  1 depicts the PRISMA flow diagram of this process. The two reviewers handled disagreements regarding article inclusion or exclusion. Where a decision could not be made, a third reviewer adjudicated the final decision. Reference list handsearching was also employed for the identification of PREMs not identified through database searching, and updates for PREMs originally identified through the database searching.

An external file that holds a picture, illustration, etc.
Object name is HESR-54-1023-g001.jpg

PRISMA diagram of patient‐reported experience measure search [Color figure can be viewed at wileyonlinelibrary.com ]

2.2. Data extraction

Descriptive data were independently extracted from the included articles by two reviewers into a standardized excel extraction form (refer to Appendix  S2 ). Discrepancies in the extracted data were discussed between the two reviewers, or adjudicated by a third if necessary. If there was insufficient information in the full‐text article regarding the validity and reliability testing undertaken, the article was excluded.

2.3. Critical appraisal

To critically appraise bias in the study design employed in PREM validity and reliability testing, the Appraisal tool for Cross‐Sectional Studies (AXIS) 25 was used. This is a 20‐item appraisal tool developed in response to the increase in cross‐sectional studies informing evidence‐based medicine and the consequent importance of ensuring that these studies are of high quality and low bias. 25 The purpose of employing the AXIS tool in the present systematic review was to ensure that the results of PREM validity and reliability testing were supported by appropriate study designs and thus able to be interpreted as a robust representation of how valid and/or reliable a PREM is. The AXIS assesses the quality of cross‐sectional studies based on the following criteria: clarity of aims/objectives and target population; appropriate study design and sampling framework; justification for the sample size; measures taken to address nonresponders and the potential for response bias; risk factors/outcome variables measured in the study; clarity of methods and statistical approach; appropriate result presentation, including internal consistency; justified discussion points and conclusion; discussion of limitations; and identification of ethical approval and any conflicts of interest. 25

The scoring system conforms to a “yes,” “no,” or “do not know/comment” design. PREMs were categorized into quartiles: >15 AXIS criteria met, 10‐15 AXIS criteria met, 5‐9 AXIS criteria met, and ≤4 AXIS criteria met. The AXIS tool was used to appraise the most recent publication for each PREM as this was also reflective of the most recent version of validity and reliability testing that the PREM that had undergone.

To assess the validity and reliability testing undertaken for PREMs included in this review, we employed a revised version of the COSMIN checklist (COnsensus‐based Standards for the selection of health status Measurement INstruments) published in a recent systematic review of quality in shared decision making (SDM) tools. 19 These criteria is comprised of 10 psychometric measurement properties and subproperties, including internal consistency; reliability; measurement error/agreement; validity (inclusive of content validity, construct validity [structural validity, hypothesis testing, and cross‐cultural validity]; and criterion validity); responsiveness; and item response theory (IRT). Appendix  S3 provides definitions for each of these measurement properties and identifies the appraisal parameters used to assess them. 26

Reporting of these measurement properties conforms to the following: “+” (meets criteria), “−” (does not meet criteria), or  "?” (unclear or missing information). These scores were numerically coded, and PREMs were ranked within their corresponding context(s) (refer to Appendix  S4 ). Where more than one article was identified for the validity and reliability testing of a PREM, all articles were used to critically appraise the PREM. If the same criteria were assessed in separate studies for a given PREM and provided conflicting results (eg, a “+” and a “−” score), then the more favorable result was recorded.

Appraisals with both tools were undertaken by one author. A sample of the revised COSMIN checklist appraisal data was cross‐checked with a second reviewer. A Kappa measure was used to assess the level of inter‐rater agreement. A Kappa value of 0.5 depicted moderate agreement, >0.7 good agreement, and >0.8 very good agreement. 27

A total of 88 PREMs were identified through the systematic literature search. Greater than one‐third of these instruments were contextually designed for inpatient care services (36.4 percent), 23.9 percent for primary care services and 12.5 percent for outpatient care services. Table  1 depicts the other contexts and conditions covered by the PREMs. Roughly 20 percent of instruments were developed in the UK, while other countries included the United States (19.3 percent), Norway (14.8 percent), and the Netherlands (13.6 percent). The most common mode of PREM administration was postal (45.7 percent), followed by face to face (33.1 percent), telephone (13.6 percent), and electronic (7.6 percent). The earliest PREMs detected through the systematic search were developed in 1993. 28 , 29 The median number of items per PREM was 27 (IQR: 21‐35; range: 4‐82), and the median number of domains was 5 (IQR: 4‐7; distribution: 2‐13). Extracted data can be identified in Appendix  S2 .

PREMs identified by individual, condition, setting, and country‐specific context

Abbreviations: PREM, patient‐reported experience measure; UAE, United Arab Emirates; UK, United Kingdom; USA, United States of America.

A proxy, not the recipient of care, completed PREMs on the behalf of patients in 11.4 percent of the PREMs. This was typically only for a small portion (10‐12 percent) of any given study's population. Over 40 percent of the PREMs were developed and tested in languages other than English. Few papers discuss formal translation processes being undertaken for PREMs.

3.1. AXIS critical appraisal

Table  2 identifies that 63 (70.5 percent) of the papers reporting on PREMs met >15 AXIS criteria. Over a quarter of studies met 10‐15 criteria (28.4 percent), and 1.1 percent (n = 1) met 5‐9 criteria. No PREM met ≤4 AXIS criteria. The median number of “yes” scores was 16 (IQR: 15‐17). The lowest scoring of all PREMs answered “yes” to only five of the 20 AXIS questions. 30 The highest scoring PREM answered “yes” to all questions. 31 Appendix  S5 presents the AXIS results for all PREMs from highest to lowest number of AXIS criteria met.

PREMs categorized according to proportion of AXIS criteria met

Abbreviations: AXIS, Appraisal tool for Cross‐Sectional Studies; PREM, patient‐reported experience measure.

Appendix  S6 identifies that all studies we assessed as presenting clear study aims and utilizing appropriate study designs to answer their research questions (Q1 and Q2). Greater than 95 percent of PREMs appropriately sampled participants to be representative of the target population under investigation (Q5). Over 95 percent of the studies reported that there was no conflict of interest related to a funding source and the interpretation of results (Q19). Questions 13 (potential for response rate bias), 14 (description of nonresponders), and 20 (attainment of ethical approval or participant consent) were the criteria least frequently met by PREM papers.

3.2. Revised COSMIN checklist validity and reliability appraisal

Appendix  S4 details the validity and reliability testing undertaken for the PREMs according to the revised COMSIN checklist. PREMs are ranked within their specified contexts according to the number of positive results obtained for the validity and reliability tests. Inter‐rater reliability between two assessors for a portion of the COSMIN checklist appraisals was κ  = 0.761, indicative of good agreement.

Some validity and reliability tests were undertaken more often than others (Table  3 ). The three psychometric tests most commonly meeting “+” criteria were internal consistency (n = 58, 65.9 percent), structural validity (n = 49, 55.7 percent), and content validity (n = 33, 37.5 percent). Seven of the 10 revised COSMIN checklist criteria were not undertaken in ≥50 percent of the PREMs: (a) reliability (n = 44, 50.0 percent); (b) hypotheses testing (n = 53, 60.2 percent); (c) cross‐cultural validity (n = 65, 73.9 percent); (d) criterion validity (n = 79, 89.8 percent); (e) responsiveness (82, 93.2 percent); (f) item response theory (n = 84, 95.5 percent); and (g) measurement error/agreement (n = 86, 97.8 percent). None of the studies undertook testing for all 10 validity and reliability criteria.

Psychometric quality of PREMs according to the Revised COSMIN checklist

Abbreviations: COSMIN, COnsensus Standards for the selection of health Measurement INstruments; IRT, item response theory; PREM, patient‐reported experience measure.

4. DISCUSSION

The purpose of this systematic review was threefold: to identify and describe peer‐reviewed PREMs, irrespective of their contextual basis; to critically appraise PREM validity and reliability; and to critically appraise any bias in the study design of PREM validity and reliability testing. It is integral to understand whether PREMs have been subject to rigorous validity and reliability testing as this reflects whether an instrument is able to appropriately capture patient‐reported experiences of health care. In turn, it is also important to ensure that the results of PREM validity and reliability testing are underpinned by a rigorous study design so that readers can be assured that the validity and reliability results are a robust representation of how valid and/or reliable that PREM is. To our knowledge, this is the first systematic review to examine PREMs across a range of health care contexts and settings.

This systematic review identified a total of 88 PREMs. Interestingly, roughly 20 percent of the identified PREMs were developed from 2015 onwards, and a quarter of all PREMs received some form of additional validity and reliability testing in this time frame as well. Given that 1993 was the earliest PREM development year identified through the search strategy, this indicates a significant increase in the desire for instruments that measure patient experiences.

Generally, the PREMs identified in this systematic review reflect a heavy emphasis on measuring singular events of health care. The Institute of Medicine's (IOM) 2001 report on crossing the quality chasm identified that despite a significant increase in chronic and complex conditions, health care systems are still devoted to acute episodes of care. 32 Overwhelmingly, this sentiment still reigns true today, despite efforts to promote greater integration and coordination of care across and within health care services, as well as patient‐centric, high‐quality health care. 8 , 33 For example, this systematic review identified only one peer‐reviewed PREM targeting chronic disease holistically (as opposed to a singular disease focus) and two PREMs focusing on the integration of care. Most PREMs related to short‐term care episodes, largely in the hospital setting, though there are PREMs (eg, the CG‐CAHPS 34 and health plan CAHPS 35 ) that examine patient experiences of care delivered over 6‐ to 12‐month periods. By developing and utilizing PREMs that maintain a single  event, unidimensional focus of health care, we are inhibiting our ability to strive for international health care goals related to reducing health care fragmentation, and optimizing continuity, coordination, and quality within and between services. Consequently, future PREM development should aim to capture patient experiences of the continuity and coordination within and between their health care services and providers in order to mirror international shifts toward greater health care integration.

Encouragingly, nearly all PREM evaluation papers met ≥10 AXIS criteria (98.9 percent). Furthermore, all papers possessed appropriate study designs for their stated aims, and >95 percent of papers demonstrated appropriate sampling of participants to be representative of the target population under investigation. One PREM, 30 however, met only five out of 20 AXIS criteria, implying that this PREM should undergo further evaluative testing prior to use in patient experience evaluations. Generally, the results of the AXIS critical appraisal indicate that the study designs underpinning PREM validity and reliability testing were sound.

Unlike the recent systematic review of hospital‐related PREMs 16 where all instruments presented some form of validity and reliability testing, we identified two PREMs (CQI‐PC and GS‐PEQ) that did not present any testing in accordance with the revised COSMIN checklist. 36 , 37 This was either a consequence of not having done the testing, not presenting clear enough information to be scored “+” or “−,” or not having published this information in the peer‐reviewed literature. Evidently, both the CQI‐PC and GS‐PEQ instruments require further validity and reliability testing before being used in patient experience evaluations.

The most frequently undertaken reliability and validity criteria that also received positive results included internal consistency (a measure of reliability), structural validity, and content validity. This ultimately indicates that most PREMs measure the concept which they set out to measure, and do so consistently. Responsiveness—an instrument's ability to detect changes overtime 19 —was not evident for >90 percent of PREMs. While some of the identified PREMs appear to have been developed for a once‐off purpose, and thus exhibiting the ability to detect changes in patient experiences overtime is not a property of significant importance, it was surprising to identify that responsiveness was not evident in most of the CAHPS suite of surveys. Most CAHPS surveys are employed annually on a nationwide scale, such as the HCAHPS, which has been used in this capacity since 2002 in US hospital reimbursement and benchmarking schemes. 38 , 39 However, only the CAHPS PCMH scored positively for responsiveness. The GPPS PREM also scored positively. This is the UK National General Practitioner Patient Survey which has been undertaken annually since 2007. 40 It is important to note though that this information may be presented outside of the peer‐reviewed literature, and consequently, what was captured in this systematic review may be an underrepresentation of all testing undertaken for these measures. The lack of testing for instrument responsiveness is consistent with previous systematic reviews 16 , 22 , both of which identified that responsiveness was not undertaken by any of the PREMs that they assessed. Evidently, testing responsiveness should be prioritized for instruments that are to be utilized on an annual or repeated basis.

The least prevalent property assessed using the COSMIN checklist was measurement error/agreement. Measurement error, in accordance with the revised COSMIN checklist, assesses whether the minimally important change (MIC) (the smallest measured change in participant experience scores that implies practicable importance 41 ) is greater than or equal to the smallest detectable change (SDC) in participants scores, or outside of the limits of agreement (LOA) (a technique used when comparing a new measuring technique to what is already practiced 42 ). Thus, in the clinical context, the MIC enables researchers to define a threshold of clinical relevancy. That is to say, a score above that threshold (as defined by the MIC) demonstrates that the intervention/program/service was clinically relevant and responsive to improving the patient experience. Given that the patient experience of health care is internationally recognized as a key determinant of health care quality, 32 , 43 and there is evidence to support the relationship between patient experience data and health care quality, 44 , 45 the clinical relevancy of improving patient experiences is likely to have implications for resource allocation and decision making in optimizing the quality of health care provided to patients. As such, assessing PREM measurement error/agreement should be undertaken, particularly in instances where PREM scores are being used to inform decision making and funding.

None of the PREMs were tested for all of the revised COSMIN checklist criteria. There are several reasons that this may be the case. For example, criterion validity was only undertaken in roughly 10 percent of the PREMs as some authors recognized that there simply is no gold standard PREM available as a comparator in their given context. 46 , 47 Another reason could be inconsistencies in psychometric reporting guidelines and journal guidance regarding what constitutes adequate validity and reliability testing. A previous systematic review 48 examined the quality of survey reporting guidelines. The authors identified that there is generally a lack of validated reporting guidelines for survey instruments. Furthermore, the review highlighted that only a small portion of medical journals, where papers such as those included in this review may be published, provide guidance for the reporting of survey quality. 48 This indicates an area of research generally that warrants greater attention as this is not just a limitation that impacts upon the quality of PREMs, but a wide range of instruments.

4.1. Limitations

One major limitation of the current study was that grey sources of literature were not considered in the identification of PREMs. Consequently, we may have missed PREMs that otherwise would have fit the inclusion criteria. Furthermore, there were PREMs that we excluded because they had not yet published their supporting validity and reliability results. This was the case for the UK Renal Registry Patient‐Reported Experience Measure (UKRR‐PREM) who had published their instrument, 49 but were still in the process of developing psychometric evaluation publications at the time that this review was undertaken. However, the purposeful selection of PREMs that were published in peer‐reviewed journals was to maximize the quality of the instruments evaluated.

A limitation of the AXIS appraisal tool is that a summative score cannot be derived to interpret the overall quality of the study being assessed 25 (ie, whether a study is deemed poor, moderate, or high quality). However, assessment of risk of bias imposed by a study design is standard practice in the appraisal of studies for systematic reviews. 50 For this study, PREMs were categorized into quartiles according to the proportion of AXIS criteria met, with full details of each PREM assessment provided in Appendix  S5 to enable readers to make an informed decision about PREMs that they may use in their own patient experience evaluations and research.

The revised COSMIN checklist also possessed some important limitations. Firstly, the revised version of the COSMIN checklist was used instead of the original checklist 51 as it was more user‐friendly to use given the large proportion of PREMs included in this systematic review. Secondly, the parameters of measure for the validity and reliability testing comprising the checklist are very prescriptive. For example, the “structural validity” criteria stated that factors identified through exploratory factor analysis (EFA) had to explain at least 50 percent of the variance. 19 Yet other parameters such as a significant Bartlett's test of sphericity ( P  < 0.05), the Kaiser‐Meyer‐Olkin (KMO) measure of sampling adequacy (acceptability typically regarded as >0.6), or factor loading >0.4 (acceptable strength of loading on a factor) 52 , 53 can also be used to assess the quality of EFA. As such, this limited the authors’ ability to assess the reliability and validity of the instruments where tests other than those prescribed in the checklist were undertaken. Thirdly, the checklist fails to attribute rigor to the multidomain design of the included PREMs in measuring the same construct, which may positively impact upon how well the PREM captures a broad array of the attributes of a patient‐reported experience. 54 Fourthly, the COSMIN fails to capture the importance of floor and ceiling effects, as well as the percentage of missing data. These were commonly reported statistics among the included PREMs and demonstrate: (a) the ability of the instrument to discern meaningful differences between patients reporting extremes of low and high experience scores; and (b) the burden and feasibility of completing the instrument. 55 Fifthly, the revised COSMIN checklist fails to provide a summative score indicative of whether, overall, a PREM is or is not valid and reliable. Moreover, whether some tests of validity and reliability are more relevant or suitable than other tests to the overall validity and reliability of a PREM remains unknown. Further, it is unclear whether all tests ultimately need to be undertaken in order for a PREM to be labeled as a valid and reliable measure. Thus, in order to assist the reader to make an informed choice in their PREM selection, Appendix S4 ranks the PREMs within their specified contexts, according to the number of “+” scores obtained. Despite these limitations, the COSMIN checklist is currently the most comprehensive psychometric quality criteria for developing outcome measurement instruments and evaluating the method of development for these instruments. 56 Furthermore, the checklist has been applied to other similar systematic reviews 16 , 22 and was the most appropriate means of systematically measuring the psychometric rigor of the included PREMs.

5. CONCLUSION

Patient‐reported experience measures are internationally recognized instruments for measuring the quality of health care services from the patients perspective. The construct of patient‐reported experience appears to still be evolving, and though this systematic review identified PREMs across a range of contexts, PREMs remain largely designed to assess singular events of health care. The key messages of this systematic review are that while the testing of PREM validity and reliability has generally been undertaken in the context of appropriate study designs, there is large variability in both the number and type of validity and reliability testing undertaken for the PREMs identified. As such, it is important that PREM users are aware of the validity and reliability already undertaken for the PREM they have selected, and whether they themselves should undertake more robust testing. Further, the selection of PREMs for research and evaluation purposes should also be considerate of other important selection criteria such as whether a disease/condition or setting‐specific measure is more appropriate than a generic measure, and whether a PREM designed in the researcher's country is more appropriate than one designed in a different country, potentially with a different health care system in mind.

Supporting information

Acknowledgments.

Joint Acknowledgment/Disclosure Statement: All research was conducted at Griffith University, using University facilities and equipment. Two authors are employed by the University and two are undertaking PhD degrees. No other disclosures.

Bull C, Byrnes J, Hettiarachchi R, Downes M. A systematic review of the validity and reliability of patient‐reported experience measures . Health Serv Res . 2019; 54 :1023‐1035. 10.1111/1475-6773.13187 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

  • Reference Manager
  • Simple TEXT file

People also looked at

Systematic review article, validity of student evaluation of teaching in higher education: a systematic review.

literature review and validity

  • 1 Department of Educational Foundations, University of Education, Winneba, Ghana
  • 2 Department of Education and Psychology, University of Cape Coast, Cape Coast, Ghana
  • 3 Department of Health, Physical Education and Recreation, University of Cape Coast, PMB, Cape Coast, Ghana
  • 4 Neurocognition and Action-Biomechanics-Research Group, Faculty of Psychology and Sports Science, Bielefeld University Postfach, Bielefeld, Germany

Introduction: Data obtained from students regarding the quality of teaching are used by higher education administrators to inform decisions concerning tenure, promotion, course development and instructional modifications, among others. This article provides a review regarding studies conducted to examine the validity of student evaluation of teaching, specifically focusing on the following objectives: (1) identify the context where studies have been conducted on student evaluation of teaching; (2) find out the methodologies usually employed for assessing the validity of student evaluation of teaching; and (3) establish the sources of measurement error in student evaluation of teaching.

Methods: The systematic review was conducted based on the PRISMA checklist. The databases searched include Scopus, Web of Science (WoS), Google Scholar, PubMed, MEDLINE, ERIC, JSTOR, PsycLIT, EconLit, APA PsycINFO and EBSCO using some specific keywords. After applying the four eligibility criteria, 15 papers were left to be analyzed.

Results: It was discovered that the generalizability theory approach was mostly used to understand the validity of student evaluation data. The review revealed that students were found at the centre of inconsistencies in the evaluation process.

Discussion: The general impression from the review is that the credibility and validity of teaching evaluation outcomes is questionable, considering the several sources of errors revealed. The study recommended closely studying these sources of errors (e.g., rating behaviours of students).

Introduction

Due to the intense competition existing among higher education institutions, universities today are undergoing a paradigm shift in students’ status ( Raza et al., 2010 ) such that students are now freely making decisions regarding the type of institution to attend, the programme to choose, and even the type of major courses to read, just like customers selecting their preferred commodities in a supermarket with several varieties of products available ( Raza and Khawaja, 2013 ). Given the significant role students play in the sustenance and running of higher education institutions, they are allowed to evaluate the quality of instruction and courses in their respective institutions. This exercise has become a common phenomenon in almost every university around the globe ( Rantanen, 2013 ).

Student evaluation of teaching is a fairly new concept which was introduced and utilised interchangeably with numerous expressions such as students’ appraisal of teaching effectiveness ( Marsh, 2007 ), student appraisal of instructor performance ( Chuah and Hill, 2004 ), students’ evaluation of educational quality (SEEQ) ( Lidice and Saglam, 2013 ), student course satisfaction ( Betoret, 2007 ), students’ evaluation of instruction ( Clayson et al., 2006 ), or student course evaluation ( Chen, 2016 ; Duggan and Carlson-Bancroft, 2016 ). Notwithstanding the disparities in the concepts, they have a common underlying objective. Based on the definition and classification of fundamental higher education concepts outlined by the United Nations Educational, Scientific and Cultural Organization (UNESCO), students’ appraisal of teaching has been explained as the process by which students assess the general teaching activities, attitude of instructors, and course contents relative to their learning experiences. It is important to emphasize that the critical features evaluated by students include the instructors’ ability to explain issues to students, guide their learning, and drive class discussions. The instructors’ competencies in unpacking course contents and making courses relevant to students are also key in the assessment process (UNESCO, as cited in Vlăsceanu et al., 2004 ).

Arguably, student evaluation of courses and teaching influences tenure and promotion ( Kogan et al., 2010 ; Galbraith et al., 2012 ), students’ university application ( Alter and Reback, 2014 ), and student’s choice of courses ( Wilhelm, 2004 ). In some advanced countries like the United States, students’ evaluation data are used for official and unofficial ranking of institutions, and auditing purposes ( Johnson, 2000 ). These uses of the data have sparked extensive scientific literature in areas such as psychology, education, economics and sociology ( Goos and Salomons, 2017 ). With this understanding, the central issue baffling researchers is the extent to which students’ evaluation data can be understood as a pointer for examining the quality of teaching in higher education ( Taut and Rakoczy, 2016 ).

Validity theory and teaching evaluation

Validity in teaching evaluation is the soundness of the interpretation and uses of teaching evaluation results; it is a matter of degree and all about bringing together pieces of evidence to support inferences made about the responses provided by the students ( Brookhart and Nitko, 2019 ). Thus, data from students’ appraisal of teaching quality is highly valid when several pieces of evidence can be provided with regard to how the data were taken, the meaningfulness of the data and how the data were used ( Brookhart and Nitko, 2019 ). For validity to be understood in this context, these possibilities should be looked at: (1) raters may not be accurate in their ratings such that the scores given may not reflect the skills/abilities of the instructors or overall teaching quality; (2) the evaluation items may not be clear enough for raters to understand; and (3) students may rate other characteristics of the course and instructor other than actual psychological construct being rated. These issues, among others, are likely to influence the fairness of the evaluation exercise and the results thereof may not be a true reflection of the construct being measured.

In practice, when students are requested to evaluate learning experiences, courses and teaching quality, there is a higher likelihood of disagreement among themselves due to many systematic and unsystematic factors ( Eckes, 2015 ). The systematic factors include differences in rater behaviours, the difficulty of items, and the central tendency effect. On the other hand, the unsystematic factors comprise variations in physical scoring or testing conditions, attention fluctuations of raters as a result of fatigue, errors in transcription, and several others ( Brennan, 2011 ). Whereas these systematic factors produce systematic errors, the unsystematic factors produce random errors ( Eckes, 2015 ). In fact, random (unsystematic) errors can easily be corrected through carefully planned assessment procedures. The case happens to be different when systematic errors are present. Systematic errors, unlike random errors, can be easily identified in a data set. It is worth noting that random errors cancel out in large sample sizes and thus, do not usually have an effect on the validity of large data sets ( Eckes, 2015 ). Systematic errors, on the other side, create a pattern in the data set and distort the meaning derived from the data.

In most evaluation situations, giving a score to depict the degree to which a particular trait is possessed by the object of measurement (i.e., lecturers) is largely based on the subjective judgement of the rater (i.e., student). Rather than operating on collective grounds, student raters regularly seem to considerably differ regarding deeply fixed, more or less individualised rating predispositions, thereby threatening the validity of the evaluation outcomes ( Eckes, 2015 ). Raters also have the probability of giving similar scores for a particular lecturer on a theoretically different criterion ( Barrett, 2005 ; Iramaneerat and Yudkowsky, 2007 ). Due to this, several raters (as in the case of students’ appraisal of instruction), in most cases, are utilised to cancel out random rater errors in the data obtained ( Houston and Myford, 2009 ). The intent of using numerous raters is to have an estimate of every lecturer’s teaching quality, which is independent of some specific attribute of the raters. This is done by measuring the attributes of the raters and utilising such data to eliminate the faults of individual raters from the final score of the rater ( Linacre, 1994 ).

Furthermore, raters usually carry out the teaching appraisal by using items with rating scales where points on the scale are required to signify, a consecutively higher degree of performance on the construct (i.e., teaching). It is instructive to mention that most universities use instruments that rely on the analytic rating approach, where raters lookout for specific features of the construct of interest and a score is assigned accordingly based on the extent of the construct’s existence. In the case of using the holistic rating method, raters evaluate the whole performance and a single score is assigned to the performance ( Engelhard, 2011 ). With each scoring type, raters are required to differentiate between scale points and assign a rating that appropriately matches the performance; if this is not properly done, the validity of the score assigned will be threatened ( Eckes, 2015 ). This can be a source of measurement error, thereby, serving as a threat to validity.

In reality, the variability in students’ appraisal of courses and teaching is contributed by distal (e.g., age, attitude, ethnicity, motivation, etc) and proximal (e.g., item difficulty, rater severity, the structure of rating scale, etc) facets. Distal facets are those variables that may have a mediated or indirect influence on the ratings provided by students ( Eckes, 2015 ). Through a well-balanced and structured evaluation procedure, the negative effect of distal facets on the data can be minimised. Unlike distal facets, proximal facets have an immediate and direct effect on the rating scores awarded to the lecturers. Thus, the proximal facets, especially the person facet (i.e., lecturers in this study), play a significant role in understanding the validity dynamics of the evaluation data ( Eckes, 2015 ). In the teaching evaluation context, which is a part of the rater-mediated assessment, the major proximal facets include instructors, raters, occasion, and rating items ( Brennan, 2011 ; Eckes, 2015 ).

Psychometric models for testing teaching evaluation outcomes

The measurement literature identifies three key measurement theories for testing the psychometric analysis of rater-mediator assessments, which includes teaching evaluation by students. These models include the Classical Measurement Theory (CMT), Generalizability Theory (GT), and Item Response Theory (IRT). The CMT stipulates that an actual/observed score rating ( X ) is a linear model containing a true score ( T ) and error score ( E ) [ X  =  T  +  E ] ( Lord and Novick, 1968 ). Unlike the observed score, the true score and the error score components are unobserved, which requires some assumptions. In connection with teaching evaluation, any rating score provided by a rater/student is an observed score with two parts; the true score which signifies the precise rating and the error score which denotes the imprecise component of the score. The CMT framework adopts statistical procedures such as regression, correlation and factor analysis with specific methodological designs (e.g., test-retest) to assess whether observed scores are devoid of measurement errors or not ( Feldt and Brennan, 1989 ). It is therefore common to see researchers who obtain teaching evaluation ratings from students on two similar occasions and using correlation analysis to examine the consistency of ratings. In this situation, a high correlation coefficient depicts a high degree of rater consistency and consequently, little error in the rating.

The GT is a statistical theory and conceptual framework for assessing the dependability of a set of observed scores for a specified degree in diverse universes ( Cronbach et al., 1963 ; Shavelson and Webb, 1991 ). The GT, which has its foundation in the CMT, discards the idea of a single undistinguishable error of measurement and somewhat postulates that the error of measurement occurs from multiple sources ( Brennan, 2011 ) (i.e., X  =  T  +  E 1  +  E 2 … +  E x ). These multiple sources of random measurement errors (also known as facets) which inflate construct-irrelevant variances may include raters (i.e., students in this context), rating items and occasions that can be estimated simultaneously within a single analysis (this is a weakness of the CMT) ( Eckes, 2015 ). Unlike the CMT, the GT has the capacity to evaluate the interaction between a number of facets and how this interaction(s) contributes to the variability in observed ratings. For example, how students in a particular class systematically rate lecturers’ teaching on a particular item can be explored. Specifically, GT estimates the amount of each error source distinctly and offers an approach for improving the dependability of behavioural measurements. In a statistical sense, the GT combines CMT statistical procedures with the analysis of variance (ANOVA) ( Brennan, 2011 ).

The IRT is a family of psychometric models that provide information about the characteristics of items on an instrument, the persons responding to these items and the underlying trait being measured ( Yang and Kao, 2014 ). Within the context of teaching evaluation, the Many-Facet Rasch Modelling (MFRM), an extended form of the Rasch model under the IRT family, is more appropriate approach to testing the dependability of students’ responses ( Rasch, 1980 ; Linacre, 1989 ). The MFRM allows the evaluation of the characteristics of individual raters, the items and how these raters influence the process of rating ( McNamara and Knoch, 2012 ). MFRM includes the evaluation of the influence of other sources of non-random errors like unreliable raters, inconsistency in ratings across occasions, and inconsistencies in the comparative difficulty of the items ( Linacre, 1994 ). For instance, the MFRM can provide information on whether a single rater tends to systematically score some category of individuals differently than the others, or whether some particular group of individuals performed systematically different on a specific item than they did on others ( Linacre, 2003 ).

Compared to the CMT, the GT and MFRM have both been found in the literature to be more appropriate in terms of assessing the dependability of observed data by estimating the true score and error score components of observed data, especially with rater-mediated assessment or evaluation ( Linacre, 2003 ). The utilisation of these two modelling approaches permits the identification of the various sources of measurement error associated with the observed data. The GT and MFRM procedures, again, compute for reliability indices, which provide an idea of the dependability of the observed data ( Brennan, 2011 ). Moreover, while CMT and GT see the data principally from a group-level viewpoint, separating the sources of measurement error and calculating their extent, an MFRM analysis largely concentrates on individual-level information and, therefore, encourages functional examination into the functioning, or behaviour, of every individual component of the facets being considered ( Linacre, 2001 ).

Rational for the review

The validity of student evaluation of courses and teaching is a contentious issue ( Hornstein, 2017 ). While there are conceptual, theoretical and empirical supports for the validity of students’ appraisal of teaching, such data have been critiqued for several reasons ( Spooren et al., 2013 ). In particular, students have been found to evaluate instructional quality based on the characteristics of the course (e.g., difficult/easy nature of courses), student characteristics (e.g., students being friends with the instructor or dislike for the course) and teacher (e.g., the strictness of the instructor) which are unrelated to the quality of teaching and course contents ( Berk, 2005 ; Isely and Singh, 2005 ; Ko et al., 2013 ). Such contamination in the evaluation process has several implications for the quality of teaching and learning, instructor growth, tenure and promotion ( Ewing, 2012 ; Galbraith et al., 2012 ).

The uncertainties surrounding the validity of student evaluation of teaching are partly due to the diverse methodologies employed by the researchers in the field. The measurement literature offers three major approaches (i.e., the CMT, GT and MFRM) that have been used for investigating the dependability of student evaluation of teaching. The question of which methodology provides the most comprehensive and accurate results in performance-mediated assessment has been extensively discussed in the measurement literature. In addition to what has been discussed earlier on, two conclusions have been made: (1) GT is preferred to CMT because the weaknesses of CMT have been curtailed by GT (see Cronbach et al., 1972 ; Shavelson and Webb, 1991 ; Brennan, 2001b ), (2) MFRM as compared to GT offers more knowledge about the data, but combining both approaches offers an excellent picture of the rating process and supports the current idea of validity evidence ( Linacre, 2003 ; Kim and Wilson, 2009 ; Brennan, 2011 ; Lee and Cha, 2016 ).

This review aims to offer a systematic imprint of literature on the dependability of student evaluation of teaching in higher education. Similar reviews have been conducted on the issue of the usefulness and validity of student appraisal of teaching ( Costin et al., 1971 ; Wachtel, 1998 ; Onwuegbuzie et al., 2006 ; Spooren et al., 2013 ); however, these studies were narrative and scooping in nature and largely focused on distal factors such distal (e.g., gender, age, attitude, ethnicity, motivation, etc), rather than proximal factors (e.g., item difficulty, rater severity, rating scale functioning, etc) 1 which is the focus of this review. The recent review conducted in 2013 by Spooren et al., for example, discussed validity issues in questionnaire design, dimensionality, online evaluation, teacher characteristics, and particularly, content and construct validity. In fact, none of the earlier reviews studied the methodologies, sources of variability in terms of proximal factors, and the context in which studies have been carried out. These dimensions are necessary ingredients in terms of understanding the teaching evaluation landscape in higher education, appropriately driving professional practice, policy formulation and implementation. Unlike previous reviews, this study was aimed at conducting a systematic review to achieve the following objectives: (1) To identify the context where studies have been conducted on student evaluation of teaching; (2) To find out the methodologies usually employed for assessing the validity of student evaluation of teaching; and (3) To understand the sources of measurement error in student evaluation of teaching.

This paper is significant for some reasons. First, the outcome of this review would enlighten administrators of higher education institutions on the sources of measurement errors and the extent of dependability of student evaluation data. This information will help these administrators on how the teaching evaluation can be carried out in an error-free setting and at the same understating the extent to which the outcome of the evaluation can be utilised. Secondly, this review would inform the direction of further studies in terms of (a) the methods future researchers should adopt in their investigation; (b) which study settings need more attention in terms of research; and (c) which specific factors should be well studied to understand the variability in student evaluation of teaching.

Research protocol

This systematic review was conducted based on the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 extended checklist ( Page et al., 2021 ). All protocols followed in this research were guided by PRISMA.

Search plan and information sources

The databases searched included Scopus, Web of Science (WoS), Google Scholar, PubMed, MEDLINE, ERIC, JSTOR, PsycLIT, EconLit, APA PsycINFO and EBSCO. The search for the literature was done by three independent researchers and carried out within the period, 15th to 26th December, 2022. The search was conducted by combining these three keywords (i.e., “validity,” “reliability,” and “variability”) with each of these phrases (i.e., “student evaluation of teaching,” “teaching quality appraisal,” “student appraisal,” and “teaching evaluation”). The Boolean operator “and” was used for combining the keywords and phrases. The language filter was applied to restrict the search to all manuscripts written in the English Language. After the initial search, 293 papers were retrieved. There were no year restrictions used for deciding which paper was eligible or not. Duplicates were detected through the Zotero tool. Some other duplications were also deleted manually. In all, 41 duplicates were deleted.

Screening procedure

Two hundred and fifty-two (252) papers were independently screened by three researchers with the following educational background and expertise: psychology, programme evaluation, measurement in education and psychology, psychometrics, and research methodology. First, the papers were screened by critically considering the titles and abstracts focusing on quantitative studies. After this phase, 113 papers were exempted and the remaining 139 papers were further screened for eligibility. The following criteria were set for the exclusion of papers for the analysis:

1. Articles which investigated the validity of the student evaluation of teaching using a statistical (quantitative) approach under CMT, GT and IRT, and not based on opinions of students, regarding the quality of such data were included. We focused on quantitative studies because it is the only means by which rating inconsistent rating beahaviours can be directly observed. Qualitative studies can only take the opinions of students, lecturers, and other stakeholders concerning the validity of the outcome of teaching evaluation. Whereas these opinions may be useful, such views may not be objective but based on the inter-subjectivity experiences of the respondents;

2. Studies which were conducted on the dependability of student appraisal of teaching in higher education. The study focused on higher education because there has not been consensus on the measurement of the quality of teaching at the pre-tertiary level of education ( Chetty et al., 2014 );

3. Studies focusing on how the distal factors contribute to the variations in student appraisal of teaching were also excluded. For example, articles that conducted factor analysis or questionnaire validation were excluded for two reasons: (1) several reviews have been conducted on the development and validation of appraisal questionnaires; (2) almost every higher education institution validates the questionnaire for the evaluation exercise. This was supported in the studies which were analyzed for this paper; all the studies used a well-validated instrument for data collection.

During the last screening phase, which is the application of the eligibility criteria, a more detailed reading was done by all four researchers. In the end, 124 out of 139 papers failed to meet the eligibility criteria, leading to a final sample of 15 papers that were analyzed and synthesized for the study (see Figure 1 ).

www.frontiersin.org

Figure 1 . PRISMA flow chart.

Data analysis plan

Data extraction is based on the 15 papers by coding the information into the following themes: (1) author(s) and publication year, (2) country where the study was conducted, (3) statistical approach adopted by the author(s), (4) the proximal factors understudied, and (5) the key idea from the research (see Table 1 ). To check the appropriateness of the coding, all four investigators were involved in the process; this was based on the recommendations of González-Valero et al. (2019) . Discrepancies were resolved among the investigators but most importantly, the themes identified by the investigators were consistent. To ensure sufficient reliability, inter-rater agreement was calculated based on the Fleiss’ Kappa (Fk) statistical index. A coefficient of 0.77 was achieved for the information extraction and selection, which indicated adequate agreement ( Fleiss, 1971 ).

www.frontiersin.org

Table 1 . Summary of studies on student evaluation of teaching in higher education.

The context of the studies

The review showed that research works on the sources of variability and the validity of student evaluation of teaching and courses in schools have been extensively conducted across North America and Asia. All the studies on the North American continent were conducted in the USA, specifically in New Mexico ( VanLeeuwen et al., 1999 ), Washington ( Gillmore et al., 1978 ), Illinois ( Kane et al., 1976 ), Massachusetts ( Mazor et al., 1999 ), and California ( Marsh and Overall, 1980 ). All of these studies in the USA were carried out before the year 2000, which perhaps, indicates a declining interest in the issue of the sources of variability in students’ appraisal of teaching. The trend of studies in Asia was different from those studies in the USA (see Table 2 ). Studies that investigated the sources of variability in student ratings of instructors in Asia started not long ago when Ibrahim (2011) studied that issue in the Sultanate of Oman using the GT approach. Since then, studies have been conducted by Samian and Noor (2012) in Malaysia, Börkan (2017) , and Üstünlüoğlu and Can (2012) all in Turkey, and Li et al. (2018) in China. A few of the studies were conducted in Europe ( n  = 3, 20%) ( Rindermann and Schofield, 2001 ; Spooren et al., 2014 ; Feistauer and Richter, 2016 ) and, particularly, in Africa ( n  = 2, 13.3%) ( Quansah, 2020 , 2022 ). Although few studies were carried out in Europe and Africa, most of the studies were quite current compared to those carried out in America.

www.frontiersin.org

Table 2 . The context of previous studies.

Methodology utilized

The review revealed that all three measurement theories (i.e., CMT, GT, and MFRM) have been utilized to investigate the dependability of student ratings of lecturers’ teaching and courses. However, the GT approach was found to be more popular in terms of its usage. Out of the 15 empirical studies which met the criteria for this review, 9 (60%) of them used the GT approach for their investigation (see Table 1 ). Comparatively, the use of the GT approach appears to be gaining ground and dominating the quality assurance in higher education literature in recent times (see Figure 2 ). Additional four studies (26.7%) used CMT and only two (13.3%) research works adopted the MFRM approach (see Figure 2 ) ( Börkan, 2017 ; Quansah, 2022 ).

www.frontiersin.org

Figure 2 . Methods used in previous studies over time.

Main sources of measurement errors

The review revealed five main sources of variability in student rating exercises. This includes student, evaluation item/scale, occasion, teacher, class, and course. For all the nine studies which adopted GT, the teacher, class or course was used as the object of measurement. In addition, the teacher factor was considered as the measurement object for two studies that adopted the MFRM. The findings on the specific sources of variability have been provided subsequently.

It was found that all the studies reviewed included the student (i.e., rater) as a source of variability in teaching evaluation rating (see Table 1 ). However, one of the studies ( Spooren et al., 2014 ) failed to recognize the student facet as a source of error in the ratings because the authors believed that student variability is desired since it represents disparities in individual students in their quality ratings of a course. Despite this argument by the authors ( Spooren et al., 2014 ), it was revealed that the students were inconsistent with how they responded to the items. Those studies which considered students as a source of measurement error had three common findings; (1) student was the only main effect variable that recorded the largest contribution to the variability in the ratings; (2) the analysis showed a low level of validity of the responses provided by the students in rating instructors/courses; and (3) increasing the number of students who participated in the evaluation for each instructor was more useful to improving the validity of student ratings.

In a more detailed analysis using the MFRM, two studies ( Börkan, 2017 ; Quansah, 2022 ) revealed that although some students provided accurate responses during the teaching evaluation exercise, quite a number of them were inconsistent with rating their instructors. Other wide-range of rating behaviours reported in these two studies include: (1) some students failed to discriminate between the quality of teaching or course quality or the performance level of the instructor(s), indicating that these students provided similar ratings for completely different situations (i.e., lecturers with different teaching ability); (2) the majority of the students were influenced by (less salient) factors which are unrelated to the targeted teaching behaviours they are required to assess. Take for example, a lecturer who is punctual to class yet has weak subject matter and pedagogical knowledge; most students provided excellent ratings for instructors being influenced by the lecturer’s punctual behaviours even at points when punctuality is not assessed; (3) the majority of the students were lenient in their ratings, providing high ratings for undeserving instructors or situations.

Evaluation items/scale

Nine (60%) out of the 15 studies considered the item as a major source of variance in student ratings. Out of the nine studies, seven of them found that the item had the least variance contribution indicating that the evaluation item contributed very few measurement errors to the quantification of the ratings (see Kane et al., 1976 ; Gillmore et al., 1978 ; Mazor et al., 1999 ; VanLeeuwen et al., 1999 ; Ibrahim, 2011 ; Spooren et al., 2014 ; Quansah, 2020 ). The remaining two studies ( Börkan, 2017 ; Quansah, 2022 ) investigated beyond the item-variance properties to the scale functioning using MFRM. In their research, these authors revealed poor scale functioning with most responses clustered around the highest two scale categories for a 5-point scale coupled with a lack of clarity of the response categories. Moreover, some of the items on the evaluation instruments were identified as unclear, redundant and could not measure the targeted trait ( Börkan, 2017 ; Quansah, 2022 ).

Two of the studies ( Spooren et al., 2014 ; Quansah, 2020 ); these studies adopted the GT methodological approach. Findings from both studies revealed that occasion did not contribute significantly to measurement errors during the ratings. It was further revealed that one-time evaluation data offered much more advantage in terms of precision in student responses than obtaining the evaluation data for a lecturer on multiple occasions from the same group of students. However, one of the studies (i.e., Quansah, 2020 ) indicated that taking the evaluation data in the middle of the semester yielded a more accurate response from the students than waiting until the semester ends.

Instructor/class/course

Few studies considered instructors, class and course type as sources of measurement error ( Gillmore et al., 1978 ; Feistauer and Richter, 2016 ; Börkan, 2017 ; Quansah, 2022 ). These studies found that class and course type contributed very little in terms of the variances in the ratings of students. For instructors, two of the studies revealed that the instructors did not receive accurate and precise ratings from the students. In most cases, the instructors received higher ratings than expected.

Credibility of students’ evaluation data

Generally, the majority of the studies demonstrated that the validity of student evaluation data was low. Whereas the studies which utilized GT and MFRM approaches showed a low level of validity of student ratings of instructors, the majority of the studies which adopted the CMT methodology found a high level of validity of the data. Except for Marsh and Overall (1980) who found a lack of consistency in the ratings by students in a university in Los Angeles. Studies by Rindermann and Schofield (2001) , Samian and Noor (2012) , and Üstünlüoğlu and Can (2012) revealed a high validity level of student ratings. What is found common with all the studies that adopted the CMT is that they all employed the criterion validity approach, where they attempted to corroborate statistical results from one occasion to another through correlational analysis. These researchers appeared to focus on the stability of traits or triangulation of evaluation.

The purpose of this review was to analyse and synthesize existing studies on the subject of the validity of students’ evaluation of teaching across the globe. The review attempted to understand the scope and context of the research area regarding the subject, the methodologies adopted by the available evidence and the factors that account for errors in student responses during teaching evaluation. The outcome of the review showed that for over five decades, the available literature on the validity of teaching evaluation data is scanty, with the majority of the recent studies conducted in Asia and North America while a few have been conducted in Europe and Africa. Interestingly, the recent interest in the subject area was found among researchers in Europe; meanwhile, scholars in North America since 1999 have not conducted any research in the area. This finding provides insight into how researchers on different continents are directing their research focus to the area. The geographical distribution of studies and the scanty available research can be tied to the complexities in studying the validity of teaching evaluation in higher education due to the highly statistical approach required ( Ashaari et al., 2011 ; Rosli et al., 2017 ).

The review showed that the majority of the studies adopted the GT approach to their investigation while just a few utilised the MFRM. The popularity of GT can be explained for two reasons. First, GT was introduced to address some weaknesses in the oldest measurement theory (i.e., CMT) in terms of its use in performance-mediated assessments ( Shavelson and Webb, 1991 ). This led to the switch from CMT to GT, even though this transition took some time due to the complexities in the use of GT such as developing syntaxes for running the analysis ( Brennan, 2001b ). Thus, it could be observed that studies adopting the GT have grown steadily from 1976 to 2020. This also explains the decreased levels of use of CMT in recent times. Secondly, several computer programming software or syntaxes are available today with their respective instruction or guidelines which have made the adoption of GT less difficult. Some of the software or syntaxes include GENOVA package ( Brennan, 2001a ), ETUDGEN ( Cardinet et al., 2010 ), MATLAB, SAS, SPSS ( Mushquash and O’Connor, 2006 ), EduG ( Cardinet et al., 2010 ), G-String, LISREL ( Teker et al., 2015 ), and R programming ( Hucbner and Lucht, 2019 ) among others. Thus, the availability of these statistical software and syntaxes can make the use of GT more popular. Concerning the MFRM approach, only two studies adopted the MFRM approach. The low adoption of MFRM analysis in the higher education quality assurance literature can also be attributed to the fact that most scholars are not aware of such a procedure. This coupled with the complexity of the use of the approach/method, which requires a high level of expertise, especially when there are so many factors involved. There is also relatively few computer programme software/syntaxes which can perform MFRM analysis. The two known applications are FACET and R programming applications ( Lunz et al., 1990 ). The programming nature of these software and, perhaps, the limited guidelines for their use may have discouraged researchers who have little background in measurement but are scholars in quality assurance.

A more significant aspect of the findings showed that the student (i.e., the rater) contributed the largest amount of errors to the teaching evaluation data. It was found that higher education students are inconsistent in their ratings with most of them failing to provide ratings that discriminate across the varying levels of performance of instructors. Most students were also influenced by (less salient) factors which are unrelated to the targeted construct being measured. What is central to this finding is the fact that higher education students are not usually trained to respond to teaching evaluations ( Eckes, 2015 ). In many institutions, students are offered a brief orientation about what the teaching evaluation is about without proper training ( Dzakadzie and Quansah, 2023 ). It is not surprising that some students showed a lack of understanding of the response options on the evaluation instrument, although these instruments had excellent psychometric properties. Higher education administrators should organize regular training programmes for students on how to rate accurately to reduce errors of measurement (such as halo effect, inconsistent rating, and inability to use the rating scales) during the teaching appraisal exercise. The training should include what (behaviours) they should look out for when appraising.

Further analysis from the review showed that a greater proportion of tertiary students are lenient in their ratings. Although the reasons for their leniency were not explored in the various studies, some factors are obvious considering the framework of teaching evaluation. A key concern is the issue of negative critical culture where students experience fear of retaliation by the lecturer when they provide poor teaching evaluation. In such instances, students may feel reluctant to share honest opinions about teaching activities and services they receive from the institution ( Adams and Umbach, 2012 ). This negative atmosphere can be worsened when anonymity and confidentiality of student responses cannot be assured. This situation might have contributed to the findings that the instructors received very high evaluation scores. An interesting perspective on this issue is that several pieces of research work have confirmed that grade inflation is positively associated with teaching evaluation outcomes ( Eiszler, 2002 ; Stroebe, 2020 ; Berezvai et al., 2021 ; Park and Cho, 2023 ). The takeaway from these studies is that some professors exchange lenient grading with excellent evaluation results. This reason can explain why one of the studies included in the systematic review showed that the teaching evaluation conducted in the middle of the semester (before any assessment) had higher reliability than those performed at the end of the semester (i.e., when some assessments have been conducted). While higher education institutions are encouraged to uphold anonymity and confidentiality, students should be oriented on why they need to provide honest responses without fear of reprisals. It is also suggested higher education administrators should orient professors/instructors on the benefits associated with accurate evaluation data. Additionally, teaching evaluation should be organised by the authorised department preferably before the end of the semester. Other strategies can be explored to decouple students’ grades from their evaluation responses.

The general impression across the available studies from different continents is that the credibility and validity of teaching evaluation outcomes is questionable, considering the several sources of errors revealed. The majority of the evidence from the empirical papers reviewed suggests little support for administrators to rely on teaching evaluation results for critical decisions concerning instructors and policy implications. The outcome of the review draws on a close partnership among students, professors/instructors and management of higher education institutions in ensuring that the reliability of such data is improved. By having a clear framework of responsibilities for all these parties and stakeholders, much progress can be made considering the implications of the evaluation results for all parties. Despite the relevance of this review, it only included quantitative studies (and excluded studies which examined the opinions of stakeholders regarding the validity of teaching evaluation results). We recommend that future researchers are encouraged to conduct a systematic review of qualitative studies conducted on the subject.

Conclusions and future research direction

The results from the review lead the researchers to a general conclusion that not much has been done in exploring the sources of variation and validity of student ratings of teachers/courses in institutions of higher education. This situation calls for an urgent need for more empirical research work to be conducted in the area. These limited studies, however, have drawn upon a universal consensus that the validity of students’ responses to the appraisal of teaching in higher education is still in doubt. Students are found at the centre of these inconsistencies in the rating process due to several factors that could not be disentangled from the students. This paper, therefore, serves as a prompt to researchers to conduct more studies in the area.

We recommend that further studies adopt the MFRM methodological framework or possibly blend these procedures to continue the discussion on the validity of evaluation responses by students. It is essential to note that merging these procedures (especially GT and MFRM) supports recent developments in validity theory which recommends multiple sources of validity evidence to be gathered, assessed, and combined into a validity argument to support score interpretation and utilisation ( Kane, 2012 ; Fan and Bond, 2016 ). Since students played a pivotal role in terms of understanding the variations in student ratings of teachers/courses in higher education, we recommend that further research should be conducted to closely study the behaviours of students during the evaluation exercise. This investigation should be extended to examining the process data by observing the behavioural patterns of the students in the rating process through log files.

Although few studies examined the scale functioning quality of the evaluation form used, this serves as a prompt for future studies to take a close look at the scale functioning of the evaluation instrument. It must be mentioned that most validation procedures for evaluation forms do not include scale category functioning. Future research should include the item as a source of measurement error for further careful examination including the scale category quality. Except for Asia, future research should be conducted in other continents like Africa, Europe, South America, Antarctica, and Australia. This is essential to provide such information to administrators of higher education regarding the utility of evaluation of teaching data from students. This is also needed to help understand the sources of measurement error, particularly proximal factors, and the validity of student ratings of teachers/courses in higher education around the globe.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

FQ: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. AC: Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. KA-G: Investigation, Methodology, Validation, Visualization, Writing – review & editing. JH: Funding acquisition, Investigation, Methodology, Validation, Visualization, Writing – review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. The study received funding from the School of Graduate Studies and Graduate Students Association (GRASSAG) of the University of Cape Coast, the Samuel and Emelia Brew-Butler/SGS/GRASSAG-UCC Research Grant. The authors sincerely thank Bielefeld University, Germany for providing financial support through the Open Access Publication Fund for the article process charge.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ Distal facets are those variables that may have a mediated or indirect influence on the ratings provided by students ( Eckes, 2015 ). Unlike distal facets, proximal facets have an immediate and direct effect on the rating scores awarded to the lecturers. Thus, the proximal facets, especially the person facet, plays a significant role in understanding the validity dynamics of the evaluation data ( Eckes, 2015 ).

Note: Those sources marked with an asterisk were the studies included in the review.

Google Scholar

Adams, M. J., and Umbach, P. D. (2012). Nonresponse and online student evaluations of teaching: understanding the influence of salience, fatigue, and academic environments. Res. High. Educ. 53, 576–591. doi: 10.1007/s11162-011-9240-5

Crossref Full Text | Google Scholar

Alter, M., and Reback, R. (2014). True for your school? How changing reputations alter demand for selective U.S. colleges. Educ. Eval. Policy Anal. 36, 346–370. doi: 10.3102/0162373713517934

Ashaari, N. S., Judi, H. M., Mohamed, H., and Wook, M. T. (2011). Student’s attitude towards statistics course. Procedia Soc. Behav. Sci. 18, 287–294. doi: 10.1016/j.sbspro.2011.05.041

Barrett, S. (2005). “Raters and examinations” in Applied Rasch measurement: a book of exemplars . eds. S. Alagumalai, D. C. Curtis, and N. Hungi (Dordrecht: Springer), 159–177.

Berezvai, Z., Lukáts, G. D., and Molontay, R. (2021). Can professors buy better evaluation with lenient grading? The effect of grade inflation on student evaluation of teaching. Assess. Eval. High. Educ. 46, 793–808. doi: 10.1080/02602938.2020.1821866

Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. Int. J. Teach. Learn. High. Educ. 17, 48–62.

Betoret, F. D. (2007). The influence of students’ and teachers’ thinking styles on student course satisfaction and on their learning process. Educ. Psychol. 27, 219–234. doi: 10.1080/01443410601066701

*Börkan, B. (2017). Exploring variability sources in student evaluation of teaching via many-facet Rasch model. J. Meas. Eval. Educ. Psychol. , 8, 15–33, doi: 10.21031/epod.298462

Brennan, R. L. (2001a). “Manual for urGENOVA version 2.1” in Iowa testing programs occasional paper number 49 (Iowa City, IA: Iowa Testing Programs, University of Iowa).

Brennan, R. L. (2001b). Generalizability theory . New York: Springer-Verlag.

Brennan, R. L. (2011). Generalizability theory and classical test theory. Appl. Meas. Educ. 24, 1–21. doi: 10.1080/08957347.2011.532417

Brookhart, S. M., and Nitko, A. J. (2019). Educational assessment of students . Upper Saddle River, NJ: Pearson.

Cardinet, J., Johnson, S., and Pini, G. (2010). Applying generalizability theory using EduG . New York, NY: Routledge/Taylor and Francis Group.

Chen, L. (2016). Do student characteristics affect course evaluation completion?. Annual Conference of the Association for Institutional Research. New Orleans, LA.

Chetty, R., Friedman, J. N., and Rockoff, J. E. (2014). Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. Am. Econ. Rev. 104, 2593–2632. doi: 10.1257/aer.104.9.2593

Chuah, K. L., and Hill, C. (2004). Student evaluation of teacher performance: random pre-destination. J. Coll. Teach. Learn. 1, 109–114. doi: 10.19030/tlc.v1i6.1961

Clayson, D. E., Frost, T. F., and Sheffet, M. J. (2006). Grades and the student evaluation of instruction: a test of the reciprocity effect. Acad. Manage. Learn. Educ. 5, 52–65. doi: 10.5465/amle.2006.20388384

Costin, F., Greenough, W. T., and Menges, R. J. (1971). Student ratings of college teaching: reliability, validity, and usefulness. Rev. Educ. Res. 41, 511–535. doi: 10.3102/00346543041005511

Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N. (1972). The dependability of behavioural measurements: theory of generalizability for scores and profiles . New York, NY: Wiley.

Cronbach, L. J., Rajaratnam, N., and Gleser, G. C. (1963). Theory of generalizability: a liberalization of reliability theory. Br. J. Stat. Psychol. 16, 137–163. doi: 10.1111/j.2044-8317.1963.tb00206.x

Duggan, M., and Carlson-Bancroft, A. (2016). How Emerson College increased participation rates in course evaluations and NSSE. Annual Conference of the Association for Institutional Research. New Orleans, LA.

Dzakadzie, Y., and Quansah, F. (2023). Modelling unit non-response and validity of online teaching evaluation in higher education using generalizability theory approach. Front. Psychol. 14:1202896. doi: 10.3389/fpsyg.2023.1202896

PubMed Abstract | Crossref Full Text | Google Scholar

Eckes, T. (2015). Introduction to many-facet Rasch measurement: analysing and evaluating rater-mediated assessment (2). Frankfurt: Peter Lang GmbH.

Eiszler, C. F. (2002). College students’ evaluations of teaching and grade inflation. Res. High. Educ. 43, 483–501. doi: 10.1023/A:1015579817194

Engelhard, G. (2011). Evaluating the bookmark judgments of standard-setting panellists. Educ. Psychol. Meas. 71, 909–924. doi: 10.1177/0013164410395934

Ewing, A. M. (2012). Estimating the impact of relative expected grade on student evaluations of teachers. Econ. Educ. Rev. 31, 141–154. doi: 10.1016/j.econedurev.2011.10.002

Fan, J., and Bond, T. (2016). Using MFRM and SEM in the validation of analytic rating scales of an English speaking assessment. In Q. Zhang (Ed.). Conference Proceedings for Pacific Rim Objective Measurement Symposium (PROMS) 29–50.

*Feistauer, D., and Richter, T. (2016). How reliable are students’ evaluations of teaching quality? A variance components approach. Assess. Eval. High. Educ. , 10, 1–17. doi: 10.1080/02602938.2016.1261083

Feldt, L. S., and Brennan, R. L. (1989). “Reliability” in Educational measurement . ed. R. L. Linn. 3rd ed (New York: American Council on Education and MacMillan), 105–146.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382. doi: 10.1037/h0031619

Galbraith, C. S., Merrill, G. B., and Kline, D. M. (2012). Are student evaluations of teaching effectiveness valid for measuring student learning outcomes in business-related classes? A neural network and Bayesian analyses. Res. High. Educ. 53, 353–374. doi: 10.1007/s11162-011-9229-0

*Gillmore, G. M., Kane, M. T., and Naccarato, R. W. (1978). The generalizability of student ratings of instruction: estimation of the teacher and course components. J. Educ. Meas. , 15, 1–13, doi: 10.1111/j.1745-3984.1978.tb00051.x

González-Valero, G., Zurita-Ortega, F., Ubago-Jiménez, J. L., and Puertas-Molero, P. (2019). Use of meditation and cognitive behavioral therapies for the treatment of stress, depression and anxiety in students. A systematic review and meta-analysis. Int. J. Environ. Res. Public Health 16, 1–23. doi: 10.3390/ijerph16224394

Goos, M., and Salomons, A. (2017). Measuring teaching quality in higher education: assessing selection bias in course evaluations. Res. High. Educ. 58, 341–364. doi: 10.1007/s11162-016-9429-8

Hornstein, H. A. (2017). Student evaluations of teaching are inadequate assessment tool for evaluating faculty performance. Cogent Educ. 4, 13–42. doi: 10.1080/2331186X.2017.1304016

Houston, J. E., and Myford, C. M. (2009). Judges’ perception of candidates’ organization and communication in relation to oral certification examination ratings. Acad. Med. 84, 1603–1609. doi: 10.1097/ACM.0b013e3181bb2227

Hucbner, A., and Lucht, M. (2019). Generalizability theory in R. Pract. Assess. Res. Eval. 24, 5–12. doi: 10.7275/5065-gc10

*Ibrahim, A. M. (2011). Using generalizability theory to estimate the relative effect of class size and number of items on the dependability of student ratings of instruction. Psychol. Rep. , 109, 252–258, doi: 10.2466/03.07.11.PR0.109.4.252-258

Iramaneerat, C., and Yudkowsky, R. (2007). Rater errors in a clinical skills assessment of medical students. Eval. Health Prof. 30, 266–283. doi: 10.1177/0163278707304040

Isely, P., and Singh, H. (2005). Do higher grades lead to favourable student evaluations? J. Econ. Educ. 36, 29–42. doi: 10.3200/JECE.36.1.29-42

Johnson, R. (2000). The authority of the student evaluation questionnaire. Teach. High. Educ. 5, 419–434. doi: 10.1080/713699176

Kane, M. (2012). Validating score interpretations and uses. Lang. Test. 29, 3–17. doi: 10.1177/0265532211417210

*Kane, M. T., Gillmore, G. M., and Crooks, T. J. (1976). Student valuation of teaching: the generalizability of class means. J. Educ. Meas. , 13, 173–183, doi: 10.1111/j.1745-3984.1976.tb00009.x

Kim, S. C., and Wilson, M. (2009). A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model. J. Appl. Meas. 10, 408–423.

PubMed Abstract | Google Scholar

Ko, J., Sammons, P., and Bakkum, L. (2013). Effective teaching: a review of research and evidence . Berkshire: CfBT Education Trust.

Kogan, L. R., Schoenfeld-Tacher, R., and Hellyer, P. W. (2010). Student evaluations of teaching: perceptions of faculty based on gender, position, and rank. Teach. High. Educ. 15, 623–636. doi: 10.1080/13562517.2010.491911

Lee, M., and Cha, D. (2016). A comparison of generalizability theory and many facet Rasch measurement in an analysis of mathematics creative problem-solving test. J. Curric. Eval. 19, 251–279. doi: 10.29221/jce.2016.19.2.251

*Li, G., Hou, G., Wang, X., Yang, D., Jian, H., and Wang, W. (2018). A multivariate generalizability theory approach to college students’ evaluation of teaching. Front. Psychol. 9:1065. doi: 10.3389/fpsyg.2018.01065

Lidice, A., and Saglam, G. (2013). Using students’ evaluations to measure educational quality. Procedia Soc. Behav. Sci. 70, 1009–1015. doi: 10.1016/j.sbspro.2013.01.152

Linacre, J. M. (1989). Many-facet Rasch measurement . Chicago: MESA Press.

Linacre, J. M. (1994). Many-facet Rasch measurement (2). Chicago: MESA Press.

Linacre, J. M. (2001). Generalizability Theory and Rasch Measurement. Rasch Measurement Transactions , 15, 806–807.

Linacre, J. M. (2003). A user’s guide to FACETS (computer program manual) . Chicago: MESA Press.

Lord, F. M., and Novick, M. R. (1968). Statistical theories of mental test scores . Reading, MA: Addison-Wesley.

Lunz, M., Wright, B., and Linacre, J. (1990). Measuring the impact of judge severity on examination scores. Appl. Meas. Educ. 3, 331–345. doi: 10.1207/s15324818ame0304_3

Marsh, H. W. (2007). “Students’ evaluations of university teaching: a multidimensional perspective” in The scholarship of teaching and learning in higher education: an evidence-based perspective (Dordrecht: Springer), 319–384.

*Marsh, H. W., and Overall, J. U. (1980). Validity of students’ evaluation of teaching effectiveness: cognitive and affective criteria. J. Educ. Psychol. , 72, 468–475, doi: 10.1037/0022-0663.72.4.468

*Mazor, K., Clauser, B., Cohen, A., Alper, E., and Pugnaire, M. (1999). The dependability of students’ ratings of preceptors. Acad. Med. , 74, S19–S21, doi: 10.1097/00001888-199910000-00028

McNamara, T. F., and Knoch, U. (2012). The Rasch wars: the emergence of Rasch measurement in language testing. Lang. Test. 29, 555–576. doi: 10.1177/0265532211430367

Mushquash, C., and O’Connor, B. P. (2006). SPSS and SAS programs for generalizability theory analysis. Behav. Res. Methods 38, 542–547. doi: 10.3758/BF03192810

Onwuegbuzie, A. J., Daniel, L. G., and Collins, K. M. T. (2006). A meta-validation model for assessing the score-validity of student teaching evaluations. Qual. Quant. 43, 197–209. doi: 10.1007/s11135-007-9112-4

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., et al. (2021). Updating guidance for reporting systematic reviews: development of the PRISMA 2020 statement. J. Clin. Epidemiol. 134, 103–112. doi: 10.1016/j.jclinepi.2021.02.003

Park, B., and Cho, J. (2023). How does grade inflation affect student evaluation of teaching? Assess. Eval. High. Educ. 48, 723–735. doi: 10.1080/02602938.2022.2126429

*Quansah, F. (2020). An assessment of lecturers’ teaching using generalisability theory: a case study of a selected university in Ghana. South Afr. J. High. Educ. , 34, 136–150. doi: 10.20853/34-5-4212

Quansah, F. (2022). Item and rater variabilities in students’ evaluation of teaching in a university in Ghana: application of many-facet Rasch model. Heliyon 8, e12548–e12549. doi: 10.1016/j.heliyon.2022.e12548

Rantanen, P. (2013). The number of feedbacks needed for reliable evaluation: a multilevel analysis of the reliability, stability and generalisability of students’ evaluation of teaching. Assess. Eval. High. Educ. 38, 224–239. doi: 10.1080/02602938.2011.625471

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests . Chicago: MESA Press.

Raza, S. A., and Khawaja, F. N. (2013). Faculty development needs as perceived by departmental heads, teachers, and students of Pakistani universities. Lit. Inform. Comput. Educ. J. 4, 992–998. doi: 10.20533/licej.2040.2589.2013.0132

Raza, S. A., Majid, Z., and Zia, A. (2010). Perceptions of Pakistani university students about roles of academics engaged in imparting development skills: implications for faculty development. Bull. Educ. Res. 32, 75–91.

*Rindermann, H., and Schofield, N. (2001). Generalizability of multidimensional student ratings of university instruction across courses and teachers. Res. High. Educ. , 42, 377–399, doi: 10.1023/A:1011050724796

Rosli, M. K., Mistima, S., and Rosli, N. (2017). Students’ attitude and anxiety towards statistics a descriptive analysis. Res. Educ. Psychol. 1, 47–56.

*Samian, Y., and Noor, N. M. (2012). Students’ perception of good lecturer based on lecturer performance assessment. Procedia Soc. Behav. Sci. , 56, 783–790, doi: 10.1016/j.sbspro.2012.09.716

Shavelson, R. J., and Webb, N. M. (1991). Generalizability theory: A primer . Newbury Park, CA: SAGE.

Spooren, P., Brockx, B., and Mortelmans, D. (2013). On the validity of student evaluation of teaching: the state of the art. Rev. Educ. Res. 83, 598–642. doi: 10.3102/0034654313496870

*Spooren, P., Mortelmans, D., and Christiaens, W. (2014). Assessing the validity and reliability of a quick scan for student’s evaluation of teaching. Results from confirmatory factor analysis and G Theory. Stud. Educ. Eval. , 43, 88–94, doi: 10.1016/j.stueduc.2014.03.001

Stroebe, W. (2020). Student evaluations of teaching encourages poor teaching and contributes to grade inflation: a theoretical and empirical analysis. Basic Appl. Soc. Psychol. 42, 276–294. doi: 10.1080/01973533.2020.1756817

Taut, S., and Rakoczy, K. (2016). Observing instructional quality in the context of school evaluation. Learn. Instr. 46, 45–60. doi: 10.1016/j.learninstruc.2016.08.003

Teker, G. T., Güler, N., and Uyanik, G. K. (2015). Comparing the effectiveness of SPSS and EduG using different designs for generalizability theory. Educ. Sci.: Theory Pract. 15, 635–645. doi: 10.12738/estp.2015.3.2278

*Üstünlüoğlu, E., and Can, S. (2012). Student evaluation of teachers: a case study of tertiary level. Int. J. New Trends Educ. Implicat. , 3, 92–99

*VanLeeuwen, D. M., Dormody, T. J., and Seevers, B. S. (1999). Assessing the reliability of student evaluation of teaching (SET) with generalizability theory. J. Agric. Educ. , 40, 1–9, doi: 10.5032/jae.1999.04001

Vlăsceanu, L., Grünberg, L., and Pârlea, D. (2004). Quality assurance and accreditation: a glossary of basic terms and definitions . Bucharest: United Nations Educational, Scientific and Cultural Organization.

Wachtel, H. T. (1998). Student evaluation of college teaching effectiveness: a brief review. Assess. Eval. High. Educ. 23, 191–212. doi: 10.1080/0260293980230207

Wilhelm, W. B. (2004). The relative influence of published teaching evaluations and other instructor attributes on course choice. J. Mark. Educ. 26, 17–30. doi: 10.1177/0273475303258276

Yang, F. M., and Kao, S. T. (2014). Item response theory for measurement validity. Shanghai Arch. Psychiatry 26, 171–177. doi: 10.3969/j.issn.1002-0829.2014.03.010

Keywords: student evaluation, higher education, validity, teacher, student, courses

Citation: Quansah F, Cobbinah A, Asamoah-Gyimah K and Hagan JE (2024) Validity of student evaluation of teaching in higher education: a systematic review. Front. Educ . 9:1329734. doi: 10.3389/feduc.2024.1329734

Received: 29 October 2023; Accepted: 26 January 2024; Published: 08 February 2024.

Reviewed by:

Copyright © 2024 Quansah, Cobbinah, Asamoah-Gyimah and Hagan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: John Elvis Hagan Jr., [email protected]

IMAGES

  1. Chapter 3: How to Get Started

    literature review and validity

  2. Validity and Reliability in Research

    literature review and validity

  3. systematic literature review validity

    literature review and validity

  4. (PDF) Validity and Reliability of questionnaires measuring attitudes to

    literature review and validity

  5. Guidance on Conducting a Systematic Literature Review

    literature review and validity

  6. School essay: Sample literature review

    literature review and validity

VIDEO

  1. Systematic literature review

  2. Review of literature

  3. Approaches to searching the literature

  4. Literature Review 101

  5. Effective Review of Literature

  6. Literature review and its process

COMMENTS

  1. Critical Analysis of Reliability and Validity in Literature Reviews

    Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Medical Research Methodology , 18(1), 1-7.

  2. (PDF) Validity and the review of literature

    The purpose of the present paper is to link the review of literature to the concept of construct validity using Messick's (1989, 1995) conception of validity as a unitary concept. Reviews of ...

  3. Literature Review: Measures for Validity

    According to Brown (2006) there are five criteria for the evaluation of the validity of literature review: purpose, scope, authority, audience and format. Accordingly, each of these criteria have been taken into account and appropriately addressed during the whole process of literature review. McNabb (2008), on the other hand, formulates three ...

  4. Guidance on Conducting a Systematic Literature Review

    We can also evaluate the validity and quality of existing work against a criterion to reveal ... subtopics: (1) the definition, typology, and purpose of literature review and (2) the literature review process. The literature review process is further broken down into subtopics on formulating the research problem, developing and validating the ...

  5. Literature review as a research methodology: An overview and guidelines

    As mentioned previously, there are a number of existing guidelines for literature reviews. Depending on the methodology needed to achieve the purpose of the review, all types can be helpful and appropriate to reach a specific goal (for examples, please see Table 1).These approaches can be qualitative, quantitative, or have a mixed design depending on the phase of the review.

  6. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  7. Chapter 9 Methods for Literature Reviews

    Literature reviews play a critical role in scholarship because science remains, first and foremost, a cumulative endeavour (vom Brocke et al., 2009). As in any academic discipline, rigorous knowledge syntheses are becoming indispensable in keeping up with an exponentially growing eHealth literature, assisting practitioners, academics, and graduate students in finding, evaluating, and ...

  8. PDF Conducting a Literature Review

    What is a Literature Review 2. Tools to help with the various stages of your review -Searching -Evaluating -Analysing and Interpreting -Writing -Publishing 3. Additional Resources ... Assessment of the validity of findings Interpretation and presentation of results Reference list Introduction Methods Discussion Conclusion

  9. Identifying Validity in Qualitative Research: A Literature Review

    Writers have searched for and found qualitative equivalents that parallel traditional quantitative approach, the first being validation, "validity relates to the honesty and genuineness of the research data, while reliability. relates to the reproducibility and stability of the data" (Creswell, 2013 p. 202) in the.

  10. Appendix A Assessing Validity of Systematic Reviews

    The inclusion or exclusion of studies in a systematic review should be clearly defined a priori. The eligibility criteria used should specify the patients, interventions or exposures and outcomes of interest. In many cases the type of study design will also be a key component of the eligibility criteria.

  11. How to Write a Literature Review

    Examples of literature reviews. Step 1 - Search for relevant literature. Step 2 - Evaluate and select sources. Step 3 - Identify themes, debates, and gaps. Step 4 - Outline your literature review's structure. Step 5 - Write your literature review.

  12. Writing a Literature Review

    The lit review is an important genre in many disciplines, not just literature (i.e., the study of works of literature such as novels and plays). When we say "literature review" or refer to "the literature," we are talking about the research (scholarship) in a given field. You will often see the terms "the research," "the ...

  13. Critical Analysis of Reliability and Validity in Literature Reviews

    Author Chetwynd, E.J. Introduction Literature reviews can take many forms depending on the field of specialty and the specific purpose of the review. The evidence base for lactation integrates research that cuts across multiple specialties (Dodgson, 2019) but the most common literature reviews accepted in the Journal of Human Lactation include scoping reviews, systematic reviews, … Read more

  14. Assessing Validity in Systematic Reviews (Internal and External)

    Validity for systematic reviews is how trustworthy the review's conclusions are for a reader. Systematic reviews compile different studies and present a summary of a range of findings. It's strength in numbers - and this strength is why they're at the top of the evidence pyramid, the strongest form of evidence.

  15. A practical guide to data analysis in general literature reviews

    The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields. ... and many confounding factors. The validity and reliability ...

  16. Critical Analysis of Reliability and Validity in Literature Reviews

    Critical Analysis of Reliability and Validity in Literature Reviews. 2022 Aug;38 (3):392-396. doi: 10.1177/08903344221100201. Reproducibility of Results.

  17. Ten Simple Rules for Writing a Literature Review

    Literature reviews are in great demand in most scientific fields. Their need stems from the ever-increasing output of scientific publications .For example, compared to 1991, in 2008 three, eight, and forty times more papers were indexed in Web of Science on malaria, obesity, and biodiversity, respectively .Given such mountains of papers, scientists cannot be expected to examine in detail every ...

  18. Evaluating Literature Reviews and Sources

    A good literature review evaluates a wide variety of sources (academic articles, scholarly books, government/NGO reports). It also evaluates literature reviews that study similar topics. This page offers you a list of resources and tips on how to evaluate the sources that you may use to write your review.

  19. Questionnaire validation practice: a protocol for a systematic

    Strengths and limitations of this study. This is the first systematic literature review to examine types of validity evidence for a range of health literacy assessments within the framework of the authoritative reference for validity testing theory, The Standards for Educational and Psychological Testing. The review is grounded in the contemporary definition of validity as a quality of the ...

  20. Getting started

    What is a literature review? Definition: A literature review is a systematic examination and synthesis of existing scholarly research on a specific topic or subject. Purpose: It serves to provide a comprehensive overview of the current state of knowledge within a particular field. Analysis: Involves critically evaluating and summarizing key findings, methodologies, and debates found in ...

  21. Full article: A construct validity analysis of the concept of

    Aim and research questions. A recent systematic review paper (Newell et al., Citation 2019) recommended an analysis of psychological literacy construct validity as a priority for future research.This paper directly addresses this research gap by identifying any threats to Messick's (Citation 1995) six aspects of construct validity across a broad range of sources.

  22. Evaluating Content Validity Trends Across Years: A Systematic ...

    The data presented in this Data in Brief article provides an overview of the systematically conducted and scientifically oriented literature review on the method of content validity used in developing research instruments. The articles used in this study are obtained from various reputable journals indexed in Scopus.

  23. Validity and reliability of a questionnaire : a literature review

    Method: This study was performed in two stages, first development of the instrument based on the literature review and semi-structured interviews with 14 speech and language pathologists and second the evaluation of the psychometric properties. Content validity of the instrument was assessed by SLP experts who were experienced in the field of EBP.

  24. Assessing the Assessment: Evidence of Reliability and Validity in the

    In addition to conducting an in-depth review of the relevant literature and previous experiences and research with predecessors (e.g., PACT, NBPTS), ... Validity refers to the degree to which evidence and theory support the proposed interpretations and uses of test scores. The consensus view holds that validity entails developing a coherent ...

  25. [2402.08565] Artificial Intelligence for Literature Reviews

    This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in ...

  26. A systematic review of the validity and reliability of patient‐reported

    The aim of this systematic review was to identify PREMs, assess their validity and reliability, and assess any bias in the study design of PREM validity and reliability testing, irrespective of the health care context the PREMs are designed to be used in. 1.1. Objectives. To identify existing tools for measuring patient‐reported experiences ...

  27. Frontiers

    IntroductionData obtained from students regarding the quality of teaching are used by higher education administrators to inform decisions concerning tenure, promotion, course development and instructional modifications, among others. This article provides a review regarding studies conducted to examine the validity of student evaluation of teaching, specifically focusing on the following ...

  28. How rigorous is active learning research in STEM education? An

    Active learning is a popular approach to teaching and learning that has gained traction through research on STEM educational improvement. There have been numerous university- and national/international-level efforts focused on transitioning courses from the lecture method to active learning. However, despite these large-scale changes, the active learning literature has not been assessed on its ...