Challenges and Opportunities of Meta-Analysis in Education Research

Meta-analyses are systematic summaries of research that use quantitative methods to find the mean effect size (standardized mean difference) for interventions. Critics of meta-analysis point out that such analyses can conflate the results of low-and high-quality studies, make improper comparisons and result in statistical noise. All these criticisms are valid for low-quality meta-analyses. However, high-quality meta-analyses correct all these problems. Critics of meta-analysis often suggest that selecting high-quality RCTs is a more valid methodology. However, education RCTs do not show consistent findings, even when all factors are controlled. Education is a social science, and variability is inevitable. Scholars who try to select the best RCTs will likely select RCTs that confirm their bias. High-quality meta-analyses offer a more transparent and rigorous model for determining best practices in education. While meta-analyses are not without limitations, they are the best tool for evaluating educational pedagogies and programs.


INTRODUCTION
It is a common public/layman conceptualization for science that research results consistently show contradictory findings.This might partly stem from the media's poor reporting on new research.The media tends to report on each new landmark study, as if it stands in a vacuum, as the sole edict, to what science proves.This is problematic because it assumes the newest study is always the most correct, rather than looking to see what the majority of research shows.In the past, researchers would complete systematic literature reviews to discover the scientific consensus on a topic.With this approach, a researcher reads all the studies on a topic and then writes about their findings.This can be problematic because it tends to be purely qualitative, and the researcher gets to present their interpretation, without being beholden to any quantitative data.
A meta-analysis is similar to a literature review, except the authors also find the average statistical result for studies on a topic.Typically, meta-analysis results are displayed in effect sizes, an equation that seeks to create a standardized mean difference, so we can compare multiple studies.Looking at research through meta-analysis is the most systematic way of examining research.The author must review all studies, and then systematically synthesize quantitative results.Ideally, this removes as much bias as possible and provides an interpretation of the most normalized result on a topic.Meta-analysis also serves a fundamental scientific principle, replication.A scientific finding is only truly valid if it can be consistently replicated.Using meta-analysis, we can be sure whether a finding has been well replicated.Replication is especially important in education research because scientific results tend to be more variable, and experiments are often carried out by those selling pedagogical products.
Over the last two decades, meta-analyses have been crucial in helping to determine what best practice in literacy instruction is.Most famously, the National Reading Panel conducted multiple meta-analyses, including one that compared systematic phonics and whole language instruction.Their research showed systematic phonics has a mean effect size of .44 (NRP, 2001).This is why many reading researchers today recommend systematic phonics instruction as part of a comprehensive literacy program.

Some scholars object to meta-analysis, and they usually cite three main arguments:
1. Meta-analysis ignores study quality.
3. Meta-analysis tends to show random statistical results, but not meaningful results.

Quality
There are typically 4 main types of studies included in a meta-analysis. 1. Case studies: studies without control groups or done retrospectively 2. Correlation studies: studies that look at the correlation between two datasets 3. Quasi-experimental studies: studies that have a non-randomized treatment and control group 4. Randomized Control Trial (RCT): studies with randomized treatment and control groups.
Typically, an RCT is seen as a higher quality study than a quasi-experimental study, and a quasi-experimental study is seen as higher quality than a case study.Sample size, duration, fidelity tracking, attrition, and measurement also affect the quality of a study.Typically, higher-quality studies show, on average, lower results.For example, a large sample size, long duration RCT with standardized measurements, is far more accurate than a small, short-duration case study that uses researcher-designed assessments.
Meta-analyses that do a poor job of controlling quality will typically include studies with varying levels of quality, such as case studies and RCTs, and report on one mean effect size.A well-done meta-analysis will either exclude low quality studies or show the difference in results for high versus low quality studies.For example, look at this result section from a fantastic meta-analysis by (Fritton, 2018).In this study, the authors used the above sensitivity analysis to show the mean effect size changed across varying levels of quality.Interestingly, the highest quality studies showed a similar effect size (.31) to the overall mean for the study (.28), suggesting that quality did not significantly impact results, an unusual finding.

Apples to Oranges
"Apples to Oranges'' is often used as a metaphor for comparisons that are too dissimilar to be meaningful.Within the context of meta-analysis, an example could be drawn from trying to find the mean effect of comprehension instruction and including multiple types of comprehension instruction together, as if they were the same thing.For example, vocabulary and strategy instruction are used to teach comprehension, but they are very different approaches.That said, good meta-analyses control for this by separating the results as moderator variables, as can be seen in Table 2 (Filderman, 2022).Moderator analysis can show what is the mean effect size for different types of studies, or outcomes.
For example, a moderator analysis could differentiate between the effect sizes of RCTs, quasi-experimental studies, and case studies.In contrast, multilevel modeling and regression analysis can be used to estimate the impact of multiple moderator variables at once.As can be seen in Table 3.

Statistical Noise
One less common criticism of meta-analysis is that the authors capture random effects and averages, not meaningful trends.Let's make a hypothetical example.Say we have 10 studies, and they show the following effects: .10,.20,.30,.40,.50,.60,.70,.80,.90,1.0, you will find a mean effect size of .50, which is quite significant.However, there is no average discernible trend within those studies.So by taking the mean, we have actually made the data less meaningful, as opposed to more meaningful.Of course, there are multiple tools for addressing this issue.Most typically, meta-analyses will use confidence intervals, which show the likely range of results between effect sizes, and or p values, which display the likelihood that a statistic is random, alongside their mean effect sizes so that readers can discern if the mean effect found was meaningful or random noise.Indeed, if you look back at the two graphics from well-done meta-analyses, they both included confidence intervals and p values alongside their effect sizes.

So, Are these Criticisms Valid?
All three of these criticisms are valid.However, they also only really apply to a poorly done meta-analysis.Meta-analysis is a relatively new technique for reviewing research, and it has evolved over the last 20 years.If you read meta-analyses done in the late 90's, they often combine multiple poor-quality studies to produce one mean effect size.While more modern meta-analyses tend to be much more sophisticated, there is a lack of consistency within the field of education for meta-analysis methodology.For example, we reviewed meta-analyses on the topic of ESL education.We found 12 meta-analyses on ESL education research, dating back to 2009 (all of which can be found in the references section).Of these 12 meta-analyses, 6 included studies without control groups and did not use moderator analysis to compare the impact of studies with and without control groups.The 6 metaanalyses that did not control quality were not rigorous and, therefore, cannot be used as a definitive proof for the scientific consensus.

THE ALTERNATIVE
Those who criticize meta-analysis often claim we should rely on high-quality RCTs instead.This is a problematic solution for two reasons.Firstly, researchers independently decide which RCTs are the most rigorous using complex processes.For example, many scholars have cited Balanced Literacy as the gold standard of reading instruction, based on a handful of RCTs reviewed by WWC (Hechinger, 2022).This suggestion was made in comparison to the findings of the NRP meta-analysis, which recommended systematic phonics instruction, based on dozens of studies.
Secondly, this methodology is based on the belief that well-done RCTs show precise outcomes and therefore do not need replication.But within the field of education, this is undoubtedly false.Let's look at some of the findings from the (Hansford, 2022) metaanalysis on language programs.There were 20 identified RCTs that looked at structured literacy phonics programs.The mean effect size was .48,and the 95% confidence intervals were [.31, .66].We can expect results of .31 to .66 in 95% of structured literacy RCT studies.This is a pretty wide range..66 is a moderate to high effect size, and .31 is low.The lowest study showed an effect size of -.11 (Vaden-Kiernan, 2008).And the highest effect size was 1.16 (Farokhbakht, unlisted date).Neither effect size is particularly representative of the normal effect of a phonics intervention.However, a scholar with an agenda could point to either study to make a case for or against structured literacy.
The Vaden-Kiernan study is of far higher quality than the Farokhbakht study.If we examine the highest quality RCT studies, in this case, longitudinal RCTs with standardized assessments.We get 3 studies: (Vaden-Kiernan, 2008), (Torgesen, 2007), and(Bratsch, 2020).
These studies showed a mean effect size of .22,with 95% confidence intervals of [-.50, .95], suggesting a high degree of variability.The lowest study showed a mean effect size of -.11, and the highest study showed a mean effect size of .43(Bratsch, 2020).Again, a biased academic could pick any of those three studies and argue for or against phonics/structured literacy.
All of these studies could also be apple-to-oranges comparisons, as each study looked at different demographics, programs, and styles of approaches.One study looked at a scripted DI approach (Vaden-Kiernan, 2008).One study looked at an Orton Gillingham approach (Torgesen, 2007).And one study looked at a speech-to-print approach (Bratsch, 2020)..19].Here the confidence intervals suggest a very narrow range.However, the highest effect size study (Interactive Inc, 2002) showed a mean effect size of .41,and the lowest effect size study (Fitzgerald, 2008) showed a mean effect size of 0 (for longitudinal outcomes).If we remove all the lowest quality studies and only include those that used standardized measurements, were longitudinal, and controlled for fidelity, we get 4 studies, (Interactive Inc 2002), (Fitzgerald, 2008), (Meisch, 2011), and (Sprague, 2012).Together the studies show a mean effect size of .16,but the confidence intervals are much wider than when all 12 RCTs are included, [-.12, .40].
Moreover, both the (Fitzgerald, 2008) study and the (Interactive Inc, 2002) study were within the highest quality category.Hence, the range of effect sizes was still 0-.41.If we look at both quasi-experimental and RCT studies, 13 out of 19 mean effect sizes were between 0 and .29.With 95% confidence intervals of [0, .19].While looking at all the studies together suggested a very consistent trend of a low effect, looking at only the highest quality studies made the found effect appear more random, and difficult to find a meaningful trend.
That said, the Read 180 studies, covered multiple grades and used different designs.
Reading Recovery might be a better example.Within the (Hansford, 2022) meta-analysis of Language programs, we identified 11 RCT studies on Reading Recovery, all of which looked at the identical grade.Moreover, all but two used the same basic design, comparing 1-on-1 intensive reading instruction for 20 weeks, to a no-treatment control group.These 11 studies showed a mean effect size of .38,with 95% confidence intervals of [-.99, 1.24] (outliers included).All these studies are RCTs on the same grade and same program.All but 2 of these studies compared no treatment to treatment.And yet, a large range of effect sizes were found.The largest impact was found in (Iverson, 1999), with a mean effect size of 2.59, and the lowest was (Schmitt, 2004), with a mean effect size of -.50.Again, any scholar with an agenda could pick either RCTs and make the opposite arguments.Even if we take the two highest quality studies, in this case (Holliman, 2013) and the (Center for Research in Education and Social Policy, 2022), we still get opposite results.Both studies were largescale longitudinal RCTs.(Holliman, 2013) showed a mean effect size of .48,and the (CRESP, 2022) study showed a mean effect size of -.19.
Inconsistent findings among RCTs create a sentiment that education science produces inconsistent findings and cannot be trusted.Again, any scholar with an agenda could pick either of these RCTs and make completely opposite arguments.Scholars on either side of the reading wars debate will likely want to point to the flaws in either study as a defense for their perspective.Indeed, pro-Reading Recovery scholars frequently point to the (Holliman, 2013) study as evidence that Reading Recovery works, and pro-structured literacy advocates frequently point to the (CRESP, 2022) study, including Emily Hanford.Both the Holliman and CRESP study have weaknesses.The Holliman study had poor fidelity controls in the control group, and the CRESP study had high attrition rates.Both studies compared intensive 1-1 reading instruction to no additional instruction, which is not an ideal study design.That said, the studies are both of higher-than-average quality compared to other studies in the Hansford 2022 meta-analysis.
Of course, instructional programs include multiple variables at once and are often compared to business-as-usual control groups.For this reason, it might be easier to isolate the fixed effect of a pedagogy than a program.(Bakken, 1997) and (Boyle, 1993) conducted RCTs, on the effects of cognitive strategies on reading comprehension for intermediate students with learning disabilities.Both studies used active control groups, in which instruction was the same as in the treatment group, minus the instruction on cognitive strategies.Both studies used standardized assessments.Both studies had a sample of between 30-40 students.Both studies were short and lasted less than a month.However, in the Bakken study we found an effect size of 2.71 for the use of cognitive strategies and in the Boyle study we found an effect size of .15.Both studies were of extremely high quality and similar, and yet, they yielded completely different results.The above-discussed anomalies suggest that even the highest quality RCTs do not lead to precise or consistent results and that even a very high-quality RCT cannot be considered reliable evidence of efficacy in isolation.

Do Meta-Analyses Provide More Homogenous Effects?
While the above research suggests that RCTs do not provide homogenous results, this does not necessarily mean that meta-analyses do.Indeed, many meta-analyses at face value appear to show very different results, for similar research questions.As of 2023, we can identify at least 14 peer-reviewed and experimental meta-analyses, conducted on English phonics instruction.(Camilli, 2003) identified the lowest effect size of .24.
Conversely, (Weiser, 2011) found the highest effect size of .78.These differences are seemingly very different; however, these meta-analyses examined very different questions.(Camilli, 2003), attempted to identify the fixed effect of systematic phonics versus unsystematic phonics, and (Weiser, 2011) was attempting to identify the random effect of encoding instruction.These research questions are fundamentally different.Comparatively, (Steubings, 2008), which had the same research question as Camilli, found a mean effect size of .31,which is statistically comparable.Similarly, (Hansford, 2022), (Piasta, 2011), (Ehri, 2001), and the (NRP, 2001) all looked at the random effect for general phonics instruction and found a mean effect size of between .40 and .45for phonics, suggesting a very homogenous effect.While, meta-analyses often produce heterogeneous effects, these differences usually have to do with the research question and methodology used.
Meta-analyses also have unique advantages for detecting outlier data.It is impossible to tell if the results from a single RCT represent outlier data when taken in isolation.However, tools like funnel plots, trim and fill, and IQR analysis, can be used within a meta-analysis to identify if a single study is an outlier (Terrin, 2003).Funnel plots can be especially useful in visualizing whether there is outlier data related to sample size.
Small sample size studies often have larger effect sizes, partly because it is harder to effectively implement a new pedagogy with a larger group of teachers.Smaller sample size studies can also produce more random results, as individual outliers can have a greater impact (IntHout, 2015).Lastly, smaller sample size studies can be more easily replicated and used to "fish" for better results (Lee, 2012).Funnel plots are commonly used to compare the results of studies with the sample sizes of studies, to test whether smaller sample size studies increase heterogeneous effects.To help illustrate this point, we created a scatter plot of RCT studies on Read 180, based on the (Hansford, 2022) meta-analysis of Read 180.The results can be seen in Figure 1. Figure 1 shows that the negative effect sizes were associated with low sample size studies, suggesting that these low sample sizes led to more random and, thus, more heterogeneous results.Similarly, the two highest effect sizes found were both associated with studies that also had study samples below the median sample size.According to Cohen's guide, most studies with a sample size above 1000 fell between the effect size range of 0-.20, suggesting a negligible outcome.This meta-analysis tool allows readers and researchers to better understand what an expected outcome might be for a pedagogy or program than any single RCT could provide.

IMPLICATIONS FOR PRACTICE
Whether trying to measure the efficacy of a principle, or a program, it is incredibly difficult to find a consistent effect found across multiple RCTs.This difficulty stems from the fact that education is not a hard science, it is a social science.There is a large degree of variability in research results.Teacher quality, student motivation, demographics, study design, and study quality will all impact the effect size.Controlling all these variables consistently is nearly impossible.You, therefore, cannot expect results to be static across multiple studies.It is not rational to expect individual RCT studies to produce results that do not vary.
Even if high-quality RCTs did show consistent results, isolating the highest-quality RCTs is very difficult and requires people to make unbiased judgments.People are likely to be more critical of the studies that do not confirm their biases and less critical of the ones that do.We can only truly avoid cherry-picking results to support our biases by reviewing all of the relevant studies on a topic.This does not mean viewing all studies uncritically, instead a good meta-analysis uses objective criteria to identify how effect sizes varied according to study quality.Moreover, when factors limit the validity of a meta-analysis, such as a lack of studies with control groups, the authors should identify it as a limitation.
Using methodologies like moderator variable analysis, regression analysis, and multilevel modeling, with meta-analysis is a far more transparent process than simply trying to select the most valid study.
However, we also see very different results even if we only look at RCT studies on the same program.For example, let's look at Read 180.In 2022 Hansford and Mcglynn identified 12 RCTs on Read 180, with a mean effect size of .11and 95% confidence intervals of [.04,

Table 2
Filderman Sensitivity Analysis

Table 3
Regression Analysis Example