GM Free Cymru

A critique of the new paper by Seralini et al -- statistics and significance

This critique, published by GM Watch, was undertaken by somebody who knows something about statistical analysis and significance. It's very interesting -- and he tackles the points made by some of the critics of the Seralini study on the grounds that the groups of rats studied were too small to demonstrate any statistical significance in the findings.

Comment on Seralini's findings By a former research analyst with a major government agency Published by GMWatch, 1 October 2012

I was interested in the argument about Seralini et al's paper because of the assertions that were being made - that the entire study and all results were useless because the sample size was not large enough and that the Sprague-Dawley (SD) rats were prone to tumours, making it hard to show significant differences.

I saw the timing differences as key, even if the SD rat was susceptible, because it suggested to me that the treatment [with NK603 GM maize and/or Roundup] was at least accelerating the tumour progression. In the figures given in the paper I did not see any particular sensitivity over time in the SD rats when used as a control [not exposed to NK603/Roundup].

Also, Table 2 stood out, with the doubling and tripling of pathologies in treatment groups compared with controls, with as many as 8, 9, or even all 10 treatment rats in a group affected. This made me question how a study in which such high numbers of rats were affected, could be dismissed.

I thought at first that perhaps Seralini's critics were claiming that the controls are showing such entire sample incidence of pathologies as to be statistically significantly the same as the treatments. But this was not evident in the Seralini et al study.

Then the argument of "historical controls" came up [the research analyst deals with this below].

I also wanted to know what statistical rationale and method was going to be used to dismiss the entire study, as none was offered by the critics. The sample size that Seralini et al used is not generally considered so small that the results must be dismissed - or if it is, why is it sanctioned in the OECD protocol [OECD453 chronic toxicity protocol, on which Seralini based his study]?

There is too much going on here to dismiss the findings. Seralini et al walked us through the entire experience of the rats for the study duration. They should have spent more time dealing with the statistical issues, but that should not be fatal in terms of the science presented.

The bottom line is that you cannot just throw the results in the bin on the grounds that the sample size is too small. What Seralini et al present is the actual sample population results in total, in individual animal number treatment and control comparisons that are binary or categorical, for intervals over the 2-year experiment. This 2-year study is designed to test for differences of results compared to the conventional 90-day study. It is the first such experiment with this maize variety, and it provides an opportunity to see if time and lifetime exposure make any difference compared to proportionally much shorter time and duration of exposure in the 90-day studies.

In some cases, the death and tumour numbers are close, and the sample size of 10 might look small (though Table 1 of the paper reports that regulatory tests allow 10, but as a minimum). But the mortality is binary, not continuous, and 5 or 7 deaths out of 10, instead of 3 or 2 is large when it's real. Binary means something in pairs, or only 2 values, like off or on, zero or 1. In this context this means that the rats can only live, or die, present a tumour or not, there are no other outcomes. Significance testing methodology in this multiple binary results context is not straightforward to design, and results cannot just be casually stated or judged without evidence.

Binary variables are not analysed in terms of means or averages and standard deviations that can be tested for differences. Rather, they are analysed by special methods, such as the proportional chi-square test, or (possibly relevant here) the Fischer's exact test, which is used for binary sample sizes of less than 5. So clearly, results can be statistically tested for sample sizes below 5, (not to mention 10) in part, because it's the variance results of the experiment that are most determining.

Conventional scientific standards of significance testing, regarding the likelihood of demonstrated health effects, are strict, and used to determine acceptance of a particular experiment pertaining to individual outcomes or endpoints. Seralini et al, however, present a chain of results on a suite of several outcomes and markers, by their time distributions, frequency, and progression for controls and treatment groups.This is more complicated to test.

To reject the entire set of results, no matter what their magnitude, and without evidence, even when the entire treatment sample group (10 out of 10) shows an effect several times (Table 2), or 9 out of 10 several additional times, on the basis of simple and unexplained assertions about sample size needs, and with no consideration of context, is not defensible.

Even worse is the use of so-called "historical controls", asserted as showing higher variation, from other arbitrarily selected but unspecified studies, that are completely removed from Seralini et al.'s experimental model, conditions, and protocols, to dismiss Seralini et al.'s results. In Hammond et al. 2004 (Hammond, B., Dudek, R., Lemen, J., Nemeth, M., 2004. Results of a 13 week safety assurance study with rats fed grain from glyphosate tolerant corn. Food Chem. Toxicol. 42, 1003-1014), six so-called reference controls did not add any information or precision in comparison to the concurrent experimental control or the treatments, and were statistically the same. The effect outcomes reported were all continuous variables, with none binary. In this case, increasing sample size did not add anything but noise, and it is not scientifically justifiable to use the results of this study as a "historical control" to reject Seralini et al's results.

In Figures 1 and 2, the time distribution of the results is skewed to the left, towards earlier mortality and tumour presentation, in the treatment groups, compared to the controls. In other words, more rats in the treatment groups died earlier or presented tumours earlier than the controls, and this shows in the time distribution of both these outcomes.

The distribution of the results of the treatments went from near normal, which is what the controls look like, to what looks like log-normal. In other words, the controls distribution looks mostly like the usual symmetrical bell-shape, called "normal", whereas the treatment distributions are not symmetrical, but have an extended tail or skew to the left, towards earlier in time. It is customary practice in such cases to call these distributions "log-normal" since if you log transform the data, the resulting distribution tends toward the normal shape. The 2-year study timeline looks as if it made a big difference in showing effects that take time to manifest.

In addition, there is too much else going on that is coherent, from a weight of the evidence perspective, with respect to timing of mortality and tumours; tumour numbers and tumour progression; and the results of the biochemistry, to just dismiss the study and carry on as if it's not important. In Table 2, frequently, more of the treatment rats had tumours and other anatomical pathologies (up to 9 or 10 out of 10 rats, and up to 26 individual pathologies or tumours per each of these affected group numbers together) than the controls (2 to 6 rats out of 10, with 2 to 10 total pathologies). All of these results must be tested by a carefully designed statistical methodology.

One problem [with Seralini et al's paper] is the failure to provide the raw data, so that the statistical methods can be examined and augmented where needed, and statistical properties can be examined, compared, and tested. Even so, it would be cavalier to reject or accept the study findings solely on this basis.

If this is about safety of the human food supply and continuous exposure for population lifetimes and intergenerational timelines, then "safety" must be defined as the practical certainty that injury will not result. This could be defined as a 99% probability found in the data analysis. We don't want false negatives [false assurances of safety]. If there is anything apparent from Seralini et al., it is that existing 90-day studies are not of adequate design or sufficiently powerful to meet this criterion. That's stated as the last line in their paper.

Some have argued that the frequency of tumours in controls may vary between laboratories due to undefined environmental or other conditions, and that this should be measured. Here, I agree with another scientist who expressed the view that perhaps if the differences between control and experimental groups were small, this would be necessary. But in this case, the differences are so striking that it is not necessary to carry out such work to be confident that there is a real effect of the test substances (NK603 and Roundup).

Too many things are going on here, both in terms of gross anatomy observations and timing, but also the biochemical measures and mechanisms, that are consistent with these observations.

In terms of the statistics concerns raised, it will take a very good critique showing inadequacy, and statistics supporting the null hypothesis [that GM maize NK603 and Roundup do not cause these toxic effects at these doses], to sweep all of these results into the bin.

Moreover, the female results stand out for tumour numbers and progression, and the biochemistry and proposed mechanisms are coherent.

The biochemistry and outcome coherence would be tough to dismiss, based on one simple significance test based on the number of animals alone, or the mortality differences. These differences are not just the mortality numbers, which is some cases beg for the stats testing because of small numbers, but the timing - which stands out in most treatments for both sexes. This clearly suggests something is going on, as shown in Figs. 1 and 2.

You only get this timing result because of the 2-year lifetime study duration. Since this is the first time this maize has been tested this way, that alone says you had better pay close attention.

The timing result will be very difficult to get around for the critics.

Table 2 is also striking in terms of the higher pathology frequencies in both sexes. This is compelling, and hard to get around.

It appears that the stats methods used for the biochemistry have significance properties built in, but I didn't see anything testing for the differences between control and treatments, although the total sample size over time and parameter, for the biochemistry is much larger and has more power than the mortality data.

So small numbers perhaps cannot be used to dismiss the results. I read this as more of a dose-response comparison between controls and treatment, with treatment showing worse outcomes for liver and kidney markers, and other effects too, all consistent and explanatory of the presented mortality, tumour and other outcomes observed.

So overall, it is the big picture, of all the results taken together, that is too striking to dismiss based on an assertion that the sample size is too small.

"Safety" is the practical certainty that injury will not result. It is freedom from danger. This is the kind of commonsense meaning that most people would think of if they were told they were "safe".

Attempts to dismiss the results of this study based on small sample size inverts this, and says that "safety" is declared if you fail to show with practical certainty that injury will result.

It is irresponsible to me to call something "safe" based on this definition, which you won't find in any dictionary or thesaurus.

In the context of Seralini et al (2012), the responsible authorities would be irresponsible if they used this inverted definition as the standard for making judgments regarding the health effects found in this study. There are too many things happening in the study that need attention and investigation, for them to be thrown in the trash based on some not yet explained statistical standard or opinion.

Bottom line, something is going on in this study that cannot be - must not be - swept away. I conclude that GMOs must be assessed for safety using the lifetime of the test organism.