Recommendations for Reducing the USE of Fish and Amphibians in Endocrine‐Disruption Testing of Biocides and Plant Protection Products in Europe

Intent. The intent of Learned Discourses is to provide a forum for open discussion. These articles refl ect the professional opinions of the authors regarding scientifi c issues. They do not represent SETAC positions or policies. And, although they are subject to editorial review for clarity, consistency, and brevity, these articles are not peer reviewed. The Learned Discourses date from 1996 in the North America SETAC News and, when that publication was replaced by the SETAC Globe, continued there through 2005. The continued success of Learned Discourses depends on our contributors. We encourage timely submissions that inform and stimulate discussion. We expect that many of the articles will address controversial topics and promise to give dissenting opinions a chance to be heard. This section is dedicated to the memory of Dr. Peter M Chapman, who founded Learned Discourses and served as Editor until he passed away in 2017. Rules. All submissions must be succinct: no longer than 1000 words, no more than 6 references, and at most 1 table or fi gure. Reference format must follow the journal requirement found at www.setacjournals.org. Topics must fall within IEAM’s sphere of interest. Submissions. All manuscripts should be sent via email as Word attachments to the IEAM Editorial Offi ce (learned_ discourses@setac.org). Recommendations for reducing the use of fi sh and amphibians in endocrine-disruption testing of biocides and plant protection products in Europe by Lagadic et al. With the goal of reducing animal testing of endocrine disruptors, in vivo embryo assays offer an ethical alternative to corroborate mammalian toxicity results and address environmental protection goals for aquatic vertebrates Timely Scientifi c Opinions In a Nutshell. . . Learned Discourses: Timely Scientifi c Opinions 659

doubling the number of animals used. Additionally, further animals are needed for range-fi nding tests to set the appropriate concentration range for the defi nitive testing.
Due to the required test organism numbers, this testing confl icts with the aim to reduce vertebrate testing and to employ vertebrate testing only as a last resort (see Regulations EC 1107/2009and EU 528/2012. In the present article, we estimate the number of fi sh and amphibians required for ED testing, based on registered PPPs and biocides in Europe. In order to estimate animal numbers, we fi rst estimated the number of biocides and pesticides that require an evaluation for ED properties. The PPP active substances, safeners, and synergists with status "approved" and "pending" were extracted from the European pesticide database (EC, 2016). Excluded were attractants, desiccants, elicitors, plant activators, pruning substances, repellents, soil amendments, and viral inoculates. This extraction and exclusion process resulted in 475 PPP active substances. Biocides with status "approved" or "approval in progress" were selected from the European Chemicals Agency database (ECHA 2019). Duplication of substances was corrected for, yielding 297 biocide active substances.
The number of animals required for each test was estimated on the basis of information provided in the test guidelines (Table 1). We assumed and advocate for the use of 1 control in case a solvent is used. For the FSTRA, the fi sh species that required the smallest sample size (i.e., zebrafi sh) was selected as a conservative approach. For each test, a range-fi nding test was included with 3 test concentrations, a control and half the replicates of a defi nitive test. Also, a conservative failure rate (triggering test repetition) was estimated for each test based on the work of Burden et al. (2017), who surveyed test laboratories, and Salinas and Weltje (2018), who analyzed validation data.
Our estimates indicate that, for mechanistic data suffi ciency, 304 000 animals would be needed to fulfi ll the requirement of ECHA-EFSA-JRC (2018) for pesticides and 190 080 animals for biocides. The estimates for MEOGRT and LAGDA are higher (1.9 and 1.2 million animals for pesticides and biocides, respectively). The use of such high numbers of animals is not compatible with the desire to reduce animal testing in the European Union (EU).
We propose the following ways to reduce the test animal numbers, without compromising the knowledge needed to conduct a proper ecotoxicity evaluation for EDs: • An obvious recommendation is to optimize test protocols to achieve lower failure rates (e.g., better validation, fewer and/or less ambitious validity criteria) and for authorities to require tests only following protocols with a low failure rate. Numbers are provided for zebrafish; using fathead minnow or medaka as test species would increase the number of animals by 30%. b The number of animals for the definitive test was based on the minimum number of animals per replicate, the number of replicates per treatment and control, and the number of treatments (cf. the respective guidelines). c The failure rate describes the probability of test repetition due to inability to meet guideline validity criteria.
a 31-36% reduction in the number of animals used in an FSTRA, this omission is rarely possible for the other tests. • Make better use of embryo assays and mechanistic mammalian data. Embryo assays can provide in vivo mechanistic information. Fish embryo assays currently undergoing Organisation for Economic Co-operation and Development (OECD)  ). For detecting thyroid activity, an amphibian embryo assay, Xenopus Eleutheroembryonic Thyroid Assay (XETA), has been adopted by the OECD. These assays are considered as "nonanimal" tests within the EU because they use embryos at a development stage where independent feeding has not yet started, whereas both the United Kingdom Animals (Scientifi c Procedures) Act 1986 and the Directive 2010/63/EU on the protection of animals used for scientifi c purposes apply to only independently feeding larval forms. The embryo assays used for investigating endocrine mechanisms therefore appear as an ethical alternative to tests conducted with older larval stages, juveniles, or adults (Halder et al. 2010).
Incorporating embryo assays into testing strategies for ED testing is consistent with EU legislative animal welfare objectives, if the FSTRA or AMA were conducted only when embryo assays indicate endocrine activity. • Avoiding duplication of mechanistic studies across vertebrates due to the high level of conservation of the endocrine system and receptor homology, as well as the key enzymes involved, extrapolation of qualitative screeninglevel information among vertebrates is warranted. There is a high concordance between the results of amphibian or fi sh and rat assays for substances that interact with Estrogen, Androgen, Thyroid, Steroidogenesis (EATS) modalities (Ankley and Gray 2013;Pickford 2010). This has been confi rmed by comparing protein sequence and/ or structural information across species at the level of the primary amino acid sequence and functional domains (LaLone et al. 2013). Therefore, it is questionable whether additional mechanistic studies for aquatic vertebrates are needed when endocrine activity has been investigated suffi ciently in mammalian assays.
Overall, the evaluation of ED properties for pesticides and biocides will result in a vast increase of the number of animals used in testing. However, utilizing cross-species extrapolation considerably reduces animal use, if no further tests on fi sh and amphibians are required when endocrine activity is suffi ciently investigated in mammalian assays. In vivo embryo assays offer an ethical alternative to corroborate mammalian results and robustly address environmental protection goals for aquatic vertebrates. Two Canadian federal monitoring programs use data on multiple indicators (or metrics) to evaluate aquatic habitat condition. The Environmental Effects Monitoring (EEM) Program requires the collection of fi eld data on sentinel fi sh populations and benthic macroinvertebrates (surrogate indicator for fi sh habitat) to evaluate the effects of pulp and paper, metal mining, and other industrial effl uents on the condition of fi sh populations and fi sh habitat (see Environment Canada 2012a). The EEM assessment uses a univariate approach to determine statistically signifi cant differences between and among reference-and exposure-site means for 9 metrics, one at a time. A difference in any 1 metric triggers an enhanced monitoring protocol. The Canadian Aquatic Biomonitoring Network (CABIN) monitors benthic macroinvertebrates to evaluate aquatic habitat conditions (Environment Canada 2012b) but instead uses a multivariate ordination of benthic community densities and evaluates individual exposure (test) sites relative to confi dence ellipses (or confi dence intervals) around minimally impaired reference sites often using 3 ordination axes, 2 at a time (i.e., a bivariate approach using axes 1 and 2, axes 2 and 3, and axes 1 and 3).

ORCID
Although Huebert et al. (2011) identifi ed 3 areas for improvement in the EEM program (i.e., pseudoreplication, calculation of the Bray-Curtis Index, and the designation of ), only the  issue is discussed here. The problem arises because 9 biological metrics are evaluated individually with an α set at 0.10. This issue is known as multiplicity and it results in the increased likelihood of fi nding a statistically signifi cant difference when testing multiple hypotheses (or multiple endpoints) in 1 study (Streiner 2015). In this case, the probability of a Type 1 (false-positive) error is 1 -(0.9)9 = 0.61, not 0.10. The use of a Bonferroni correction would maintain the likelihood of a false-positive error at 0.10 across all 9 tests, by changing α from 0.10 to α / n = 0.10/9 = 0.011 (Huebert et al. 2011). This proposal was criticized, as the Bonferroni correction reduces statistical power and increases the likelihood of a Type 2 (false-negative) error (Bosker et al. 2012). The Bonferroni correction is 1 of several possible options that addresses the issue of multiplicity. We will briefl y discuss this and other options and recommend another solution to this issue.
The Bonferroni correction (and stepwise variants) maintains the likelihood of a Type 1 error across multiple tests (Huebert et al. 2011) but sacrifi ces power (Bosker et al. 2012). Another method uses the False Discovery Rate (FDR), to address the decreased power issues. However, both of these methods fail to adjust for the potential interrelatedness of the metrics (Streiner 2015). Although procedures to adjust for correlated metrics in the FDR approach have been developed, their use is not widespread. Figure 1 illustrates the consequences of failing to recognize the effect of metric interrelatedness on confi dence regions, using a hypothetical data set.
The multivariate analysis of variance (MANOVA) provides a solution to problems with multiple tests and correlated metrics (Lehmacher et al. 1991). In MANOVA, all metrics are evaluated simultaneously, resolving the multiple-comparison, false-positive issue using a single test. Correlations among metrics are also incorporated into MANOVA, so that the issue of doublecounting redundant metrics is addressed. However, because MANOVA evaluates all metrics simultaneously, a different null hypothesis (Ho) is tested (i.e., Ho: all metric differences are 0). As a result, rejection of a multivariate null hypothesis may require secondary tests (see Lehmacher et al. 1991) and additional statistical tools (e.g., discriminant functions or canonical variates analysis) to interpret the result and identify important impacted metrics (e.g., see Bowman and Somers 2006). These secondary tests could be used to evaluate alternate hypotheses based on specifi c response patterns, inferring the cause of the impairment.
For regulatory programs like EEM and CABIN, evaluation of test sites using thresholds based on the normal range of reference-site variation is possible using univariate and multivariate approaches (Bowman and Somers 2006). In the univariate case, the difference between the reference-area mean and exposure-area mean (or a reference-area mean and a test site) is compared to a threshold based on the standard deviation, but without compensation for the effect of correlations between metrics (i.e., the confi dence region for 2 metrics considered separately can differ appreciably from a confi dence region that accounts for the correlation between 2 metrics; see Figure 1). In the multivariate scenario, the difference between reference-and exposure-area means (i.e., centroids, or the reference-area centroid and a test site) is compared to a threshold that is a function of all of the metrics, their correlations, and the multivariate reference-area variation to produce an F-value and associated probability much like the univariate case. A signifi cant MANOVA can be followed with discriminant functions analysis and partial analyses (see Bowman and Somers 2006), secondary MANOVAs testing specifi c alternate hypotheses, or ANOVAs with a Bonferroni correction (Huebert et al. 2011).
In summary, the use of traditional MANOVA-based approaches by EEM and CABIN should provide more rigorous data analyses that result in identifi cation of true environmental impairments that signal a need for enhanced monitoring. This statistical approach is valid and useful for any area of research that includes a multiplicity of metrics.
(A) (B) (C) (D) Figure 1. Four different data sets (A-D) are presented. These data sets consist of 30 reference-site data points (yellow-filled circles), drawn from multivariate normal populations with increasing specified correlations (i.e., Pearson's r = A: 0.0, B: 0.3, C: 0.6, and D: 0.9) between 2 standardized metrics (i.e., means of 0 and SD of 1). The area within the dashed square represents the 95% confidence region defined by the normal range (i.e., mean ± t × SD; Bowman and Somers 2006) when both metrics are independent, whereas the ellipse (solid line) is the 95% confidence region for both metrics, considered simultaneously. When the correlation between metrics is ignored, the test site (red-filled circle) lies within the normal range for each metric and it is a constant Euclidean distance (ED = 1.41) from the reference-site mean (black-filled circle). By contrast, the ellipse and generalized distance (GD) incorporate the correlation between metrics and change as the correlation changes. The univariate approach uses the dashed square and ED, whereas the multivariate approach uses the ellipse and GD to determine whether a test site falls outside of the normal range (after Lehmacher et al. 1991).
correlation between metrics is ignored, the test site (red-fi lled circle) lies within the normal range for each metric and it is a constant Euclidean distance