Research design and data collection
This was a multireader, three-way crossover, randomized controlled trial with 36 ultrasound specialists from multiple institutions using images and videos collected in a clinical routine (Figure 1). It was approved by the Institutional Review Board of the First Affiliated Hospital of Sun Yat-sen University (2019421) and was conducted in accordance with the Declaration of Helsinki. All participants provided written informed consent. The study on the clinical effectiveness of deep learning artificial intelligence in the detection of fetal intracranial malformations in ultrasound examination was registered at http://www.chictr.org.cn on July 5, 2021 (study registration : ChiCTR2100048233).
Fetal neurosonography images and videos with normal or abnormal intracranial findings were submitted to the First Affiliated Hospital of Sun Yat-sen University (FAHSYSU) and the Affiliated Women’s and Children’s Hospital of Xiamen University during the period from January 1 to December 1, 2021. Continuously collected from (WCHXU). 2022. All images and videos met the following criteria: (1) neurosonography data obtained by an expert with his more than 10 years of experience in fetal anatomical scanning; (2) neurosonography data include at least one of his three reference screening planes obtained according to guidelines;22, 27, 28(3) appropriately enlarged integrated skull image; There are no obvious acoustic shadows or measurement caliper overlays. All data, including color Doppler ultrasound data and data that did not meet the inclusion criteria, were excluded. Each data undergoes quality control before testing and is carried out by two senior sonographers (ML and HX with over 15 years of experience) and only if this two experts reach a consensus. Incorporated.
Intracranial abnormal findings in the reference screening plane were classified into nine different patterns according to the textbook of prenatal brain ultrasound and ISUOG practice guidelines.22, 27, 28:(1) Non-visualization of septal space pellucidum (CSP). (2) Non-visualization of the septum pellucida (SP). (3) Crescent-shaped single ventricle. (4) Mild ventricular hypertrophy (VM). (5) Severe VM. (6) Non-intraventricular cyst. (7) Intraventricular cyst. (8) Open the fourth ventricle. (9) Megacisterna magna. Therefore, including the regular pattern, there are 10 different patterns. All images and video prenatal ultrasound diagnoses were confirmed by prenatal or postnatal magnetic resonance imaging (MRI), follow-up ultrasound, or autopsy. Ultrasound examinations were performed using machines from six different manufacturers (GE Voluson 730 Expert/E6/E8/E10 (GEHealthcare, Zipf, Austria), Samsung UGEO WS80A (Samsung Medison, Seoul, South Korea), Philips IU22 (PhilipsHealthcare, Bothell, Washington, USA).
Randomization and crossover design
As shown in Figure 1, the eligible neurosonography images/videos were randomly grouped into three datasets (Dataset 1, Dataset 2, and Dataset 3), and within each dataset there were 9 types of There was a balanced proportion of malformations and normal patterns. These datasets were read interactively by three groups of acousticians in his three reading modes (unassisted mode, simultaneous mode, and second mode). The three dataset orders and three reading modes were a crossover design and constituted three types of tests (Figure 1).
Thirty-six ultrasound specialists from 32 different hospitals across the country and with three different levels of expertise participated in the reading comprehension test. They were randomly divided into his three groups by expertise (n = 12 in each group) were randomly assigned to one type of test using a random assignment sequence generated by a research assistant with a computer random number generator. This sequence was hidden until testing began. The expert group included professors with more than 10 years of experience. The group of qualified sonographers included attending sonographers with 5 to 10 years of experience. The trainee group included a resident with her 2 to 4 years of experience in fetal anatomical scanning. Three groups of ultrasound specialists performed at least 10,000 he-, 5,000-, and 1,000-she fetal ultrasounds, respectively. All sonographers were blinded to the diagnosis and had not previously reviewed these images or videos. To avoid carryover effects and contamination, all datasets were presented to the sonographer in a random order, and the order varied for each reader.
a computer program designed to perform tests
Specifically, for this reading test, we designed a program that could display images/videos on a computer screen with 10 pattern options (see Supplementary Movie 1). An applet inserted into this program also made offline diameter measurements possible. Therefore, the reader selected one of the corresponding patterns after reviewing the images/videos. At the same time, the program automatically recorded the answer and reading time for each data point. The display settings for these three reading modes were different. In unassisted mode there was no AI reference, while in simultaneous mode the image/video with AI diagnosis was displayed in parallel with the original data at the beginning of the reading. In the second mode, after reading the original image/video and making the diagnosis, click the “Next” button and the AI ​​diagnosis will display the same data and make the final diagnosis. Therefore, in the second reading mode, there were two sets of answers for each data point, before and after AI assistance. Prior to implementation, all sonographers received full training on how to use the software and could only participate in testing if they were eligible to use the AI. PAICS was built on the real-time convolutional neural network (CNN) algorithm You Only Look Once version 3 (YOLOv3). Nerve ultrasound image (n= 43078) from normal fetus (n= 13400) and fetuses with central nervous system malformations (n= 2448) patients aged 18 to 40 weeks of gestation were obtained from the databases of two tertiary hospitals in China and randomly assigned to the training, fine-tuning, and internal validation dataset for the development and internal evaluation of PAICS. (ratio, 8:1:1). Image dataset (n= 812) was used to further externally validate the performance of PAICS and to compare its performance with that of sonographers with different levels of expertise. The macro-mean AUC and micro-mean AUC for internal validation are 0.933 (0.798 to 1.000) and 0.977 (0.97 to 0.985), respectively, and the corresponding values ​​for external validation are 0.902 (0.816 to 0.989) and 0.898 (0.885 to 0.911), all Equivalent to the results of a professional ultrasound specialist (0.9 [0.778–0.99], P= 0.891; 0.9 [0.893–0.907], P= 0.788). The macro- and micro-average sensitivities of PAICS were 0.876 (0.596 to 0.999) and 0.959 (0.941 to 0.973), and the macro- and micro-average specificities in internal validation were 0.99 (0.95 to 1.000) and 0.995 (0.993 to 0.997). did. The external validation macro- and micro-mean sensitivities were 0.826 (0.624-1.000) and 0.817 (0.788-0.843), and the macro- and micro-mean specificities were 0.98 (0.926-1.000) and 0.98 (0.976-0.983), respectively.15.
Questionnaire regarding subjective evaluation of PAICS effectiveness
At the end of the test, sonographers completed a questionnaire regarding their subjective assessment of the effectiveness of PAICS, which included several questions regarding whether PAICS was helpful in identifying 10 patterns in neurosonography data. . If the answer is yes, you should consider each of her next three questions. 1. The subjective value of the degree of PAICS support was rated on a scale of 10 to 100. 2. The benefits of PAICS assistance were due to (1) providing a diagnosis or (2) segmenting the location of the lesion. 3. Which assistant mode was preferred: (1) simultaneous mode, (2) secondary mode. Alternatively, if the sonographer determines that her AI is not useful, she can choose one of three choices (0: no, 1: yes, 2: unknown) for each of her three considerations: I answered. 1. The diagnosis made by the AI ​​was wrong. 2. AI has confused diagnosis. 3. The ultrasound specialist was confident in his diagnosis and did not require further assistance.
result
The primary outcome was the average accuracy (average ACC), which assessed the score for correct identification of all 10 patterns without prior knowledge.
Secondary outcomes included area under the curve (AUC) for receiver operating characteristic (ROC), diagnostic accuracy (ACC), sensitivity (SEN), and specificity for multiclass classification (SPE).9and the time it takes to read.
Sample size estimation
Our study had three hypotheses. (1) the mean ACC in the simultaneous mode and (2) the second mode were each at least 3% higher (effect size = 3%) than the mean ACC in the unaided mode, and 3) performance decreased in an unspecified direction. , one DL assistance mode is better than the other, defined by a 3% difference in average ACC.
Under these conditions, each hypothesis had a two-sided α of 0.0167 (α = 0.05 divided by 3 to adjust for multiple comparisons using the Bonferroni method), and power was 90%. If the mean ACC of the primary result was set to 80% in unassisted mode and the two DL assisted methods were increased by 3%, the required sample size was 3583 images/videos per group, for a total sample size of 10749. Increasing to 99% would require 5908 images/videos per group, or a total of 17724 images/videos. In this study, the total sample size was much larger than estimated, ensuring sufficient power for subgroup analyses. Subgroup analyzes were performed according to the level of expertise (experts, competent sonographers, trainees) and different types of image patterns (10 patterns).
statistical analysis
Multiclass classification performance indicators including diagnostic ACC, SEN, SPE, AUC and their 95% confidence intervals (CI) were estimated by micro and macro analysis.9,29. Micro and macro ACC were calculated according to the confusion matrix summed by the confusion matrices from each pattern. Therefore, these ACCs were used to evaluate the diagnostic accuracy of a specific pattern among the 10 patterns. ROC curves were plotted by sensitivity (true positive rate) versus specificity (false positive rate). Fleis kappa (FK) values ​​were calculated to assess diagnostic agreement between labeled sonographers. Continuous variables are presented as mean ± standard deviation (SD) or median (interquartile range, IQR), as appropriate, and categorical variables are presented as numbers and percentages. Comparisons between three independent groups were made using ANOVA or Kruskal-Wallis tests for continuous variables and chi-square tests for categorical variables. Comparisons between two independent groups were made using the t test or Mann-Whitney U test for continuous variables and the chi-square test for categorical variables. Micro and macro AUC between two groups were compared by Delong test and paired t test, respectively. Micro and macro SEN, SPE, and ACC between the two groups were compared by chi-square test of proportions and paired t test, respectively. McNemar’s test was used to compare the preferred modes of all acousticians. Multiple comparisons were corrected by Bonferroni test. All analyzes were performed using R statistical software (version 4.0.2, R Core Team, 2020).30,and PValues ​​less than 0.0167 are considered significant for all analyses.