Towards equitable AI in oncology – Nature Reviews Clinical Oncology

admin
18 Min Read

In this Perspective, we consider the challenges and potential solutions for enabling the development and implementation of equitable AI models in oncology. We explore common sources of bias, including the choice of model approach, the current methods of cancer dataset curation, the need for rigorous validation and the contextual biases that might occur with AI models developed exclusively in HICs. Finally, we discuss successful examples that are helping to address the challenges of inequity in AI, including global data consortia as well as newer techniques such as synthetic data, and suggest specific actions that could be taken by key stakeholders such as AI developers, oncologists and regulatory agencies. Opportunities to optimize equity in AI model development are presented in Fig. 1.

Deep learning has received considerable attention in the past years owing to the ability to extract high-level features from data that are difficult or potentially impossible to detect using other methods; nonetheless, AI-based tools with potential utility in health-care encompass a spectrum of approaches, each offering distinct strengths and applications. Deep learning, characterized by its dynamic and evolving nature, operates through deep neural networks comprising layers that progressively transform input data into measurable features via trainable non-linear operations. As more layers process the data, they become increasingly abstract, and the final layer translates these deep features into desired outputs, such as detecting the presence of cancer or predicting therapeutic outcomes. However, this leveraging of complex operations for pattern recognition and representation learning also necessitates a large volume of data for effective learning. This data-driven approach often results in a lack of transparency, commonly referred to as the ‘black box’ issue, in which the specific predictive features might remain unclear and not readily understandable. The challenge with such an approach is that AI models can inadvertently learn patterns inherent to specific sets of images, rather than biologically meaningful features, potentially leading to biased parameters that lack human interpretability. Data published in 2022 (ref. ) illustrate the ability of deep learning-based models to recognize race from chest radiographs, a feature otherwise imperceptible to the human eye. This capability poses a risk of amplifying existing biases present in training datasets, potentially exacerbating health disparities. When AI models incorrectly infer a patient’s race from medical imaging, this could lead to misguided assumptions about disease prevalence, severity and treatment efficacy specific to that group. If an AI model incorrectly categorizes a patient as being from an ethnic group with a higher incidence of a specific cancer subtype, this could lead to incorrect treatment decision-making. This incorrect categorization might inadvertently lead to other pertinent diagnoses or treatment considerations being neglected. The black-box nature of many AI models poses a considerable challenge to the development of equitable AI as it becomes difficult to determine whether the AI-generated prediction reflects disease pathology or the effects of other confounders. This ambiguity hinders the effective identification and mitigation of biases, thus limiting the ability to ensure equity in AI applications. To tackle these challenges, alternative methods, such as conventional data science approaches, which are designed to directly identify correlations or the use of larger neural networks for data visualization, offer potential insights into decision-making processes.

Feature engineering is a method of selecting predetermined imaging features to improve the performance of AI algorithms using medical images for more accurate diagnosis and prediction of treatment responses. In contrast to deep learning, this approach offers a deliberate attempt to incorporate disease-specific knowledge to predict outcomes, thereby empowering domain experts. For example, quantitative vessel tortuosity (QVT), a feature that delves into the twistedness or tortuosity of the tumour vasculature, capitalizes on the insight that increased angiogenesis, often indicative of more-aggressive tumours, results in a more tortuous vasculature. In this way, feature engineering enables domain experts to ensure better interpretability and control over the AI system as they can then explicitly design features that align with the problem at hand. The implications of feature engineering extend to dataset size considerations, which is crucial in LMIC settings, in which opportunities for curation of large datasets might be limited by costs and logistical constraints. Pre-engineered features, such as QVT, might reduce the need for large datasets, unlike deep learning, which requires large training sets for effective unsupervised feature learning. This potentially provides a substantial advantage, emphasizing the feasibility of feature engineering in addressing the unique challenges of developing AI systems for use in LMICs (Box 1).

Imaging-based AI algorithms, which use medical images such as those obtained from radiological imaging and digitized pathology slides as their source of primary data, are influenced by the nature of datasets. Whether sourced from clinical trials or institutional records, the characteristics of the underlying datasets have a crucial role in determining the generalizability of the algorithm. As has been well documented, biases in AI systems stem to a large extent from the underlying data used to train the models. Most AI tools are currently developed using historical institutional datasets and are therefore susceptible to inheriting the biases arising from the disproportionate representation of different populations in many cancer datasets. Most of these data are homogeneous, often being selected from the same institution or the same group of local health-care facilities. Some instances of models being trained using clinical trial datasets have been reported, although this approach remains less common. Clinical trial datasets are curated from images acquired at multiple unrelated institutions often in different countries, under controlled conditions and according to specific study protocols, ensuring the collection of standardized and well-annotated data. These datasets are also carefully curated and monitored, leading to higher standards of data quality and reliability. Clinical trial datasets offer a higher standard of training data for AI algorithms; nonetheless, the historical under-representation of populations such as Black, Hispanic and Asian patients in clinical trials in the USA means that use of such datasets is likely to perpetuate existing disparities, and thus further exacerbate inequities. For example, imaging-based AI models trained primarily on data obtained from a population of mostly white patients might not perform as effectively or accurately in other populations, such as those of other ethnicities, and patients falling into more than one under-represented group, such as Hispanic female patients, might have even worse outcomes. Data published in 2021 indicate that focusing on augmenting the diversity and size of demographic groups within the training dataset rather than solely relying on oversampling techniques offers a more effective method of reducing the negative consequences of demographic disparities.

Furthermore, in the context of addressing the challenges of AI development and validation, considering the specific obstacles faced by LMICs in curating large datasets for AI validation is an essential step. Limitations to the development of AI tools in most LMICs include limited health-care system funding, and lack of adequately trained personnel and infrastructure, all of which impede the collection, storage and management of large amounts of data. Moreover, a substantial proportion of the data collected often remains in paper form, which, in addition to a lack of robust follow-up, adds further complexity to the process. The absence of high-speed internet, cloud computing and machine learning software in many LMICs often restricts the ability to establish the necessary infrastructure for local AI validation in these countries. Adding to these challenges, the number of clinical trials enrolling patients with cancer in LMICs is substantially lower than in the USA or Europe. Even when clinical trials do enrol patients in LMICs, the centres involved often lack the essential infrastructure or resources needed for tasks such as additional data collection or slide scanning. Consequently, often only researchers based in countries or regions with sufficient resources available are able to generate or collect adequate data and implement novel AI technologies, perpetuating a digital divide that exacerbates existing health-care disparities.

The current approach to the evaluation of FDA-approved AI tools for oncology underscores areas for improvement, with only a minority undergoing thorough assessments for clinical effectiveness. A recent study has revealed challenges with the 510(k) pathway for artificial intelligence or machine learning (AI/ML)-based medical devices, as many originate from non-AI/ML-based devices, posing difficulties in finding true equivalents. Concerns arise over frequent changes in AI/ML tasks along the predicate network, raising doubts about pathway suitability for ensuring safety and effectiveness, especially for specific indications. Current evaluation criteria might be insufficient, particularly when comprehensive data are not required, and these criteria also often overlook unique AI-related challenges, such as the regulation of evolving or continuously learning algorithms. This disparity contrasts with FDA approval pathways for novel drugs, which demand higher evidence standards. Hence, tailored evaluation criteria and approval pathways are necessary to ensure AI devices meet essential minimum standards of safety, efficacy and clinical utility. The FDA prioritizes clinical evaluation and model monitoring; nonetheless, a clear need exists to expand assessments across various cancer datasets and subpopulations, all while recognizing an important commitment to safety and efficacy alongside support for innovation. Paige Prostate, the first AI-based computational pathology tool to receive FDA approval via the de novo pathway, underwent rigorous validation studies, including clinical and analytical assessments, labelling requirements, and design verification, which were implemented to mitigate potentially inaccurate results. Regulatory agencies need to be committed to, and to hold vendors to, a comprehensive and transparent validation process that includes details on the ethnicity of patients providing source images in order to demonstrate a commitment to equitable access and the applicability of the device across diverse patient populations. Achieving this balance between rigorous evaluation and progress through collaboration among stakeholders is essential, and will be fundamental to ensuring that AI tools for oncology meet stringent standards of performance and safety while also advancing accessibility in health-care delivery.

An analysis published in 2022 (ref. ) indicates that only one of almost 120 existing AI tools was trained and/or validated using datasets that included representation from geographically and ethnically diverse patient populations across the USA. Additionally, most AI tools are tested using limited benchmark datasets that fail to take into account technical and biological variability in the underlying parameters. For example, the average size of the validation set for AI tools receiving FDA approval between January 2010 and March 2020 was only 300 patients, which is insufficient to gauge device safety and efficacy. Moreover, almost all of the 130 AI devices were assessed using only retrospective studies at their submission, with none of the 54 high-risk (class II) devices receiving prospective evaluations. Furthermore, most of the devices did not undergo publicly reported assessments at multiple centres in the evaluation study. Of the 41 devices for which the number of evaluation sites was reported, only a few were evaluated at multiple centres, with a substantial proportion of approved devices possibly evaluated at a small number of centres with limited geographical diversity. Addressing these shortcomings in the validation process will be crucial to achieving equitable AI. Comprehensive evaluation and validation methods should encompass larger and more diverse patient populations and include evaluations in prospective studies involving multiple centres. Applying these changes is expected to lead to the development of AI-based medical devices that are more reliable, more accurate and more effective in individuals from a wide range of backgrounds.

One of the challenges associated with developing new and emerging AI models is that the resources needed to develop and validate these tools at scale might simply not be available in LMICs. Training LLMs, such as Gemini or GPT-4, requires vast resources, including substantial electricity consumption. For example, developing the Llama series of LLMs reportedly required 2.6 million kilowatt-hours of electricity, emitting nearly 1,000 tons of CO. This intensive energy usage poses major challenges in terms of infrastructure and costs of developing these models, especially in settings in which resources are limited. These limitations exacerbate existing disparities in access to technology as only well-funded entities are likely to be able to afford the resources required for the development and deployment of advanced AI models for oncology, contributing to a widening digital divide. Data from several studies published over the past few years also emphasize the implications of variations in technical parameters, such as imaging system performance and data preparation, on the performance of AI algorithms developed using medical images. This observation is particularly relevant in LMICs in which scanners might be less advanced and/or more variable in performance, which has been shown to affect the performance of any AI tools developed using such images. For example, an AI algorithm developed using a 3 T MRI scanner might not be applicable in many African countries and potentially other LMICs, in which the scanners are typically ≤1.5 T. Minimizing these effects requires the consideration of inter-scanner differences and the development of approaches that account for them, such as adjustments according to scanner credentials. However, implementing standard imaging protocols is often not feasible in routine clinical practice, particularly in LMICs, in which the costs of acquiring and maintaining advanced imaging equipment presents a major financial challenge. To address this challenge, each device should be calibrated and tested at each new site of installation or usage, and multiple quality control checkpoints need to be established at different stages of evaluation, including continual monitoring to accommodate changes in technical parameters and algorithm performance.

Understanding the variability of imaging-based biomarkers within specific populations and harmonizing data analysis methods will be vital for both driving the development of clinical decision support systems and mitigating bias. Lastly, the disparity in readily available data modalities, such as lack of data from chest radiographs and portable ultrasonography devices relative to data from MRI or CT, underscores the need for dedicated efforts to develop AI tools compatible with these modalities for effective clinical deployment in resource-constrained settings. To foster more equitable AI, a pressing need exists for an increased focus on developing algorithms tailored to more readily available data, including data acquired in LMICs, to maximize the benefit from these tools.

Share This Article
By admin
test bio
Please login to use this feature.