Enrichment of lung cancer computed tomography collections with AI-derived annotations – Scientific Data

admin
6 Min Read

Public imaging datasets are critical for the development and evaluation of automated tools in cancer imaging. Unfortunately, many do not include annotations or image-derived features, complicating downstream analysis. Artificial intelligence-based annotation tools have been shown to achieve acceptable performance and can be used to automatically annotate large datasets. As part of the effort to enrich public data available within NCI Imaging Data Commons (IDC), here we introduce AI-generated annotations for two collections containing computed tomography images of the chest, NSCLC-Radiomics, and a subset of the National Lung Screening Trial. Using publicly available AI algorithms, we derived volumetric annotations of thoracic organs-at-risk, their corresponding radiomics features, and slice-level annotations of anatomical landmarks and regions. The resulting annotations are publicly available within IDC, where the DICOM format is used to harmonize the data and achieve FAIR (Findable, Accessible, Interoperable, Reusable) data principles. The annotations are accompanied by cloud-enabled notebooks demonstrating their use. This study reinforces the need for large, publicly accessible curated datasets and demonstrates how AI can aid in cancer imaging.

National Cancer Institute (NCI) Imaging Data Commons (IDC) contains publicly available cancer imaging, image-derived, and image-related data, co-located with tools for exploration, visualization, and analysis. Public imaging data contributed by various initiatives, including those from The Cancer Imaging Archive (TCIA), is ingested into this repository, allowing users to query metadata corresponding to images, annotations, and clinical attributes of the publicly available collections to define relevant cohorts, or subsets, of data. The IDC platform is based on the Google Cloud Platform (GCP), which enables the co-location of data with cloud-based tools for its exploration and analysis. Using tools from GCP, users can form a subset of data (a cohort) that is specific to the task at hand. Users also have the option of creating and using virtual machines to run computationally intensive jobs. Lastly, all analysis steps can be documented using Google Colaboratory python notebooks and shared with others.

Publicly available imaging datasets including the annotation of organs, lesions, and other regions of interest can aid in the development of imaging biomarkers, but unfortunately, many datasets suffer from the limited amount of annotations available. Using IDC, we chose to generate AI annotations for two collections: the Non-small Cell Lung Cancer (NSCLC) Radiomics dataset and a subset of the National Lung Screening Trial (NLST) dataset. The NSCLC-Radiomics collection contains labeled tumors, and only partially labeled organs of interest (combination of lung, esophagus, heart, and spinal cord). The NLST dataset, though widely used by many researchers, does not contain any image annotations.

In order to annotate the computed tomography (CT) images, we make use of publicly available pre-trained deep learning models for automatically generating annotations. These annotations include volumetric segmentation of organs, the labeling of the region of the body scanned (e.g., chest and abdomen), and landmarks that capture the inferior to superior extent of a selection of organs and bones. The first pre-trained model used is the nnU-Net framework for volumetric segmentation of thoracic organs. Since the collections we are analyzing concern lung cancer, we chose this model as it produces segmentations of the thoracic organs at risk (heart, aorta, trachea, and esophagus). These regions are routinely used during treatment planning, and could all be affected by the presence of lung cancer (and therefore used to develop new biomarkers or validate published ones). Multiple configurations of the nnU-Net framework (2D vs 3D, low vs high resolution, with and without test-time augmentation) were first applied to the NSCLC-Radiomics collection to evaluate its performance, where the best-performing model configuration was chosen for NLST evaluation. To enhance information about bone and organ landmarks as well as the region of the body, the publicly available body part regression algorithm was employed. The Body Part Regression model is an unsupervised approach trained on a diverse set of CT data and produces a continuous score for each transverse slice in a 3D volume. These slice scores correspond to specific landmarks obtained from training data, and can then be used to infer the body part region.

To allow for interoperability with the existing tools, we harmonize the representation of the annotations with that of the images being annotated, and implement the Findable, Accessible, Interoperable, and Reuse (FAIR) principles of data curation, we leverage the Digital Imaging and Communications in Medicine (DICOM) standard. Our dataset is encoded using standard DICOM objects containing volumetric segmentations, slice-level annotations, and segmentation-derived radiomics features. Furthermore, it is accompanied by the complete cloud-ready analysis workflow in the form of Google Colaboratory notebooks that can be used to recreate the dataset, and by the examples demonstrating how to query and visualize those standard objects, and how to convert them into alternative representations.

Share This Article
By admin
test bio
Please login to use this feature.