Abstract:
Objective: This study proposes an unsupervised learning model, ViTD-CycleGAN, based on an improved CycleGAN to synthesize computed tomography (CT) images from cone-beam computed tomography (CBCT) images. Our aim is to enhance the quality and realism of synthetic CT (sCT) images. Methods: ViTD-CycleGAN incorporates a U-Net framework based on a vision Transformer (ViT) and depth-wise convolution (DW) into its generator, where the self-attention mechanism of the Transformer is leveraged to extract and preserve crucial features and detailed information. Additionally, a gradient penalty and pixel-wise loss function are introduced to enhance the stability of the model training and image consistency. Results: Quantitative evaluation metrics (MAE, PSNR, and SSIM) for head and neck as well as chest datasets indicate the superior performance of the proposed model compared with existing unsupervised learning methods. Ablation experiments show that the DW significantly improved the model performance. Visual-display analysis confirms that the sCT images generated using the ViTD-CycleGAN exhibit higher image quality and realism. Conclusion: Compared with other unsupervised learning methods, the proposed method can improve the quality of CBCT-synthesized CT images and thus offer potential clinical application value.