ACT-Net: Anchor-Context Action Detection in Surgery Videos

Luoying Hao; Yan Hu; Wenjun Lin; Qun Wang; Heng Li; Huazhu Fu; Jinming Duan; Jiang Liu

doi:10.1007/978-3-031-43996-4_19

ACT-Net: Anchor-Context Action Detection in Surgery Videos

Luoying Hao, Yan Hu, Wenjun Lin, Qun Wang, Heng Li, Huazhu Fu, Jinming Duan, Jiang Liu

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure’s regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.

Original language	English
Title of host publication	Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings
Editors	Hayit Greenspan, Hayit Greenspan, Anant Madabhushi, Parvin Mousavi, Septimiu Salcudean, James Duncan, Tanveer Syeda-Mahmood, Russell Taylor
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	196-206
Number of pages	11
ISBN (Print)	9783031439957
DOIs	https://doi.org/10.1007/978-3-031-43996-4_19
Publication status	Published - 2023
Externally published	Yes
Event	26th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2023 - Vancouver, Canada Duration: 8 Oct 2023 → 12 Oct 2023

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14228 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	26th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2023
Country/Territory	Canada
City	Vancouver
Period	8/10/23 → 12/10/23

Keywords

Action detection
Anchor-context
Conditional diffusion
Surgical video

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-031-43996-4_19

Cite this

Hao, L., Hu, Y., Lin, W., Wang, Q., Li, H., Fu, H., Duan, J., & Liu, J. (2023). ACT-Net: Anchor-Context Action Detection in Surgery Videos. In H. Greenspan, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, & R. Taylor (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings (pp. 196-206). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14228 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-43996-4_19

Hao, Luoying ; Hu, Yan ; Lin, Wenjun et al. / ACT-Net : Anchor-Context Action Detection in Surgery Videos. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings. editor / Hayit Greenspan ; Hayit Greenspan ; Anant Madabhushi ; Parvin Mousavi ; Septimiu Salcudean ; James Duncan ; Tanveer Syeda-Mahmood ; Russell Taylor. Springer Science and Business Media Deutschland GmbH, 2023. pp. 196-206 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{a0742d577c5044f981400262f25fb3fa,

title = "ACT-Net: Anchor-Context Action Detection in Surgery Videos",

abstract = "Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure{\textquoteright}s regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.",

keywords = "Action detection, Anchor-context, Conditional diffusion, Surgical video",

author = "Luoying Hao and Yan Hu and Wenjun Lin and Qun Wang and Heng Li and Huazhu Fu and Jinming Duan and Jiang Liu",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.; 26th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2023 ; Conference date: 08-10-2023 Through 12-10-2023",

year = "2023",

doi = "10.1007/978-3-031-43996-4_19",

language = "English",

isbn = "9783031439957",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "196--206",

editor = "Hayit Greenspan and Hayit Greenspan and Anant Madabhushi and Parvin Mousavi and Septimiu Salcudean and James Duncan and Tanveer Syeda-Mahmood and Russell Taylor",

booktitle = "Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings",

address = "Germany",

}

Hao, L, Hu, Y, Lin, W, Wang, Q, Li, H, Fu, H, Duan, J & Liu, J 2023, ACT-Net: Anchor-Context Action Detection in Surgery Videos. in H Greenspan, H Greenspan, A Madabhushi, P Mousavi, S Salcudean, J Duncan, T Syeda-Mahmood & R Taylor (eds), Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14228 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 196-206, 26th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2023, Vancouver, Canada, 8/10/23. https://doi.org/10.1007/978-3-031-43996-4_19

ACT-Net: Anchor-Context Action Detection in Surgery Videos. / Hao, Luoying; Hu, Yan; Lin, Wenjun et al.
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings. ed. / Hayit Greenspan; Hayit Greenspan; Anant Madabhushi; Parvin Mousavi; Septimiu Salcudean; James Duncan; Tanveer Syeda-Mahmood; Russell Taylor. Springer Science and Business Media Deutschland GmbH, 2023. p. 196-206 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14228 LNCS).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - ACT-Net

T2 - 26th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2023

AU - Hao, Luoying

AU - Hu, Yan

AU - Lin, Wenjun

AU - Wang, Qun

AU - Li, Heng

AU - Fu, Huazhu

AU - Duan, Jinming

AU - Liu, Jiang

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

PY - 2023

Y1 - 2023

N2 - Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure’s regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.

AB - Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure’s regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.

KW - Action detection

KW - Anchor-context

KW - Conditional diffusion

KW - Surgical video

UR - http://www.scopus.com/inward/record.url?scp=85174698943&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-43996-4_19

DO - 10.1007/978-3-031-43996-4_19

M3 - Conference contribution

AN - SCOPUS:85174698943

SN - 9783031439957

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 196

EP - 206

BT - Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings

A2 - Greenspan, Hayit

A2 - Madabhushi, Anant

A2 - Mousavi, Parvin

A2 - Salcudean, Septimiu

A2 - Duncan, James

A2 - Syeda-Mahmood, Tanveer

A2 - Taylor, Russell

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 8 October 2023 through 12 October 2023

ER -

Hao L, Hu Y, Lin W, Wang Q, Li H, Fu H et al. ACT-Net: Anchor-Context Action Detection in Surgery Videos. In Greenspan H, Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 - 26th International Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2023. p. 196-206. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-43996-4_19