2

Multimodal pre-training based on graph attention network for document understanding

In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. The code is available at https://github.com/ZZR8066/GraphDoc.

Zhenrong Zhang, Jiefeng Ma, Jun Du, et al

Multimodal pre-training based on graph attention network for document understanding