Skip to content

Document image Transformer (DiT) mask RCNN fine-tuning

Notifications You must be signed in to change notification settings

er1czz/dit-maskrcnn-finetune

Repository files navigation

Document Layout Analysis from Data Labeling to Model Retraining

  • keyword: computer vision, object detection, document layout analysis, Vision Transformer (ViT), Document image Transformer (DiT), fine-tune, retrain, PyTorch, detectron2
  • highlight:
    • label conversion from labelme bboxes to coco masks
    • retrain model pretrained for multi-classes object detection into single-class

LEFT inset: COCO mask labels for multi-classes detection

RIGHT inset: labelme rectangle mask labels (converted from bounding boxes) for single-class detection

1 Custom data labeling and preprocess

Note1: common datasets for document layout

  • DocLayNet (2023) 28GB, 80863 pages {financial_reports,scientific_articles,laws_and_regulations, government_tenders, manuals, patents}
  • DocBank (2020) 50GB, 500K pages from arXiv.com {scientifi articles}
  • PubLayNet (2019) 96GB, over 1 million pdf from PubMed {scientific articles}
  • RVL-CDIP (2015) 37GB, 400,000 grayscale images, 16 classes
    • {"letter", "form", "email", "handwritten", "advertisement", "scientific report", "scientific publication", "specification", "file folder", "news article", "budget", "invoice", "presentation", "questionnaire", "resume", "memo"}
  • PRImA Layout Analysis Dataset (2009) 1240 pages {magazines pages, technical articles}
  • somewhate relevant HierText (2023) 11639 images {hierarchical annotations of text in natural scenes and documents}

Note2: labeling tools

  • labelme https://github.com/wkentaro/labelme
    • very basic functions and app can crush sometimes (remember to enable autosave)
    • outut label json file for each image separately
    • can use third party tools to convert labelme output to cooc format: such as https://github.com/fcakyon/labelme2coco
  • label studio https://labelstud.io/
    • mature but with decent learning curve
    • can natively output label in coco format (therefore, label data of all the images are compiled into one json file)

LabelMe label output format

  • example json file below shows three different labels in both rectangle (bbox) and polygon (mask) formats
  • point coordinates for rectangle [[x1, y1], [x3, y3]]
  • point coordinates for polygon (mask) [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
  • use this custom script to convert labelme to coco (all-in-one) master_convert.py
{"imagePath": "123.jpg",
  "imageData": null,
  "shapes":
      [
      {"shape_type": "rectangle", "points": [[204.9180327868852, 131.14754098360655], [1529.5081967213114, 1944.2622950819673]], "flags": {}, "group_id": null, "label": "section"},
      {"shape_type": "rectangle", "points": [[1585.245901639344, 78.68852459016394], [200.97022415523588, 1.4210854715202004e-14]], "flags": {}, "group_id": null, "label": "Header"},
      {"shape_type": "rectangle", "points": [[204.9180327868852, 1957.377049180328], [537.7049180327868, 2054.0983606557375]], "flags": {}, "group_id": null, "label": "section"},
      {"shape_type": "polygon", "points": [[204.9180327868852, 131.14754098360655], [204.9180327868852, 1944.2622950819673], [1529.5081967213114, 1944.2622950819673], [1529.5081967213114, 131.14754098360655]], "flags": {}, "group_id": null, "label": "section"},
      {"shape_type": "polygon", "points": [[1585.245901639344, 78.68852459016394], [1585.245901639344, 1.4210854715202004e-14], [200.97022415523588, 1.4210854715202004e-14], [200.97022415523588, 78.68852459016394]], "flags": {}, "group_id": null, "label": "Header"},
      {"shape_type": "polygon", "points": [[204.9180327868852, 1957.377049180328], [204.9180327868852, 2054.0983606557375], [537.7049180327868, 2054.0983606557375], [537.7049180327868, 1957.377049180328]], "flags": {}, "group_id": null, "label": "section"}
      ],
  "version": "5.1.1",
  "flags": {},
  "imageHeight": 2200,
  "imageWidth": 1700}

2 Setup environment

  • below is the version of PyTorch and detectron2 I installed and ran successfully on both AWS EC2 g4dn.2xlarge instance (NVIDIA Turing T4 16GB VRAM) and a local workstation (NVIDIA Turing TU104 8GB VRAM)
torch:  2.0 ; cuda:  cu118
detectron2: 0.6
  • Install cuda and then PyTorch
  • Install detectron2 (Building detectron2 from source is recommended)
  • Then pip install -r dit_requirement.txt dit_requirement.txt
  • Clone unilm repo (DiT is part of unilm) git clone https://github.com/microsoft/unilm.git
    • If you use Python3.10 or above, you will encounter this error ImportError: cannot import name 'Iterable' from 'collections'
    • Please modify unilm/dit/object_detection/ditod/table_evaluation/data_structure.py replace line 6 as Iterable from collections.abc
    • for more details, please check DiT_tutorial.ipynb

3 Model retraining

# train specs
cfg.SOLVER.IMS_PER_BATCH = 2  # batch size
cfg.SOLVER.MAX_ITER = 100000   # max_iteration(100000) = images_count(500) * epoch(400) / batch_size (2)
cfg.SOLVER.BASE_LR = 25e-5  # LR (0.00025)
cfg.SOLVER.LR_SCHEDULER_NAME = "WarmupCosineLR"
cfg.SOLVER.WARMUP_ITERS = int(0.2*cfg.SOLVER.MAX_ITER)
cfg.SOLVER.CLIP_GRADIENTS.ENABLED = True
cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE = "value"
cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE = 1.0
cfg.SOLVER.AMP.ENABLED = True
# data augmentation during retrain
from detectron2.data import DatasetMapper, build_detection_train_loader
from detectron2.data import transforms as T
class MyTrainer_Aug(MyTrainer):
    @classmethod
    def build_train_loader(cls, cfg):
        mapper = DatasetMapper(cfg, is_train=True,
                                use_instance_mask = True, # must enable, otherwise gt_mask error
                                augmentations=[T.AugmentationList([
                                   T.RandomBrightness(0.9, 1.1),
                                   T.RandomFlip(prob=0.5),
                                   #T.Resize((800, 1200)) # h, w 
                                   T.ResizeShortestEdge(800, 1200) # shorter edge will be scaled to 800 and longer edge will be no larger than 2000 (ratio locked) OUT OF VRAM
                                   ])]
                                )
        return build_detection_train_loader(cfg, mapper=mapper)
  • use trainer = MyTrainer_Aug(cfg) instead of trainer = MyTrainer(cfg)

4 Evaluation (performance metrics)

5. Batch inference (active learning)

  • Script that can use retrained model to generate output in labelme format
    • Inference by CPU (even with multiprocessing) is much slower (two orders of magnitude) than by CUDA of single process.
    • e.g. AWS g4dn.2xlarge CUDA inference of 500 images took about 340 s.
  • CUDA version (~0.6 s per image)
  • CPU version, multiprocessing (+20 s per image)

6. Data Augmentation (create a new dataset)

7. Result

  • combine image sets side-by-side by opencv hconcat
  • please note the fine-tuning was carried out with a small dataset of 20 images.
  • therefore, the inference result from this "lightly" tuned model (right inset below) is not ideal albeit the prediction is for single-class.

LEFT inset: original model for multi-classes detection

RIGHT inset: fine-tuned model for single-class detection

About

Document image Transformer (DiT) mask RCNN fine-tuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages