Skip to content

slideDupIdentify

Sander W. van der Laan edited this page Jan 3, 2024 · 2 revisions

slideDupIdentify identifies and organizes multiplicate files based on specified criteria, such as study type and stain name. It prioritizes multiplicates according to certain rules and provides options to output information about the multiplicates and log statistics.

This script identifies and organizes multiplicate files based on specified criteria, such as --output for the output file name, study type (--study_type), stain name (--stain). It prioritizes multiplicates according to certain rules. It provides options (--verbose) to output information about the multiplicates and log statistics using --log.

Images are expected to be of the form study_typestudy_number.[additional_info.]stain.[random_info.]file_extension, e.g., AE1234.T01-12345.CD34.ndpi, where AE is the study_type, 1234 is the study_number, T01-12345 is the additional_info and optional, CD34 is the stain name, and ndpi is the file_extension. The random_info is optional and can be any random string of characters, e.g. 2017-12-22_23.54.03. The file_extension is expected to be ndpi or TIF for the original image files.

The script will move all files with the same study_number and stain name to the duplicate folder. It will prioritize the files based on the following criteria:

  • There is a ndpi > keep ndpi, keep_this_one
  • Different creation date > keep latest file, different_date_kept_latest
  • Same date, different type > keep ndpi, same_date_diff_type_kept_ndpi
  • Same date, same type, different checksum > keep biggest, same_date_same_type_diff_checksum_biggest
  • Same date, same type, same checksum > keep first one, same_date_same_type_same_checksum_keep_this_one
  • When none of the above apply > cannot_assign_priority

Example usage:

python slideDupIdentify.py --study_type AE --stain CD34 --output duplicate_files

Argument(s):

  • --image-folder, -i Specify the folder where images are located (default: current directory). Required.
  • --studytype, -t Specify the study type prefix, e.g., AE. Required.
  • --stain, -s Specify the stain name, e.g., CD34. Required.
  • --out_file, -o Specify the output file name (without extension) to write duplicate information. Required.

Optional argument(s):

  • --force, -f Force overwrite if the output file already exists. Optional.
  • --dry_run, -d Perform a dry run (report in the terminal, no actual file operations. Optional.
  • --debug, -D Print debug information. Optional.
  • --verbose, -V Print the number of duplicate samples identified. Optional.
  • --help, -h Print this help message and exit. Optional.
  • --version, -v Print the version number and exit. Optional.

Overview

Welcome to slideToolKit

Manual

Introduction
General instructions

slideToolKit scripts

slide2Tiles
slideAppend.sh
slideAppendGCT.sh
slideConvert
slideDirectory
slideDupIdentify.py
slideEMask
slideEntropySegmentation.py
slideExtract.py
slideExtractTiles.py
slideInfo
slideInfo.py
slideJobChecker
slideLookup
slideMacro
slideMacro.py
slideMask
slideNormalize
slideRename
slideRename.py
slideThumb
slideThumb.py

slideQuantify v1

slideQuantify_v1
slideQuantify_v1_1_expresshist_mask.sh
slideQuantify_v1_2_expresshist_tile.sh
slideQuantify_v1_3_tile_normalizing.sh
slideQuantify_v1_4_cellprofiler.sh
slideQuantify_v1_5_wrapup.sh

slideQuantify v2

slideQuantify_v2
slideQuantify_v2_1_entropy_segmentation.sh
slideQuantify_v2_2_extract_tiles.sh
slideQuantify_v2_3_tile_normalizing.sh
slideQuantify_v2_4_cellprofiler.sh
slideQuantify_v2_5_wrapup.sh

slideQuantifyOSX

slideQuantifyOSX
slideQuantify_cellprofiler.sh
slideQuantify_mask.sh
slideQuantify_normalizing.sh
slideQuantify_tiling.sh
slideQuantify_wrapup.sh

Other scripts

slideToolKitTest.py

Installation

macOSX

Conda version (default/preferred)
Homebrew version

Linux

Rocky 8 Conda version (default/preferred)

Legacy

Ubuntu 16.04 LTS
Ubuntu 12.04 CentOS7 Conda version with modules
Administrator version

Conda vs brew

Clone this wiki locally