Dataset Ninja LogoDataset Ninja:

PolypGen Dataset

803712665
Tagmedical
Taskinstance segmentation
Release YearMade in 2022
LicenseCC BY 4.0
Download2 GB

Introduction #

Released 2022-11-16 ·Sharib Ali, Debesh Jha, Noha Ghatwaryet al.

The PolypGen: A Polyp Segmentation and Detection Generalisation Dataset is one of the most comprehensive detection and pixel-level segmentation datasets curated by a team of computational scientists and expert gastroenterologists. The authors have curated a dataset from six unique centres incorporating more than 300 patients which is crucial fue to automate polyp detection and segmentation. Polyps in the colon are widely known cancer precursors identified by colonoscopy - the polyp’s number, size, and surface structure are linked to the risk of colon cancer. The dataset includes both single frame and sequence data with 3762 annotated polyp labels with precise delineation of polyp boundaries verified by six senior gastroenterologists.

PolypGen data was collected from 6 different centers. More than 300 unique patient videos/frames were used for this study. The general purpose of this diverse dataset is to allow the robust design of deep learning models and their validation to assess their generalizability capability. In this context, authors have proposed different dataset configurations for training and out-of-sample validation and proposed unique generalization assessment metrics to reveal the strength of deep learning methods. The below authors provide a comprehensive description of dataset collection, annotation strategies and their quality, ethical guidelines, and metric evaluation strategies.

Dataset Findings Size Availability
Kvasir-SEG Polyps 1000 images† Open academic
HyperKvasir GI findings including polyps 110,079 images and 374 videos Open academic
Kvasir-Capsule GI findings including polyps 4,741,504 images Open academic
CVC-ColonDB Polyps 380 images†† By request•
ETIS-Larib Polyp DB Polyps 196 images† Open academic
EDD202015 GI lesions including polyps 386 images Open academic
CVC-ClinicDB Polyps 612 images† Open academic
CVC-VideoClinicDB Polyps 11,954 images† By request•
ASU-Mayo polyp database Polyps 18,781 images† By request•
KID Angiectasia, bleeding, inflammations 2371 images, 47 videos Open academic•
Atlas of GI Endoscope GI lesions 1295 images Unknown•
El salvador atlas GI lesions 5071 video clips Open academic♣️
PolypGen Multi-centre colon polyps 1537 images†& 2225 video sequence Open academic

Table 1. An overview of existing gastrointestinal (GI) lesion datasets including polyps: the number of images or videos
along with the availability type is provided.

Centre System info Ethical approval Patient consenting type
Ambroise Paré Hospital, Paris, France Olympus Exera 195 N° IDRCB: 2019-A01602-55 Endospectral study
Istituto Oncologico Veneto, Padova, Italy Olympus endoscope H190 NA Generic patients consent
Centro Riferimento Oncologico, IRCCS, Italy Olympus VG-165, CV180, H185 NA Generic patients consent
Oslo University Hospital, Oslo, Norway Olympus Evis Exera III, CF 190 Exempted† Written informed consent
John Radcliffe Hospital, Oxford, UK GIF-H260Z, EVIS Lucera CV260 REC Ref: 16/YH/0247 Universal consent
University of Alexandria, Alexandria, Egypt Olympus Exera 160AL, 180AL NA Written informed consent

Table 2. Data collection information for each centre: Data acquisition system and patient consenting information.

A consortium of six different medical data centres (hospitals) was built where each data centre provided videos and image frames from at least 50 unique patients. The videos and image samples were collected and sent by the senior gastroenterologists involved in this project. The collected dataset consisted of both polyp and normal mucosa colonoscopy acquisitions. To incorporate the nature of polyp occurrences and maintain heterogeneity in the data distribution, the following protocol was adhered to for establishing the dataset:

  • Single frame sampling from each patient video incorporated different viewpoints
  • Sequence frame sampling consisted of both visible and invisible polyp frames (in most cases) with a minimal gap
  • While single frame data consisted of all polyp instances in that patient, sequence frame data consisted of only a localised targeted polyp
  • Positive sequence included both positive and negative polyp instances but from videos with confirmed polyp location while for negative sequence only patient videos with normal mucosa were used

An overview of the number of samples comprising positive samples and negative samples is presented in Fig. 2 a. The total positive samples of 3762 frames are released comprising 484, 1166, 457, 677, 458 and 520 frames from centres C1, C2, C3, C4, C5 and C6, respectively. These frames consist of 1537 single frames (1449 frames from C1-C5 also provided in EndoCV2021 challenge and 88 frames from C6), and 2225 sequence frames with the majority of sequence data sampled from centres C2 (865), C4 (450), and C6 (432). The number of polyp counts for pixel-level annotation of small (≤ 100×100), medium (between > 100 × 100 pixels and ≤ 200 × 200 pixels), large (≥ 200 × 200 pixels) sized polyps from each centre including no polyp frames but frames in close proximity of polyp are represented as histogram plot (Fig. 2 b). The total annotations for polyp that authors release is 3447. All these polyp samples are verified by expert gastroenterologists.

Fig.2a
Fig.2b

Figure 2. PolypGen dataset: (a) Positive (both single and sequence frames) and negative samples (sequence only) from each centre, and (b) polyp size-based histogram plot for positive samples showing variable-sized annotated polyps in the dataset (small is ≤ 100×100 pixels; medium is > 100×100 ≤ 200×200, and large is > 200×200 pixels). Null represents no polyp present in the sample.

The authors have provided both still image frames and continuous short video sequence data with their corresponding annotations. The positive and negative samples in the dataset of the polyp generalisation (PolypGen) are further detailed below.

Fig.3

Figure 3. Sample polyp annotations from each centre: Segmentation area with boundaries and corresponding bounding box/boxes overlaid images from all six centres. Samples include both small-sized polyp (< 10000 pixels) including some flat polyp samples to large-sized (≥ 40000 pixels) polyps and polyps during the resection procedure such as polyps with blue dyes.

Positive samples

Positive samples consist of video frames from the patient with a diagnosed polyp case. The selected frames may or may not have the polyp in them but may be located near the chosen frame. Nevertheless, a majority of these frames consist of at least one polyp in the frame. For the sequence positive samples, the continuity of the appearance and disappearance of the polyp similar to the real scenario has been taken into account and thus these frames can have a mixture of polyp instances and frames with normal mucosa. Table 3 is provided to detail the characteristics of 23 sequence data included in our dataset. It can be observed from Figure 4 that varied-sized polyps are included in the dataset with variable viewpoints, occlusions and instruments. Exemplary pixel-level annotations of positive polyp samples for each centre and their corresponding bounding boxes are presented in Fig. 3.

Sequence Description Artifact
seq1 Normal mucosa Light reflections; green patch
seq2 5 mm polyp at 6 o’clock Partially covered with stool; reflections; green patch
seq3 Polyp at distance, 4 o’clock Light reflection from liquid; green patch
seq4 2-3 mm polyp Liquid covering half of the image; green patch
seq5 5 mm polyp catched by a snare Partial occlusion by biopsy instrument
seq6 Polyp covering half of the circumference Cap; green patch
seq7 Normal mucosa Light reflection; some remnant stool; green patch
seq8 Typical flat cancer Light reflection; green patch
seq9 2 mm polyp at 2 o’clock Light reflection; green patch
seq10 Subtle small protrusions Some remnant stool
seq11 Polyp at 2-3 o’clock Light reflections in the periphery
seq12 Dye lifted 4-5 mm polyp Low contrast
seq13 6-7 mm polyp catched with a snare Low contrast; small reflections
seq14 Paris 1 p polyp, large long stalk, JNET 2a Lifted by Indigo Carmine, snare placed around the stalk
seq15 Paris 1 s JNET2a polyp and 1 Paris 1 sp to the left Lifted by Indigo Carmine
seq16 Paris 1 p polyp, large long stalk, JNET 2a Lifted by Indigo Carmine
seq17 Paris 1 sp polyp Light reflections make surface assessment impossible
seq18 Difficult interpretation Blurry image and reduced view
seq19 Paris 1 p polyp, large long stalk, JNET 2a Less contrast and slightly occluded
seq20 Half of the polyp visible Blurry image, with some blood on the mucosa
seq21 Two adenomas polyp Blurry image
seq22 Adenomas polyp Blurry image makes exact diagnosis impossible
seq23 Serrated polyp Perfect clean mucosa, minor light reflections

Table 3. Positive sample sequence summarised attribute: Total of 23 sequences are provided as positive sample sequences for patients with polyp instances during colonoscopy examination. Here JNET refers to the Japan NBI Expert Team classification score. These sequences depict different-sized polyps and locations with different artefacts and varying visibility. Sequences referring to one selected image are shown in Fig. 4.

Negative samples

Negative samples mostly refer to the negative sequences released in this dataset, i.e. no polyp frames. These sequences are taken from patient videos which consisted of confirmed absence of polyps (i.e., normal mucosa) in the acquired videos or at areas away from the polyp occurrences. It includes cases with anatomies such as colon linings, light reflections and mucosa covered with stool that may be confused with polyps (see Figure 5 and corresponding negative sequence attributes in Table 4).

Annotation strategies and quality assurance

A team of 6 senior gastroenterologists (with over 20 years of experience in endoscopy), two experienced post-doctoral researchers, and one PhD student were involved in the data collection, data sorting, annotation and the review process of the quality of annotations. For details on data collection and data sorting please refer to Section Video acquisition, collection and dataset construction. All annotations were performed by a team of three experienced researchers using an online annotation tool called Labelbox. The dataset was divided equally between the three reviewers for the annotation process where each researcher annotated a specific group of frames. However, all the annotated frames were revised by the senior gastroenterologists’ team. Each annotation was later cross-validated for accurate segmentation margins by the team and by the centre expert. Further, an independent binary review process was then assigned to a senior gastroenterologist, in most cases experts from different centres were assigned. A protocol for manual annotation of polyp was designed to minimise the heterogeneity in the manual delineation process. The protocol was in detail discussed with the clinical experts and the annotators during several weekly meetings. Here, the authors only present a brief on the important aspects of the annotation that should be taken care of during annotations. Example samples were provided by expert endoscopists to the annotators especially this was the case in the video annotations. The set protocols are listed below (refer to Fig. 3 for final ground truth annotations):

Fig4

Figure 4. Positive sequence data: Representative samples chosen from 23 sequences of the provided positive samples data.
A summary description is provided in Table 3. Parts of the images have been cropped for visualization.

  • Clear raised polyps: Boundary pixels should include only protruded regions. Precaution has to be taken when delineating along the normal colon folds
  • Inked polyp regions: Only part of the non-inked appearing object delineation
  • Polyps with instrument parts: Annotation should not include instrument and is required to be carefully delineated and may form more than one object
  • Pedunculated polyps: Annotation should include all raised regions unless appearing on the fold
  • Flat polyps: Zooming the regions identified with flat polyps before manual delineation. Also, consulting centre expert if needed.

Fig5

Figure 5. Negative sequence data: Representative samples chosen from each sequence of the provided negative sample data. A summary description is provided in Table 4. Parts of the images have been cropped for visualization.

  • Video sequence annotation: One sample from an expert gastroenterologist was provided for sequences that showed difficulty in distinguishing between mucosa and polyp. Polyps that are distant and not clearly visible were also not annotated as polyps.
  • Tackling with occlusion: Polyps that were occluded with stool or instrument were required to exclude the parts of mucosa that were obstructed.
  • Cancerous mucosa: Mucosa that were already cancerous but did not appear as polyps were excluded from the annotation. However, a raised mucosal surface that characterised adenomatous polyps was included.

Each of these annotated masks was reviewed by expert gastroenterologists. During this review process, a binary score was provided by the experts depending on whether the annotations were clinically acceptable or not. Some of the experts also provided feedback on the annotation and these images were placed into ambiguous categories for further rectification based on expert feedback. This ambiguous category was then jointly annotated by two researchers and further sent for review to one expert. The outcome of these quality checks is provided in Figure 6. It can be observed that a large fraction (30.5%) of annotations were rejected (excluding ambiguous batches, the total annotations were 2213, among which only 1537 were accepted and 676 frames were rejected). Similarly, the ambiguous batch that included correction of annotations after the first review also recorded 34.17% of rejected frames on the second review.

Sequence Description Artifact
seq1_neg Normal vascular pattern Light reflections in the periphery; not clean lens
seq2_neg Normal vascular pattern Contracted bowel; light reflections
seq3_neg Mucosa not satisfactorily visualized Stool covers the field of view
seq4_neg Reduced vascular pattern Light reflections and a small amount of stool
seq5_neg Reduced vascular pattern Light reflections
seq6_neg Normal vascular pattern Light reflections; biopsy forceps
seq7_neg Normal vascular pattern Very close to the luminal wall
seq8_neg Normal vascular pattern Blurry; semi-opaque liquid; cap
seq9_neg Normal vascular pattern Blurry; semi-opaque liquid; cap
seq10_neg Not possible to assess the mucosa Blurry; occluded
seq11_neg Normal vascular pattern Light reflections in the periphery; bubble on the lens
seq12_neg Normal vascular pattern Not a clean lens, mucosa covered by stool
seq13_neg Probably normal vascular pattern; Not possible to assess the mucosa Air bubbles; remnant stool; too close to the mucosa, blur, reflections
seq14_neg Clean bowel, normal vascular pattern Very close to the mucosa in all
seq15_neg Clean bowel, normal vascular pattern Some bubbles and light reflections
seq16_neg Clean bowel, normal vascular pattern Some bubbles and light reflection
seq17_neg Clean bowel, normal vascular pattern Very close to the mucosa in all
seq18_neg Clean bowel, normal vascular pattern, well distended Some stool residues
seq19_neg Clean bowel, normal vascular pattern, well distended Some liquid residues
seq20_neg Clean bowel, normal vascular pattern, well distended Some stool residues and reflections
seq21_neg Clean bowel, normal vascular pattern Very close, minor stool residues in last images
seq22_neg Clean bowel, normal vascular pattern, well distended Some liquid and stool residues, reflections
seq23_neg Perfect clean bowel, normal vascular pattern, well distended Some light reflections

Table 4. Negative sample sequence summarised attribute: Total of 23 sequences are provided as negative sample sequences for patients with no polyp during colonoscopy examination. These sequences depict different artefacts and varying visibility of vascular patterns and occlusion of the mucosa.

Fig6

Figure 6. Annotation quality review: Total curated frames along with accepted and rejected frame numbers during annotation quality review by experts for single-frame data. Annotated frames with % of flat and protruded polyps categorised during annotation are also provided

A subset of this dataset (from C1 - C5 except C6) forms the dataset of our EndoCV2021 challenge (Addressing generalisability in polyp detection and segmentation) training data, i.e., an event held in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI 2021), Nice, France. The current released data consists of additional positive and negative frames for both single and sequence data and a 6th centre data (C6). The presented version does not consist of training and test splits and users are free to apply their own strategies as applicable to the nature of their work. To access the complete dataset, users are requested to create a Synapse account and then the compiled dataset can be downloaded at source which has been published under Creative Commons 4.0 International (CC BY) licence. The dataset can only be used for educational and research purposes and must cite this paper. All collected data has been obtained through written patient consent or through ethical approval as tabulated in Table 2.

ExpandExpand
Dataset LinkHomepageDataset LinkResearch PaperDataset LinkKaggle

Summary #

PolypGen: A Polyp Segmentation and Detection Generalisation Dataset from EndoCV2021 Challenge is a dataset for instance segmentation, semantic segmentation, object detection, and classification tasks. It is used in the medical industry.

The dataset consists of 8037 images with 3734 labeled objects belonging to 1 single class (polyp).

Images in the PolypGen dataset have pixel-level instance segmentation annotations. Due to the nature of the instance segmentation task, it can be automatically transformed into a semantic segmentation (only one mask for every class) or object detection (bounding boxes for every object) tasks. There are 4916 (61% of the total) unlabeled images (i.e. without annotations). There are no pre-defined train/val/test splits in the dataset. Alternatively, the dataset could be split into 2 cases: negative (4275 images) and positive (3762 images). Also institute and sequence tags are included. The dataset was released in 2022 by the UK-NOR-EGY-IT-SWE-FR-US joint research group.

Dataset Poster

Explore #

PolypGen dataset has 8037 images. Click on one of the examples below or open "Explore" tool anytime you need to view dataset images with annotations. This tool has extended visualization capabilities like zoom, translation, objects table, custom filters and more. Hover the mouse over the images to hide or show annotations.

OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
OpenSample annotation mask from PolypGenSample image from PolypGen
👀
Have a look at 8037 images
View images along with annotations and tags, search and filter by various parameters

Class balance #

There are 1 annotation classes in the dataset. Find the general statistics and balances for every class in the table below. Click any row to preview images that have labels of the selected class. Sort by column to find the most rare or prevalent classes.

Search
Rows 1-1 of 1
Class
Images
Objects
Count on image
average
Area on image
average
polyp
mask
3121
3734
1.2
10.11%

Images #

Explore every single image in the dataset with respect to the number of annotations of each class it has. Click a row to preview selected image. Sort by any column to find anomalies and edge cases. Use horizontal scroll if the table has many columns for a large number of classes in the dataset.

Object distribution #

Interactive heatmap chart for every class with object distribution shows how many images are in the dataset with a certain number of objects of a specific class. Users can click cell and see the list of all corresponding images.

Class sizes #

The table below gives various size properties of objects for every class. Click a row to see the image with annotations of the selected class. Sort columns to find classes with the smallest or largest objects or understand the size differences between classes.

Search
Rows 1-1 of 1
Class
Object count
Avg area
Max area
Min area
Min height
Min height
Max height
Max height
Avg height
Avg height
Min width
Min width
Max width
Max width
polyp
mask
3734
8.45%
80.91%
0%
1px
0.09%
1080px
100%
330px
33.07%
1px
0.07%
1716px
100%

Spatial Heatmap #

The heatmaps below give the spatial distributions of all objects for every class. These visualizations provide insights into the most probable and rare object locations on the image. It helps analyze objects' placements in a dataset.

Spatial Heatmap

Objects #

Table contains all 3734 objects. Click a row to preview an image with annotations, and use search or pagination to navigate. Sort columns to find outliers in the dataset.

Search
Rows 1-10 of 3734
Object ID
Class
Image name
click row to open
Image size
height x width
Height
Height
Width
Width
Area
1
polyp
mask
C1_962OLCV1_100H0004.jpg
1080 x 1350
307px
28.43%
369px
27.33%
1.94%
2
polyp
mask
C3_C3_EndoCV2021_00464.jpg
1080 x 1350
102px
9.44%
241px
17.85%
0.99%
3
polyp
mask
seq9_7_endocv2021_positive_228.jpg
1080 x 1350
346px
32.04%
139px
10.3%
2.39%
4
polyp
mask
seq5_EndoCV2021_seq5_236.jpg
1064 x 1350
329px
30.92%
377px
27.93%
6.07%
5
polyp
mask
seq13_EndoCV2021_seq13_788.jpg
1064 x 1350
902px
84.77%
568px
42.07%
20.32%
6
polyp
mask
C3_C3_EndoCV2021_00429.jpg
1080 x 1350
741px
68.61%
783px
58%
29.03%
7
polyp
mask
seq12_EndoCV2021_seq12_476.jpg
1010 x 900
97px
9.6%
96px
10.67%
0.69%
8
polyp
mask
seq21_seq6_C6_396.jpg
710 x 910
118px
16.62%
147px
16.15%
1.59%
9
polyp
mask
seq21_seq6_C6_396.jpg
710 x 910
193px
27.18%
432px
47.47%
9.74%
10
polyp
mask
seq22_seq7_C6_390.jpg
700 x 830
174px
24.86%
193px
23.25%
3.87%

License #

PolypGen: A Polyp Segmentation and Detection Generalisation Dataset from EndoCV2021 Challenge is under CC BY 4.0 license.

Source

Citation #

If you make use of the PolypGen data, please cite the following reference:

[1] Ali, Sharib, Jha, Debesh, Ghatwary, Noha et al. (2021) PolypGen: A 
multi-center polyp detection and segmentation dataset for generalisability 
assessment. arXiv.

[2] Ali, Sharib, et al. "Assessing generalisability of deep learning-based polyp 
detection and segmentation methods through a computer vision challenge." arXiv 
preprint arXiv:2202.12031 (2022).

[3] Ali S, Dmitrieva M, Ghatwary N, Bano S, Polat G, Temizel A, et al. Deep 
learning for detection and segmentation of artefact and disease instances in 
gastrointestinal endoscopy. Medical Image Analysis. 2021:102002.

Source

If you are happy with Dataset Ninja and use provided visualizations and tools in your work, please cite us:

@misc{ visualization-tools-for-polypgen-dataset,
  title = { Visualization Tools for PolypGen Dataset },
  type = { Computer Vision Tools },
  author = { Dataset Ninja },
  howpublished = { \url{ https://datasetninja.com/polypgen } },
  url = { https://datasetninja.com/polypgen },
  journal = { Dataset Ninja },
  publisher = { Dataset Ninja },
  year = { 2025 },
  month = { may },
  note = { visited on 2025-05-01 },
}

Download #

Dataset PolypGen can be downloaded in Supervisely format:

As an alternative, it can be downloaded with dataset-tools package:

pip install --upgrade dataset-tools

… using following python code:

import dataset_tools as dtools

dtools.download(dataset='PolypGen', dst_dir='~/dataset-ninja/')

Make sure not to overlook the python code example available on the Supervisely Developer Portal. It will give you a clear idea of how to effortlessly work with the downloaded dataset.

. . .

Disclaimer #

Our gal from the legal dep told us we need to post this:

Dataset Ninja provides visualizations and statistics for some datasets that can be found online and can be downloaded by general audience. Dataset Ninja is not a dataset hosting platform and can only be used for informational purposes. The platform does not claim any rights for the original content, including images, videos, annotations and descriptions. Joint publishing is prohibited.

You take full responsibility when you use datasets presented at Dataset Ninja, as well as other information, including visualizations and statistics we provide. You are in charge of compliance with any dataset license and all other permissions. You are required to navigate datasets homepage and make sure that you can use it. In case of any questions, get in touch with us at hello@datasetninja.com.