OpenFace-CQUPT commited on
Commit
cad1498
·
verified ·
1 Parent(s): d334b38

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -21,13 +21,13 @@ We developed a domain-speciffc large language-vision assistant (PA-LLaVA) for pa
21
  ### Introduction
22
  These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
23
 
24
- Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs.data.
25
 
26
  #### Data Cleaning Process
27
 
28
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/IAeFWhH8brZYDaTJnew2N.png)
29
 
30
- ### Get the Dataset
31
 
32
  ### Step 1 Download the public datasets.
33
  Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.
 
21
  ### Introduction
22
  These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
23
 
24
+ Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs data.
25
 
26
  #### Data Cleaning Process
27
 
28
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/IAeFWhH8brZYDaTJnew2N.png)
29
 
30
+ ## Get the Dataset
31
 
32
  ### Step 1 Download the public datasets.
33
  Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.