OpenFace-CQUPT
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -21,13 +21,13 @@ We developed a domain-speciffc large language-vision assistant (PA-LLaVA) for pa
|
|
21 |
### Introduction
|
22 |
These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
|
23 |
|
24 |
-
Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs
|
25 |
|
26 |
#### Data Cleaning Process
|
27 |
|
28 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/IAeFWhH8brZYDaTJnew2N.png)
|
29 |
|
30 |
-
|
31 |
|
32 |
### Step 1 Download the public datasets.
|
33 |
Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.
|
|
|
21 |
### Introduction
|
22 |
These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
|
23 |
|
24 |
+
Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs data.
|
25 |
|
26 |
#### Data Cleaning Process
|
27 |
|
28 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/IAeFWhH8brZYDaTJnew2N.png)
|
29 |
|
30 |
+
## Get the Dataset
|
31 |
|
32 |
### Step 1 Download the public datasets.
|
33 |
Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.
|