Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ tags:
|
|
11 |
datasets:
|
12 |
- assin
|
13 |
- assin2
|
14 |
-
- stjiris/portuguese-legal-sentences-
|
15 |
widget:
|
16 |
- source_sentence: "O advogado apresentou as provas ao juíz."
|
17 |
sentences:
|
@@ -36,11 +36,11 @@ model-index:
|
|
36 |
type: Pearson Correlation
|
37 |
value: 0.8249826985133595
|
38 |
---
|
39 |
-
# stjiris/bert-large-portuguese-cased-legal-mlm-sts-
|
40 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
41 |
-
stjiris/bert-large-portuguese-cased-legal-mlm-sts-
|
42 |
|
43 |
-
It was trained using the MLM technique with a learning rate 3e-5 [Legal Sentences from +-30000 documents](https://huggingface.co/datasets/stjiris/portuguese-legal-sentences-
|
44 |
|
45 |
It is adapted to the Portuguese legal domain and trained for STS on portuguese datasets. [assin](https://huggingface.co/datasets/assin), [assin2](https://huggingface.co/datasets/assin2) and [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) portuguese subdataset
|
46 |
|
@@ -55,7 +55,7 @@ Then you can use the model like this:
|
|
55 |
from sentence_transformers import SentenceTransformer
|
56 |
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
|
57 |
|
58 |
-
model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-sts-
|
59 |
embeddings = model.encode(sentences)
|
60 |
print(embeddings)
|
61 |
```
|
@@ -75,8 +75,8 @@ def mean_pooling(model_output, attention_mask):
|
|
75 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
76 |
|
77 |
# Load model from HuggingFace Hub
|
78 |
-
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-
|
79 |
-
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-
|
80 |
|
81 |
# Tokenize sentences
|
82 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
|
11 |
datasets:
|
12 |
- assin
|
13 |
- assin2
|
14 |
+
- stjiris/portuguese-legal-sentences-v1.0
|
15 |
widget:
|
16 |
- source_sentence: "O advogado apresentou as provas ao juíz."
|
17 |
sentences:
|
|
|
36 |
type: Pearson Correlation
|
37 |
value: 0.8249826985133595
|
38 |
---
|
39 |
+
# stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0
|
40 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
41 |
+
stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0 derives from [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) large.
|
42 |
|
43 |
+
It was trained using the MLM technique with a learning rate 3e-5 [Legal Sentences from +-30000 documents](https://huggingface.co/datasets/stjiris/portuguese-legal-sentences-v1.0) 130k training steps (best performance for our semantic search system implementation)
|
44 |
|
45 |
It is adapted to the Portuguese legal domain and trained for STS on portuguese datasets. [assin](https://huggingface.co/datasets/assin), [assin2](https://huggingface.co/datasets/assin2) and [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) portuguese subdataset
|
46 |
|
|
|
55 |
from sentence_transformers import SentenceTransformer
|
56 |
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
|
57 |
|
58 |
+
model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
|
59 |
embeddings = model.encode(sentences)
|
60 |
print(embeddings)
|
61 |
```
|
|
|
75 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
76 |
|
77 |
# Load model from HuggingFace Hub
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
|
79 |
+
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm-sts-v1.0')
|
80 |
|
81 |
# Tokenize sentences
|
82 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|