Token Classification
GLiNER
PyTorch
multilingual
thegenerativegeneration commited on
Commit
aea4f45
·
verified ·
1 Parent(s): be55c64

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - multilingual
5
+ library_name: gliner
6
+ datasets:
7
+ - urchade/pile-mistral-v0.1
8
+ - thegenerativegeneration/custom_PII_wisperlanguages
9
+ pipeline_tag: token-classification
10
+ ---
11
+
12
+ Finetuned for way too short on the following categories:
13
+
14
+ ```python
15
+ PII_CATEGORIES3 = [
16
+ "identifying person name", "non-identifying person name",
17
+ "identifying company name", "non-identifying company name",
18
+ "identifying address", "non-identifying address",
19
+ "identifying city", "non-identifying city", "identifying postal code", "non-identifying postal code",
20
+ "identifying gender", "non-identifying gender",
21
+ "identifying religious belief", "non-identifying religious belief",
22
+ "identifying nationality", "non-identifying nationality",
23
+ "identifying age", "non-identifying age", "other identifying information",
24
+ ]
25
+ ```
26
+
27
+ On these languages (the ones contained in Whisper):
28
+
29
+ ```python
30
+ LANGUAGES = {
31
+ "en": "english",
32
+ "zh": "chinese",
33
+ "de": "german",
34
+ "es": "spanish",
35
+ "ru": "russian",
36
+ "ko": "korean",
37
+ "fr": "french",
38
+ "ja": "japanese",
39
+ "pt": "portuguese",
40
+ "tr": "turkish",
41
+ "pl": "polish",
42
+ "ca": "catalan",
43
+ "nl": "dutch",
44
+ "ar": "arabic",
45
+ "sv": "swedish",
46
+ "it": "italian",
47
+ "id": "indonesian",
48
+ "hi": "hindi",
49
+ "fi": "finnish",
50
+ "vi": "vietnamese",
51
+ "he": "hebrew",
52
+ "uk": "ukrainian",
53
+ "el": "greek",
54
+ "ms": "malay",
55
+ "cs": "czech",
56
+ "ro": "romanian",
57
+ "da": "danish",
58
+ "hu": "hungarian",
59
+ "ta": "tamil",
60
+ "no": "norwegian",
61
+ "th": "thai",
62
+ "ur": "urdu",
63
+ "hr": "croatian",
64
+ "bg": "bulgarian",
65
+ "lt": "lithuanian",
66
+ "la": "latin",
67
+ "mi": "maori",
68
+ "ml": "malayalam",
69
+ "cy": "welsh",
70
+ "sk": "slovak",
71
+ "te": "telugu",
72
+ "fa": "persian",
73
+ "lv": "latvian",
74
+ "bn": "bengali",
75
+ "sr": "serbian",
76
+ "az": "azerbaijani",
77
+ "sl": "slovenian",
78
+ "kn": "kannada",
79
+ "et": "estonian",
80
+ "mk": "macedonian",
81
+ "br": "breton",
82
+ "eu": "basque",
83
+ "is": "icelandic",
84
+ "hy": "armenian",
85
+ "ne": "nepali",
86
+ "mn": "mongolian",
87
+ "bs": "bosnian",
88
+ "kk": "kazakh",
89
+ "sq": "albanian",
90
+ "sw": "swahili",
91
+ "gl": "galician",
92
+ "mr": "marathi",
93
+ "pa": "punjabi",
94
+ "si": "sinhala",
95
+ "km": "khmer",
96
+ "sn": "shona",
97
+ "yo": "yoruba",
98
+ "so": "somali",
99
+ "af": "afrikaans",
100
+ "oc": "occitan",
101
+ "ka": "georgian",
102
+ "be": "belarusian",
103
+ "tg": "tajik",
104
+ "sd": "sindhi",
105
+ "gu": "gujarati",
106
+ "am": "amharic",
107
+ "yi": "yiddish",
108
+ "lo": "lao",
109
+ "uz": "uzbek",
110
+ "fo": "faroese",
111
+ "ht": "haitian creole",
112
+ "ps": "pashto",
113
+ "tk": "turkmen",
114
+ "nn": "nynorsk",
115
+ "mt": "maltese",
116
+ "sa": "sanskrit",
117
+ "lb": "luxembourgish",
118
+ "my": "myanmar",
119
+ "bo": "tibetan",
120
+ "tl": "tagalog",
121
+ "mg": "malagasy",
122
+ "as": "assamese",
123
+ "tt": "tatar",
124
+ "haw": "hawaiian",
125
+ "ln": "lingala",
126
+ "ha": "hausa",
127
+ "ba": "bashkir",
128
+ "jw": "javanese",
129
+ "su": "sundanese",
130
+ "yue": "cantonese",
131
+ }
132
+ ```