o òÜÓhÖ$ã@sÆUdZddlmZmZmZddlmZmZddlm Z e e¡ZddiZ dZdZd Zd ZdZdZd ZedededededediZeeefed<dd„e ¡DƒZeeefed<Gdd„deƒZdS)z Tokenization classes for CANINE.é)ÚDictÚListÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úloggingznielsr/canine-séiiàiàiàiàiàz[CLS]z[SEP]z[BOS]z[MASK]z[PAD]z [RESERVED]ÚSPECIAL_CODEPOINTScCói|]\}}||“qS©r)Ú.0Ú codepointÚnamerrúd/var/www/html/ai/venv/lib/python3.10/site-packages/transformers/models/canine/tokenization_canine.pyÚ ;órÚSPECIAL_CODEPOINTS_BY_NAMEc sHeZdZdZeZeeƒeeƒeeƒeeƒee ƒee ƒddf‡fdd„ Zede fdd„ƒZd d „Zdedeefdd „Zdede fdd„Zde defdd„Zdd„Z d$dee deee dee fdd„Z d%dee deee dedee f‡fdd„ Z d$dee deee dee fdd„Zd$d ed!eefd"d#„Z‡ZS)&ÚCanineTokenizeraé Construct a CANINE tokenizer (i.e. a character splitter). It turns text into a sequence of characters, and then converts each character into its Unicode code point. [`CanineTokenizer`] inherits from [`PreTrainedTokenizer`]. Refer to superclass [`PreTrainedTokenizer`] for usage examples and documentation concerning parameters. Args: model_max_length (`int`, *optional*, defaults to 2048): The maximum sentence length the model accepts. Fr c st|tƒrt|dddn|}t|tƒrt|dddn|}t|tƒr(t|dddn|}t|tƒr6t|dddn|}t|tƒrDt|dddn|}t|tƒrRt|dddn|}i|_t ¡D] \} }| |j|<q[dd„|j ¡Dƒ|_t|_t |jƒ|_ tƒjd||||||||dœ| ¤ŽdS)NF)ÚlstripÚrstripTcSrrr)r rrrrrris ÿz,CanineTokenizer.__init__..)Ú bos_tokenÚ eos_tokenÚ sep_tokenÚ cls_tokenÚ pad_tokenÚ mask_tokenÚadd_prefix_spaceÚmodel_max_lengthr) Ú isinstanceÚstrrÚ_special_codepointsr ÚitemsÚ_special_codepoint_stringsÚUNICODE_VOCAB_SIZEÚ_unicode_vocab_sizeÚlenÚ_num_special_tokensÚsuperÚ__init__)ÚselfrrrrrrrrÚkwargsrr©Ú __class__rrr)Ns4ÿø ÷zCanineTokenizer.__init__ÚreturncCs|jS©N)r%)r*rrrÚ vocab_size|szCanineTokenizer.vocab_sizecCs$dd„t|jƒDƒ}| |j¡|S)NcSsi|]}t|ƒ|“qSr)Úchr)r Úirrrrrz-CanineTokenizer.get_vocab..)Úranger0ÚupdateÚadded_tokens_encoder)r*ÚvocabrrrÚ get_vocab€szCanineTokenizer.get_vocabÚtextcCst|ƒS)z5Tokenize a string (i.e. perform character splitting).)Úlist)r*r8rrrÚ _tokenize…szCanineTokenizer._tokenizeÚtokencCs*zt|ƒWStytd|›dƒ‚w)zaConverts a token (i.e. a Unicode character) in an id (i.e. its integer Unicode code point value).zinvalid token: 'ú')ÚordÚ TypeErrorÚ ValueError)r*r;rrrÚ_convert_token_to_id‰s ÿz$CanineTokenizer._convert_token_to_idÚindexcCs:z|tvr t|WSt|ƒWStytd|›ƒ‚w)z˜ Converts a Unicode code point (integer) in a token (str). In case it's a special code point, convert to human-readable format. zinvalid id: )r r1r>r?)r*rArrrÚ_convert_id_to_tokens ÿz$CanineTokenizer._convert_id_to_tokencCs d |¡S)NÚ)Újoin)r*ÚtokensrrrÚconvert_tokens_to_stringœs z(CanineTokenizer.convert_tokens_to_stringNÚtoken_ids_0Útoken_ids_1cCs4|jg}|jg}|||}|dur|||7}|S)a˜ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CANINE sequence has the following format: - single sequence: `[CLS] X [SEP]` - pair of sequences: `[CLS] A [SEP] B [SEP]` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. N)Úsep_token_idÚcls_token_id©r*rGrHÚsepÚclsÚresultrrrÚ build_inputs_with_special_tokensŸsz0CanineTokenizer.build_inputs_with_special_tokensÚalready_has_special_tokenscsT|rtƒj||ddSdgdgt|ƒdg}|dur(|dgt|ƒdg7}|S)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rGrHrPérN)r(Úget_special_tokens_maskr&)r*rGrHrPrNr,rrrRºsÿz'CanineTokenizer.get_special_tokens_maskcCsH|jg}|jg}t|||ƒdg}|dur"|t||ƒdg7}|S)aÓ Create a mask from the two sequences passed to be used in a sequence-pair classification task. A CANINE sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). rNrQ)rIrJr&rKrrrÚ$create_token_type_ids_from_sequencesÖsz4CanineTokenizer.create_token_type_ids_from_sequencesÚsave_directoryÚfilename_prefixcCsdS)Nrr)r*rTrUrrrÚsave_vocabularyöszCanineTokenizer.save_vocabularyr/)NF)Ú__name__Ú __module__Ú__qualname__Ú__doc__Ú&PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESÚmax_model_input_sizesr1ÚCLSÚSEPÚPADÚMASKr)ÚpropertyÚintr0r7r rr:r@rBrFrrOÚboolrRrSrVÚ __classcell__rrr,rr>s\ ÷.ÿÿ ÿ þÿÿ ÿÿþÿÿ ÿ þ rN)rZÚtypingrrrÚtokenization_utilsrrÚutilsrÚ get_loggerrWÚloggerr[r$r_r]r^ÚBOSr`ÚRESERVEDr rbr Ú__annotations__r"rrrrrrÚs. ÿ ô"