o óÜÓhVã@sÐdZddlZddlmZddlmZmZmZmZm Z m Z ddlZddl mZddlmZmZddlmZer> z <> aûYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.c sbeZdZdZeZeZeZ ddgZ d4d eee eff‡fdd„ Zed d„ƒZd5dd„Zdd„Zdd„Zedd„ƒZdd„Zd5dddee f‡fdd„ Zdd„Zd d!„Zd"d#„Zd$d%„Zd6d&ee dee fd'd(„Zd6d)d*„Z d7d+ee d,eee d-e!dee f‡fd.d/„ Z" d6d+ee d,eee dee fd0d1„Z#ed2d3„ƒZ$‡Z%S)8ÚLlamaTokenizeraë Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model. Args: vocab_file (`str`): Path to the vocabulary file. unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The end of sequence token. pad_token (`str` or `tokenizers.AddedToken`, *optional*): A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. sp_model_kwargs (`Dict[str, Any]`, `Optional`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. add_bos_token (`bool`, *optional*, defaults to `True`): Whether or not to add an `bos_token` at the start of sequences. add_eos_token (`bool`, *optional*, defaults to `False`): Whether or not to add an `eos_token` at the end of sequences. clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`): Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. use_default_system_prompt (`bool`, *optional*, defaults to `False`): Whether or not the default system prompt for Llama should be used. spaces_between_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to add spaces between special tokens. legacy (`bool`, *optional*): Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple example: - `legacy=True`: ```python >>> from transformers import T5Tokenizer >>> tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=True) >>> tokenizer.encode("Hello .") [8774, 32099, 3, 5, 1] ``` - `legacy=False`: ```python >>> from transformers import T5Tokenizer >>> tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False) >>> tokenizer.encode("Hello .") # the extra space `[3]` is no longer here [8774, 32099, 5, 1] ``` Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details. Ú input_idsÚattention_maskúúúNTFÚsp_model_kwargsc s|durin||_t|tƒrt|dddn|}t|tƒr#t|dddn|}t|tƒr1t|dddn|}t|tƒr?t|dddn|}|durQt d|j›d¡d}||_||_||_ ||_ | |_| | dd¡¡|_tƒjd|||||||j| | ||dœ| ¤ŽdS) NFT)Ú normalizedÚspecialz2You are using the default legacy behaviour of the a_. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565Ú from_slow)Ú bos_tokenÚ eos_tokenÚ unk_tokenÚ pad_tokenÚ add_bos_tokenÚ add_eos_tokenrÚclean_up_tokenization_spacesÚuse_default_system_promptÚspaces_between_special_tokensÚlegacy©)rÚ isinstanceÚstrrÚloggerÚwarning_onceÚ __class__r$rrr r"Úget_spm_processorÚpopÚsp_modelÚsuperÚ__init__)Úselfrrrrrrrr r!r"r#r$Úkwargs©r*r%úb/var/www/html/ai/venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.pyr/s>ÿõ ôzLlamaTokenizer.__init__cCst|j t|jƒ¡ƒS©N)Úlenr-Úencoder'r©r0r%r%r3Úunk_token_lengthÃszLlamaTokenizer.unk_token_lengthcCs²tjdi|j¤Ž}|js|r| |j¡|St|jdƒ3}| ¡}td|j j ›dƒ}|j |¡}| ¡}d|_|j |¡| ¡}| |¡Wdƒ|S1sRwY|S)NÚrbzThe new behaviour of z (with `self.legacy = False`)Fr%)ÚspmÚSentencePieceProcessorrr$ÚLoadrÚopenÚreadr r*Ú__name__Ú ModelProtoÚ FromStringÚNormalizerSpecÚadd_dummy_prefixÚnormalizer_specÚ MergeFromÚSerializeToStringÚLoadFromSerializedProto)r0rÚ tokenizerÚfr-Ú model_pb2ÚmodelrDr%r%r3r+Ès" ø ÷ z LlamaTokenizer.get_spm_processorcCs$|j ¡}d|d<|j ¡|d<|S)Nr-Úsp_model_proto)Ú__dict__Úcopyr-Úserialized_model_proto)r0Ústater%r%r3Ú__getstate__Ùs zLlamaTokenizer.__getstate__cCs,||_tjdi|j¤Ž|_|j |j¡dS)Nr%)rMr:r;rr-rGrL)r0Údr%r%r3Ú__setstate__ßszLlamaTokenizer.__setstate__cCs |j ¡S)zReturns vocab size)r-Úget_piece_sizer7r%r%r3Ú vocab_sizeäs zLlamaTokenizer.vocab_sizecs(‡fdd„tˆjƒDƒ}| ˆj¡|S)zReturns vocab as a dictcsi|]}ˆ |¡|“qSr%)Úconvert_ids_to_tokens)Ú.0Úir7r%r3Ú ësz,LlamaTokenizer.get_vocab..)ÚrangerUÚupdateÚadded_tokens_encoder)r0Úvocabr%r7r3Ú get_vocabészLlamaTokenizer.get_vocabÚtextrÚreturncs||js t|ƒdkrtƒj|fi|¤ŽStƒjt| td¡fi|¤Ž}t|ƒdkr<|dtkr<|d|jvr<|dd…}|S)zŸ Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the first token is special. rú éN)r$r5r.ÚtokenizeÚSPIECE_UNDERLINEÚreplaceÚall_special_tokens)r0r_Úadd_special_tokensr1Útokensr2r%r3rcðs &zLlamaTokenizer.tokenizecKs^|jj|td}|js| tdf¡s|S|jj|j|td}t|ƒ|jkr-||jd…S|S)u( Returns a tokenized string. We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give `['H', 'e', 'y']` instead of `['â–He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the `unk_token`. Here is an example with `unk_token = ""` and `unk_token_length = 4`. `self.tokenizer.sp_model.encode(" Hey", out_type = str)[4:]`. )Úout_typeraN) r-r6r'r$Ú startswithrdrr5r8)r0r_r1rhr%r%r3Ú _tokenizeÿs zLlamaTokenizer._tokenizecCs|j |¡S)z0Converts a token (str) in an id using the vocab.)r-Úpiece_to_id)r0Útokenr%r%r3Ú_convert_token_to_idsz#LlamaTokenizer._convert_token_to_idcCs|j |¡}|S)z=Converts an index (integer) in a token (str) using the vocab.)r-Ú IdToPiece)r0Úindexrmr%r%r3Ú_convert_id_to_tokensz#LlamaTokenizer._convert_id_to_tokencCs¤|d t¡r|ddd…|d<g}d}d}t|ƒD],\}}||jvr@|s1|dkr1|jr1|d7}||j |¡|7}d}g}q| |¡d}q||j |¡7}|S)z:Converts a sequence of tokens (string) in a single string.rrbNÚFraT)rjrdÚ enumeraterfr$r-ÚdecodeÚappend)r0rhÚcurrent_sub_tokensÚ out_stringÚprev_is_specialrXrmr%r%r3Úconvert_tokens_to_strings z'LlamaTokenizer.convert_tokens_to_stringÚfilename_prefixcCsÔtj |¡st d|›d¡dStj ||r|dndtd¡}tj |j¡tj |¡kr?tj |j¡r?t |j|ƒ|fStj |j¡sgt|dƒ}|j ¡}| |¡Wdƒ|fS1sbwY|fS)a Save the vocabulary and special tokens file to a directory. Args: save_directory (`str`): The directory in which to save the vocabulary. Returns: `Tuple(str)`: Paths to the files saved. zVocabulary path (z) should be a directoryNú-rrrÚwb)ÚosÚpathÚisdirr(ÚerrorÚjoinÚVOCAB_FILES_NAMESÚabspathrÚisfilerr=r-rOÚwrite)r0Úsave_directoryrzÚout_vocab_fileÚfiÚcontent_spiece_modelr%r%r3Úsave_vocabulary2s"ÿ(û þüzLlamaTokenizer.save_vocabularycCsL|jr|jgng}|jr|jgng}|||}|dur$||||}|Sr4)rÚbos_token_idr Úeos_token_id©r0Útoken_ids_0Útoken_ids_1r‹rŒÚoutputr%r%r3Ú build_inputs_with_special_tokensMsz/LlamaTokenizer.build_inputs_with_special_tokensrŽrÚalready_has_special_tokenscs€|rtƒj||ddS|jrdgng}|jrdgng}|dur*|dgt|ƒ|S|dgt|ƒ||dgt|ƒ|S)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rŽrr’rbNr)r.Úget_special_tokens_maskrr r5)r0rŽrr’r‹rŒr2r%r3r“Xs(ÿÿþýüûÿz&LlamaTokenizer.get_special_tokens_maskcCs`|jr|jgng}|jr|jgng}dgt|||ƒ}|dur.|dgt|||ƒ7}|S)aÅ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` if token_ids_1 is None, only returns the first portion of the mask (0s). Args: token_ids_0 (`List[int]`): List of ids. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). rNrb)rr‹r rŒr5rr%r%r3Ú$create_token_type_ids_from_sequences}sz3LlamaTokenizer.create_token_type_ids_from_sequencescCsTt d|jj›d¡d}| d|jrdnd¡}t dd¡ d d ¡}| d|¡}|S)aA LLaMA uses [INST] and [/INST] to indicate user messages, and <> and <> to indicate system messages. Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering rather than needing special tokens. The system message is partly 'embedded' in the first user message, which results in an unusual token ordering when it is present. This template should definitely be changed if you wish to fine-tune a model with more flexible role ordering! The output should look something like: [INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer [INST] Prompt [/INST] Answer [INST] Prompt [/INST] The reference for this chat template is [this code snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362) in the original repository. zU No chat template is defined for this tokenizer - using the default template for the zÓ class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information. a1{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif USE_DEFAULT_PROMPT == true and not '<>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\n' + system_message + '\n<>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<>\n' + content.strip() + '\n<>\n\n' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}ÚUSE_DEFAULT_PROMPTÚtrueÚfalseÚ z\nú'z\'ÚDEFAULT_SYSTEM_MESSAGE)r(r)r*r?rer"ÚDEFAULT_SYSTEM_PROMPT)r0ÚtemplateÚdefault_messager%r%r3Údefault_chat_templatežsÿÿÿz$LlamaTokenizer.default_chat_template)rrrNNTFFFFN)Fr4)NF)&r?Ú __module__Ú__qualname__Ú__doc__r‚Úvocab_files_namesÚPRETRAINED_VOCAB_FILES_MAPÚpretrained_vocab_files_mapÚ&PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESÚmax_model_input_sizesÚmodel_input_namesrrr'rr/Úpropertyr8r+rQrSrUr^rrcrkrnrqryrrŠr‘ÚintÚboolr“r”ržÚ __classcell__r%r%r2r3rBsnEóù6 ÿÿ ÿÿþ&ÿÿ ÿ þ!r)#r¡r}ÚshutilrÚtypingrrrrrrÚ sentencepiecer:Úconvert_slow_tokenizerr Útokenization_utilsrrÚutilsr Útokenization_utils_baserÚ get_loggerr?r(r‚r£r¥rdÚB_INSTÚE_INSTÚB_SYSÚE_SYSr›rr%r%r%r3Ús0 ÿÿü ÿ