o óÜÓhÞ<ã@sÂdZddlZddlZddlZddlZddlmZmZddlm Z m Z mZmZddl mZe e¡Zddd œZd did did œZd d iZdd„ZGdd„deƒZdd„Zdd„ZGdd„de ƒZdS)z$Tokenization classes for OpenAI GPT.éN)ÚOptionalÚTupleé)ÚPreTrainedTokenizerÚ_is_controlÚ_is_punctuationÚ_is_whitespace)Úloggingz vocab.jsonz merges.txt)Ú vocab_fileÚmerges_filez openai-gptz9https://huggingface.co/openai-gpt/resolve/main/vocab.jsonz9https://huggingface.co/openai-gpt/resolve/main/merges.txticCs| ¡}|sgS| ¡}|S)z@Runs basic whitespace cleaning and splitting on a piece of text.)ÚstripÚsplit)ÚtextÚtokens©rúd/var/www/html/ai/venv/lib/python3.10/site-packages/transformers/models/openai/tokenization_openai.pyÚwhitespace_tokenize.s rc@sXeZdZdZ ddd„Zddd„Zdd „Zdd d„Zdd „Zdd„Z dd„Z dS)ÚBasicTokenizeraª Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). Args: do_lower_case (`bool`, *optional*, defaults to `True`): Whether or not to lowercase the input when tokenizing. never_split (`Iterable`, *optional*): Collection of tokens which will never be split during tokenization. Only has an effect when `do_basic_tokenize=True` tokenize_chinese_chars (`bool`, *optional*, defaults to `True`): Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)). strip_accents (`bool`, *optional*): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for `lowercase` (as in the original BERT). do_split_on_punc (`bool`, *optional*, defaults to `True`): In some instances we want to skip the basic punctuation splitting so that later tokenization can capture the full context of the words, such as contractions. TNcCs2|durg}||_t|ƒ|_||_||_||_dS©N)Ú do_lower_caseÚsetÚnever_splitÚtokenize_chinese_charsÚ strip_accentsÚdo_split_on_punc)ÚselfrrrrrrrrÚ__init__Os zBasicTokenizer.__init__cCs¶|r |j t|ƒ¡n|j}| |¡}|jr| |¡}t d|¡}t|ƒ}g}|D])}||vrH|j r@| ¡}|jdur?| |¡}n|jrH| |¡}| | ||¡¡q(td |¡ƒ}|S)aj Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer. Args: never_split (`List[str]`, *optional*) Kept for backward compatibility purposes. Now implemented directly at the base class level (see [`PreTrainedTokenizer.tokenize`]) List of token not to split. ÚNFCFú )rÚunionrÚ_clean_textrÚ_tokenize_chinese_charsÚunicodedataÚ normalizerrÚlowerrÚ_run_strip_accentsÚextendÚ_run_split_on_puncÚjoin)rrrÚunicode_normalized_textÚorig_tokensÚsplit_tokensÚtokenÚ output_tokensrrrÚtokenize_s& € zBasicTokenizer.tokenizecCsBt d|¡}g}|D]}t |¡}|dkrq | |¡q d |¡S)z$Strips accents from a piece of text.ÚNFDÚMnÚ)r"r#ÚcategoryÚappendr()rrÚoutputÚcharÚcatrrrr%…s z!BasicTokenizer._run_strip_accentscCs |jr|dur||vr|gSt|ƒ}d}d}g}|t|ƒkrI||}t|ƒr/| |g¡d}n|r6| g¡d}|d |¡|d7}|t|ƒksdd„|DƒS) z&Splits punctuation on a piece of text.NrTFéÿÿÿÿécSsg|]}d |¡‘qS)r1)r()Ú.0ÚxrrrÚ ¤óz5BasicTokenizer._run_split_on_punc..)rÚlistÚlenrr3)rrrÚcharsÚiÚstart_new_wordr4r5rrrr's$ öz!BasicTokenizer._run_split_on_punccCsTg}|D] }t|ƒ}| |¡r| d¡| |¡| d¡q| |¡qd |¡S)z)Adds whitespace around any CJK character.rr1)ÚordÚ_is_chinese_charr3r(©rrr4r5Úcprrrr!¦s z&BasicTokenizer._tokenize_chinese_charscCsˆ|dkr|dks@|dkr|dks@|dkr|dks@|dkr |dks@|d kr(|d ks@|dkr0|dks@|d kr8|dks@|dkrB|dkrBdSdS)z6Checks whether CP is the codepoint of a CJK character.iNiÿŸi4i¿Miiß¦i§i?·i@·i¸i ¸i¯ÎiùiÿúiøiúTFr)rrErrrrC³szBasicTokenizer._is_chinese_charcCsXg}|D]"}t|ƒ}|dks|dkst|ƒrqt|ƒr!| d¡q| |¡qd |¡S)zBPerforms invalid character removal and whitespace cleanup on text.riýÿrr1)rBrrr3r(rDrrrr Ës zBasicTokenizer._clean_text)TNTNTr)Ú__name__Ú __module__Ú__qualname__Ú__doc__rr.r%r'r!rCr rrrrr8s ú & rcCs6tƒ}|d}|dd…D]}| ||f¡|}q |S)zƒ Return set of symbol pairs in a word. word is represented as tuple of symbols (symbols being variable-length strings) rr8N)rÚadd)ÚwordÚpairsÚ prev_charr5rrrÚ get_pairsÙsrNcCsn| dd¡}| dd¡}| dd¡}| dd¡}| dd¡}t d d |¡}t dd|¡}t d d|¡}| ¡S)zm fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization uâ€”ú-uâ€“uâ€•uâ€¦z...õÂ´ú'zD(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)z \1 z\s*\n\s*z z[^\S\n]+r)ÚreplaceÚreÚsubr)rrrrÚtext_standardizeæsrUcsžeZdZdZeZeZeZ ddgZ d‡fdd„ Zedd„ƒZ ed d „ƒZdd„Zd d„Zdd„Zdd„Zdd„Zdd„Zddedeedeefdd„Z‡ZS)ÚOpenAIGPTTokenizera( Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities: - lowercases all inputs, - uses `SpaCy` tokenizer and `ftfy` for pre-BPE tokenization if they are installed, fallback to BERT's `BasicTokenizer` if not. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): Path to the vocabulary file. merges_file (`str`): Path to the merges file. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Ú input_idsÚattention_maskúcs4zddl}ddlm}|ƒ}|j|_|j|_Wnty.t d¡t dd|_d|_Ynwt |dd}t |¡|_ Wdƒn1sFwYdd „|j ¡Dƒ|_t |dd} | ¡ d ¡dd…} Wdƒn1srwYd d„| Dƒ} tt| tt| ƒƒƒƒ|_i|_tƒjdd|i|¤ŽdS)Nr)ÚEnglishzQftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.T)rúutf-8©ÚencodingcSsi|]\}}||“qSrr)r9ÚkÚvrrrÚ r<z/OpenAIGPTTokenizer.__init__..Ú r8r7cSsg|]}t| ¡ƒ‘qSr)Útupler )r9Úmergerrrr;!sz/OpenAIGPTTokenizer.__init__..Ú unk_tokenr)ÚftfyÚ spacy.lang.enrZÚ tokenizerÚnlpÚfix_textÚImportErrorÚloggerÚwarningrÚopenÚjsonÚloadÚencoderÚitemsÚdecoderÚreadr ÚdictÚzipÚranger>Ú bpe_ranksÚcacheÚsuperr)rr rrdÚkwargsrerZÚ_nlpÚvocab_handleÚ merges_handleÚmerges©Ú __class__rrrs, ýÿÿzOpenAIGPTTokenizer.__init__cCsdS)NTr©rrrrr'sz OpenAIGPTTokenizer.do_lower_casecCs t|jƒSr)r>rprrrrÚ vocab_size+s zOpenAIGPTTokenizer.vocab_sizecCst|jfi|j¤ŽSr)rtrpÚadded_tokens_encoderrrrrÚ get_vocab/szOpenAIGPTTokenizer.get_vocabc s~t|dd…ƒ|ddf}|ˆjvrˆj|St|ƒ}|s#|dS t|‡fdd„d}|ˆjvr4ny|\}}g}d}|t|ƒkr›z| ||¡} Wnty\| ||d…¡Yn?w| ||| …¡| }|||krŠ|t|ƒdkrŠ||d|krŠ| ||¡|d 7}n| ||¡|d7}|t|ƒksBt|ƒ}|}t|ƒdkr¨nt|ƒ}q$d |¡}|dkr¸d}|ˆj|<|S) Nr7úTcsˆj |tdƒ¡S)NÚinf)rwÚgetÚfloat)ÚpairrrrÚ<sz(OpenAIGPTTokenizer.bpe..©Úkeyrr8érz z )rbrxrNÚminrwr>ÚindexÚ ValueErrorr&r3r() rr,rKrLÚbigramÚfirstÚsecondÚnew_wordr@ÚjrrrÚbpe2sN þ, ñä zOpenAIGPTTokenizer.bpecCs„g}|jdur!|j |¡}|D]}| t| |¡ d¡ƒ¡q|S| t| |¡ƒ¡}|D]}| t| |j ¡¡ d¡ƒ¡q-|S)zTokenize a string.Nr) rirhr.r&r=r–r rUrr$)rrr+r,rrrÚ _tokenize^s ý"zOpenAIGPTTokenizer._tokenizecCs|j ||j |j¡¡S)z0Converts a token (str) in an id using the vocab.)rpr‡rd)rr,rrrÚ_convert_token_to_idmsz'OpenAIGPTTokenizer._convert_token_to_idcCs|j ||j¡S)z0Converts an id in a token (BPE) using the vocab.)rrr‡rd)rrrrrÚ_convert_id_to_tokenqsz'OpenAIGPTTokenizer._convert_id_to_tokencCsd |¡ dd¡ ¡}|S)z:Converts a sequence of tokens (string) in a single string.r1r…r)r(rRr)rrÚ out_stringrrrÚconvert_tokens_to_stringusz+OpenAIGPTTokenizer.convert_tokens_to_stringNÚsave_directoryÚfilename_prefixÚreturnc CsVtj |¡st d|›d¡dStj ||r|dndtd¡}tj ||r,|dndtd¡}t|ddd }| t j |jd ddd d¡Wdƒn1sTwYd}t|ddd =}| d¡t|j ¡dd„dD]!\}} || kr†t d|›d¡| }| d |¡d¡|d7}qsWdƒ||fS1s¢wY||fS)NzVocabulary path (z) should be a directoryrOr1r rÚwr[r\rTF)ÚindentÚ sort_keysÚensure_asciirarz#version: 0.2 cSs|dS)Nr8r)ÚkvrrrrŠ‹sz4OpenAIGPTTokenizer.save_vocabulary..r‹zSaving vocabulary to zZ: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!rr8)ÚosÚpathÚisdirrkÚerrorr(ÚVOCAB_FILES_NAMESrmÚwriternÚdumpsrpÚsortedrwrqrl) rrœrr Ú merge_fileÚfrÚwriterÚ bpe_tokensÚtoken_indexrrrÚsave_vocabularyzs8ÿÿ ÿ ÿ ø þôz"OpenAIGPTTokenizer.save_vocabulary)rYr)rFrGrHrIr¨Úvocab_files_namesÚPRETRAINED_VOCAB_FILES_MAPÚpretrained_vocab_files_mapÚ&PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESÚmax_model_input_sizesÚmodel_input_namesrÚpropertyrr‚r„r–r—r˜r™r›Ústrrrr±Ú __classcell__rrrrrVõs$ ,(rV)rIrnr¤rSr"ÚtypingrrÚtokenization_utilsrrrrÚutilsr Ú get_loggerrFrkr¨r³rµrÚobjectrrNrUrVrrrrÚs. þþÿ "