语言模型
1 读取数据集 with open ( '/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt' ) as f : corpus_chars = f . read ( ) print ( len ( corpus_chars ) ) print ( corpus_chars [ : 40 ] ) corpus_chars = corpus_chars . replace ( '\n' , ' ' ) . replace ( '\r' , ' ' ) corpus_chars = corpus_chars [ : 10000 ] 2 建立字符索引 idx_to_char = list ( set ( corpus_chars ) ) # 去重,得到索引到字符的映射 char_to_idx = { char : i for i , char in enumerate ( idx_to_char ) } # 字符到索引的映射 vocab_size = len ( char_to_idx ) print ( vocab_size ) corpus_indices = [ char_to_idx [ char ] for char in corpus_chars ] # 将每个字符转化为索引,得到一个索引的序列 sample =