语言模型

简介

1. 任务说明

本文主要介绍基于lstm的语言的模型的实现，给定一个输入词序列（中文分词、英文tokenize），计算其ppl（语言模型困惑度，用户表示句子的流利程度），基于循环神经网络语言模型的介绍可以参阅论文。相对于传统的方法，基于循环神经网络的方法能够更好的解决稀疏词的问题。

2. 效果说明

在small meidum large三个不同配置情况的ppl对比：

small config	train	valid	test
paddle	40.962	118.111	112.617
tensorflow	40.492	118.329	113.788

medium config	train	valid	test
paddle	45.620	87.398	83.682
tensorflow	45.594	87.363	84.015

large config	train	valid	test
paddle	37.221	82.358	78.137
tensorflow	38.342	82.311	78.121

3. 数据集

此任务的数据集合是采用ptb dataset，下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

快速开始

1. 开始第一次模型调用

训练或fine-tune

任务训练启动命令如下：

!python train.py --use_gpu True --data_path data/data11325/simple-examples/data --model_type small --rnn_model basic_lstm

需要指定数据的目录，默认训练文件名为 ptb.train.txt，可用--train_file指定；默认验证文件名为 ptb.valid.txt，可用--eval_file指定；默认测试文件名为 ptb.test.txt，可用--test_file指定
模型的大小(默认为small，用户可以选择medium，或者large)
模型的类型（默认为static，可选项static|padding|cudnn|basic_lstm）
batch大小默认和模型大小有关，可以通过--batch_size指定
训练轮数默认和模型大小有关，可以通过--max_epoch指定
默认将模型保存在当前目录的models目录下

进阶使用

1. 任务定义与建模

此任务目的是给定一个输入的词序列，预测下一个词出现的概率。

2. 模型原理介绍

此任务采用了序列任务常用的rnn网络，实现了一个两层的lstm网络，然后lstm的结果去预测下一个词出现的概率。计算的每一个概率和实际下一个词的交叉熵，然后求和，做e的次幂，得到困惑度ppl。当前计算方式和句子的长度有关，仍需要继续优化。

由于数据的特殊性，每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell，数据的特殊性下节会介绍。

3. 数据格式说明

此任务的数据格式比较简单，每一行为一个已经分好词（英文的tokenize）的词序列。

目前的句子示例如下图所示:

aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
pierre <unk> N years old will join the board as a nonexecutive director nov. N
mr. <unk> is chairman of <unk> n.v. the dutch publishing group

特殊说明：ptb的数据比较特殊，ptb的数据来源于一些文章，相邻的句子可能来源于一个段落或者相邻的段落，ptb 数据不能做shuffle

4. 目录结构

.
├── train.py             # 训练代码
├── reader.py            # 数据读取
├── args.py              # 参数读取
├── config.py              # 训练配置
├── data                # 数据下载
├── language_model.py  		  # 模型定义文件

5. 如何组建自己的模型

自定义数据： 关于数据，如果可以把自己的数据先进行分词（或者tokenize），然后放入到data目录下，并修改reader.py中文件的名称，如果句子之间没有关联，用户可以将train.py中更新的代码注释掉。
```
init_hidden = np.array(fetch_outs[1])
init_cell = np.array(fetch_outs[2])
```
网络结构更改： 网络只实现了基于lstm的语言模型，用户可以自己的需求更换为gru或者self等网络结构，这些实现都是在language_model.py 中定义

In[1]

# 解压数据集
!cd data/data11325 && unzip -qo simple-examples.zip

In[2]

# 运行训练，使用GPU，并且使用小模型
# 训练轮数也限制到3轮，以避免日志过多
# 最终可以通过提高训练轮数达到比较好的效果
!echo "training"
!python train.py --use_gpu True --data_path data/data11325/simple-examples/data --model_type small --rnn_model basic_lstm --max_epoch=3

training
2019-09-06 15:27:04,782 - [161] [line:278] - INFO: Running with args : Namespace(batch_size=0, data_path='data/data11325/simple-examples/data', enable_ce=False, eval_file='ptb.valid.txt', max_epoch=3, model_type='small', para_init=False, parallel=True, profile=False, rnn_model='basic_lstm', save_freeze_dir='freeze', save_model_dir='models', test_file='ptb.test.txt', train_file='ptb.train.txt', use_gpu=True, use_py_reader=False)
2019-09-06 15:27:04,782 - [161] [line:46] - ERROR: Exception module 'paddle.fluid' has no attribute 'is_compiled_with_cuda'
W0906 15:27:05.684434   161 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0906 15:27:05.688470   161 device_context.cc:267] device: 0, cuDNN Version: 7.3.
2019-09-06 15:27:05,706 - [161] [line:239] - WARNING: 
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
2019-09-06 15:27:05,706 - [161] [line:318] - INFO: begin to load data
2019-09-06 15:27:05,851 - [161] [line:38] - INFO: vocab word num 10000
2019-09-06 15:27:06,139 - [161] [line:327] - INFO: finished load data
I0906 15:27:06.207988   161 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0906 15:27:06.212198   161 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
2019-09-06 15:27:14,326 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[232]; Time: 0.03484 s; ppl: 841.91589, lr: 1.00000
2019-09-06 15:27:22,457 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[464]; Time: 0.03489 s; ppl: 619.10516, lr: 1.00000
2019-09-06 15:27:30,568 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[696]; Time: 0.03513 s; ppl: 502.20526, lr: 1.00000
2019-09-06 15:27:38,671 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[928]; Time: 0.03495 s; ppl: 431.69897, lr: 1.00000
2019-09-06 15:27:46,776 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1160]; Time: 0.03492 s; ppl: 388.54611, lr: 1.00000
2019-09-06 15:27:54,875 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1392]; Time: 0.03481 s; ppl: 349.14734, lr: 1.00000
2019-09-06 15:28:02,985 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1624]; Time: 0.03489 s; ppl: 322.18011, lr: 1.00000
2019-09-06 15:28:11,115 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1856]; Time: 0.03479 s; ppl: 302.02744, lr: 1.00000
2019-09-06 15:28:19,220 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[2088]; Time: 0.03497 s; ppl: 282.88611, lr: 1.00000
2019-09-06 15:28:27,336 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[2320]; Time: 0.03477 s; ppl: 267.54971, lr: 1.00000
2019-09-06 15:28:27,407 - [161] [line:354] - INFO: 
Train epoch:[0]; epoch Time: 81.26715; ppl: 267.49921; avg_time: 28.62169 steps/s 

2019-09-06 15:28:30,555 - [161] [line:376] - INFO: Valid ppl: 179.66524
2019-09-06 15:28:30,768 - [161] [line:384] - INFO: Saved model to: models.
2019-09-06 15:28:39,090 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[232]; Time: 0.03542 s; ppl: 148.95425, lr: 1.00000
2019-09-06 15:28:47,364 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[464]; Time: 0.03536 s; ppl: 157.15070, lr: 1.00000
2019-09-06 15:28:56,247 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[696]; Time: 0.04438 s; ppl: 152.67220, lr: 1.00000

In[12]

# 加模型进行预测
!python infer.py --rnn_model basic_lstm

20
(20, 20, 1)
(400, 1)
ppl: 2026.6036376953125

点击链接，使用AI Studio一键上手实践项目吧：https://aistudio.baidu.com/aistudio/projectdetail/122290

来源：oschina

链接：https://my.oschina.net/u/4067628/blog/4254171

标签

whitespace

epoch

基于LSTM的语言模型实现