语言模型
简介
1. 任务说明
本文主要介绍基于lstm的语言的模型的实现,给定一个输入词序列(中文分词、英文tokenize),计算其ppl(语言模型困惑度,用户表示句子的流利程度),基于循环神经网络语言模型的介绍可以参阅论文。相对于传统的方法,基于循环神经网络的方法能够更好的解决稀疏词的问题。
2. 效果说明
在small meidum large三个不同配置情况的ppl对比:
small config | train | valid | test |
---|---|---|---|
paddle | 40.962 | 118.111 | 112.617 |
tensorflow | 40.492 | 118.329 | 113.788 |
medium config | train | valid | test |
---|---|---|---|
paddle | 45.620 | 87.398 | 83.682 |
tensorflow | 45.594 | 87.363 | 84.015 |
large config | train | valid | test |
---|---|---|---|
paddle | 37.221 | 82.358 | 78.137 |
tensorflow | 38.342 | 82.311 | 78.121 |
3. 数据集
此任务的数据集合是采用ptb dataset,下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
快速开始
1. 开始第一次模型调用
训练或fine-tune
任务训练启动命令如下:
!python train.py --use_gpu True --data_path data/data11325/simple-examples/data --model_type small --rnn_model basic_lstm
- 需要指定数据的目录,默认训练文件名为 ptb.train.txt,可用--train_file指定;默认验证文件名为 ptb.valid.txt,可用--eval_file指定;默认测试文件名为 ptb.test.txt,可用--test_file指定
- 模型的大小(默认为small,用户可以选择medium, 或者large)
- 模型的类型(默认为static,可选项static|padding|cudnn|basic_lstm)
- batch大小默认和模型大小有关,可以通过--batch_size指定
- 训练轮数默认和模型大小有关,可以通过--max_epoch指定
- 默认将模型保存在当前目录的models目录下
进阶使用
1. 任务定义与建模
此任务目的是给定一个输入的词序列,预测下一个词出现的概率。
2. 模型原理介绍
此任务采用了序列任务常用的rnn网络,实现了一个两层的lstm网络,然后lstm的结果去预测下一个词出现的概率。计算的每一个概率和实际下一个词的交叉熵,然后求和,做e的次幂,得到困惑度ppl。当前计算方式和句子的长度有关,仍需要继续优化。
由于数据的特殊性,每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell,数据的特殊性下节会介绍。
3. 数据格式说明
此任务的数据格式比较简单,每一行为一个已经分好词(英文的tokenize)的词序列。
目前的句子示例如下图所示:
aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
pierre <unk> N years old will join the board as a nonexecutive director nov. N
mr. <unk> is chairman of <unk> n.v. the dutch publishing group
特殊说明:ptb的数据比较特殊,ptb的数据来源于一些文章,相邻的句子可能来源于一个段落或者相邻的段落,ptb 数据不能做shuffle
4. 目录结构
.
├── train.py # 训练代码
├── reader.py # 数据读取
├── args.py # 参数读取
├── config.py # 训练配置
├── data # 数据下载
├── language_model.py # 模型定义文件
5. 如何组建自己的模型
-
自定义数据: 关于数据,如果可以把自己的数据先进行分词(或者tokenize),然后放入到data目录下,并修改reader.py中文件的名称,如果句子之间没有关联,用户可以将
train.py
中更新的代码注释掉。init_hidden = np.array(fetch_outs[1]) init_cell = np.array(fetch_outs[2])
-
网络结构更改: 网络只实现了基于lstm的语言模型,用户可以自己的需求更换为gru或者self等网络结构,这些实现都是在language_model.py 中定义
# 解压数据集
!cd data/data11325 && unzip -qo simple-examples.zip
# 运行训练,使用GPU,并且使用小模型
# 训练轮数也限制到3轮,以避免日志过多
# 最终可以通过提高训练轮数达到比较好的效果
!echo "training"
!python train.py --use_gpu True --data_path data/data11325/simple-examples/data --model_type small --rnn_model basic_lstm --max_epoch=3
training 2019-09-06 15:27:04,782 - [161] [line:278] - INFO: Running with args : Namespace(batch_size=0, data_path='data/data11325/simple-examples/data', enable_ce=False, eval_file='ptb.valid.txt', max_epoch=3, model_type='small', para_init=False, parallel=True, profile=False, rnn_model='basic_lstm', save_freeze_dir='freeze', save_model_dir='models', test_file='ptb.test.txt', train_file='ptb.train.txt', use_gpu=True, use_py_reader=False) 2019-09-06 15:27:04,782 - [161] [line:46] - ERROR: Exception module 'paddle.fluid' has no attribute 'is_compiled_with_cuda' W0906 15:27:05.684434 161 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0 W0906 15:27:05.688470 161 device_context.cc:267] device: 0, cuDNN Version: 7.3. 2019-09-06 15:27:05,706 - [161] [line:239] - WARNING: You can try our memory optimize feature to save your memory usage: # create a build_strategy variable to set memory optimize option build_strategy = compiler.BuildStrategy() build_strategy.enable_inplace = True build_strategy.memory_optimize = True # pass the build_strategy to with_data_parallel API compiled_prog = compiler.CompiledProgram(main).with_data_parallel( loss_name=loss.name, build_strategy=build_strategy) !!! Memory optimize is our experimental feature !!! some variables may be removed/reused internal to save memory usage, in order to fetch the right value of the fetch_list, please set the persistable property to true for each variable in fetch_list # Sample conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) # if you need to fetch conv1, then: conv1.persistable = True 2019-09-06 15:27:05,706 - [161] [line:318] - INFO: begin to load data 2019-09-06 15:27:05,851 - [161] [line:38] - INFO: vocab word num 10000 2019-09-06 15:27:06,139 - [161] [line:327] - INFO: finished load data I0906 15:27:06.207988 161 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I0906 15:27:06.212198 161 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1 2019-09-06 15:27:14,326 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[232]; Time: 0.03484 s; ppl: 841.91589, lr: 1.00000 2019-09-06 15:27:22,457 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[464]; Time: 0.03489 s; ppl: 619.10516, lr: 1.00000 2019-09-06 15:27:30,568 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[696]; Time: 0.03513 s; ppl: 502.20526, lr: 1.00000 2019-09-06 15:27:38,671 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[928]; Time: 0.03495 s; ppl: 431.69897, lr: 1.00000 2019-09-06 15:27:46,776 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1160]; Time: 0.03492 s; ppl: 388.54611, lr: 1.00000 2019-09-06 15:27:54,875 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1392]; Time: 0.03481 s; ppl: 349.14734, lr: 1.00000 2019-09-06 15:28:02,985 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1624]; Time: 0.03489 s; ppl: 322.18011, lr: 1.00000 2019-09-06 15:28:11,115 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[1856]; Time: 0.03479 s; ppl: 302.02744, lr: 1.00000 2019-09-06 15:28:19,220 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[2088]; Time: 0.03497 s; ppl: 282.88611, lr: 1.00000 2019-09-06 15:28:27,336 - [161] [line:230] - INFO: -- Epoch:[0]; Batch:[2320]; Time: 0.03477 s; ppl: 267.54971, lr: 1.00000 2019-09-06 15:28:27,407 - [161] [line:354] - INFO: Train epoch:[0]; epoch Time: 81.26715; ppl: 267.49921; avg_time: 28.62169 steps/s 2019-09-06 15:28:30,555 - [161] [line:376] - INFO: Valid ppl: 179.66524 2019-09-06 15:28:30,768 - [161] [line:384] - INFO: Saved model to: models. 2019-09-06 15:28:39,090 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[232]; Time: 0.03542 s; ppl: 148.95425, lr: 1.00000 2019-09-06 15:28:47,364 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[464]; Time: 0.03536 s; ppl: 157.15070, lr: 1.00000 2019-09-06 15:28:56,247 - [161] [line:230] - INFO: -- Epoch:[1]; Batch:[696]; Time: 0.04438 s; ppl: 152.67220, lr: 1.00000
# 加模型进行预测
!python infer.py --rnn_model basic_lstm
20 (20, 20, 1) (400, 1) ppl: 2026.6036376953125
点击链接,使用AI Studio一键上手实践项目吧:https://aistudio.baidu.com/aistudio/projectdetail/122290
来源:oschina
链接:https://my.oschina.net/u/4067628/blog/4254171