文本分类实战
此处还用到torchtext,针对NLP的文本预处理功能模块。 1、读取数据 数据源: 斯坦福的IMDb数据集(Stanford’s Large Movie Review Dataset) def read_imdb ( folder = 'train' , data_root = "./dataset/aclImdb_v1/aclImdb" ) : data = [ ] for label in [ 'pos' , 'neg' ] : folder_name = os . path . join ( data_root , folder , label ) for file in tqdm ( os . listdir ( folder_name ) ) : with open ( os . path . join ( folder_name , file ) , 'rb' ) as f : review = f . read ( ) . decode ( 'utf-8' ) . replace ( '\n' , '' ) . lower ( ) data . append ( [ review , 1 if label == 'pos' else 0 ] ) random . shuffle ( data ) return data DATA_ROOT = "/home