text-classification

FastText recall is 'nan' but precision is a number

旧时模样 提交于 2021-02-10 05:31:20
问题 I trained a supervised model in FastText using the Python interface and I'm getting weird results for precision and recall. First, I trained a model: model = fasttext.train_supervised("train.txt", wordNgrams=3, epoch=100, pretrainedVectors=pretrained_model) Then I get results for the test data: def print_results(N, p, r): print("N\t" + str(N)) print("P@{}\t{:.3f}".format(1, p)) print("R@{}\t{:.3f}".format(1, r)) print_results(*model.test('test.txt')) But the results are always odd, because

Intent classification with large number of intent classes

眉间皱痕 提交于 2021-01-29 08:21:46
问题 I am working on a data set of approximately 3000 questions and I want to perform intent classification. The data set is not labelled yet , but from the business perspective, there's a requirement of identifying approximately 80 various intent classes . Let's assume my training data has approximately equal number of each classes and is not majorly skewed towards some of the classes. I am intending to convert the text to word2vec or Glove and then feed into my classifier. I am familiar with

How to use Bert for long text classification?

…衆ロ難τιáo~ 提交于 2021-01-14 04:14:19
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

How to use Bert for long text classification?

五迷三道 提交于 2021-01-14 04:08:45
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

How to use Bert for long text classification?

人盡茶涼 提交于 2021-01-14 04:07:43
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

Sliding window for long text in BERT for Question Answering

岁酱吖の 提交于 2021-01-05 00:51:51
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle

Sliding window for long text in BERT for Question Answering

浪尽此生 提交于 2021-01-05 00:27:26
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle

Oversampling after splitting the dataset - Text classification

本秂侑毒 提交于 2021-01-01 13:33:30
问题 I am having some issues with the steps to follow for over-sampling a dataset. What I have done is the following: # Separate input features and target y_up = df.Label X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1) # setting up testing and training sets X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27) class_0 = X_train_up[X_train_up.Label==0] class_1 = X_train_up[X_train_up.Label==1] # upsample minority class_1_upsampled =