Training a model to identify names appearing in a sentence

问题

I have a dataset containing the names of about 238583 people. The names can contain more than one word for example: Willie Enriquez , James J Johnson, D.J. Khaled. My problem is to identify these names when it appears in a sentence. I am trying to create a machine learning model that can identify if the input is a name or not. My trouble is figuring the input and output of this model. Since I have a bunch of names I can train a model which can recognise a name when the input is a name, but what about the other words that are part of this sentence. The model should also be able to identify words that are not names. Assuming the sentences can have any other words in it, what would be the ideal dataset for this purpose? Does it make sense to train a model on a random bunch of words and tag it as NonNames?
(The entire sentences in which the names appear is not available. The user can type absolutely anything he/she wants)

Thankyou.

回答1:

The specifics of the answer may vary according to which model you are using, but the general idea is more or less the following:

You are trying to solve a classification task, precisely a binary classification task where you want to distinguish between proper names (assuming from your example) from other expressions.

The input to your model, in the most general case, are the features of the example that you want to classify: you should decide what features you think are useful to distinguish such names (e.g., number of words, contains capital letter, every word is capitalized, contains dotted letters, contains any word that you already have in your dataset, etc...). The output is the class, that is 0/1 for non-names/names.

You then train your model with positive examples from the dataset that you have and negative examples (i.e. non-names) taken from random words for non-names.

If the use can enter full sentences then you will need to do a preprocessing step where you extract all sequences of length N (word n-grams) and classify each of them individually with your previously trained model.

来源：https://stackoverflow.com/questions/51476682/training-a-model-to-identify-names-appearing-in-a-sentence

标签

machine-learning

nlp

ner