Extracting Related Date and Location from a sentence

问题

I'm working with written text (paragraphs of articles and books) that includes both locations and dates. I want to extract from the texts pairs that contain locations and dates that are associated with one another. For example, given the following phrase:

The man left Amsterdam on January and reached Nepal on October 21st

I would have an output such as this:

>>>[(Amsterdam, January), (Nepal, October 21st)]

I tried splitting the text through "connecting words" (such as "and" for example) and work on part as follows: find words that indicate a location ("at", "in", "from","to" etc.) and words that indicate a date or time ("on", "during" etc.), and join what you find. However, this proved to be problematic, as there are too much words that indicate location and date, and sometimes the basic "find" method cannot distinguish between them.

Assume that I am able to identify a date as such, and given a word that starts with a capital letter, I am able to determine if it is a location or not. The main issue is connecting between them, and making sure they are.

I figured that tools like ntlk and scapy will assist me here, but there isn't enough documentation to help me find an exact solution to this kind of problem.

Any help would be appreciated!

回答1:

This seems like a Named Entity Recognition problem. Following are the steps to the same. For a detailed understanding, please refer to this article.

Download Stanford NER from here
Unzip the zipped folder and save in a drive
Copy the “stanford-ner.jar” from the folder and save it just outside the folder as shown in the image below.
Download the caseless models from https://stanfordnlp.github.io/CoreNLP/history.html by clicking on “caseless” as given below. The models in the first link also work however, the caseless models help in identifying named entities even when they are not capitalized as required by formal grammar rules.
Run the following Python code. Please note that this code worked on a windows 10, 64 bit machine with Python 2.7 version.

Note: Please ensure that all the paths are updated to the paths on the local machine

#Import all the required libraries.
import os
from nltk.tag import StanfordNERTagger
import pandas as pd

#Set environmental variables programmatically.
#Set the classpath to the path where the jar file is located
os.environ['CLASSPATH'] = "<your path>/stanford-ner-2015-04-20/stanford-ner.jar"
#Set the Stanford models to the path where the models are stored
os.environ['STANFORD_MODELS'] = '<your path>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner'

#Set the java jdk path. This code worked with this particular java jdk
java_path = "C:/Program Files/Java/jdk1.8.0_191/bin/java.exe"
os.environ['JAVAHOME'] = java_path


#Set the path to the model that you would like to use
stanford_classifier  =  '<your path>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz'

#Build NER tagger object
st = StanfordNERTagger(stanford_classifier)

#A sample text for NER tagging
text = 'The man left Amsterdam on January and reached Nepal on October 21st'

#Tag the sentence and print output
tagged = st.tag(str(text).split())
print(tagged)
#[(u'The', u'O'), 
# (u'man', u'O'), 
# (u'left', u'O'), 
# (u'Amsterdam', u'LOCATION'), 
# (u'on', u'O'), 
# (u'January', u'DATE'), 
# (u'and', u'O'), 
# (u'reached', u'O'), 
# (u'Nepal', u'LOCATION'), 
# (u'on', u'O'), 
# (u'October', u'DATE'), 
# (u'21st', u'DATE')]

This approach works for a majority of the cases.

来源：https://stackoverflow.com/questions/61372354/extracting-related-date-and-location-from-a-sentence

标签

python

nlp

nltk

linguistics