问题
I have a text in which some words may repeat. I have to detect words occurrences for each word like:
{
"index": 10,
"word": "soul",
"characterOffsetBegin": 1606,
"characterOffsetEnd": 1609
}
I have implemented this approach that partially works
var seen = new Map();
tokens.forEach(token => { // for each token
let item = {
"word": token
}
var pattern = "\\b($1)\\b";
var wordRegex = new RegExp(pattern.replace('$1', token), "g");
// calculate token begin end
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
});
This will work in most of cases as showed here:
function aggressive_tokenizer(text) {
// most punctuation
text = text.replace(/([^\w\.\'\-\/\+\<\>,&])/g, " $1 ");
// commas if followed by space
text = text.replace(/(,\s)/g, " $1");
// single quotes if followed by a space
text = text.replace(/('\s)/g, " $1");
// single quotes if last char
text = text.replace(/('$)/, " $1");
text = text.replace(/(\s+[`'"‘])(\w+)\b(?!\2)/g, " $2")
// periods before newline or end of string
text = text.replace(/\. *(\n|$)/g, " . ");
// replace punct
// ignore "-" since may be in slang scream
text = text.replace(/[\\?\^%<>=!&|+\~]/g, "");
text = text.replace(/[…;,.:*#\)\({}\[\]]/g, "");
// finally split remainings into words
text = text.split(/\s+/)
return text;
}
var seen = new Map();
var text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
var tokens = aggressive_tokenizer(text);
var indexes = tokens.map(token => { // for each token
let item = {
"word": token
}
var pattern = "\\b($1)\\b";
var wordRegex = new RegExp(pattern.replace('$1', token), "g");
// calculate token begin end
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
return item;
});
console.log(indexes);
There are some circumstances, where I have found out that the indexes are missing:
var text = "'Lorem ipsum 'dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Here I have added a "'" to some words: "'Lorem" and "'dolor" (That would be in english something like a contraction like "'Cause'", etc. Now it won't work as expected:
{
"word": "'Lorem"
}
This is probably because of the pattern = "\\b($1)\\b";, that I'm using to exactly match the word to get the right begin and end char offsets, while the tokenizer will tokenize some text like 'Cause as 'Cause, so keeping the accent to further analyze this token (like for transforming 'cause in because in a NLP pipeline, hence I cannot remove the "'" from those tokens.
Another attempt is to use the regex
pattern = "(?<!\\S)$1(?!\\S)";
that works in the case of 'Lorem, but could fail in other cases.
来源:https://stackoverflow.com/questions/64032621/detect-exact-words-positions-in-text-in-javascript