parsings strings: extracting words and phrases [JavaScript]

前端 未结 10 681
没有蜡笔的小新
没有蜡笔的小新 2020-12-01 19:20

I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms. Thus splitting the respective string by the space-character is not suffic

相关标签:
10条回答
  • 2020-12-01 19:55
    var str = 'foo bar "lorem ipsum" baz';  
    var results = str.match(/("[^"]+"|[^"\s]+)/g);
    

    ... returns the array you're looking for.
    Note, however:

    • Bounding quotes are included, so can be removed with replace(/^"([^"]+)"$/,"$1") on the results.
    • Spaces between the quotes will stay intact. So, if there are three spaces between lorem and ipsum, they'll be in the result. You can fix this by running replace(/\s+/," ") on the results.
    • If there's no closing " after ipsum (i.e. an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
    0 讨论(0)
  • 2020-12-01 19:57

    Expanding on the accepted answer, here's a search engine parser that,

    • can match phrases or words
    • treats phrases as regular expressions
    • does a boolean OR across multiple properties (e.g. item.title and item.body)
    • handles negation of words or phrases when they are prefixed with -

    Treating phrases as regular expressions makes the UI simpler for my purposes.

    const matchOrIncludes = (str, search, useMatch = true) => {
      if (useMatch) {
        let result = false
        try {
          result = str.match(search)
        } catch (err) {
          return false
        }
        return result
      }
      return str.includes(search)
    }
    
    
    const itemMatches = (item, searchString, fields) => {
      const keywords = searchString.toString().replace(/\s\s+/g, ' ').trim().toLocaleLowerCase().match(/(-?"[^"]+"|[^"\s]+)/g) || []
      for (let i = 0; i < keywords.length; i++) {
        const negateWord = keywords[i].startsWith('-') ? true : false
        let word = keywords[i].replace(/^-/,'')
        const isPhraseRegex = word.startsWith('"') ? true : false
        if (isPhraseRegex) {
          word = word.replace(/^"(.+)"$/,"$1")
        }
        let word_in_item = false
        for (const field of fields) {
          if (item[field] && matchOrIncludes(item[field].toLocaleLowerCase(), word, isPhraseRegex)) {
            word_in_item = true
            break
          }
        }
        if ((! negateWord && ! word_in_item) || (negateWord && word_in_item)) {
          return false
        }
      }
      return true
    }
    
    const item = {title: 'My title', body: 'Some text'}
    console.log(itemMatches(item, 'text', ['title', 'body']))

    0 讨论(0)
  • 2020-12-01 19:58

    A simple regular expression will do but leave the quotation marks. e.g.

    'foo bar "lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
    output:   ['foo', 'bar', '"lorem ipsum"', 'baz']
    

    edit: beaten to it by shyamsundar, sorry for the double answer

    0 讨论(0)
  • 2020-12-01 20:01
    'foo bar "lorem ipsum" baz'.match(/"[^"]*"|\w+/g);
    

    the bounding quotes get included though

    0 讨论(0)
提交回复
热议问题