Pouch DB Fast Search | 易学教程

问题

I want to search in my pouch database in a fast way. The query I have below is very slow even for a small dataset of 1000 items and index on field. I am guessing it is because I am using a regular expressions. I even tried to do the regex as "^"+search (just search the start) and it takes awhile (10 seconds)

What is the best way to do an OR search on the fields below?

Here is code:

        db_items.find({
          selector: {name: {$regex: RegExp("^"+search, "i")}},
          fields: ['_id', 'name','unit_price','category','quantity','item_id'],
          sort: ['name']
        }

I applied code from answer and am having performance issues still. Taking 20 seconds on 10k documents with index on the name field

item_index_creation.push(db_items.createIndex({
  index: {
    fields: ['name']
  }
}));


function item_view_index(doc) 
{
      const regex = /[\s\.;]+/gi;
      ['name'].forEach(field => {
        if (doc[field]) {
          const words = doc[field].replaceAll(regex,',').split(',');
          words.forEach(word => {
            word = word.trim();
            if (word.length) {
              emit(word.toLocaleLowerCase(), [field, word]);
            }
          });
        }
      });
  }

        //This is taking 20+ seconds on 11,000 documents
        const search_results = await db_items.query(item_view_index, {
          include_docs: true,
          reduce: false,
          descending: descending,
          startkey: descending ? search + '\uFFF0' : search,
          endkey: descending ? search : search + '\uFFF0'
        });
        
        var results = search_results.rows;
        var db_response = [];
        for(var k=0;k<results.length;k++)
        {
            var row = results[k].doc;
            var item = {unit_price: to_currency_no_money(row.unit_price), image: default_image,label: row.name+' - '+to_currency_no_money(row.unit_price),category:row.category,quantity: to_quantity(row.quantity),value: row.item_id};
            db_response.push(item);
        }
        response(db_response);

回答1:

Unfortunately you are not going to get fast searches using the likes of $regex because^[1]

$regex, $ne, and $not cannot use on-disk indexes, and must use in-memory filtering instead.

So there it is - very inefficient in your case.

In a comment under your post you said

...But can use anything else that searches start of word. I can make all docs lowercase if needed. Watch to match “search%”

Great! That's easy enough. However if you absolutely have to use a Mango query this won't fly - the best choice for your requirement as read is to use map/reduce.

For simplicity the snippet below creates an map/reduce view indexing any words in two fields, name and category. Here's the map function.

function(doc) {
      const regex = /[\s\.;]+/gi;
      ['name', 'category'].forEach(field => {
        if (doc[field]) {
          const words = doc[field].replaceAll(regex,',').split(',');
          words.forEach(word => {
            word = word.trim();
            if (word.length) {
              emit(word.toLocaleLowerCase(), [field, word]);
            }
          });
        }
      });
    }

The highlights of the map function are

Multiple document fields are sourced
All indexed words are lowercase
in the index row value, the field and original word is preserved

Note from the last point, you may for example post process to filter away certain fields, display original word, etc.

OK now to the query which is a prefix search.

async function search(term, descending) {
 // snip snip
    term = term.toLocaleLowerCase();
    const result = await db.query(view_index, {
      include_docs: false,
      reduce: false,
      descending: descending,
      startkey: descending ? term + '\uFFF0' : term,
      endkey: descending ? term : term + '\uFFF0'
    });
 // snip snip
}

Note the descending property; reversing the startkey and endkey is necessary.

Run the snippet below, which indexes 10k documents using words from a Lorem Ipsum generator. It provides relatively fast responses - but beware this demo uses the memory adapter so your mileage may vary, however I believe with some tuning for your specific use case you should get decent response times. Good luck!

p.s. I recommend running the snippet fullscreen.

const LoremIpsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Amet consectetur adipiscing elit duis tristique. Sed risus ultricies tristique nulla aliquet enim. Pharetra pharetra massa massa ultricies mi. Nisl nisi scelerisque eu ultrices vitae auctor eu. Tellus rutrum tellus pellentesque eu. Justo donec enim diam vulputate ut pharetra sit amet aliquam. Non pulvinar neque laoreet suspendisse interdum consectetur libero. Pellentesque habitant morbi tristique senectus et netus et malesuada. Nisl purus in mollis nunc sed. Ultricies integer quis auctor elit. Ultrices in iaculis nunc sed augue lacus viverra.Eu scelerisque felis imperdiet proin fermentum. Ultrices dui sapien eget mi proin sed libero. Bibendum neque egestas congue quisque egestas diam in arcu cursus. Pretium lectus quam id leo in vitae turpis massa sed. Sed vulputate mi sit amet mauris commodo quis. Elementum eu facilisis sed odio morbi quis commodo odio. Sed viverra ipsum nunc aliquet bibendum enim facilisis gravida neque. Id interdum velit laoreet id donec ultrices. Aenean et tortor at risus viverra adipiscing at. Gravida rutrum quisque non tellus orci ac auctor. Lobortis mattis aliquam faucibus purus in massa tempor nec feugiat. Tempus imperdiet nulla malesuada pellentesque elit eget gravida cum. Justo donec enim diam vulputate. Pharetra vel turpis nunc eget lorem dolor sed viverra. Neque gravida in fermentum et. Vel eros donec ac odio. Egestas egestas fringilla phasellus faucibus.Aliquam eleifend mi in nulla posuere sollicitudin aliquam ultrices sagittis. Sit amet cursus sit amet dictum sit amet justo donec.";
const view_index = "view_index";
const max_docs = 10000;

async function search(term, descending) {
  const resultView = gel('results');
  resultView.classList.add('hide');

  term = term.trim();

  if (term.length) {
    let start = Date.now();
    term = term.toLocaleLowerCase();
    const result = await db.query(view_index, {
      include_docs: false,
      reduce: false,
      descending: descending,
      startkey: descending ? term + '\uFFF0' : term,
      endkey: descending ? term : term + '\uFFF0'
    });

    const mark = Date.now() - start;
    const matches = `<div>${uniquify(result.rows).join("<br/>")}</div>`;
    const metrics = `<div>Matches: ${result.rows.length}/${mark}ms</div>`;
    resultView.innerHTML = `<div>${metrics}${matches}</div>`
    resultView.classList.remove('hide');
  }
}
// gather up unique words that matched.  This is sloppy!
function uniquify(rows) {
  let arr = [];
  rows.forEach(row => {
    // note we're displaying the case-sensitive word!
    if (arr.indexOf(row.value[1]) === -1) arr.push(row.value[1]);
  });
  return arr;
}

//
// pouchdb boilerplate code
//
const gel = id => document.getElementById(id);

let db;

// init example db instance
async function initDb() {

  const start = Date.now();

  db = new PouchDB('test', {
    adapter: 'memory'
  });

  // declare view_index
  const ddoc = {
    _id: '_design/' + view_index,
    views: {}
  };
  ddoc.views[view_index] = {
    map: function(doc) {
      const regex = /[\s\.;]+/gi;
      ['name', 'category'].forEach(field => {
        if (doc[field]) {
          const words = doc[field].replaceAll(regex,
            ',').split(',');
          words.forEach(word => {
            word = word.trim();
            if (word.length) {
              emit(word.toLocaleLowerCase(), [field, word]);
            }
          });
        }
      });
    }.toString()
  };
  // install the map/reduce design doc
  await db.put(ddoc);

  // insert the docs. 
  console.log(`adding ${max_docs} documents...`);
  await db.bulkDocs(getDocsToInstall());

  // force the index to build
  console.log(`Waiting for index '${view_index}' to build...`);
  await db.query(view_index, {
    reduce: true
  });

  console.log(`db inited in ${Date.now()-start}ms`);
}



// canned test documents
function getDocsToInstall() {
  const docs = new Array(max_docs);
  const words = LoremIpsum.split(' ');
  const getWord = () => {
    return words[Math.floor(Math.random() * Math.floor(words.length))];
  }
  for (let i = 0; i < docs.length; i++) {
    docs[i] = {
      name: getWord(),
      category: getWord()
    };
  }
  return docs;
}


initDb().then(() => {
  // setup input listener
  const searchFn = () => {
    search(gel('term').value, gel('sort').checked);
  };
  gel('term').addEventListener('input', searchFn);
  gel('sort').addEventListener('input', searchFn);
  gel('view').classList.remove('hide')
});

.hide {
  display: none
}

.label {
  text-align: right;
  margin-right: 1em;
}

.hints {
  font-size: smaller;
}

<script src="//cdn.jsdelivr.net/npm/pouchdb@7.1.1/dist/pouchdb.min.js"></script>
<script src="https://github.com/pouchdb/pouchdb/releases/download/7.1.1/pouchdb.memory.min.js"></script>
<table id='view' class='hide'>
  <tr>
    <td class='label'>
      <label for='term'>Term</label>
    </td>
    <td>
      <input id='term' type='text' />
    </td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td class='hints'><em>hint:</em> Lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit
    </td>
  </tr>
  <tr>
    <td class='label'>
      <label for='sort'>Sort Desc</label>
    </td>
    <td>
      <input id='sort' type='checkbox' />
    </td>
  </tr>
</table>
<div style='margin-top:2em'></div>
<hr/>
<div id='results' class='hide'>
</div>

¹PouchDB Guide - Further reading

回答2:

We had the same issue using pouchdb find and secondary indexes in pouchdb.

Our pouchdb had 5K items and searching by term was really slow (on low performance devices it took around 20s).

We then decided to change our search to use the allDocs function. We also took advantage of the ids of our documents. Basically we wanted to make a search by term only on the documents with type: "product" that had an id like "product::{someProductId}". So we went with something like the following and it greatly improved our search performance (on low performance devices it went down to 3s)

 db.allDocs({
    include_docs: true,
    attachments: false,
    startkey: "product::", // search only in docs starting with product::
    endkey: "product::\ufff0"
  })
  .then(function(result) {
    const rows = result.rows.filter(
      i =>
        i.doc &&
        i.doc.name &&
        (i.doc.name.toLowerCase().indexOf(term.toLowerCase()) > -1 || i.doc. category.toLowerCase().indexOf(term.toLowerCase()) > -1)
    );
    return rows
})

回答3:

I've built a demo: https://jsbin.com/ralemisufo/1/edit?js,output Note that for shorter code I've elided all error handling and async behavior from the snippets in this answer.

PouchDB (and CouchDB) both are not optimized to work with arbitrary queries, they shine when they can use a precomputed index. Unfortunately, a regular expression is always an arbitrary query, so no indices whatsoever can be used (Docs). You have to structure your data access patterns in a way the database can help you.

The demo stores (upon "Init") 10.000 documents, identified by randomized words, in a fresh database. The document already contains a lower-cased version of the identifying name:

for (i=0; i < 10000;i++) {
  var name = randomizedWord();
  docs.push({
    _id: randomizedWord(),
    name: name,
    lname: name.toLowerCase(),
    position: i,
  });
}

(For efficient insertion, the bulk API is used).

When the data is inserted, a Mango index for the name is created. I think you have code for this already, as PouchDB throws an error otherwise when sorting the data. Afterwards, an additional Mango index for the lower-cased name is created, taking a few additional seconds. Keep in mind, that this time is spent only once, additional inserts automatically keep the cache up-to-date:

db.createIndex({
  index: {fields: ['name']}
});
db.createIndex({
  index: {fields: ['lname']}
});

We can now query for the data, with a trick: A prefix match can be simulated in PouchDB and CouchDB by doing a $gte with the prefix and a $lt with the prefix plus the highest possible character. For example, all documents with a name starting with xa have a name greater-or-equal than xa and lower than xa\uffff (Collation Rules):

db.find({
  selector: {lname: {
    $gte: "xa",
    $lt: "xa\uffff",
  }},
  sort: ["lname"],
})

On my machine, the "Faster" version is usually takes <100ms, while the "Question" version clocks in at around 2000ms. Of course, this is a micro benchmark and not controlled in any way, but it may be a starting point for future research. Let me know if you have additional questions regarding PouchDB and CouchDB indexing.

来源：https://stackoverflow.com/questions/58999498/pouch-db-fast-search

标签

pouchdb