Network-efficient difference between two strings in Javascript

£可爱£侵袭症+ 提交于 2019-12-06 01:53:55

I'm not very familiar with this topic but I can point you to an open source (Apache License 2.0) project which may be very useful.

It is a Diff, Match and Patch library written in several languages, including JavaScript, from a Google engineer and it is used in several online collaborative editing services.

Here are a list of resources:

A simple approach, assuming that you know the copy on the server isn't going to change, would just be to send a list of edits (deletions and additions), with the deletions represented as a start and end index, and the additions represented as a start index and the text to insert.

If you have more than a simple diff algorithm to work with (I'm not sure exactly what you mean by "string comparison is not the issue"), you could also detect moved or copied chunks of text, and send those as the start and end index of the moved or copied piece of text, as well as the destination to insert it.

Note that you'll need to make sure to keep track of whether your indices refer to the original document, or the document as edited so far. An easy approach to avoid this problem is to always perform the edits from the end of the document towards the beginning; then earlier edits won't affect the offsets specified by later edits.

For an example of an approach like this, see the ed format that diff -e outputs. This is basically input that could be fed into the ed line-oriented text editor. If you want the absolute smallest diffs to send across you may want to do character based indexing rather than line based indexing, but the same basic approach could work.

Any edits the user's performing can be efficiently broken down into: delete from X for length Y; insert at X text "whatever". X and Y are offsets in characters from the start of the text; Y is a number of characters; "whatever" is any string of characters. You say you need no help computing the diff, but an example is here, except it's richer in its output than you need, but does identify "removals and insertions", so, just change the output part.

The exact format in which you send the data to the server can be tuned, but I don't think there's much mileage in doing that -- pending measurement, I'd start by sending the commands as D for delete or I for insert, the numbers in decimal, the inserted string in quoted form. Once you have some statistics on actual transfers being performed, you can see how much overhead is in the numbers (decimal vs binary) and quotes, but I suspect that may not be all that meaningful (if it proves to be, there are all sort of things you can try, such as giving offsets from the latest point of insertion or deletion, rather than always from the start, to make things faster).

You can sample what the user is doing every few seconds, and just send the incremental changes over those last few seconds (if any) -- this way, each packet you're sending will be small, and if the net connection or the user's computer/browser crash, the user won't have lost much work.

You could just send changes every 500ms, so, whatever changes were made in the last 500ms would be sent, but you only send data when there was a change.

In this you could then send the position of the changed word(s) and just send the entire word, but I would have the position be from the front of the text.

It won't be several sentences worth, but there may be several words involved, but, if you send them in order of change then the result should be consistent.

Because there are so many ways to do edits--even within short periods of time like 500ms--including dragging and dropping, or cutting and pasting, large sections of text around within the document or from outside it--I don't know if there's going to be something that will cover all scenarios really well. This is certainly a non-answer to your question at face value, but I would consider carefully the trouble of developing and maintaining something like this compared to changing the interface to restrict the text size and breaking existing texts into smaller pieces.

Maybe that's not possible in your situation, but if it is, I would guess it would be much less trouble in the end to dodge the issue in this way and just send full documents after an edit.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!