How to split a string into words. Ex: “stringintowords” -> “String Into Words”?

前端未结

关注

 13  1238

粉色の甜心

What is the right way to split a string into words ? (string doesn\'t contain any spaces or punctuation marks)

For example: \"stringintowords\" -> \"String Into Word

相关标签:

13条回答

情歌与酒

2020-11-29 20:32

This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-11-29 20:38

There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.

In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-11-29 20:40

Best bet would be to compare a substring from 0 with a dictionary, and when you found a match, to extract that word and start a new dictionary search from that point... but it's going to be very error prone, and you'll have issues with plurals and apostrophes (sinks, sink's), and other parts of speech.

EDIT

would "singleemotion" become "single emotion" or "sin glee motion"?

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-11-29 20:42
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.

You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).

Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).

If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
```
Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444 
```
And your answer would come from:
```
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
```
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-11-29 20:43

Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).

This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-29 20:43

Actually, with the dictionary this problem can be solved in O(n) time. More precisely in (k + 1) * n at worst, where n is the number of characters in the string and k is the length of the longest word in the dictionary.

Besides, the algorithm allows you to skip junk.

Here's the working implementation in Common Lisp I've created some time ago: https://gist.github.com/3381522

0 讨论(0)
发布评论:

提交评论
- 加载中...