analysis | 易学教程

Ruby Text Analysis

阅读更多关于 Ruby Text Analysis

问题 Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french) 回答1: the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here. There are a few standard toolkits

Find out the real file type

阅读更多关于 Find out the real file type

问题 I am working on an ASP web page that handles file uploads. Only certain types of files are allowed to be uploaded, like .XLS, .XML, .CSV, .TXT, .PDF, .PPT, etc. I have to decide if a file really has the same type as the extension shows. In other words if a trojan.exe was renamed to harmless.pdf and uploaded, the application must be able to find out that the uploaded file is NOT a .PDF file. What techniques would you use to analyze these uploaded files? Where can I get the best information

How are exponents calculated?

阅读更多关于 How are exponents calculated?

I'm trying to determine the asymptotic run-time of one of my algorithms, which uses exponents, but I'm not sure of how exponents are calculated programmatically. I'm specifically looking for the pow() algorithm used for double-precision, floating point numbers. I've had a chance to look at fdlibm's implementation. The comments describe the algorithm used: * n * Method: Let x = 2 * (1+f) * 1. Compute and return log2(x) in two pieces: * log2(x) = w1 + w2, * where w1 has 53-24 = 29 bit trailing zeros. * 2. Perform y*log2(x) = n+y' by simulating muti-precision * arithmetic, where |y'|<=0.5. * 3.

Search times for binary search tree

阅读更多关于 Search times for binary search tree

Does anyone know how to figure out search time for a binary search tree(i.e. worst-case, best-case, and average-case)? For a non-self-balancing tree (possible but unusual for a search tree), worst case is O(n), which is for the degenerate binary tree (a linked list). In this case, you have to search, on average, half the list before finding your desired element. Best case is O(log 2 n) for a perfectly balanced tree, since you cut the search space in half for every tree level. Average case is somewhere in between those two and depends entirely on the data :-) Since you rarely get to control the

Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

阅读更多关于 Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

问题 I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below. What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the

Reccurrence T(n) = T(n^(1/2)) + 1

阅读更多关于 Reccurrence T(n) = T(n^(1/2)) + 1

I've been looking at this reccurrence and wanted to check if I was taking the right approach. T(n) = T(n^(1/2)) + 1 = T(n^(1/4)) + 1 + 1 = T(n^(1/8)) + 1 + 1 + 1 ... = 1 + 1 + 1 + ... + 1 (a total of rad n times) = n^(1/2) So the answer would come to theta bound of n^(1/2) hint: assume n = 2 2 m or m = log 2 log 2 n, and you know 2 2 m-1 * 2 2 m-1 = 2 2 m so, if you define S(m)=T(n) your S will be: S(m) = S(m-1)+1 → S(m) = Θ(m) → S(m)=T(n) = Θ(log 2 log 2 n) extend it for the general case. In recursion like T(n) = T(n/2) + 1, in each iteration, we reduce the height of the tree to half. This

Find out the real file type

阅读更多关于 Find out the real file type

I am working on an ASP web page that handles file uploads. Only certain types of files are allowed to be uploaded, like .XLS, .XML, .CSV, .TXT, .PDF, .PPT, etc. I have to decide if a file really has the same type as the extension shows. In other words if a trojan.exe was renamed to harmless.pdf and uploaded, the application must be able to find out that the uploaded file is NOT a .PDF file. What techniques would you use to analyze these uploaded files? Where can I get the best information about the format of these files? One way would be to check for certain signatures or magic numbers in the

Where is the Query Analyzer in SQL Server Management Studio 2008 R2?

阅读更多关于 Where is the Query Analyzer in SQL Server Management Studio 2008 R2?

I have some SQL thats getting run and it is taking to long to return the results / parse / display, etc. in a asp.net c# application. I have SQL Server Management Studio 2008 R2 installed to connect to a remote SQL Server 2000 machine. Is there a Query Analyzer or profiler I can use to see whats going on? I'm not sure if I'm sending too many requests, if the requests are taking too long, if there are additional indexes I can add to speed things up etc. EDIT: Any free tools out there that are replacements for the Microsoft tools? Default locations: Programs > Microsoft SQL Server 2008 R2 > SQL

Difference between average case and amortized analysis

阅读更多关于 Difference between average case and amortized analysis

I am reading an article on amortized analysis of algorithms. The following is a text snippet. Amortized analysis is similar to average-case analysis in that it is concerned with the cost averaged over a sequence of operations. However, average case analysis relies on probabilistic assumptions about the data structures and operations in order to compute an expected running time of an algorithm. Its applicability is therefore dependent on certain assumptions about the probability distribution of algorithm inputs. An average case bound does not preclude the possibility that one will get “unlucky”

Showing an image with pylab.imshow()

阅读更多关于 Showing an image with pylab.imshow()

问题 I'm relatively new to all this and I started to do the tutorial on image analysis here: http://www.pythonvision.org/basic-tutorial I have installed all the modules but I didn't get very far before hitting a snag. when trying to perform the pylab.imshow(dna) step it returns the following error: In [10]: pylab.imshow(dna) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-10-fc86cadb4e46> in <module>() ----> 1