extraction | 易学教程

How to extract links from a webpage using lxml, XPath and Python?

阅读更多关于 How to extract links from a webpage using lxml, XPath and Python?

问题 I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on. However, I cannot seem to use it with lxml . from lxml import etree parsedPage = etree.HTML(page) # Create parse tree from valid page. # Xpath query hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") for x in hyperlinks: print x # Print links in <a> tags, containing the title attribute

Temporal Extraction (i.e. Extract date/time entities from free form text) - How?

阅读更多关于 Temporal Extraction (i.e. Extract date/time entities from free form text) - How?

问题 Has anyone found a simple, but effective way to extract date references from text? I've done a fair amount of searching for temporal extraction tools, but there isn't a lot out there. There are a few white papers, but it seems to fall into a subset of the whole semantic web thingy but not given much attention. I'm just looking for something that is 80% effective. There is no need to capture things like "the month after Jan 2009", but basic common dates entities would be nice. I'm open to all

C# Selenium QA - how to extract the value of custom attribute

阅读更多关于 C# Selenium QA - how to extract the value of custom attribute

问题 I have complex DOM structure with dynamically changing values of the HTML id attributes. Selenium is returning "null" when I attempt to extract the values of the "custom" attributes. I need to extract the value of parentId (namely I need the "qqq-bbb", which is different on every reload) //Selenium(C#): // evaluates to "null" string someStringName = someIWebElement.GetAttribute("parentId"); //HTML tag (example): // in Chrome the parentId is visible and unique <g id="333-aaa" parentId="qqq-bbb

How to extract data from a file in C

阅读更多关于 How to extract data from a file in C

问题 I have a .dat file containing 6 columns of N numbers like so: -4.997740e-01 -1.164187e+00 3.838383e-01 6.395961e+01 -1.938013e+02 -4.310365e-02 -1.822405e+00 4.470735e-01 -2.691410e-01 -8.528020e+01 -1.358874e+02 -7.072167e-01 9.932887e-01 -2.157249e+00 -2.303825e+00 -5.508925e+01 -3.548236e+02 1.250405e+00 -1.871123e+00 1.505421e-01 -6.550555e-01 -3.254452e+02 -5.501001e+01 8.776851e-01 1.370605e+00 -1.028076e+00 -1.137059e+00 6.096598e+01 -4.472264e+02 -1.268752e+00 ............ ...........

Microdata extraction from HTML in Java

阅读更多关于 Microdata extraction from HTML in Java

问题 I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution. Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me. I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in

Using scrapy to find specific text from multiple websites

阅读更多关于 Using scrapy to find specific text from multiple websites

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation , but I can't seem to find this. Thank you. class FinalSpider(scrapy.Spider): name = "final" allowed_domains = ['example.com'] start_urls = [URL % starting_number] def __init__(self): self.page_number = starting_number

Extract Video frames in Android

阅读更多关于 Extract Video frames in Android

问题 I wanted to know if it is possible to extract frames from a running Video in Android? I need to extract frames at regular intervals and send them for further processing. Would someone be able to find an answer for me? Thanks, Abhi 回答1: You can use MediaMetadataRetriever: I'm currently using it, by calling: mediaMetadataRetriever.getFrameAtTime(timeUs,MediaMetadataRetriever.OPTION_CLOSEST); I get the frame's bitmap . Note that it's only supported since: API Level 10 . 回答2: You can use the code

How to extract comment out of header file using python, perl, or sed?

阅读更多关于 How to extract comment out of header file using python, perl, or sed?

I have a header file like this: /* * APP 180-2 ALG-254/258/772 implementation * Last update: 03/01/2006 * Issue date: 08/22/2004 * * Copyright (C) 2006 Somebody's Name here * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the

What is the best perl module to extract text from a pdf? [closed]

阅读更多关于 What is the best perl module to extract text from a pdf? [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . What is the best way to extract text from a pdf? 回答1: The CAM::PDF module is pretty useful for extracting text and maintaining some

How do I extract ecdf values out of ecdfplot()

阅读更多关于 How do I extract ecdf values out of ecdfplot()

If I use the ecdfplot() function of the latticeExtra package how do I get the actual values calculated i.e. the y-values which correspond to the ~x|g input? I've been looking at ?ecdfplot but there's not discription to it. For the usual highlevel function ecdf() it works with the command plot=FALSE but this does not work for ecdfplot() . The reason I want to use ecdfplot() rather than ecdf() is that I need to calculate the ecdf() values for a grouping variable. I know I could do this handish too but I'm quite convinced that there is a highroad too. Here a small expample u <- rnorm(100,0,1)