data-extraction | 易学教程

Retrieving Data in Java

阅读更多关于 Retrieving Data in Java

问题 I'm a java novice. Is it possible to get data from a website and then store it in some sort of data structure? For example, the program gets the value of a stock from yahoo finance at a given time and stores it. Like I said, I'm not that proficient with Java and I'd like to know if this could be done. If it can be, is it very hard to do it? 回答1: public class GetYahooData { public ArrayList<JSONObject> getOutputFromUrl(String url) { ArrayList<JSONObject> output = new ArrayList<JSONObject>();

Why HTML code is different when parsing site using Jsoup than using browser

阅读更多关于 Why HTML code is different when parsing site using Jsoup than using browser

问题 I am on the website http://www.flashscore.com/nhl/ and I am trying to extract the links of the 'Today's Matches' table. I am trying it with the following code, but it does not work Can you point out where the mistake is? final Document page = Jsoup .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1") .cookie("_ga","GA1.2.47011772.1485726144") .referrer("http://d.flashscore.com/x/feed/proxy-local") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,

Having issue in extracting data from 100 web pages in a single loop

阅读更多关于 Having issue in extracting data from 100 web pages in a single loop

问题 I am sort of stuck. My goal is to extract data from a website that has several hundred pages. Its a sports website and i have to extract the team names and other relevant data. So far i have been successful doing it. I ran the loop for 6-7 pages and its works perfectly well. but when i change the loop to about month(25), it retrieves incomplete data. For instance if the destination date is 25 October, it may stop randomly at 10-12 October. I am using phpQuery and my internet connection is 1MB

Need to extract text messages out of an HTML document

阅读更多关于 Need to extract text messages out of an HTML document

问题 Hello, I have a long HTML document, this is only the part that interests me: <iframe class="goog-te-menu-frame skiptranslate" src="javascript:void(0)" frameborder="0" style="display: none; visibility: visible;"></iframe><div class="chatbox3"><div class="chatbox2"><div class="chatbox"><div class="logwrapper" style="top: 89px; margin-right: 168px;"><div class="logbox"><div style="position: relative; min-height: 100%;"><div class="logitem"><p class="statuslog">You're now chatting with a random

ICR for machine printed text?

阅读更多关于 ICR for machine printed text?

问题 I know ICR is basically used for handwritten(hand printed) data recognition but can we leverage ICR to extract distorted(bad quality) machine printed text by any chance ? if not what is best way to solve the following problem I have a unstructured document which may run into 2 or more pages, with in the document there are few date field which will be handwritten.now I want to convert this to text file. I have tried some fullpage ocr(omnipage and abbyy etc) tools which have ICR modules to

Extract Data from a Web Page - using VBA

阅读更多关于 Extract Data from a Web Page - using VBA

问题 Using VBA, I need to extract data from webpage http://emops.tse.com.tw/t21/sii/t21sc03_2011_9_e.htm I am able to fetch all the data using following code: With ActiveSheet.QueryTables.Add(Connection:="URL;http://emops.tse.com.tw/t21/sii/t21sc03_2012_2_e.htm", Destination:=Range("$A$1")) .Name = "67083361_zpid" .FieldNames = True .RowNumbers = False .FillAdjacentFormulas = False .PreserveFormatting = True .RefreshOnFileOpen = False .BackgroundQuery = True .RefreshStyle = xlInsertDeleteCells

Unable to extract date of birth from a given format

阅读更多关于 Unable to extract date of birth from a given format

问题 I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the files but is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform. Data: data=""" Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM" Jacob's Date of birth is 9/15/1963 Name:Annie; DOB:10/30/1970 Code: import re pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*

Paradox database file

阅读更多关于 Paradox database file

问题 I found paradox database files with different extension. There are db file, mb file, dat file, px file, XG0 file, XG1 file, XG2 file, XG3 file, XG4 file, YG0 file, YG1 file, YG2 file, YG3 file and YG4 file. I already found way to open db file and px file using gnumeric spreadsheet. I found some of needed data from db file. But, rest of data not in db file. So, i have to open rest of files. I cant find software that can read those files. 回答1: Unfortunately, Borland has never documented the

ruby: Extracting fields from nested json

阅读更多关于 ruby: Extracting fields from nested json

问题 I am trying to teach myself ruby and solve a problem at work. My ultimate goal is to extract out three of the many fields in JSON response from an API, manipulate and dump to CSV for executive reporting. The structure of the JSON is: { "status": 200, "data": { "total": 251, "alerts": [ { "dataPoint": "x", "ackedBy": "x", "dataSourceInstance": "x", "dataSource": "x", "host": "x", "endOn": 0, "ackedOn": 1385085190, "dataSourceInstanceId": 588384, "hostId": 935, "type": "alert", "dataSourceId":

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

阅读更多关于 PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

问题 Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to