data-extraction

Retrieving Data in Java

淺唱寂寞╮ 提交于 2020-01-04 13:22:30
问题 I'm a java novice. Is it possible to get data from a website and then store it in some sort of data structure? For example, the program gets the value of a stock from yahoo finance at a given time and stores it. Like I said, I'm not that proficient with Java and I'd like to know if this could be done. If it can be, is it very hard to do it? 回答1: public class GetYahooData { public ArrayList<JSONObject> getOutputFromUrl(String url) { ArrayList<JSONObject> output = new ArrayList<JSONObject>();

Why HTML code is different when parsing site using Jsoup than using browser

≯℡__Kan透↙ 提交于 2019-12-25 09:26:28
问题 I am on the website http://www.flashscore.com/nhl/ and I am trying to extract the links of the 'Today's Matches' table. I am trying it with the following code, but it does not work Can you point out where the mistake is? final Document page = Jsoup .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1") .cookie("_ga","GA1.2.47011772.1485726144") .referrer("http://d.flashscore.com/x/feed/proxy-local") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,

Having issue in extracting data from 100 web pages in a single loop

痞子三分冷 提交于 2019-12-25 05:57:44
问题 I am sort of stuck. My goal is to extract data from a website that has several hundred pages. Its a sports website and i have to extract the team names and other relevant data. So far i have been successful doing it. I ran the loop for 6-7 pages and its works perfectly well. but when i change the loop to about month(25), it retrieves incomplete data. For instance if the destination date is 25 October, it may stop randomly at 10-12 October. I am using phpQuery and my internet connection is 1MB

Need to extract text messages out of an HTML document

ⅰ亾dé卋堺 提交于 2019-12-25 01:18:19
问题 Hello, I have a long HTML document, this is only the part that interests me: <iframe class="goog-te-menu-frame skiptranslate" src="javascript:void(0)" frameborder="0" style="display: none; visibility: visible;"></iframe><div class="chatbox3"><div class="chatbox2"><div class="chatbox"><div class="logwrapper" style="top: 89px; margin-right: 168px;"><div class="logbox"><div style="position: relative; min-height: 100%;"><div class="logitem"><p class="statuslog">You're now chatting with a random

ICR for machine printed text?

£可爱£侵袭症+ 提交于 2019-12-24 13:55:08
问题 I know ICR is basically used for handwritten(hand printed) data recognition but can we leverage ICR to extract distorted(bad quality) machine printed text by any chance ? if not what is best way to solve the following problem I have a unstructured document which may run into 2 or more pages, with in the document there are few date field which will be handwritten.now I want to convert this to text file. I have tried some fullpage ocr(omnipage and abbyy etc) tools which have ICR modules to

Extract Data from a Web Page - using VBA

无人久伴 提交于 2019-12-23 04:03:06
问题 Using VBA, I need to extract data from webpage http://emops.tse.com.tw/t21/sii/t21sc03_2011_9_e.htm I am able to fetch all the data using following code: With ActiveSheet.QueryTables.Add(Connection:="URL;http://emops.tse.com.tw/t21/sii/t21sc03_2012_2_e.htm", Destination:=Range("$A$1")) .Name = "67083361_zpid" .FieldNames = True .RowNumbers = False .FillAdjacentFormulas = False .PreserveFormatting = True .RefreshOnFileOpen = False .BackgroundQuery = True .RefreshStyle = xlInsertDeleteCells

Unable to extract date of birth from a given format

大城市里の小女人 提交于 2019-12-23 03:17:41
问题 I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the files but is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform. Data: data=""" Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM" Jacob's Date of birth is 9/15/1963 Name:Annie; DOB:10/30/1970 Code: import re pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*

Paradox database file

我与影子孤独终老i 提交于 2019-12-22 05:36:40
问题 I found paradox database files with different extension. There are db file, mb file, dat file, px file, XG0 file, XG1 file, XG2 file, XG3 file, XG4 file, YG0 file, YG1 file, YG2 file, YG3 file and YG4 file. I already found way to open db file and px file using gnumeric spreadsheet. I found some of needed data from db file. But, rest of data not in db file. So, i have to open rest of files. I cant find software that can read those files. 回答1: Unfortunately, Borland has never documented the

ruby: Extracting fields from nested json

随声附和 提交于 2019-12-22 05:08:14
问题 I am trying to teach myself ruby and solve a problem at work. My ultimate goal is to extract out three of the many fields in JSON response from an API, manipulate and dump to CSV for executive reporting. The structure of the JSON is: { "status": 200, "data": { "total": 251, "alerts": [ { "dataPoint": "x", "ackedBy": "x", "dataSourceInstance": "x", "dataSource": "x", "host": "x", "endOn": 0, "ackedOn": 1385085190, "dataSourceInstanceId": 588384, "hostId": 935, "type": "alert", "dataSourceId":

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

心已入冬 提交于 2019-12-20 10:33:35
问题 Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to