bigdata

Data preparation to upload into Redis server

ⅰ亾dé卋堺 提交于 2019-12-11 08:43:09
问题 I have a 10GB .xml file, which I want to upload into redis server using the mass insert . I need advise on how to convert this .xml data to some key, value or any other data structure supported by redis? I am working with stack over flow dumps and for example, If I take up the comments.xml. Data pattern: row Id="5" PostId="5" Score="9" Text="this is a super theoretical AI question. An interesting discussion! but out of place..." CreationDate="2014-05-14T00:23:15.437" UserId="34" Lets say I

Reading a excel file in hadoop map reduce

一笑奈何 提交于 2019-12-11 08:34:52
问题 I am trying to read a Excel file containing some data for aggregation in hadoop.The map reduce program seems to be working fine but the output produce is in a non readable format.Do I need to use any special InputFormat reader for Excel file in Hadoop Map Reduce ?.My configuration is as below Configuration conf=getConf(); Job job=new Job(conf,"LatestWordCount"); job.setJarByClass(FlightDetailsCount.class); Path input=new Path(args[0]); Path output=new Path(args[1]); FileInputFormat

Postgres fails fetching data in Python

匆匆过客 提交于 2019-12-11 08:28:41
问题 I am using Python with psycopg2 module to get data from Postgres database. The database is quite large (tens of GB). Everything appears to be working, I am creating objects from the fetched data. However, after ~160000 of created objects I get the following error: I suppose the reason is the amount of data, but I could not get anywhere searching for a solution online. I am not aware of using any proxy and have never used any on this machine before, the database is on localhost . 回答1: It's

Spark Standalone Mode: Worker not starting properly in cloudera

会有一股神秘感。 提交于 2019-12-11 08:17:40
问题 I am new to the spark, After installing the spark using parcels available in the cloudera manager. I have configured the files as shown in the below link from cloudera enterprise: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html After this setup, I have started all the nodes in the spark by running /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-all.sh. But I couldn't run the worker nodes

Alternatives for problems involving very large array indexing storing very large values

点点圈 提交于 2019-12-11 08:15:26
问题 Please suggest some alternatives for resolving problems in which brute force solution is using Arrays with very large index and storing very large values(Very large means beyond the range of INT ). I am using Java to solve this problem. Sample problem: Putting large no of pebbles in a very large group of buckets, and then calculating average pebbles in each bucket. One way is to declare a big array and keep placing pebbles according to the indexes specified by user and then calculating

What is a job history server in Hadoop and why is it mandatory to start the history server before starting Pig in Map Reduce mode?

一曲冷凌霜 提交于 2019-12-11 07:50:51
问题 Before starting Pig in map reduce mode you always have to start the history server else while trying to execute Pig Latin statements the below mentioned logs are generated: 2018-10-18 15:59:13,709 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. **Redirecting to job history server** 2018-10-18 15:59:14,713 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0

shifting with re-sampling in time series data

别来无恙 提交于 2019-12-11 07:14:36
问题 assume that i have this time-series data: A B timestamp 1 1 2 2 1 2 3 1 1 4 0 1 5 1 0 6 0 1 7 1 0 8 1 1 i am looking for a re-sample value that would give me specific count of occurrences at least for some frequency if I would use re sample for the data from 1 to 8 with 2S, i will get different maximum if i would start from 2 to 8 for the same window size (2S) ds = series.resample( str(tries) +'S').sum() for shift in range(1,100): tries = 1 series = pd.read_csv("file.csv",index_col='timestamp

Flume to read facebook page/feed/post

笑着哭i 提交于 2019-12-11 06:34:52
问题 Anyone knows how to use flume so that it's reads data from a Facebook page? Actually I want to have a flume agent that reads a specific Facebook page and extracts all the information such as post/feed and push the data into Hadoop databases. 回答1: As mentioned in Flume Streaming Data from Facebook. The sentiment_analysis project has an overview containing the following: 1) Sample PHP code for the Facebook HTTP gets and posts 2) Flume configuration for a Facebook HTTP Source 3) The flume agent

aggregating jsonarray into Map<key, list> in spark in spark2.x

丶灬走出姿态 提交于 2019-12-11 06:24:52
问题 I am quite new to Spark. I have a input json file which I am reading as val df = spark.read.json("/Users/user/Desktop/resource.json"); Contents of resource.json looks like this: {"path":"path1","key":"key1","region":"region1"} {"path":"path112","key":"key1","region":"region1"} {"path":"path22","key":"key2","region":"region1"} Is there any way we can process this dataframe and aggregate result as Map<key, List<data>> where data is each json object in which key is present. For ex: expected

Read only n-th column of a text file which has no header with R and sqldf

佐手、 提交于 2019-12-11 06:22:26
问题 I have a similiar problem like this question: selecting every Nth column in using SQLDF or read.csv.sql I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post. Do you have an idea to solve this problem? I thought about something like that, but any results with fread or read.table