bigdata | 易学教程

PrimeFaces DataExporter for big data

阅读更多关于 PrimeFaces DataExporter for big data

问题 I have 65000 data for getting from DB into excel.But PF DataExporter component is not writing big data into the excel.What can I use for this process?Is there a library for this process? 回答1: You can use: Apache POI Some examples - Quick guide very easy to use and excellent simple examples Jasper reports Just link - needs some time to figure it out If you need just one excel export, use Apache POI. If you have a lot of reports, i would recommend you to use Jasper reports because you can have

What is the difference between FAILED AND ERROR in spark application states

阅读更多关于 What is the difference between FAILED AND ERROR in spark application states

问题 I am trying to create a state diagram of a submitted spark application. I and kind of lost on when then an application is considered FAILED. States are from here: https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala 回答1: This stage is very important, since when it comes to Big Data , Spark is awesome, but let's face it, we haven't solve the problem yet! When a task/job fails, Spark restarts it

Spark partitioning/cluster enforcing

阅读更多关于 Spark partitioning/cluster enforcing

问题 I will be using a large amount of files structured as follows: /day/hour-min.txt.gz with a total of 14 days. I will use a cluster of 90 nodes/workers. I am reading everything with wholeTextFiles() as it is the only way that allows me to split the data appropriately. All the computations will be done on a per-minute basis (so basically per file) with a few reduce steps at the end. There are roughly 20.000 files; How to efficiently partition them? Do I let spark decide? Ideally, I think each

How to store millions of statistics records efficiently?

阅读更多关于 How to store millions of statistics records efficiently?

问题 We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task? Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like 345422,{051216:23212,051217:64233} where 051216

Enriching DataStream using static DataSet in Flink streaming

阅读更多关于 Enriching DataStream using static DataSet in Flink streaming

问题 I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB). For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not. An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by

What's the difference between a watermark and a trigger in Flink?

阅读更多关于 What's the difference between a watermark and a trigger in Flink?

问题 I read that, "..The ordering operator has to buffer all elements it receives. Then, when it receives a watermark it can sort all elements that have a timestamp that is lower than the watermark and emit them in the sorted order. This is correct because the watermark signals that not more elements can arrive that would be intermixed with the sorted elements..." - https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams Hence, it seems that the watermark serves as a signal to

Which db manager for a 100Go Table? [closed]

阅读更多关于 Which db manager for a 100Go Table? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I realize a 2G / 3G / 4G data retrieval project as part of my studies. I have to store this data, and to make queries on it. My table : [freq {float}, dbm {float}, timestamp {init}] I receive about 15GB per day of data, from 100000 to 200000 entries per minute, and that's for 6 day's. I could use a simple DBMS

Terms of use data from Facebook API

阅读更多关于 Terms of use data from Facebook API

问题 easy question - if I get data from search graph Facebook (ex. list of all pages or events - anything what I can retrivie) can I use them in my own site, publish them? comercial use? Or is there any restriction? thx EDIT: I need public data like fan pages or public events, but I want use them out of Facebook, on anther site (present data in my own way). Good example is socialbakers - big statistics from FB (example is there possybillity to make that site based on Facebook API data? 回答1: It

Process multiple file using awk

阅读更多关于 Process multiple file using awk

问题 I've got to process lots of txt files (16 million of rows for each file) using awk. I've got to read for example ten files: File #1: en sample_1 200 en.n sample_2 10 en sample_3 10 File #2: en sample_1 10 en sample_3 67 File #3: en sample_1 1 en.n sample_2 10 en sample_4 20 ... I would like to have an output like this: source title f1 f2 f3 sum(f1,f2,f3) en sample_1 200 10 1 211 en.n sample_2 10 0 10 20 en sample_3 10 67 0 77 en sample_4 0 0 20 20 Here my first version of code: #! /bin/bash

How to vectorize text file in mahout?

阅读更多关于 How to vectorize text file in mahout?

问题 I'm having a text file with label and tweets . positive,I love this car negative,I hate this book positive,Good product. I need to convert each line into vector value.If i use seq2sparse command means the whole document gets converted to vector,but i need to convert each line as vector not the whole document. ex : key : positive value : vectorvalue(tweet) How can we achieve this in mahout? /* Here is what i have done */ StringTokenizer str= new StringTokenizer(line,","); String label=str