bigdata

Storing Apache Hadoop Data Output to Mysql Database

我们两清 提交于 2019-12-06 01:07:27
I need to store output of map-reduce program into database, so is there any way? If so, is it possible to store output into multiple columns & tables based on requirement?? please suggest me some solutions. Thank you.. Michal The great example is shown on this blog , I tried it and it goes really well. I quote the most important parts of the code. At first, you must create a class representing data you would like to store. The class must implement DBWritable interface: public class DBOutputWritable implements Writable, DBWritable { private String name; private int count; public

Split dataset per rows into smaller files in R

大城市里の小女人 提交于 2019-12-05 23:19:37
I am analyzing a dataset, with 1.14 GB (1,232,705,653 bytes). When reading the data in R: trade = read.csv("commodity_trade_statistics_data.csv") One can see that it has 8225871 instances and 10 attributes. As I intend to analyze the dataset through a Data Wrangling web app that has a limit on the imports of 100MB, I am wondering how can I split the data into files with a max of 100MB? The split that I intend to do is per rows and each file should contain the header. eastclintw00d Split up the dataframe into the desired number of chunks. Here is an example with the built-in mtcars dataset: no

Failing to write offset data to zookeeper in kafka-storm

孤街浪徒 提交于 2019-12-05 21:54:46
问题 I was setting up a storm cluster to calculate real time trending and other statistics, however I have some problems introducing the "recovery" feature into this project, by allowing the offset that was last read by the kafka-spout (the source code for kafka-spout comes from https://github.com/apache/incubator-storm/tree/master/external/storm-kafka) to be remembered. I start my kafka-spout in this way: BrokerHosts zkHost = new ZkHosts("localhost:2181"); SpoutConfig kafkaConfig = new

Element-wise mean of several big.matrix objects in R

与世无争的帅哥 提交于 2019-12-05 21:36:35
I have 17 filebacked big.matrix objects (dim 10985 x 52598, 4.3GB each) of which I would like to calculate the element-wise mean. The result can be stored in another big.matrix (gcm.res.outputM). biganalytics::apply() doesn't work as the MARGIN can be set to 1 OR 2 only. I tried to use 2 for loops as shown here gcm.res.outputM <- filebacked.big.matrix(10958, 52598, separated = FALSE, backingfile = "gcm.res.outputM.bin", backingpath = NULL, descriptorfile = "gcm.res.outputM.desc", binarydescriptor = FALSE) for(i in 1:10958){ for(j in 1:52598){ t <- rbind(gcm.res.output1[i,j], gcm.res.output2[i

How to replace null NAN or Infinite values to default value in Spark Scala

白昼怎懂夜的黑 提交于 2019-12-05 21:28:48
I'm reading in csvs into Spark and I'm setting the schema to all DecimalType(10,0) columns. When I query the data, I get the following error: NumberFormatException: Infinite or NaN If I have NaN/null/infinite values in my dataframe, I would like to set them to 0. How do I do this? This is how I'm attempting to load the data: var cases = spark.read.option("header",false). option("nanValue","0"). option("nullValue","0"). option("positiveInf","0"). option("negativeInf","0"). schema(schema). csv(... Any help would be greatly appreciated. If you have NaN values in multiple columns, you can use na

Fastest way to load huge .dat into array

落花浮王杯 提交于 2019-12-05 20:30:03
I have extensively searched in stackexchange a neat solution for loading a huge (~2GB) .dat file into a numpy array, but didn't find a proper solution. So far I managed to load it as a list in a really fast way (<1 min): list=[] f = open('myhugefile0') for line in f: list.append(line) f.close() Using np.loadtxt freezes my computer and takes several minutes to load (~ 10 min). How can I open the file as an array without the allocating issue that seems to bottleneck np.loadtxt? EDIT: Input data is a float (200000,5181) array. One line example: 2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1

oozie Sqoop action fails to import data to hive

与世无争的帅哥 提交于 2019-12-05 18:50:20
I am facing issue while executing oozie sqoop action. In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data. It fails while importing temp data to hive. In logs I am not getting any exception. Below is a sqoop action I am using. <workflow-app name="testSqoopLoadWorkflow" xmlns="uri:oozie:workflow:0.4"> <credentials> <credential name='hive_credentials' type='hcat'> <property> <name>hcat.metastore.uri</name> <value>${HIVE_THRIFT_URL}</value> </property> <property> <name>hcat.metastore.principal</name> <value>${KERBEROS_PRINCIPAL}<

Can Apache Sqoop and Flume be used interchangeably?

梦想的初衷 提交于 2019-12-05 18:44:36
I am new to Big data. From some of the answers to What's the difference between Flume and Sqoop? , both Flume and Sqoop can pull data from source and push to Hadoop. Can anyone please specify exaclty where flume is used and where sqoop is? Can both be used for the same tasks? Flume and Sqoop are both designed to work with different kind of data sources. Sqoop works with any kind of RDBMS system that supports JDBC connectivity. Flume on the other hand works well with streaming data sources like log data which is being generated continuously in your environment. Specifically, Sqoop could be used

Opening a HDFS file in browser

喜你入骨 提交于 2019-12-05 18:38:33
I am trying to open a file (present in the HDFS location: /user/input/Summary.txt) in my browser using the URL: hdfs://localhost:8020/user/input/Summary.txt but I am getting an error in my firefox browser: Firefox doesn't know how to open this address, because the protocol (hdfs) isn't associated with any program. If I change the protocol from hdfs to http (which ideally should not work) then I am getting the message: It looks like you are making an HTTP request to a Hadoop IPC port. This is not the correct port for the web interface on this daemon. This is present in the core-site.xml file:

Mini batch-training of a scikit-learn classifier where I provide the mini batches

最后都变了- 提交于 2019-12-05 17:41:30
I have a very big dataset that can not be loaded in memory. I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression . Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches? I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit , release the minibatch from