bigdata | 易学教程

Storing Apache Hadoop Data Output to Mysql Database

阅读更多关于 Storing Apache Hadoop Data Output to Mysql Database

I need to store output of map-reduce program into database, so is there any way? If so, is it possible to store output into multiple columns & tables based on requirement?? please suggest me some solutions. Thank you.. Michal The great example is shown on this blog , I tried it and it goes really well. I quote the most important parts of the code. At first, you must create a class representing data you would like to store. The class must implement DBWritable interface: public class DBOutputWritable implements Writable, DBWritable { private String name; private int count; public

Split dataset per rows into smaller files in R

阅读更多关于 Split dataset per rows into smaller files in R

I am analyzing a dataset, with 1.14 GB (1,232,705,653 bytes). When reading the data in R: trade = read.csv("commodity_trade_statistics_data.csv") One can see that it has 8225871 instances and 10 attributes. As I intend to analyze the dataset through a Data Wrangling web app that has a limit on the imports of 100MB, I am wondering how can I split the data into files with a max of 100MB? The split that I intend to do is per rows and each file should contain the header. eastclintw00d Split up the dataframe into the desired number of chunks. Here is an example with the built-in mtcars dataset: no

Failing to write offset data to zookeeper in kafka-storm

阅读更多关于 Failing to write offset data to zookeeper in kafka-storm

问题 I was setting up a storm cluster to calculate real time trending and other statistics, however I have some problems introducing the "recovery" feature into this project, by allowing the offset that was last read by the kafka-spout (the source code for kafka-spout comes from https://github.com/apache/incubator-storm/tree/master/external/storm-kafka) to be remembered. I start my kafka-spout in this way: BrokerHosts zkHost = new ZkHosts("localhost:2181"); SpoutConfig kafkaConfig = new

Element-wise mean of several big.matrix objects in R

阅读更多关于 Element-wise mean of several big.matrix objects in R

I have 17 filebacked big.matrix objects (dim 10985 x 52598, 4.3GB each) of which I would like to calculate the element-wise mean. The result can be stored in another big.matrix (gcm.res.outputM). biganalytics::apply() doesn't work as the MARGIN can be set to 1 OR 2 only. I tried to use 2 for loops as shown here gcm.res.outputM <- filebacked.big.matrix(10958, 52598, separated = FALSE, backingfile = "gcm.res.outputM.bin", backingpath = NULL, descriptorfile = "gcm.res.outputM.desc", binarydescriptor = FALSE) for(i in 1:10958){ for(j in 1:52598){ t <- rbind(gcm.res.output1[i,j], gcm.res.output2[i

How to replace null NAN or Infinite values to default value in Spark Scala

阅读更多关于 How to replace null NAN or Infinite values to default value in Spark Scala

I'm reading in csvs into Spark and I'm setting the schema to all DecimalType(10,0) columns. When I query the data, I get the following error: NumberFormatException: Infinite or NaN If I have NaN/null/infinite values in my dataframe, I would like to set them to 0. How do I do this? This is how I'm attempting to load the data: var cases = spark.read.option("header",false). option("nanValue","0"). option("nullValue","0"). option("positiveInf","0"). option("negativeInf","0"). schema(schema). csv(... Any help would be greatly appreciated. If you have NaN values in multiple columns, you can use na

Fastest way to load huge .dat into array

阅读更多关于 Fastest way to load huge .dat into array

I have extensively searched in stackexchange a neat solution for loading a huge (~2GB) .dat file into a numpy array, but didn't find a proper solution. So far I managed to load it as a list in a really fast way (<1 min): list=[] f = open('myhugefile0') for line in f: list.append(line) f.close() Using np.loadtxt freezes my computer and takes several minutes to load (~ 10 min). How can I open the file as an array without the allocating issue that seems to bottleneck np.loadtxt? EDIT: Input data is a float (200000,5181) array. One line example: 2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1

oozie Sqoop action fails to import data to hive

阅读更多关于 oozie Sqoop action fails to import data to hive

I am facing issue while executing oozie sqoop action. In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data. It fails while importing temp data to hive. In logs I am not getting any exception. Below is a sqoop action I am using. <workflow-app name="testSqoopLoadWorkflow" xmlns="uri:oozie:workflow:0.4"> <credentials> <credential name='hive_credentials' type='hcat'> <property> <name>hcat.metastore.uri</name> <value>${HIVE_THRIFT_URL}</value> </property> <property> <name>hcat.metastore.principal</name> <value>${KERBEROS_PRINCIPAL}<

Can Apache Sqoop and Flume be used interchangeably?

阅读更多关于 Can Apache Sqoop and Flume be used interchangeably?

I am new to Big data. From some of the answers to What's the difference between Flume and Sqoop? , both Flume and Sqoop can pull data from source and push to Hadoop. Can anyone please specify exaclty where flume is used and where sqoop is? Can both be used for the same tasks? Flume and Sqoop are both designed to work with different kind of data sources. Sqoop works with any kind of RDBMS system that supports JDBC connectivity. Flume on the other hand works well with streaming data sources like log data which is being generated continuously in your environment. Specifically, Sqoop could be used

Opening a HDFS file in browser

阅读更多关于 Opening a HDFS file in browser

I am trying to open a file (present in the HDFS location: /user/input/Summary.txt) in my browser using the URL: hdfs://localhost:8020/user/input/Summary.txt but I am getting an error in my firefox browser: Firefox doesn't know how to open this address, because the protocol (hdfs) isn't associated with any program. If I change the protocol from hdfs to http (which ideally should not work) then I am getting the message: It looks like you are making an HTTP request to a Hadoop IPC port. This is not the correct port for the web interface on this daemon. This is present in the core-site.xml file:

Mini batch-training of a scikit-learn classifier where I provide the mini batches

阅读更多关于 Mini batch-training of a scikit-learn classifier where I provide the mini batches

I have a very big dataset that can not be loaded in memory. I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression . Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches? I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit , release the minibatch from