bigdata

Hadoop Nodemanager and Resourcemanager not starting

时光毁灭记忆、已成空白 提交于 2019-12-07 08:05:48
问题 I am trying to setup the latest Hadoop 2.2 single node cluster on Ubuntu 13.10 64 bit. the OS is a fresh installation, and I have tried using both java-6 64 bit and java-7 64 bit. After following the steps from this and after failing, from this link, I am not able to start nodemanager and resourcemanager with the command: sbin/yarn-daemon.sh start nodemanager sudo sbin/yarn-daemon.sh start nodemanager and resource manager with sbin/yarn-daemon.sh start resourcemanager sudo sbin/yarn-daemon.sh

Convert any Elasticsearch response to simple field value format

白昼怎懂夜的黑 提交于 2019-12-07 07:53:26
On elastic search, when doing a simple query like: GET miindex-*/mytype/_search { "query": { "query_string": { "analyze_wildcard": true, "query": "*" } } } It returns a format like: { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 28, "max_score": 1, "hits": [ ... So I parse like response.hits.hits to get the actual records. However if you are doing another type of query e.g. aggregation, the response is totally different like: { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total":

How to Iterate each column in a Dataframe in Spark Scala

穿精又带淫゛_ 提交于 2019-12-07 07:49:29
Suppose I have a dataframe with multiple columns, I want to iterate each column, do some calculation and update that column. Is there any good way to do that? Update In below example I have a dataframe with two integer columns c1 and c2. each column's value is divided with the sum of its columns. import org.apache.spark.sql.expressions.Window val df = Seq((1,15), (2,20), (3,30)).toDF("c1","c2") val result = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, sum(acc(colname)).over(Window.orderBy(lit(1)))/acc(colname))) Output : scala> result.show() +---+------------------+ | c1|

Mini batch-training of a scikit-learn classifier where I provide the mini batches

偶尔善良 提交于 2019-12-07 07:33:32
问题 I have a very big dataset that can not be loaded in memory. I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression . Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches? 回答1: I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each

Long lag time importing large .CSV's in R WITH header in second row

南笙酒味 提交于 2019-12-07 06:49:38
问题 I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below: ### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ### system.time(read.csv('../data/2008.csv', header = T)) # user system elapsed # 88.301 2.416 90.716 library(data.table)

How to get the first not null value from a column of values in Big Query?

ⅰ亾dé卋堺 提交于 2019-12-07 06:40:17
问题 I am trying to extract the first not null value from a column of values based on timestamp. Can somebody share your thoughts on this. Thank you. What have i tried so far? FIRST_VALUE( column ) OVER ( PARTITION BY id ORDER BY timestamp) Input :- id,column,timestamp 1,NULL,10:30 am 1,NULL,10:31 am 1,'xyz',10:32 am 1,'def',10:33 am 2,NULL,11:30 am 2,'abc',11:31 am Output(expected) :- 1,'xyz',10:30 am 1,'xyz',10:31 am 1,'xyz',10:32 am 1,'xyz',10:33 am 2,'abc',11:30 am 2,'abc',11:31 am 回答1: Try

Basic addition in Tensorflow?

核能气质少年 提交于 2019-12-07 05:43:56
问题 I want to make a program where I enter in a set of x1 x2 and outputs a y. All of the tensor flow tutorials I can find start with image recognition. Can someone help me by providing me either code or a tutorial on how to do this in python? thanks in advance. edit- the x1 x2 coordinates I was planning to use would be like 1, 1 and the y would be 2 or 4, 6 and the y would be 10. I want to provide the program with data to learn from. I have tried to learn from the tensorflow website but it seemed

Find connected components of big Graph using Boost

a 夏天 提交于 2019-12-07 04:56:30
I have wrote a code for finding connected component of a very big Graph (80 million edges) but it doesn't work, When the edges number is close to 40 million it crashed. int main(){ using namespace boost; { int node1,node2; typedef adjacency_list <vecS, vecS, undirectedS> Graph; Graph G; std::ifstream infile("pairs.txt"); std::string line; while (std::getline(infile,line)) { std::istringstream iss(line); iss >> node1 >> node2; add_edge(node1, node2, G);} cout <<"writing file"<<endl; int j = 0; ofstream out; out.open("connected_component.txt"); std::vector<int> component(num_vertices(G)); int

Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]

心不动则不痛 提交于 2019-12-07 03:09:47
问题 This question already has an answer here : Spark SQL security considerations (1 answer) Closed 2 years ago . Scenario: Say there is a table in Hive , and it is queried using the below SparkSql in Apache Spark , where table name is passed as an argument and concatenated to the query. In case of non-distributed system, I have basic understanding of SQL-Injection vulnerability and in the context of JDBC understand the usage of createStatement/preparedStatement in the those kind of scenario. But

MySQL View is faster or not for querying from DB that containd 40 million data [closed]

烂漫一生 提交于 2019-12-06 16:29:16
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . I have a table that contains 40 million data. I want to reduce the query time. is it possible to do it using views?. if yes, could you please explain how?. 回答1: Yes, It is possible to reduce query time using view because it have clustered index assigned and they'll store temporary