bigdata

How to convert json array<String> to csv in spark sql

谁说胖子不能爱 提交于 2019-12-08 07:00:46
问题 I have tried this query to get required experience from linkedin data. Dataset<Row> filteredData = spark .sql("select full_name ,experience from (select *, explode(experience['title']) exp from tempTable )" + " a where lower(exp) like '%developer%'"); But I got this error: and finally I tried but I got more rows with the same name . Dataset<Row> filteredData = spark .sql("select full_name ,explode(experience) from (select *, explode(experience['title']) exp from tempTable )" + " a where lower

how to read to huge file into buffer

两盒软妹~` 提交于 2019-12-08 05:08:18
问题 i have some code to read a file: FILE* file = fopen(fileName.c_str(), "r"); assert(file != NULL); size_t BUF_SIZE = 10 * 1024 * 1024; char* buf = new char[BUF_SIZE]; string contents; while (!feof(file)) { int ret = fread(buf, BUF_SIZE, 1, file); assert(ret != -1); contents.append(buf); } I know the size of the file in advance, so I assign a buffer to store the content from the file in this line: char* buf = new char[BUF_SIZE]; If the file I need read is very large, for example up to several

Inserting json date obeject in mongodb from R

不打扰是莪最后的温柔 提交于 2019-12-08 04:14:14
问题 I am trying to insert forecasted values from a forecasting model along with timestamps in mongodb from. The following code converts the R dataframe into json and then bson. However,when the result is inserted into mongodb, the timestamp is not recognized as date object. mongo1 <-mongo.create(host = "localhost:27017",db = "test",username = "test",password = "test") rev<-data.frame(ts=c("2017-01-06 05:30:00","2017-01-06 05:31:00","2017-01-06 05:32:00","2017-01-06 05:33:00","2017-01-06 05:34:00"

Convert any Elasticsearch response to simple field value format

假如想象 提交于 2019-12-08 04:07:19
问题 On elastic search, when doing a simple query like: GET miindex-*/mytype/_search { "query": { "query_string": { "analyze_wildcard": true, "query": "*" } } } It returns a format like: { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 28, "max_score": 1, "hits": [ ... So I parse like response.hits.hits to get the actual records. However if you are doing another type of query e.g. aggregation, the response is totally different like: {

How to Iterate each column in a Dataframe in Spark Scala

你离开我真会死。 提交于 2019-12-08 03:58:19
问题 Suppose I have a dataframe with multiple columns, I want to iterate each column, do some calculation and update that column. Is there any good way to do that? 回答1: Update In below example I have a dataframe with two integer columns c1 and c2. each column's value is divided with the sum of its columns. import org.apache.spark.sql.expressions.Window val df = Seq((1,15), (2,20), (3,30)).toDF("c1","c2") val result = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, sum(acc(colname

How can I efficiently create a user graph based on transaction data using Python?

霸气de小男生 提交于 2019-12-08 03:40:29
问题 I'm attempting to create a graph of users in Python using the networkx package. My raw data is individual payment transactions, where the payment data includes a user, a payment instrument, an IP address, etc. My nodes are users, and I am creating edges if any two users have shared an IP address. From that transaction data, I've created a Pandas dataframe of unique [user, IP] pairs. To create edges, I need to find [user_a, user_b] pairs where both users share an IP. Let's call this DataFrame

How Locality Sensitive Hashing (LSH) works?

陌路散爱 提交于 2019-12-08 02:49:54
问题 I've read already this question, but unfortunately it didn't help. What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q : suppose that using our set of locality sensitive family functions h_1,h_2,...,h_n we have translated q to a low-dimension ( n dimensions) hash code c . Then c is the index of the bucket which q is assigned to and where (hopefully) are assigned also its nearest neighbors, let say that there are 100 vectors

Python + Beam + Flink

匆匆过客 提交于 2019-12-08 00:40:47
问题 I've been trying to get the Apache Beam Portability Framework to work with Python and Apache Flink and I can't seem to find a complete set of instructions to get the environment working. Are there any references with complete list of prerequisites and steps to get a simple python pipeline working? 回答1: Overall, for local portable runner (ULR), see the wiki, quote from there: Run a Python-SDK Pipeline: Compile container as a local build: ./gradlew :beam-sdks-python-container:docker Start ULR

Using Kinesis Analytics to construct real time sessions

和自甴很熟 提交于 2019-12-08 00:39:18
问题 Is there an example somewhere or can someone explain how to using Kinesis Analytics to construct real time sessions. (ie sessionization) It is mentioned that this possible here: https://aws.amazon.com/blogs/aws/amazon-kinesis-analytics-process-streaming-data-in-real-time-with-sql/ in the discussion of custom windows but does not give an example. Typically this is done in SQL using the LAG function so you can compute the time difference between consecutive rows. This post: https://blog

Number of MapReduce tasks

孤街醉人 提交于 2019-12-07 21:40:16
问题 I need some help about how it is possible to get the correct number of Map and Reduce tasks in my application. Is there any way to discover this number? Thanks 回答1: It is not possible to get the actual number of map and reduce tasks for an application before its execution, since the factors of task failures followed by re-attempts and speculative execution attempts cannot be accurately determined prior to execution, an approximate number tasks can be derived. The total number of Map tasks for