bigdata | 易学教程

How to convert json array<String> to csv in spark sql

阅读更多关于 How to convert json array to csv in spark sql

问题 I have tried this query to get required experience from linkedin data. Dataset<Row> filteredData = spark .sql("select full_name ,experience from (select *, explode(experience['title']) exp from tempTable )" + " a where lower(exp) like '%developer%'"); But I got this error: and finally I tried but I got more rows with the same name . Dataset<Row> filteredData = spark .sql("select full_name ,explode(experience) from (select *, explode(experience['title']) exp from tempTable )" + " a where lower

how to read to huge file into buffer

阅读更多关于 how to read to huge file into buffer

问题 i have some code to read a file: FILE* file = fopen(fileName.c_str(), "r"); assert(file != NULL); size_t BUF_SIZE = 10 * 1024 * 1024; char* buf = new char[BUF_SIZE]; string contents; while (!feof(file)) { int ret = fread(buf, BUF_SIZE, 1, file); assert(ret != -1); contents.append(buf); } I know the size of the file in advance, so I assign a buffer to store the content from the file in this line: char* buf = new char[BUF_SIZE]; If the file I need read is very large, for example up to several

Inserting json date obeject in mongodb from R

阅读更多关于 Inserting json date obeject in mongodb from R

问题 I am trying to insert forecasted values from a forecasting model along with timestamps in mongodb from. The following code converts the R dataframe into json and then bson. However,when the result is inserted into mongodb, the timestamp is not recognized as date object. mongo1 <-mongo.create(host = "localhost:27017",db = "test",username = "test",password = "test") rev<-data.frame(ts=c("2017-01-06 05:30:00","2017-01-06 05:31:00","2017-01-06 05:32:00","2017-01-06 05:33:00","2017-01-06 05:34:00"

Convert any Elasticsearch response to simple field value format

阅读更多关于 Convert any Elasticsearch response to simple field value format

问题 On elastic search, when doing a simple query like: GET miindex-*/mytype/_search { "query": { "query_string": { "analyze_wildcard": true, "query": "*" } } } It returns a format like: { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 28, "max_score": 1, "hits": [ ... So I parse like response.hits.hits to get the actual records. However if you are doing another type of query e.g. aggregation, the response is totally different like: {

How to Iterate each column in a Dataframe in Spark Scala

阅读更多关于 How to Iterate each column in a Dataframe in Spark Scala

问题 Suppose I have a dataframe with multiple columns, I want to iterate each column, do some calculation and update that column. Is there any good way to do that? 回答1: Update In below example I have a dataframe with two integer columns c1 and c2. each column's value is divided with the sum of its columns. import org.apache.spark.sql.expressions.Window val df = Seq((1,15), (2,20), (3,30)).toDF("c1","c2") val result = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, sum(acc(colname

How can I efficiently create a user graph based on transaction data using Python?

阅读更多关于 How can I efficiently create a user graph based on transaction data using Python?

问题 I'm attempting to create a graph of users in Python using the networkx package. My raw data is individual payment transactions, where the payment data includes a user, a payment instrument, an IP address, etc. My nodes are users, and I am creating edges if any two users have shared an IP address. From that transaction data, I've created a Pandas dataframe of unique [user, IP] pairs. To create edges, I need to find [user_a, user_b] pairs where both users share an IP. Let's call this DataFrame

How Locality Sensitive Hashing (LSH) works?

阅读更多关于 How Locality Sensitive Hashing (LSH) works?

问题 I've read already this question, but unfortunately it didn't help. What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q : suppose that using our set of locality sensitive family functions h_1,h_2,...,h_n we have translated q to a low-dimension ( n dimensions) hash code c . Then c is the index of the bucket which q is assigned to and where (hopefully) are assigned also its nearest neighbors, let say that there are 100 vectors

Python + Beam + Flink

阅读更多关于 Python + Beam + Flink

问题 I've been trying to get the Apache Beam Portability Framework to work with Python and Apache Flink and I can't seem to find a complete set of instructions to get the environment working. Are there any references with complete list of prerequisites and steps to get a simple python pipeline working? 回答1: Overall, for local portable runner (ULR), see the wiki, quote from there: Run a Python-SDK Pipeline: Compile container as a local build: ./gradlew :beam-sdks-python-container:docker Start ULR

Using Kinesis Analytics to construct real time sessions

阅读更多关于 Using Kinesis Analytics to construct real time sessions

问题 Is there an example somewhere or can someone explain how to using Kinesis Analytics to construct real time sessions. (ie sessionization) It is mentioned that this possible here: https://aws.amazon.com/blogs/aws/amazon-kinesis-analytics-process-streaming-data-in-real-time-with-sql/ in the discussion of custom windows but does not give an example. Typically this is done in SQL using the LAG function so you can compute the time difference between consecutive rows. This post: https://blog

Number of MapReduce tasks

阅读更多关于 Number of MapReduce tasks

问题 I need some help about how it is possible to get the correct number of Map and Reduce tasks in my application. Is there any way to discover this number? Thanks 回答1: It is not possible to get the actual number of map and reduce tasks for an application before its execution, since the factors of task failures followed by re-attempts and speculative execution attempts cannot be accurately determined prior to execution, an approximate number tasks can be derived. The total number of Map tasks for