bigdata | 易学教程

Process huge GEOJson file with jq

阅读更多关于 Process huge GEOJson file with jq

问题 Given a GEOJson file as follows:- { "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": { "FEATCODE": 15014 }, "geometry": { "type": "Polygon", "coordinates": [ ..... I want to end up with the following:- { "type": "FeatureCollection", "features": [ { "tippecanoe : {"minzoom" : 13}, "type": "Feature", "properties": { "FEATCODE": 15014 }, "geometry": { "type": "Polygon", "coordinates": [ ..... ie. I have added the tippecanoe object to each feature in the array

Understanding and building a social network algorithm

阅读更多关于 Understanding and building a social network algorithm

I am not sure whether this is the right platform to ask this question. But my problem statement is : I have a book shop & x no of clients (x is huge). A client can tell me whether a book is a good or bad (not recommended). I have a internal logic to club books together , so if a client says a book is bad, he is saying that similar books are bad too and don't show him that. I oblige and hide those books. Clients can also interact among themselves, and have a mutual confidence level between them. A case arises when client A says Book X1 is bad. Hence i blacklist X1,X2,X3,X4 etc. But his friend

In spark, how does broadcast work?

阅读更多关于 In spark, how does broadcast work?

This is a very simple question: in spark, broadcast can be used to send variables to executors efficiently. How does this work ? More precisely: when are values sent : as soon as I call broadcast , or when the values are used ? Where exactly is the data sent : to all executors, or only to the ones that will need it ? where is the data stored ? In memory, or on disk ? Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call the .value method ? Short answer Values are sent the first time they are needed in an executor. Nothing

R vector size limit: “long vectors (argument 5) are not supported in .C”

阅读更多关于 R vector size limit: “long vectors (argument 5) are not supported in .C”

I have a very large matrix I'm trying to run through glmnet on a server with plenty of memory. It works fine even on very large data sets up to a certain point, after which I get the following error: Error in elnet(x, ...) : long vectors (argument 5) are not supported in .C If I understand correctly this is caused by a limitation in R which cannot have any vector with length longer than INT_MAX. Is that correct? Are there any available solutions to this that don't require a complete rewrite of glmnet? Do any of the alternative R interpreters (Riposte, etc) address this limitation? Thanks!

Fastest way to cross-tabulate two massive logical vectors in R

阅读更多关于 Fastest way to cross-tabulate two massive logical vectors in R

For two logical vectors, x and y , of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations? I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon. Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time). set.seed(0) N = 3E8 p = 0.02 x = sample(c(TRUE, FALSE), N,

Hive order by not visible column

阅读更多关于 Hive order by not visible column

问题 Let's say I have table test with column a,b and c and test2 with same column. Can I create a view of table test and test 2 joined together and ordered by field c from table test without showing it in final output? In my case: CREATE VIEW AS test_view AS SELECT a,b FROM (SELECT * FROM test ORDER BY c) JOIN test2 ON test.a =test2.a; Ok I test it and it is not possible because shuffle phase so maybe there is another solution to somehow do it? Table are too big to do broadcast join. Of course I

Fastest way to compare row and previous row in pandas dataframe with millions of rows

阅读更多关于 Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row. As an example, this is a simplified version of my problem: User Time Col1 newcol1 newcol2 newcol3 newcol4 0 1 6 [cat, dog, goat] 0 0 0 0 1 1 6 [cat, sheep] 0 0 0 0 2 1 12 [sheep, goat] 0 0 0 0 3 2 3 [cat, lion] 0 0 0 0 4 2 5 [fish, goat, lemur] 0 0 0 0 5 3 9 [cat, dog] 0 0 0 0 6 4 4 [dog, goat] 0 0 0 0 7 4 11 [cat] 0 0 0 0 At the moment I have a function which loops through and calculates values for ' newcol1 ' and ' newcol2

How to check Spark Version [closed]

阅读更多关于 How to check Spark Version [closed]

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. Please help. Thanks BruceWayne Addition to @Binary Nerd If you are using Spark, use the following to get the Spark version: spark-submit --version or Login to the Cloudera Manager and goto Hosts page then run inspect hosts in cluster You can get the spark version by using the following command: spark-submit --version spark-shell --version spark-sql --version You can visit the below site to know the spark-version used in CDH 5.7.0 http://www.cloudera.com/documentation/enterprise/release

Apache Spark vs Akka [closed]

阅读更多关于 Apache Spark vs Akka [closed]

Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them. Moreover, I would like to get the use cases suitable for each of them. Apache Spark is actually built on Akka. Akka is a general purpose framework to create reactive, distributed, parallel and resilient concurrent applications in Scala or Java. Akka uses the Actor model to hide all the thread-related code and gives you really simple and helpful interfaces to implement a scalable and

How to validate history data?

阅读更多关于 How to validate history data?

问题 Currently we are reading date using calendar instance for picking last one month record using sparksql. Now we need: In case of extra events being added to previous day we must also be able to manually insert summary start and end dates, in case we need manual re run of job for a previous time period: e.g: a manual re run table could look like this: rprtng_period_type_cd summary_start_date summary_end_date summary_iv M 2018-01-01 2018-01-31 2018-01 D 2018-03-05 2018-03-05 2018-03-05 D 2018-03