bigdata

importance of PCA or SVD in machine learning

对着背影说爱祢 提交于 2019-11-28 15:07:49
问题 All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand. I am trying to think (since long time) but I am not able to guess why is it so. In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like

Spark Fixed Width File Import Large number of columns causing high Execution time

假如想象 提交于 2019-11-28 14:30:18
I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files. Code read the text file as RDD with sparkContext.textFile("abc.txt") then reads JSON schema and gets the column names and width of each column. In the function I read the fixed length string and using the start and end position we use substring function to create the Array . Map the function to RDD. Convert the above RDD to DF and map colnames and write to

Confusion in hashing used by LSH

不羁的心 提交于 2019-11-28 14:21:15
Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?

R foverlaps equivalent in Python

左心房为你撑大大i 提交于 2019-11-28 12:43:14
I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3. What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at each interval, while table_B will have one entry every 0 - 10 minutes at irregular intervals. This

aggregation using ffdfdply function in R

*爱你&永不变心* 提交于 2019-11-28 11:49:54
问题 I tried aggregation on large dataset using 'ffbase' package using ffdfdply function in R. lets say I have three variables called Date,Item and sales. Here I want to aggregate the sales over Date and Item using sum function. Could you please guide me through some proper syntax in R. Here I tried like this: grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], split=as.character(data$sales),FUN = function(data) summaryBy(Date+Item~sales, data=data, FUN=sum)). I would appreciate for your

Operation Time Out Error in cqlsh console of cassandra

爱⌒轻易说出口 提交于 2019-11-28 10:54:21
I have a three nodes Cassandra Cluster and I have created one table which has more than 2,000,000 rows. When I execute this ( select count(*) from userdetails ) query in cqlsh, I got this error: OperationTimedOut: errors={}, last_host=192.168.1.2 When I run count function for less row or with limit 50,000 it works fine. count(*) actually pages through all the data. So a select count(*) from userdetails without a limit would be expected to timeout with that many rows. Some details here: http://planetcassandra.org/blog/counting-key-in-cassandra/ You may want to consider maintaining the count

Prepare my bigdata with Spark via Python

雨燕双飞 提交于 2019-11-28 10:29:06
问题 My 100m in size, quantized data: (1424411938', [3885, 7898]) (3333333333', [3885, 7898]) Desired result: (3885, [3333333333, 1424411938]) (7898, [3333333333, 1424411938]) So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python: def prepare(data): result = [] for point_id, cluster in data: for index, c in enumerate(cluster): found = 0 for res in result: if c == res[0]: found = 1 if(found == 0): result.append(

How to change sqoop metastore?

折月煮酒 提交于 2019-11-28 09:31:10
I am using sqoop 1.4.2 version. I am trying to change the sqoop metastore from default hsqldb to mysql. I have configured following properties in sqoop-site.xml file. <property> <name>sqoop.metastore.client.enable.autoconnect</name> <value>false</value> <description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:mysql://ip:3206/sqoop?createDatabaseIfNotExist=true</value> </property> <property> <name>sqoop.metastore.client

Load a small random sample from a large csv file into R data frame

本小妞迷上赌 提交于 2019-11-28 09:19:49
The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame? You can also just do it in the terminal with perl. perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want. Try this based on examples 6e and 6f on the sqldf github home page :

How to expand one column in Pandas to many columns?

倖福魔咒の 提交于 2019-11-28 08:54:11
问题 As the title, I have one column (series) in pandas, and each row of it is a list like [0,1,2,3,4,5] . Each list has 6 numbers. I want to change this column into 6 columns, for example, the [0,1,2,3,4,5] will become 6 columns, with 0 is the first column, 1 is the second, 2 is the third and so on. How can I make it? 回答1: Not as fast as @jezrael's solution. But elegant :-) apply with pd.Series df.a.apply(pd.Series) 0 1 2 3 4 5 0 0 1 2 3 4 5 1 0 1 2 3 4 5 or df.a.apply(pd.Series, index=list(