bigdata

How to parallelize computation on “big data” dictionary of lists?

守給你的承諾、 提交于 2021-02-18 19:00:17
问题 I have a question here regarding doing calculations on a python dictionary----in this case, the dictionary has millions of keys, and the lists are similarly long. There seems to be disagreement whether one could use parallelization here, so I'll ask the question here more explicitly. Here is the original question: Optimizing parsing of massive python dictionary, multi-threading This is a toy (small) python dictionary: example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],

How to parallelize computation on “big data” dictionary of lists?

只谈情不闲聊 提交于 2021-02-18 18:58:47
问题 I have a question here regarding doing calculations on a python dictionary----in this case, the dictionary has millions of keys, and the lists are similarly long. There seems to be disagreement whether one could use parallelization here, so I'll ask the question here more explicitly. Here is the original question: Optimizing parsing of massive python dictionary, multi-threading This is a toy (small) python dictionary: example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],

Is there a faster way than fread() to read big data?

£可爱£侵袭症+ 提交于 2021-02-18 11:27:28
问题 Hi first of all I already search on stack and google and found posts such at this one : Quickly reading very large tables as dataframes. While those are helpfull and well answered, I'm looking for more informations. I am looking for the best way to read/import "big" data that can go up to 50-60GB. I am currently using the fread() function from data.table and it is the function that is the fastest I know at the moment. The pc/server I work on got a good cpu (work station) and 32 GB RAM, but

multithreading for data from dataframe pandas

守給你的承諾、 提交于 2021-02-18 11:11:50
问题 I'm struggling to use multithreading for calculating relatedness between list of customers who have different shopping items on their baskets. So I have a pandas data frame consists of 1,000 customers, which means that I have to calculate the relatedness 1 million times and this takes too long to process An example of the data frame looks like this: ID Item 1 Banana 1 Apple 2 Orange 2 Banana 2 Tomato 3 Apple 3 Tomato 3 Orange Here is the simplefied version of the code: import pandas as pd def

Create columns from row with same ID

给你一囗甜甜゛ 提交于 2021-02-16 14:25:06
问题 I have a df like this: Id username age 1 michael. 34 6. Mike. 65 7. Stephanie. 14 1. Mikael. 34 6. Mick. 65 As you can see, username are not writed the same for the same id. I would like to regroup all username to the same row like this: Id username username_2 Age 1 michael. mikael. 34 6. Mike. Mick. 65 7. Stephanie. 14 Thanks. 回答1: You can create MultiIndex for count duplicated Id by cumcount and then is possible reshape by unstack, last some data cleaning by add_prefix with reset_index: df1

Error during wrapup: long vectors not supported yet: in glm() function

倾然丶 夕夏残阳落幕 提交于 2021-02-11 16:42:44
问题 I found several questions on Stackoverflow regarding this topic (some of them without any answer) but nothing related (so far) with this error in regression. I'm, running a probit model in r with (I'm guessing) too many fixed effects (year and places): myprobit <- glm(factor(Y) ~ factor(T) + factor(X1) + factor(X2) + factor(X3) + factor(YEAR) + factor(PLACE), family = binomial(link = "probit"), data = DT) The PLACE variable has about 1000 unique values and YEAR 8 values. The dataset DT has 13

What is the fastest way to record real-time data in python with least memory loss

孤街醉人 提交于 2021-02-11 14:53:12
问题 In every step of a loop I have some data which I want to be saved in the end in my hard disk. One way: list = [] for i in range(1e10): list.append(numpy_array_i) pickle.dump(list, open(self.save_path, "wb"), protocol=4) But I worry: 1_I ran out of memory for because of the list 2_If something crashes all data will be lost. Because of this I have also thought of a way to save data in real time such as: file = make_new_csv_or_xlsx_file() for i in range(1e10): file.write_in_a_new_line(numpy

What is the fastest way to read several lines of data from a large file

可紊 提交于 2021-02-11 12:47:41
问题 My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these: 1, 34, 56, 67, 678, 23462, ... 2, 3, 6, 8, 34, 5 23,547, 648, 34657 ... ... ... I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while. I also tried to write all data to sqlite3 database based on the logics of the

What is the fastest way to read several lines of data from a large file

社会主义新天地 提交于 2021-02-11 12:45:45
问题 My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these: 1, 34, 56, 67, 678, 23462, ... 2, 3, 6, 8, 34, 5 23,547, 648, 34657 ... ... ... I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while. I also tried to write all data to sqlite3 database based on the logics of the

What happens if a coordinator node goes down during a write in Apache Cassandra?

你离开我真会死。 提交于 2021-02-11 12:36:01
问题 Pretty much the title, but I realize that there are a lot of different edge cases here, but I am somehow not able to find a credible source on this. 回答1: If co-ordinator goes down mid request , cassandra drivers are designed to handle such case with retry policy which you can configure. More Details 来源: https://stackoverflow.com/questions/37722828/what-happens-if-a-coordinator-node-goes-down-during-a-write-in-apache-cassandra