pandas

Pandas: break categorical column to multiple columns

北城余情 提交于 2021-02-07 02:52:33
问题 Imagine a Pandas dataframe of the following format: id type v1 v2 1 A 6 9 1 B 4 2 2 A 3 7 2 B 3 6 I would like to convert this dataframe into the following format: id A_v1 A_v2 B_v1 B_v2 1 6 9 4 2 2 3 7 3 6 Is there an elegant way of doing this? 回答1: You could use set_index to move the type and id columns into the index, and then unstack to move the type index level into the column index. You don't have to worry about the v values -- where the indexes go dictate the arrangement of the values.

Pandas: break categorical column to multiple columns

邮差的信 提交于 2021-02-07 02:52:33
问题 Imagine a Pandas dataframe of the following format: id type v1 v2 1 A 6 9 1 B 4 2 2 A 3 7 2 B 3 6 I would like to convert this dataframe into the following format: id A_v1 A_v2 B_v1 B_v2 1 6 9 4 2 2 3 7 3 6 Is there an elegant way of doing this? 回答1: You could use set_index to move the type and id columns into the index, and then unstack to move the type index level into the column index. You don't have to worry about the v values -- where the indexes go dictate the arrangement of the values.

Pandas - select column using other column value as column name

百般思念 提交于 2021-02-07 00:52:29
问题 I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column. Example: Input dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']}) a | b | names | --- | --- | ---- | 1 | -1 | 'a' | 2 | -2 | 'b' | 3 | -3 | 'a' | 4 | -4 | 'b' | Output dataframe: pd.DataFrame.from_dict({"a": [1,

pandas apply function to multiple columns and multiple rows

我只是一个虾纸丫 提交于 2021-02-06 23:15:23
问题 I have a dataframe with consecutive pixel coordinates in rows and columns 'xpos', 'ypos', and I want to calculate the angle in degrees of each path between consecutive pixels. Currently I have the solution presented below, which works fine and for teh size of my file is speedy enough, but iterating through all the rows seems not to be the pandas way to do it. I know how to apply a function to different columns, and how to apply functions to different rows of columns, but can't figure out how

pandas apply function to multiple columns and multiple rows

别来无恙 提交于 2021-02-06 23:15:14
问题 I have a dataframe with consecutive pixel coordinates in rows and columns 'xpos', 'ypos', and I want to calculate the angle in degrees of each path between consecutive pixels. Currently I have the solution presented below, which works fine and for teh size of my file is speedy enough, but iterating through all the rows seems not to be the pandas way to do it. I know how to apply a function to different columns, and how to apply functions to different rows of columns, but can't figure out how

Dask dataframe split partitions based on a column or function

做~自己de王妃 提交于 2021-02-06 20:48:40
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

妖精的绣舞 提交于 2021-02-06 20:45:47
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Performance decrease for huge amount of columns. Pyspark

我们两清 提交于 2021-02-06 20:18:54
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

空扰寡人 提交于 2021-02-06 20:14:01
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

可紊 提交于 2021-02-06 20:10:08
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing