pandas | 易学教程

Pandas: break categorical column to multiple columns

阅读更多关于 Pandas: break categorical column to multiple columns

问题 Imagine a Pandas dataframe of the following format: id type v1 v2 1 A 6 9 1 B 4 2 2 A 3 7 2 B 3 6 I would like to convert this dataframe into the following format: id A_v1 A_v2 B_v1 B_v2 1 6 9 4 2 2 3 7 3 6 Is there an elegant way of doing this? 回答1: You could use set_index to move the type and id columns into the index, and then unstack to move the type index level into the column index. You don't have to worry about the v values -- where the indexes go dictate the arrangement of the values.

Pandas: break categorical column to multiple columns

阅读更多关于 Pandas: break categorical column to multiple columns

Pandas - select column using other column value as column name

阅读更多关于 Pandas - select column using other column value as column name

问题 I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column. Example: Input dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']}) a | b | names | --- | --- | ---- | 1 | -1 | 'a' | 2 | -2 | 'b' | 3 | -3 | 'a' | 4 | -4 | 'b' | Output dataframe: pd.DataFrame.from_dict({"a": [1,

pandas apply function to multiple columns and multiple rows

阅读更多关于 pandas apply function to multiple columns and multiple rows

问题 I have a dataframe with consecutive pixel coordinates in rows and columns 'xpos', 'ypos', and I want to calculate the angle in degrees of each path between consecutive pixels. Currently I have the solution presented below, which works fine and for teh size of my file is speedy enough, but iterating through all the rows seems not to be the pandas way to do it. I know how to apply a function to different columns, and how to apply functions to different rows of columns, but can't figure out how

pandas apply function to multiple columns and multiple rows

阅读更多关于 pandas apply function to multiple columns and multiple rows

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark

问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark