pandas

pandas append on columns with different names

為{幸葍}努か 提交于 2021-02-07 21:00:37
问题 How to append 2 different dataframes with different column names a = pd.DataFrame({ "id": [0,1,2,3], "countryid": [22,36,21,64], "famousfruit": ["banana", "apple", "mango", "orange"], "famousanimal": ["monkey", "elephant", "monkey", "horse"], "waterlvl": [23, 43, 41, 87] }).set_index("id") >> a b = pd.DataFrame({ "id": [0,1,2,3], "cid": [25,27,98,67], "FAM_FRUIT": ["grapes", "pineapple", "avacado", "orange"], "FAM_ANI": ["giraffe", "dog", "cat", "horse"], }).set_index("id") >>b How to append

lambda function to scale column in pandas dataframe returns: “'float' object has no attribute 'min'”

人盡茶涼 提交于 2021-02-07 20:49:40
问题 I am just getting started in Python and Machine Learning and have encountered an issue which I haven't been able to fix myself or with any other online resource. I am trying to scale a column in a pandas dataframe using a lambda function in the following way: X['col1'] = X['col1'].apply(lambda x: (x - x.min()) / (x.max() - x.min())) and get the following error message: 'float' object has no attribute 'min' I have tried to convert the data type into integer and the following error is returned:

lambda function to scale column in pandas dataframe returns: “'float' object has no attribute 'min'”

流过昼夜 提交于 2021-02-07 20:48:30
问题 I am just getting started in Python and Machine Learning and have encountered an issue which I haven't been able to fix myself or with any other online resource. I am trying to scale a column in a pandas dataframe using a lambda function in the following way: X['col1'] = X['col1'].apply(lambda x: (x - x.min()) / (x.max() - x.min())) and get the following error message: 'float' object has no attribute 'min' I have tried to convert the data type into integer and the following error is returned:

java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

时光毁灭记忆、已成空白 提交于 2021-02-07 20:39:38
问题 I'm testing the example code provided in the documentation of pandas_udf (https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf), using Pyspark 2.3.1 on my local machine: from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v

I have to compare data from each row of a Pandas DataFrame with data from the rest of the rows, is there a way to speed up the computation?

前提是你 提交于 2021-02-07 20:38:04
问题 Let's say I have a pandas DataFrame (loaded from a csv file) with this structure (the number of var and err columns is not fixed, and it varies from file to file): var_0; var_1; var_2; 32; 9; 41; 47; 22; 41; 15; 12; 32; 3; 4; 4; 10; 9; 41; 43; 21; 45; 32; 14; 32; 51; 20; 40; Let's discard the err_ds_j and the err_mean columns for the sake of this question. I have to perform an automatic comparison of the values of each row, with the values of the other rows; as an example: I have to compare

understanding lambda functions in pandas

房东的猫 提交于 2021-02-07 20:35:11
问题 I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github. I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype. If the value for a particular country is

Pandas calculate length of consecutive equal values from a grouped dataframe

寵の児 提交于 2021-02-07 20:34:55
问题 I want to do what they've done in the answer here: Calculating the number of specific consecutive equal values in a vectorized way in pandas , but using a grouped dataframe instead of a series. So given a dataframe with several columns A B C ------------ x x 0 x x 5 x x 2 x x 0 x x 0 x x 3 x x 0 y x 1 y x 10 y x 0 y x 5 y x 0 y x 0 I want to groupby columns A and B, then count the number of consecutive zeros in C. After that I'd like to return counts of the number of times each length of

Error when changing date format in dataframe index

拟墨画扇 提交于 2021-02-07 20:34:44
问题 I have the following df : A B 2018-01-02 100.000000 100.000000 2018-01-03 100.808036 100.325886 2018-01-04 101.616560 102.307700 I am looking forward to change the time format of the index, so I tried (using @jezrael s response in the link Format pandas dataframe index date): df.index = rdo.index.strftime('%d-%m-%Y') But it outputs : AttributeError: 'Index' object has no attribute 'strftime' My desired output would be: A B 02-01-2018 100.000000 100.000000 03-01-2018 100.808036 100.325886 04

understanding lambda functions in pandas

大城市里の小女人 提交于 2021-02-07 20:31:56
问题 I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github. I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype. If the value for a particular country is

How do I parse a csv with pandas that has a comma delimiter and space?

帅比萌擦擦* 提交于 2021-02-07 20:30:42
问题 I currently have the following data.csv which has a comma delimiter: name,day Chicken Sandwich,Wednesday Pesto Pasta,Thursday Lettuce, Tomato & Onion Sandwich,Friday Lettuce, Tomato & Onion Pita,Friday Soup,Saturday The parser script is: import pandas as pd df = pd.read_csv('data.csv', delimiter=',', error_bad_lines=False, index_col=False) print(df.head(5)) The output is: Skipping line 4: expected 2 fields, saw 3 Skipping line 5: expected 2 fields, saw 3 name day 0 Chicken Sandwich Wednesday