data-analysis | 易学教程

how to map two rows of different dataframe based on a condition in pandas

阅读更多关于 how to map two rows of different dataframe based on a condition in pandas

问题 I have two dataframes, df1, Names one two three Sri is a good player Ravi is a mentor Kumar is a cricketer player df2, values sri NaN sri, is kumar,cricketer player I am trying to get the row in df1 which contains the all the items in df2 My expected output is, values Names sri Sri is a good player NaN sri, is Sri is a good player kumar,cricketer player Kumar is a cricketer player i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist())) I also tried, but I cannot achieve my

pandas calculating mean per month

阅读更多关于 pandas calculating mean per month

问题 I created the following dataframe: availability = pd.DataFrame(propertyAvailableData).set_index("createdat") monthly_availability = availability.fillna(value=0).groupby(pd.TimeGrouper(freq='M')) This gives the following output 2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \ createdat 2015-08-12 1.0 1.0 1.0 1.0 1.0 2015-08-17 0.0 0.0 0.0 0.0 0.0 2015-08-18 0.0 1.0 1.0 1.0 1.0 2015-08-18 0.0 0.0 0.0 0.0 0.0 2015-08-19 0.0 1.0 1.0 1.0 1.0 2015-09-03 0.0 1.0 1.0 1.0 1.0 2015-09-03 0.0 1

How to add a new column and aggregate values in R

阅读更多关于 How to add a new column and aggregate values in R

问题 I am completely new to gnuplot and am only trying this because I need to learn it. I have a values in three columns where the first represents the filename (date and time, one hour interval) and the remaining two columns represent two different entities Prop1 and Prop2. Datetime Prop1 Prop2 20110101_0000.txt 2 5 20110101_0100.txt 2 5 20110101_0200.txt 2 5 ... 20110101_2300.txt 2 5 20110201_0000.txt 2 5 20110101_0100.txt 2 5 ... 20110201_2300.txt 2 5 ... I need to aggregate the data by the

How can I create a new dataframe comparing values and getting only most recent data in R?

阅读更多关于 How can I create a new dataframe comparing values and getting only most recent data in R?

问题 I have a data frame that has the data from the Gini Index of countries. Plenty of the values are NA , so i want to create a new data frame that has, for each country, the most recent Gini Index measured for it. For example, if Brazil has a value for 2012, 2013 and 2015, the new data frame will have only the value of 2015. This is how the data looks like: Country.Name Country.Code X2014 X2015 X2016 X2017 8 Argentina ARG 41.4 NA 42.4 NA 9 Armenia ARM 31.5 32.4 32.5 NA 13 Austria AUT 30.5 30.5

how to remove entire column if a particular row has duplicate values in a dataframe in python

阅读更多关于 how to remove entire column if a particular row has duplicate values in a dataframe in python

问题 I have a dataframe like this, df, Name City 0 sri chennai 1 pedhci pune 2 bahra pune there is a duplicate in City column. I tried: df["City"].drop_duplicates() but it gives only the particular column. my desired output should be output_df Name City 0 sri chennai 1 pedhci pune 回答1: You can use: df2 = df.drop_duplicates(subset='City') if you want to store the result in a new dataframe, or: df.drop_duplicates(subset='City',inplace=True) if you want to update df . This produces: >>> df City Name

How can I produce a table of transition types in R?

阅读更多关于 How can I produce a table of transition types in R?

问题 I have some data which has a number of different id's in it and a list of their states at different times (t1, t2, t3 etc) and I'd like to generate a table that gives information about the different types of state change that happen, so something that would look like this for the sample data (copied below). x y z x 0 2 0 y 1 2 1 z 1 0 2 Which would show, for example, that x changed to y twice and y changed to x once. Does anyone know how I might be able to do this in R? SAMPLE DATA: id <- c(

python pandas unable to display summary of large dataframe

阅读更多关于 python pandas unable to display summary of large dataframe

问题 I recently upgraded to pandas version 0.13 and am experiencing this problem where no matter how big my dataframe is ( the biggest one has 25 columns and 158430 rows), pandas prints out the entire dataframe (well not the entire thing, just a few rows in each column but it's still messy!) instead of printing out the summary table which is much cleaner in the case of such large data frames. I was just wondering whether anyone else is having this problem or has had this problem in the past and

how to split and categorize value in a column of a pandas dataframe

阅读更多关于 how to split and categorize value in a column of a pandas dataframe

问题 I have a df, keys 0 one 1 two,one 2 " " 3 five,one 4 " " 5 two,four 6 four 7 four,five and two lists, actual=["one","two"] syn=["four","five"] I am creating a new row df["val"] I am trrying to get the categories of the cells in df["keys"] . If anyone of the key is present in actual then i want to add actual in a new column but same row, If anyone of the value is not present in actual then i want the corresponding df["val"] as syn . and it should not do anything on the white space cells. My

Spark-sqlserver connection

阅读更多关于 Spark-sqlserver connection

问题 Can we connect spark with sql-server? If so, how? I am new to spark, I want to connect the server to spark and work directly from sql-server instead of uploading .txt or .csv file. Please help, Thank you. 回答1: Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file. //Spark 1.4.1 /

Simple (working) handwritten digit recognition: how to improve it?

阅读更多关于 Simple (working) handwritten digit recognition: how to improve it?

问题 I just wrote this very simple handwritten digit recoginition. Here is 8kb archive with the following code + ten .PNG image files. It works: is well recognized as . In short, each digit of the database (50x50 pixels = 250 coefficients) is summarized into a 10-coefficient-vector (by keeping the 10 biggest singular values, see Low-rank approximation with SVD). Then for the digit to be recognized, we minimize the distance with the digits in the database. from scipy import misc import numpy as np