pandas

timedelta64 and datetime conversion

流过昼夜 提交于 2021-02-05 08:04:01
问题 I have two datetime (Timestamp) formatted columns in my dataframe, df['start'], df['end'] . I'd like to get the duration between the two dates. So I create the duration column df['duration'] = df['start'] - df['end'] However, now the duration column is formatted as numpy.timedelta64 , instead of datetime.timedelta as I would expect. >>> df['duration'][0] >>> numpy.timedelta64(0,'ns') While >>> df['start'][0] - df['end'][0] >>> datetime.timedelta(0) Can someone explain to me why the array

Pandas filtering based on OR AND

倖福魔咒の 提交于 2021-02-05 08:03:54
问题 I am trying to filter rows in a pandas df like this: df1= df0[(df0.col1=='a' ) | (df0.col2=='b' & df0.col3=='c')] I believe i used proper parentheses, but I get: cannot compare a dtyped [object] array with a scalar of type [bool] Basically, if a OR (b&C) is true is the condition i want 回答1: Boolean indexing Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses , since by

How to extract table from website using python

青春壹個敷衍的年華 提交于 2021-02-05 08:02:48
问题 i have been trying to extract the table from website but i am lost. can anyone help me ? my goal is to extract the table of scope page : https://training.gov.au/Organisation/Details/31102 import requests from bs4 import BeautifulSoup url = "https://training.gov.au/Organisation/Details/31102" response = requests.get(url) page = response.text soup = BeautifulSoup(page, 'lxml') table = soup.find(id ="ScopeQualification") [row.text.split() for row in table.find_all("tr")] 回答1: find OrganisationId

pandas: multiply column depending on other column

别来无恙 提交于 2021-02-05 07:59:05
问题 I have a dataframe with column a and b. I want to multiply column a with value x if b is true and with value y if b is false. What is the best way to achieve this? 回答1: You could do it in 2 steps: df.loc[df.b, 'a'] *= x df.loc[df.b == False, 'a'] *= y Or in 1 step using where : In [366]: df = pd.DataFrame({'a':randn(5), 'b':[True, True, False, True, False]}) df Out[366]: a b 0 0.619641 True 1 -2.080053 True 2 0.379665 False 3 0.134897 True 4 1.580838 False In [367]: df.a *= np.where(df.b, 5

Loop through XML in Python

瘦欲@ 提交于 2021-02-05 07:57:25
问题 My data set is as following: <?xml version="1.0" encoding="UTF-8"?> <depts xmlns="http://SOMELINK" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" date="2021-01-15"> <dept dept_id="00001" col_two="00001value" col_three="00001false" name = "some_name"> <owners> <currentowner col_four="00001value" col_five="00001value" col_six="00001false" name = "some_name"> <addr col_seven="00001value" col_eight="00001value" col_nine="00001false"/> <

How to reindex a pandas dataframe within a function?

為{幸葍}努か 提交于 2021-02-05 07:55:30
问题 I'm trying to add column headers with empty values to my dataframe (just like this answer), but within a function that is already modifying it, like so: mydf = pd.DataFrame() def myfunc(df): df['newcol1'] = np.nan # this works list_of_newcols = ['newcol2', 'newcol3'] df = df.reindex(columns=df.columns.tolist() + list_of_newcols) # this does not return myfunc(mydf) If I run the lines individually in an IPython console, it will add them. But run as a script, newcol1 will be added but 2 and 3

Pandas NLTK tokenizing “unhashable type: 'list'”

我们两清 提交于 2021-02-05 07:55:10
问题 Following this example: Twitter data mining with Python and Gephi: Case synthetic biology CSV to: df['Country', 'Responses'] 'Country' Italy Italy France Germany 'Responses' "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." tokenize the text in 'Responses' remove the 100 most common words (based on brown.corpus) identify the remaining 100 most frequent words I can get through step 1 and 2, but get an error on step 3: TypeError: unhashable type: 'list' I believe it's because

Wrong Dates in Dataframe and Subplots

不羁的心 提交于 2021-02-05 07:54:30
问题 I am trying to plot my data in the csv file. Currently my dates are not shown properly in the plot also if i am converting it. How can I change it to show the proper dat format as defined Y-m-d? The second question is that I am currently plotting all the dat in one plot but want to have for every Valuegroup one subplot. My code looks like the following: import pandas as pd import matplotlib.pyplot as plt csv_loader = pd.read_csv('C:/Test.csv', encoding='cp1252', sep=';', index_col=0).dropna()

Adding row shifting in pandas dataframe

你离开我真会死。 提交于 2021-02-05 07:54:30
问题 I have a pandas df , which I created by using shift() function iterating through the original df : for i in range(2, 4): df["lag_{}".format(i)] = df.x.shift(i) So there will be actual x column and lag2-lag10 columns with shifted x values. I have trained this dataset for the regression model to make one-step forward prediction. Would like to add the new row in the end of the dataframe with nan value for x and shifted values from the last position to be able to use these new lags for fitting

Removing specific word in a string in pandas

空扰寡人 提交于 2021-02-05 07:53:28
问题 I'm trying to remove several words in each value of a column but nothing is happening. stop_words = ["and","lang","naman","the","sa","ko","na", "yan","n","yang","mo","ung","ang","ako","ng", "ndi","pag","ba","on","un","Me","at","to", "is","sia","kaya","I","s","sla","dun","po","b","pro" ] newdata['Verbatim'] = newdata['Verbatim'].replace(stop_words,'', inplace = True) I'm trying to generate a word cloud out from the result of the replacement but I am getting the same words(that doesn't mean