pandas

How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

久未见 提交于 2021-02-05 11:16:31
问题 I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings. |---------------------|------------------| | sentenceCol | other column | |---------------------|------------------| |'this is from france'| 15 | |---------------------|------------------| DF B contains a column that is a list of countries |---------------------|------------------| | country | other column | |---------------------|------------------| |'france' | 33 | |------------------

pySpark, aggregate complex function (difference of consecutive events)

喜你入骨 提交于 2021-02-05 11:16:29
问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite

pySpark, aggregate complex function (difference of consecutive events)

こ雲淡風輕ζ 提交于 2021-02-05 11:15:48
问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite

How to preserve date format when creating an Excel file?

和自甴很熟 提交于 2021-02-05 11:14:27
问题 I have an .xlsx file that I import into Python and create a Pandas dataframe. One of the columns in the .xlsx file is formatted as a date, mm-dd-yyyy and gets imported like that. I then delete some unneeded columns from that dataframe and export that dataframe using the xlsxwriter engine to create another Excel file. writer = pd.ExcelWriter('Sample_Master_Data_edited.xlsx', engine='xlsxwriter', date_format='mm/dd/yyyy') When I do that, the date column's format changes and time automatically

iterrows cannot iterate over DataFrame Eror: touple object has no attribute “A”

荒凉一梦 提交于 2021-02-05 11:13:50
问题 When I try to iterate over a dataframe, somehow dtype is changed. dates = pd.date_range('20130101',periods=6) df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD')) df A B C D 2013-01-01 -1.328046 -0.545127 -0.033153 1.190336 2013-01-02 -0.549147 0.447161 1.179931 0.397521 2013-01-03 -0.106707 -0.327574 -0.933817 -1.032949 2013-01-04 -0.519988 -1.007374 -0.794482 -1.757222 2013-01-05 -0.739735 1.220599 -1.387994 -0.116178 2013-01-06 0.262876 -0.679471 -0.568768 -0.277880 now

Pandas Multiindex count on levels

谁都会走 提交于 2021-02-05 10:59:47
问题 The data: index = [('A', 'aa', 'aaa'), ('A', 'aa', 'aab'), ('B', 'bb', 'bbb'), ('B', 'bb', 'bbc'), ('C', 'cc', 'ccc') ] values = [0.07, 0.04, 0.04, 0.06, 0.07] s = pd.Series(data=values, index=pd.MultiIndex.from_tuples(index)) s A aa aaa 0.07 aab 0.04 B bb bbb 0.04 bbc 0.06 C cc ccc 0.07 To get a mean of first two levels is easy: s.mean(level=[0,1]) Result: A aa 0.055 B bb 0.050 C cc 0.070 But to get a count on first two levels does not work the same: #s.count(level=[0,1]) # does not work I

pandas resample to get monthly average with time series data

帅比萌擦擦* 提交于 2021-02-05 10:54:06
问题 I'm using the time series dataset from tableau (https://community.tableau.com/thread/194200), containing daily furniture sales, and I want to resample to get average monthly sales. And I tried using resample in Pandas to get monthly mean: There are four days in January selling furniture, and there is no sales in the rest of Jan. Order Date Sales ... 2014/1/6 2573.82 2014/1/7 76.728 2014/1/16 127.104 2014/1/20 38.6 ... y_furniture = furniture['Sales'].resample('MS').mean() I want the result to

Finding the Index with maximum number of rows

蓝咒 提交于 2021-02-05 09:44:38
问题 My task: For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names. The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate. Question 5 Which state has the most counties in it

Finding the Index with maximum number of rows

淺唱寂寞╮ 提交于 2021-02-05 09:44:06
问题 My task: For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names. The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate. Question 5 Which state has the most counties in it

X and Y label being cut in matplotlib plots

我与影子孤独终老i 提交于 2021-02-05 09:40:44
问题 I have this code: import pandas as pd from pandas import datetime from pandas import DataFrame as df import matplotlib from pandas_datareader import data as web import matplotlib.pyplot as plt import datetime start = datetime.date(2016,1,1) end = datetime.date.today() stock = 'fb' fig = plt.figure(dpi=1400) data = web.DataReader(stock, 'yahoo', start, end) fig, ax = plt.subplots(dpi=720) data['vol_pct'] = data['Volume'].pct_change() data.plot(y='vol_pct', ax = plt.gca(), title = 'this is the