pandas

Record linking two large CSVs in Python?

给你一囗甜甜゛ 提交于 2021-02-10 15:40:41
问题 I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g. CSV_1 CSV_2 Name City Date Name_of_thing City_of_Origin Time Examp. Bton 7/11 THE EXAMPLE, LLC Bton, USA 7/11/2020 00:00

geopandas cannot read a geojson properly

筅森魡賤 提交于 2021-02-10 15:38:41
问题 I am trying the following: After downloading http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json In [2]: import geopandas In [3]: geopandas.read_file('./gz_2010_us_050_00_20m.json') --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-3-83a1d4a0fc1f> in <module> ----> 1 geopandas.read_file('./gz_2010_us_050_00_20m.json') ~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/io/file.py

GridSearchCV on a working pipeline returns ValueError

ぐ巨炮叔叔 提交于 2021-02-10 15:16:04
问题 I am using GridSearchCV in order to find the best parameters for my pipeline. My pipeline seems to work well as I can apply: pipeline.fit(X_train, y_train) preds = pipeline.predict(X_test) And I get a decent result. But GridSearchCV obviously doesn't like something, and I cannot figure it out. My pipeline: feats = FeatureUnion([('age', age), ('education_num', education_num), ('is_education_favo', is_education_favo), ('is_marital_status_favo', is_marital_status_favo), ('hours_per_week', hours

GridSearchCV on a working pipeline returns ValueError

主宰稳场 提交于 2021-02-10 15:15:23
问题 I am using GridSearchCV in order to find the best parameters for my pipeline. My pipeline seems to work well as I can apply: pipeline.fit(X_train, y_train) preds = pipeline.predict(X_test) And I get a decent result. But GridSearchCV obviously doesn't like something, and I cannot figure it out. My pipeline: feats = FeatureUnion([('age', age), ('education_num', education_num), ('is_education_favo', is_education_favo), ('is_marital_status_favo', is_marital_status_favo), ('hours_per_week', hours

how to read only a chunk of csv file fast?

柔情痞子 提交于 2021-02-10 15:13:05
问题 I'm using this answer on how to read only a chunk of CSV file with pandas . The suggestion to use pd.read_csv('./input/test.csv' , iterator=True, chunksize=1000) works excellent but it returns a <class 'pandas.io.parsers.TextFileReader'> , so I'm converting it to dataframe with pd.concat(pd.read_csv('./input/test.csv' , iterator=True, chunksize=25)) but that takes as much time as reading the file in the first place! Any suggestions on how to read only a chunk of the file fast? 回答1: pd.read

Error in removing punctuation: 'float' object has no attribute 'translate'

不想你离开。 提交于 2021-02-10 15:08:45
问题 I am trying to remove punctuations from a col in a data frame by doing the following: def remove_punctuation(text): return text.translate(table) df['data'] = df['data'].map(lambda x: remove_punctuation(x)) But I am getting the following error: 'float' object has no attribute 'translate' I checked the dtype of the col as in here: from pandas.api.types import is_string_dtype is_string_dtype(df['data']) and got the following output: True I am not sure what's going wrong in here? I have also

Error in removing punctuation: 'float' object has no attribute 'translate'

血红的双手。 提交于 2021-02-10 15:06:02
问题 I am trying to remove punctuations from a col in a data frame by doing the following: def remove_punctuation(text): return text.translate(table) df['data'] = df['data'].map(lambda x: remove_punctuation(x)) But I am getting the following error: 'float' object has no attribute 'translate' I checked the dtype of the col as in here: from pandas.api.types import is_string_dtype is_string_dtype(df['data']) and got the following output: True I am not sure what's going wrong in here? I have also

Drop rows in pandas if they contains “???”

非 Y 不嫁゛ 提交于 2021-02-10 14:55:17
问题 Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem. This is my code (I have tried both types): df = df[~df["text"].str.contains("?????", na=False)] df = df[~df["text"].str.contains("?????")] error that I'm getting: re.error: nothing to repeat at position 0 It works for every other value except for "????". I have googled it, and looked all over this website but I couldnt find any solutions. 回答1: The parameter

How to insert missing dates and forward fill columns after grouping by another column in pandas dataframe

拜拜、爱过 提交于 2021-02-10 14:53:47
问题 I have data available on a monthly basis(for different securities) which I want to convert to a daily basis by adding the missing dates and forward filling the monthly data for all the days of the month(i.e. data on 12/3/2015 = data on 12/1/2015 and so on for all securities). My data looks like this: x = pd.DataFrame({'ticker': ['a','a','a','b','b'], 'dt': ['12/1/2015','1/1/2016','2/1/2016','1/1/2016','2/1/2016'], 'score': [2.8,3.8,3.8,1.9,1.7]}) I tried creating a multi-index using dates and

How to insert missing dates and forward fill columns after grouping by another column in pandas dataframe

前提是你 提交于 2021-02-10 14:52:27
问题 I have data available on a monthly basis(for different securities) which I want to convert to a daily basis by adding the missing dates and forward filling the monthly data for all the days of the month(i.e. data on 12/3/2015 = data on 12/1/2015 and so on for all securities). My data looks like this: x = pd.DataFrame({'ticker': ['a','a','a','b','b'], 'dt': ['12/1/2015','1/1/2016','2/1/2016','1/1/2016','2/1/2016'], 'score': [2.8,3.8,3.8,1.9,1.7]}) I tried creating a multi-index using dates and