data-cleaning

Filling data using .fillNA(), data pulled from Quandl

て烟熏妆下的殇ゞ 提交于 2021-02-08 03:49:22
问题 I've pulled some stock data from Quandl for both Crude Oil prices (WTI) and Caterpillar (CAT) price. When I concatenate the two dataframes together I'm left with some NaNs. My ultimate goal is to run a .Pearsonr() to assess the correlation (along with p-values), however I can't get Pearsonr() to work because of all the Nan's. So I'm trying to clean them up. When I use the .fillNA() function it doesn't seem to be working. I've even tried .interpolate() as well as .dropna(). None of them appear

Converting from bytes to French text in Python

牧云@^-^@ 提交于 2021-02-02 02:18:33
问题 I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz file (I downloaded using wget ). I want to extract the text and see how it looks like in order to further process the corpus. Using the following code to extract the text from gzip , I obtained data with the class being bytes . with gzip.open(file_path, 'rb') as f_in: print('type(f_in)=', type(f_in)) text = f_in.read()

Converting from bytes to French text in Python

爷,独闯天下 提交于 2021-02-02 02:08:52
问题 I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz file (I downloaded using wget ). I want to extract the text and see how it looks like in order to further process the corpus. Using the following code to extract the text from gzip , I obtained data with the class being bytes . with gzip.open(file_path, 'rb') as f_in: print('type(f_in)=', type(f_in)) text = f_in.read()

Converting from bytes to French text in Python

时光毁灭记忆、已成空白 提交于 2021-02-02 02:05:53
问题 I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz file (I downloaded using wget ). I want to extract the text and see how it looks like in order to further process the corpus. Using the following code to extract the text from gzip , I obtained data with the class being bytes . with gzip.open(file_path, 'rb') as f_in: print('type(f_in)=', type(f_in)) text = f_in.read()

Pyspark clean data within dataframe

只谈情不闲聊 提交于 2021-01-29 14:26:52
问题 I have the following file data.json which I try to clean using Pyspark. {"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}} {"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}} {"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}} {"positionmessage":{"callsign": , "name": , "mmsi": 200,"timestamplast"

Compare each pair of dates in two columns in python efficiently

早过忘川 提交于 2021-01-29 02:12:00
问题 I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through. I have data in the form of: Start End 0 2008-10-01 2008-10-31 1 2006-07-01 2006-12-31 2 2000-05-01 2002-12-31 3 1971-08-01 1973-12-31 4 1969-01-01 1969-12-31 I have added a column to write the result to, even though I just want to highlight

Compare each pair of dates in two columns in python efficiently

半世苍凉 提交于 2021-01-29 02:07:48
问题 I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through. I have data in the form of: Start End 0 2008-10-01 2008-10-31 1 2006-07-01 2006-12-31 2 2000-05-01 2002-12-31 3 1971-08-01 1973-12-31 4 1969-01-01 1969-12-31 I have added a column to write the result to, even though I just want to highlight

Replace NaN value with a median?

若如初见. 提交于 2021-01-28 03:30:23
问题 So I am trying to use Pandas to replace all NaN values in a table with the median across a particular range. I am working with a larger dataset but for example np.random.seed(0) rng = pd.date_range('2020-09-24', periods=20, freq='0.2H') df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)), 'Dist' :np.random.randn(len(rng)) }) df.Dist[df.Dist<=-0.6] = np.nan df.Val[df.Val<=-0.5] = np.nan What I want to do is replace the NaN values for Val and Dist with the median value for each

Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”

你离开我真会死。 提交于 2021-01-27 14:11:36
问题 I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are sparse. Usual practice here would be to combine sparse genres into "Other" for example. Easily done when there are not many sparse categories: # Join bungalows as they are sparse classes into 1 df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True) but

Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”

百般思念 提交于 2021-01-27 14:00:39
问题 I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are sparse. Usual practice here would be to combine sparse genres into "Other" for example. Easily done when there are not many sparse categories: # Join bungalows as they are sparse classes into 1 df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True) but