data-cleaning | 易学教程

Filling data using .fillNA(), data pulled from Quandl

阅读更多关于 Filling data using .fillNA(), data pulled from Quandl

问题 I've pulled some stock data from Quandl for both Crude Oil prices (WTI) and Caterpillar (CAT) price. When I concatenate the two dataframes together I'm left with some NaNs. My ultimate goal is to run a .Pearsonr() to assess the correlation (along with p-values), however I can't get Pearsonr() to work because of all the Nan's. So I'm trying to clean them up. When I use the .fillNA() function it doesn't seem to be working. I've even tried .interpolate() as well as .dropna(). None of them appear

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

问题 I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz file (I downloaded using wget ). I want to extract the text and see how it looks like in order to further process the corpus. Using the following code to extract the text from gzip , I obtained data with the class being bytes . with gzip.open(file_path, 'rb') as f_in: print('type(f_in)=', type(f_in)) text = f_in.read()

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

Pyspark clean data within dataframe

阅读更多关于 Pyspark clean data within dataframe

问题 I have the following file data.json which I try to clean using Pyspark. {"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}} {"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}} {"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}} {"positionmessage":{"callsign": , "name": , "mmsi": 200,"timestamplast"

Compare each pair of dates in two columns in python efficiently

阅读更多关于 Compare each pair of dates in two columns in python efficiently

问题 I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through. I have data in the form of: Start End 0 2008-10-01 2008-10-31 1 2006-07-01 2006-12-31 2 2000-05-01 2002-12-31 3 1971-08-01 1973-12-31 4 1969-01-01 1969-12-31 I have added a column to write the result to, even though I just want to highlight

Compare each pair of dates in two columns in python efficiently

阅读更多关于 Compare each pair of dates in two columns in python efficiently

Replace NaN value with a median?

阅读更多关于 Replace NaN value with a median?

问题 So I am trying to use Pandas to replace all NaN values in a table with the median across a particular range. I am working with a larger dataset but for example np.random.seed(0) rng = pd.date_range('2020-09-24', periods=20, freq='0.2H') df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)), 'Dist' :np.random.randn(len(rng)) }) df.Dist[df.Dist<=-0.6] = np.nan df.Val[df.Val<=-0.5] = np.nan What I want to do is replace the NaN values for Val and Dist with the median value for each

Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”

阅读更多关于 Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”

问题 I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are sparse. Usual practice here would be to combine sparse genres into "Other" for example. Easily done when there are not many sparse categories: # Join bungalows as they are sparse classes into 1 df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True) but

Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”

阅读更多关于 Dealing with sparse categories in Pandas - replace everything not in top categories with “Other”