data-cleaning

Replace multiple values using a reference table

百般思念 提交于 2021-02-19 07:34:09
问题 I’m cleaning a data base, one of the fields is “country” however the country names in my data base do not match the output I need. I though of using str_replace function but I have over 50 countries that need to be fix, so it’s not the most efficient way. I already prepared a CSV file with the original country input and the output I need for reference. Here is what I have so far: library(stringr) library(dplyr) library(tidyr) library(readxl) database1<- read_excel("database.xlsx") database1

Extract phone number from noised string

落爺英雄遲暮 提交于 2021-02-14 18:35:00
问题 I have a column in a table that contains random data along with phone numbers in different formats. The column may contain Name Phone Email HTML tags Addresses (with numbers) Examples: 1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 2) John Smith 3) xxx@yyy.com 4) John Smith 8 999 888 77 77 How a

What's the most efficient way to convert a time-series data into a cross-sectional one?

本小妞迷上赌 提交于 2021-02-11 14:30:25
问题 Here's the thing, I have the dataset below where date is the index: date value 2020-01-01 100 2020-02-01 140 2020-03-01 156 2020-04-01 161 2020-05-01 170 . . . And I want to transform it in this other dataset: value_t0 value_t1 value_t2 value_t3 value_t4 ... 100 NaN NaN NaN NaN ... 140 100 NaN NaN NaN ... 156 140 100 NaN NaN ... 161 156 140 100 NaN ... 170 161 156 140 100 ... First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped

replacing values of a dataframe column using values of a list and list name in R

*爱你&永不变心* 提交于 2021-02-11 07:00:09
问题 I would like to replace Values inside of a column with a list name conditional upon the values being inside the list values: df <- data.frame(Activity = c("Checking emails", "Playing games", "Reading", "Watching TV", "Watching YouTube", "Watching TV", "Relaxing", "Getting ready", "Working/ studying", "Relaxing")) mylist <-list(Tech_activity = c("Browsing social media", "Checking emails", "Video calling", "On my computer/ PC", "Watching YouTube", "Browsing the internet", "On my phone",

replacing values of a dataframe column using values of a list and list name in R

我是研究僧i 提交于 2021-02-11 06:59:50
问题 I would like to replace Values inside of a column with a list name conditional upon the values being inside the list values: df <- data.frame(Activity = c("Checking emails", "Playing games", "Reading", "Watching TV", "Watching YouTube", "Watching TV", "Relaxing", "Getting ready", "Working/ studying", "Relaxing")) mylist <-list(Tech_activity = c("Browsing social media", "Checking emails", "Video calling", "On my computer/ PC", "Watching YouTube", "Browsing the internet", "On my phone",

R how to vectorize a function with multiple if else conditions

梦想与她 提交于 2021-02-10 20:15:08
问题 Hi I am new to vectorizing functions in R. I have a code similar the following. library(truncnorm) library(microbenchmark) num_obs=10000 Observation=seq(1,num_obs) Obs_Type=sample(1:4, num_obs, replace=T) Upper_bound = runif(num_obs,0,1) Lower_bound=runif(num_obs,2,4) mean = runif(num_obs,10,15) df1= data.frame(Observation,Obs_Type,Upper_bound,Lower_bound,mean) df1$draw_value = 0 Trial_func=function(df1){ for (i in 1:nrow(df1)){ if (df1[i,"Obs_Type"] ==1){ #If Type == 1; then a=-Inf, b =

R how to vectorize a function with multiple if else conditions

自作多情 提交于 2021-02-10 20:08:33
问题 Hi I am new to vectorizing functions in R. I have a code similar the following. library(truncnorm) library(microbenchmark) num_obs=10000 Observation=seq(1,num_obs) Obs_Type=sample(1:4, num_obs, replace=T) Upper_bound = runif(num_obs,0,1) Lower_bound=runif(num_obs,2,4) mean = runif(num_obs,10,15) df1= data.frame(Observation,Obs_Type,Upper_bound,Lower_bound,mean) df1$draw_value = 0 Trial_func=function(df1){ for (i in 1:nrow(df1)){ if (df1[i,"Obs_Type"] ==1){ #If Type == 1; then a=-Inf, b =

How to filter out positional data based on distance from a known reference trajectory?

假如想象 提交于 2021-02-08 07:22:49
问题 I have a 87288-point dataset that I need to filter. The filtering fields for the dataset are a X position and a Y position, as latitude and longitude. Plotted the data looks like this: The problem is , I only need data along a certain path, which is known in advance. Something like this: I already know how to filter data in a Pandas DF, but given the path is not linear, I need an effective strategy to clear out all the noisy data with a certain degree of precision (since the dataset is so

How to filter out positional data based on distance from a known reference trajectory?

戏子无情 提交于 2021-02-08 07:20:49
问题 I have a 87288-point dataset that I need to filter. The filtering fields for the dataset are a X position and a Y position, as latitude and longitude. Plotted the data looks like this: The problem is , I only need data along a certain path, which is known in advance. Something like this: I already know how to filter data in a Pandas DF, but given the path is not linear, I need an effective strategy to clear out all the noisy data with a certain degree of precision (since the dataset is so

Filling data using .fillNA(), data pulled from Quandl

徘徊边缘 提交于 2021-02-08 03:50:12
问题 I've pulled some stock data from Quandl for both Crude Oil prices (WTI) and Caterpillar (CAT) price. When I concatenate the two dataframes together I'm left with some NaNs. My ultimate goal is to run a .Pearsonr() to assess the correlation (along with p-values), however I can't get Pearsonr() to work because of all the Nan's. So I'm trying to clean them up. When I use the .fillNA() function it doesn't seem to be working. I've even tried .interpolate() as well as .dropna(). None of them appear