Issue with rpy2 handling NA/missing value in dataframe from R to Python

爷,独闯天下 提交于 2019-12-22 18:31:51

问题


I've encounter a problem when using rpy2 package to transform dataframe saved in R to Python.

import os
os.environ['R_HOME'] = '/Library/Frameworks/R.framework/Resources'

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri

# define a trivial dataframe in R
ro.r('n = c(1,2)')
ro.r("b = c(NA,'def')")
ro.r("temp_df = data.frame(n,b)")

# the dataframe in R shows missing value in one cell as NA
temp_rdf = ro.r('temp_df')
print(temp_rdf)

  n    b
1 1 <NA>
2 2  def

# yet the transformed Python dataframe replace the missing value with a string
temp_pydf = pandas2ri.ri2py(temp_rdf)
print(temp_pydf)

     n    b
1  1.0  def
2  2.0  def

I did some search and found this post Rpy2 pandas2ri.ri2py() is converting NA values to integers. It explains why but doesn't provide a solution to this. I want to have Null values in Python for those NA in R dataframe. How could I do this?


回答1:


Updates: http://rpy.sourceforge.net/rpy2/doc-2.2/html/rinterface.html

Above link may have useful help on some settings. If you find "NA " (include the space" and go to the second hit. There is one that looks like it relates to your NA problem.

Original post: assuming "def" as shown in your output is coming in as a string, you could replace it with a string that you are confident is not a value in your data and then use this in lieu of the NA value that is not coming in:

This sample code illustrates the concept.

x = "def"
type(x)
x = x.replace("def", "NA")
x

Looking at the problem that your source has two rows that both say 'def' one where it came from the data and another where NA converted to def:

  1. Convert 'def' to something else in R
  2. bring in your data
  3. now 'def' means NA
  4. use it as such or convert it to something you can live with

Is this a problem you encounter often?

  1. if so, create a test function to check your data for 'def'

  2. if found replace with something crazy you know the data will not have like: my_crazy_replacementValue

  3. replace "def" with your desired stand-in for NA

  4. replace my_crazy_replacementValue with "def"

In Python, the most common value for NA, I think is None. Unfortuantely, you cannot replace a value with None using:

string.replace()

It seems reasonable that there should be a better answer: a "Pythonic" way of converting a specified value in a data frame to None. I have to review Pandas -> data frames when I get a chance and then I may log back in and edit this paragraph (or maybe someone else will beat me to it). Hoping the above might help you in the interim.



来源:https://stackoverflow.com/questions/42231400/issue-with-rpy2-handling-na-missing-value-in-dataframe-from-r-to-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!