Cannot find col function in pyspark

佐手、 提交于 2019-11-26 15:44:35

问题


In pyspark 1.6.2, I can import col function by

from pyspark.sql.functions import col

but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?


回答1:


It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.

If you carefully check the source you'll find col listed among other _functions. This dictionary is further iterated and _create_function is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals.

Finally __all__, which defines a list of items exported from the module, just exports all globals excluding ones contained in the blacklist.

If this mechanisms is still not clear you can create a toy example:

  • Create Python module called foo.py with a following content:

    # Creates a function assigned to the name foo
    globals()["foo"] = lambda x: "foo {0}".format(x)
    
    # Exports all entries from globals which start with foo
    __all__ = [x for x in globals() if x.startswith("foo")]
    
  • Place it somewhere on the Python path (for example in the working directory).

  • Import foo:

    from foo import foo
    
    foo(1)
    

An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.

Depending on the IDE installing type annotations might resolve the problem (see for example zero323/pyspark-stubs#172).




回答2:


As of VS Code 1.26.1 this can be solved by modifying python.linting.pylintArgs setting:

"python.linting.pylintArgs": [
        "--generated-members=pyspark.*",
        "--extension-pkg-whitelist=pyspark",
        "--ignored-modules=pyspark.sql.functions"
    ]

That issue was explained on github: https://github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443




回答3:


In Pycharm the col function and others are flagged as "not found"

a workaround is to import functions and call the col function from there.

for example:

from pyspark.sql import functions as F
df.select(F.col("my_column"))



回答4:


I ran into a similar problem trying to set up a PySpark development environment with Eclipse and PyDev. PySpark uses a dynamic namespace. To get it to work I needed to add PySpark to "force Builtins" as below.



来源:https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!