How to fix pyspark NLTK Error with OSError: [WinError 123]?

问题

I got an unexcpected error when I run transforming RDD to DataFrame:

import nltk
from nltk import pos_tag
my_rdd_of_lists = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x))
my_df = spark.createDataFrame(my_rdd_of_lists)

This error appears always when I call nltk function od rdd. When I made this line with any numpy method, it did not fail.

Error code:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 323, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

And

OSError: [WinError 123] Nazwa pliku, nazwa katalogu lub składnia etykiety woluminu jest niepoprawna: 'C:\\C:\\Users\\Olga\\Desktop\\Spark\\spark-2.4.5-bin-hadoop2.7\\jars\\spark-core_2.11-2.4.5.jar'

So here is the part I don't know how to resolve. I thought that it is the problem with environment variables, but it seems there is everything ok:

SPARK HOME: C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7

I've also printed my sys.path:

import sys
for i in sys.path:
    print(i)

And got:

C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7\python
C:\Users\Olga\AppData\Local\Temp\spark-22c0eb38-fcc0-4f1f-b8dd-af83e15d342c\userFiles-3195dcc7-0fc6-469f-9afc-7752510f2471
C:\Users\Olga\Desktop\Spark\spark-2.4.5-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip
C:\Users\Olga
C:\Users\Olga\Anaconda3\python37.zip
C:\Users\Olga\Anaconda3\DLLs
C:\Users\Olga\Anaconda3\lib
C:\Users\Olga\Anaconda3

C:\Users\Olga\Anaconda3\lib\site-packages
C:\Users\Olga\Anaconda3\lib\site-packages\win32
C:\Users\Olga\Anaconda3\lib\site-packages\win32\lib
C:\Users\Olga\Anaconda3\lib\site-packages\Pythonwin
C:\Users\Olga\Anaconda3\lib\site-packages\IPython\extensions
C:\Users\Olga\.ipython

Here also everything looks ok for me. Please help, I don't know what to do. Earlier parts of codes were running without any error. Should I install nltk in any other way to run it with spark?

回答1:

Hai Milva to solve the os error in your code you can just import the os so all the permission for running your program is given to the code like:

{{{ import os }}}

Hope this answer helps you

回答2:

It seems that it was some problem with packages.

I uninstalled nltk, pandas and numpy with pip and then I did the same but with conda.

After that I listed my packages and found one weird called package that seemed to be a bug, called "-umpy".

I could not even uninstall it - no with command prompt, neither with Anaconda navigator. So I just found it in files on my computer and removed. Then I installed nltk once again.

After that it started working correctly and bug did not appear.

来源：https://stackoverflow.com/questions/61059445/how-to-fix-pyspark-nltk-error-with-oserror-winerror-123

标签

python

pyspark

conda