apache-spark | 易学教程

spark-nlp 'JavaPackage' object is not callable

阅读更多关于 spark-nlp 'JavaPackage' object is not callable

问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\

PySpark: An error occurred while calling o51.showString. No module named XXX

阅读更多关于 PySpark: An error occurred while calling o51.showString. No module named XXX

问题 My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure: |root |-- cast_to_float.py |-- tests |-- test.py In cast_to_float.py , my code: from pyspark.sql.types import FloatType from pyspark.sql.functions import udf def cast_to_float(y, column_name): return y.withColumn(column_name, y[column_name].cast(FloatType())) def cast_to_float_1(y, column_name): to_float = udf(cast2float1, FloatType()) return y.withColumn(column_name, to_float

pyspark. zip arrays in a dataframe

阅读更多关于 pyspark. zip arrays in a dataframe

问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as

pyspark. zip arrays in a dataframe

阅读更多关于 pyspark. zip arrays in a dataframe

Ignore Spark Cluster Own Jars

阅读更多关于 Ignore Spark Cluster Own Jars

问题 I would like to use my own application Spark jars. More in concrete I have one jar of mllib that is not already released that contains a fixed bug of BisectingKMeans. So, my idea is to use it in my spark cluster (in locally it works perfectly). I've tried many things: extraclasspath, userClassPathFirst, jars option...many options that do not work. My last idea is to use the Shade rule of sbt to change all org.apache.spark.* packages to shadespark.* but when I deploy it is still using the

How to parse and transform json string from spark data frame rows in pyspark

阅读更多关于 How to parse and transform json string from spark data frame rows in pyspark

问题 How to parse and transform json string from spark dataframe rows in pyspark? I'm looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows ( jstr1 , jstr2 , ...), which are saved to spark df . I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr has the same schema, columns/keys a

How to open a file which is stored in HDFS in pySpark using with open

阅读更多关于 How to open a file which is stored in HDFS in pySpark using with open

问题 How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found from pyspark import SparkConf,SparkContext conf = SparkConf () sc = SparkContext(conf = conf) def getMovieName(): movieNames = {} with open ("/user/sachinkerala6174/inData/movieStat") as f: for line in f: fields = line.split("|") mID = fields[0] mName = fields[1] movieNames[int(fields[0])] = fields[1] return movieNames nameDict

how to deploy war file in spark-submit command (spark)

阅读更多关于 how to deploy war file in spark-submit command (spark)

问题 I am using spark-submit --class main.Main --master local[2] /user/sampledata/parser-0.0.1-SNAPSHOT.jar to run a java-spark code, is it possible to run this code using war file instead of jar,since i am looking to deploy it on tomcat i tried by war file but it gives class not found exception 来源： https://stackoverflow.com/questions/40734240/how-to-deploy-war-file-in-spark-submit-command-spark

Unsupported Array error when reading JDBC source in (Py)Spark?

阅读更多关于 Unsupported Array error when reading JDBC source in (Py)Spark?

问题 Trying to convert postgreSQL DB to Dataframe . Following is my code: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Connect to DB") \ .getOrCreate() jdbcUrl = "jdbc:postgresql://XXXXXX" connectionProperties = { "user" : " ", "password" : " ", "driver" : "org.postgresql.Driver" } query = "(SELECT table_name FROM information_schema.tables) XXX" df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties) table_name_list = df.select("table_name")

column object not callable spark

阅读更多关于 column object not callable spark

问题 I tried to install spark and run the commands given in the tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html P-MBP:spark-2.0.2-bin-hadoop2.4 prem$ ./bin/pyspark Python 2.7.13 (default, Apr 4 2017, 08:44:49) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to