apache-spark

Issue in Union with Empty dataframe

拈花ヽ惹草 提交于 2021-01-29 20:00:28
问题 I wanted to append a dataframe to another empty dataframe in a loop and finally write to a Location. My Code - val myMap = Map(1001 -> "rollNo='12'",1002 -> "rollNo='13'") val myHiveTableData = spark.table(<table_name>) val allOtherIngestedData = spark.createDataFrame(sc.emptyRDD[Row],rawDataHiveDf.schema) myMap.keys.foreach { i => val filteredDataDf = myHiveTableData.where(myMap(i)) val othersDf = myHiveTableData.except(filteredDataDf) allOtherIngestedData.union(othersDf) filteredDataDf

how to extract data JSON from zeppelin sql

妖精的绣舞 提交于 2021-01-29 19:26:19
问题 I query to test_tbl table on Zeppelin. the table data structure looks like as below : %sql desc stg.test_tbl col_name | data_type | comment id | string | title | string | tags | string | The tags column has data JSON type following as : {"name":[{"family": null, "first": "nelson"}, {"pos_code":{"house":"tlv", "id":"A12YR"}}]} and I want to see the JSON data with columns, so my query is : select *, tag.* from stg.test_tbl as t lateral view explode(t.tags.name) name as name lateral view explode

PySpark and time series data: how to smartly avoid overlapping dates?

走远了吗. 提交于 2021-01-29 18:40:26
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

三世轮回 提交于 2021-01-29 18:10:15
问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

StandardScaler returns NaN

时间秒杀一切 提交于 2021-01-29 17:50:06
问题 env: spark-1.6.0 with scala-2.10.4 usage: // row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature) val df = sqlContext.read.parquet("data/Labeled.parquet") val SC = new StandardScaler() .setInputCol("feature").setOutputCol("scaled") .setWithMean(false).setWithStd(true).fit(df) val scaled = SC.transform(df) .drop("feature").withColumnRenamed("scaled","feature") Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler NaN exists in

Combine pivoted and aggregated column in PySpark Dataframe

有些话、适合烂在心里 提交于 2021-01-29 17:48:26
问题 My question is related to this one. I have a PySpark DataFrame, named df , as shown below. date | recipe | percent | volume ---------------------------------------- 2019-01-01 | A | 0.03 | 53 2019-01-01 | A | 0.02 | 55 2019-01-01 | B | 0.05 | 60 2019-01-02 | A | 0.11 | 75 2019-01-02 | B | 0.06 | 64 2019-01-02 | B | 0.08 | 66 If I pivot it on recipe and aggregate both percent and volume , I get column names that concatenate recipe and the aggregated variable. I can use alias to clean things up

NoSuchFieldException: parentOffset - Hive on Spark

安稳与你 提交于 2021-01-29 17:43:09
问题 I'm trying to run Hive on Spark locally. I have followed all the configurations on the hive official site. On the hive console, I firstly created a simple table and tried to insert a few values into it. set hive.cli.print.current.db=true; create temporary table sketch_input (id int, category char(1)); insert into table sketch_input values (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'), (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11,

Getting error on connecting to a local SQL Server database to databricks via JDBC connection

房东的猫 提交于 2021-01-29 17:40:34
问题 Basically I'm trying to connect to a SQL Server database on my local machine from databricks using a JDBC connection. I'm following the procedure mentioned in the documentation as shown here on the databricks website. I used the following code as mentioned on the website: jdbcHostname = "localhost" jdbcDatabase = "TestDB" jdbcPort = "3306" jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase) connectionProperties = { "jdbcUsername" : "user1", "jdbcPassword" :

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

不羁的心 提交于 2021-01-29 16:35:03
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

FPGrowth/Association Rules using Sparklyr

点点圈 提交于 2021-01-29 16:32:08
问题 I am trying to build an association rules algorithm using Sparklyr and have been following this blog which is really well explained. However, there is a section just after they fit the FPGrowth algorithm where the author extracts the rules from the "FPGrowthModel object" which is returned but I am not able to reproduce to extract my rules. The section where I am struggling is this piece of code: rules = FPGmodel %>% invoke("associationRules") Could someone please explain where FPGmodel comes