apache-spark | 易学教程

Issue in Union with Empty dataframe

阅读更多关于 Issue in Union with Empty dataframe

问题 I wanted to append a dataframe to another empty dataframe in a loop and finally write to a Location. My Code - val myMap = Map(1001 -> "rollNo='12'",1002 -> "rollNo='13'") val myHiveTableData = spark.table(<table_name>) val allOtherIngestedData = spark.createDataFrame(sc.emptyRDD[Row],rawDataHiveDf.schema) myMap.keys.foreach { i => val filteredDataDf = myHiveTableData.where(myMap(i)) val othersDf = myHiveTableData.except(filteredDataDf) allOtherIngestedData.union(othersDf) filteredDataDf

how to extract data JSON from zeppelin sql

阅读更多关于 how to extract data JSON from zeppelin sql

问题 I query to test_tbl table on Zeppelin. the table data structure looks like as below : %sql desc stg.test_tbl col_name | data_type | comment id | string | title | string | tags | string | The tags column has data JSON type following as : {"name":[{"family": null, "first": "nelson"}, {"pos_code":{"house":"tlv", "id":"A12YR"}}]} and I want to see the JSON data with columns, so my query is : select *, tag.* from stg.test_tbl as t lateral view explode(t.tags.name) name as name lateral view explode

PySpark and time series data: how to smartly avoid overlapping dates?

阅读更多关于 PySpark and time series data: how to smartly avoid overlapping dates?

问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

阅读更多关于 Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

StandardScaler returns NaN

阅读更多关于 StandardScaler returns NaN

问题 env: spark-1.6.0 with scala-2.10.4 usage: // row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature) val df = sqlContext.read.parquet("data/Labeled.parquet") val SC = new StandardScaler() .setInputCol("feature").setOutputCol("scaled") .setWithMean(false).setWithStd(true).fit(df) val scaled = SC.transform(df) .drop("feature").withColumnRenamed("scaled","feature") Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler NaN exists in

Combine pivoted and aggregated column in PySpark Dataframe

阅读更多关于 Combine pivoted and aggregated column in PySpark Dataframe

问题 My question is related to this one. I have a PySpark DataFrame, named df , as shown below. date | recipe | percent | volume ---------------------------------------- 2019-01-01 | A | 0.03 | 53 2019-01-01 | A | 0.02 | 55 2019-01-01 | B | 0.05 | 60 2019-01-02 | A | 0.11 | 75 2019-01-02 | B | 0.06 | 64 2019-01-02 | B | 0.08 | 66 If I pivot it on recipe and aggregate both percent and volume , I get column names that concatenate recipe and the aggregated variable. I can use alias to clean things up

NoSuchFieldException: parentOffset - Hive on Spark

阅读更多关于 NoSuchFieldException: parentOffset - Hive on Spark

问题 I'm trying to run Hive on Spark locally. I have followed all the configurations on the hive official site. On the hive console, I firstly created a simple table and tried to insert a few values into it. set hive.cli.print.current.db=true; create temporary table sketch_input (id int, category char(1)); insert into table sketch_input values (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'), (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11,

Getting error on connecting to a local SQL Server database to databricks via JDBC connection

阅读更多关于 Getting error on connecting to a local SQL Server database to databricks via JDBC connection

问题 Basically I'm trying to connect to a SQL Server database on my local machine from databricks using a JDBC connection. I'm following the procedure mentioned in the documentation as shown here on the databricks website. I used the following code as mentioned on the website: jdbcHostname = "localhost" jdbcDatabase = "TestDB" jdbcPort = "3306" jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase) connectionProperties = { "jdbcUsername" : "user1", "jdbcPassword" :

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

阅读更多关于 To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

FPGrowth/Association Rules using Sparklyr

阅读更多关于 FPGrowth/Association Rules using Sparklyr

问题 I am trying to build an association rules algorithm using Sparklyr and have been following this blog which is really well explained. However, there is a section just after they fit the FPGrowth algorithm where the author extracts the rules from the "FPGrowthModel object" which is returned but I am not able to reproduce to extract my rules. The section where I am struggling is this piece of code: rules = FPGmodel %>% invoke("associationRules") Could someone please explain where FPGmodel comes