bigdata

ValueError: as_list() is not defined on an unknown TensorShape

纵然是瞬间 提交于 2021-01-29 05:20:20
问题 i work on thhe example based in this web and here is i got after this jobs_train, jobs_test = jobs_df.randomSplit([0.6, 0.4]) >>> zuckerberg_train, zuckerberg_test = zuckerberg_df.randomSplit([0.6, 0.4]) >>> train_df = jobs_train.unionAll(zuckerberg_train) >>> test_df = jobs_test.unionAll(zuckerberg_test) >>> from pyspark.ml.classification import LogisticRegression >>> from pyspark.ml import Pipeline >>> from sparkdl import DeepImageFeaturizer >>> featurizer = DeepImageFeaturizer(inputCol=

why boolean field is not working in Hive?

不想你离开。 提交于 2021-01-29 02:21:19
问题 I have a column in my hive table which datatype is boolean. when I tried to import data from csv, it stored as NULL. This is my sample table : CREATE tABLE if not exists Engineanalysis( EngineModel String, EnginePartNo String , Location String, Position String, InspectionReq boolean) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; My sample data : AB01,AS01-IT01,AIRFRAME,,0 AB02,AS01-IT02,AIRFRAME,,1 AB03,AS01-IT03,AIRFRAME,,1 AB04,AS01-IT04,AIRFRAME,,1 AB05,AS01-IT05

why boolean field is not working in Hive?

亡梦爱人 提交于 2021-01-29 02:14:05
问题 I have a column in my hive table which datatype is boolean. when I tried to import data from csv, it stored as NULL. This is my sample table : CREATE tABLE if not exists Engineanalysis( EngineModel String, EnginePartNo String , Location String, Position String, InspectionReq boolean) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; My sample data : AB01,AS01-IT01,AIRFRAME,,0 AB02,AS01-IT02,AIRFRAME,,1 AB03,AS01-IT03,AIRFRAME,,1 AB04,AS01-IT04,AIRFRAME,,1 AB05,AS01-IT05

Get columns describe from group by

一笑奈何 提交于 2021-01-28 00:11:51
问题 I'm interesting in get data describe from group by dataset of Pandas. The data refer to vacations taken by different people. In addition, the number of places visited is stored in that city. City Name Places 0 Seattle Alice 10 1 Seattle Bob 11 2 Portland Mallory 7 3 Seattle Mallory 5 4 Memphis Bob 6 5 Portland Mallory 9 6 Memphis Alice 1 7 Memphis Alice 20 8 Seattle Alice 14 9 Seattle Bob 10 I want to get data from DataFrame.describe() And the new dataframe should look like this. Name City

Not able to delete the data from hdfs, even after leaving safemode?

回眸只為那壹抹淺笑 提交于 2021-01-27 21:00:29
问题 I used this command to leave the safe mode hdfs dfsadmin -safemode leave But even then, when I use this command to delete files hdfs dfs -rm -r /user/amandeep/share/ It shows the following error 15/06/18 23:35:05 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. rm: Cannot delete /user/amandeep/share/lib/lib_20150615024237. Name node is in safe mode. 来源: https://stackoverflow.com/questions/30922639/not-able-to-delete-the

spark throws error when reading hive table

谁说胖子不能爱 提交于 2021-01-27 13:56:17
问题 i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0) when i use the following properties i was able to query for hive: set hive.mapred.mode=nonstrict; set hive.optimize.ppd=true; set hive.optimize.index.filter=true; set hive.tez.bucket.pruning=true; set hive.explain.user=false; set hive.fetch.task.conversion=none; now when

CSS3 transform: translate maximum value?

陌路散爱 提交于 2021-01-27 13:17:46
问题 I created an experiment to infinite-scroll the first billion digits of Pi to find/create a scrolling solution that has high-performance with a massive dataset. I started testing with iScroll and ran into an issue. This demo works great (in Chrome) till around 33 million transform: translate(0px, 3.35545e+07px); You can see the issue by running the following commands in the dev tools console, then scrolling. scroller.scrollTo(0, -33553700); scroller._execEvent('scroll'); Any CSS experts know

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

痴心易碎 提交于 2021-01-27 04:08:01
问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

那年仲夏 提交于 2021-01-27 04:07:48
问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Reuse tasks in airflow

允我心安 提交于 2021-01-24 10:58:06
问题 I'm trying out airflow for orchestrating some of my data pipelines. I'm having multiple tasks for each ingestion pipeline. The tasks are getting repeated across multiple ingestion pipelines. How can I reuse a task across DAGS in airflow? 回答1: Just like object is an instance of a class, an Airflow task is an instance of an Operator (strictly speaking, BaseOperator) So write a "re-usable" (aka generic) operator and use it 100s of times across your pipeline(s) simply by passing different params