Exceptions when reading tutorial CSV file in the Cloudera VM

问题

I'm trying to do a Spark tutorial that comes with the Cloudera Virtual Machine. But even though I'm using the correct line-ending encoding, I can not execute the scripts, because I get tons of errors. The tutorial is part of the Coursera Introduction to Big Data Analytics course. The assignment can be found here.

So here's what I did. Install the IPython shell (if not yet done):

sudo easy_install ipython==1.2.1

Open/Start the shell (either with 1.2.0 or 1.4.0):

PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.2.0

Set the line-endings to windows style. This is because the file is in windows-encoding and it's said in the course to do so. If you don't do this, you'll get other errors.

sc._jsc.hadoopConfiguration().set('textinputformat.record.delimiter','\r\n')

Trying to load the CSV file:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',header = 'true',inferSchema = 'true',path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

But getting a very long list of errors, which starts like this:

Py4JJavaError: An error occurred while calling o23.load.: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)

The full error message can be seen here. And this is the /etc/hive/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
  <!-- that are implied by Hadoop setup variables.                                                -->
  <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
  <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
  <!-- resource).                                                                                 -->

  <!-- Hive Execution Parameters -->

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>cloudera</value>
  </property>

  <property>
    <name>hive.hwi.war.file</name>
    <value>/usr/lib/hive/lib/hive-hwi-0.8.1-cdh4.0.0.jar</value>
    <description>This is the WAR file with the jsp content for Hive Web Interface</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>
</configuration>

Any help or idea how to solve that? I guess it's a pretty common error. But I couldn't find any solution, yet.

One more thing: is there a way to dump such long error messages into a separate log-file?

回答1:

Summary of the discussion: Executing the following command solved the issue:

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

see https://www.coursera.org/learn/bigdata-analytics/supplement/tyH3p/setup-pyspark-for-dataframes for more info.

回答2:

Seems that there are two problems. First, the hive-metastore was offline in some occasions. And second, the schema can not be inferred. Therefore I created a schema manually and added it as an argument when loading the CSV file. Anyway, I would love to understand if this works somehow with schemaInfer=true.

Here's my version with the manually defined schema. So, make sure the hive is started:

sudo service hive-metastore restart

Then, have a look into the first part of the CSV file to understand it's structure. I used this command line:

head /usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv

Now, open the python shell. See the original posting for how to do that. Then define the schema:

from pyspark.sql.types import *
schema = StructType([
    StructField("business_id", StringType(), True),
    StructField("cool", IntegerType(), True),
    StructField("date", StringType(), True),
    StructField("funny", IntegerType(), True),
    StructField("id", StringType(), True),
    StructField("stars", IntegerType(), True),
    StructField("text", StringType(), True),
    StructField("type", StringType(), True),
    StructField("useful", IntegerType(), True),
    StructField("user_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("full_address", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("neighborhood", StringType(), True),
    StructField("open", StringType(), True),
    StructField("review_count", IntegerType(), True),
    StructField("state", StringType(), True)])

Then load the CSV file by specifying the schema. Note that there is no need to set the windows line endings:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
schema = schema,
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

The the result by any method executed on the dataset. I tried getting the count, which worked perfectly.

yelp_df.count()

Thanks to the help of @yaron we could figure out how to load the CSV with inferSchema. First, you must setup the hive-metastore correctly:

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

Then, start the Python shell and DO NOT change the line endings to Windows encoding. Keep in mind that changing that is persistent (session invariant). So, if you changed it to Windows style before, you need to reset it it '\n'. Then load the CSV file with inferSchema set to true:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
inferSchema = 'true',
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

来源：https://stackoverflow.com/questions/36966550/exceptions-when-reading-tutorial-csv-file-in-the-cloudera-vm

标签

python

csv

Hadoop

pyspark

cloudera-quickstart-vm