Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

问题

I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.

I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc

So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,

Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.
1. Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.
2. There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

What is the cost I have to pay for using string instead of varchar in terms of storage and performance

回答1:

Lets try to understand from how it is implemented in API:-

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter

Here is the magic begins -->

private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
        return new StringDataWriter((StringObjectInspector)inspector);
    case VARCHAR:
        return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);

}

createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar or string, accordingly it creates writer class for these types.

Now lets move on to VarcharDataWriter class.

private class VarcharDataWriter implements DataWriter {
    private HiveVarcharObjectInspector inspector;

    public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value).getValue();
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

to StringDataWriter class

private class StringDataWriter implements DataWriter {
    private StringObjectInspector inspector;

    public StringDataWriter(StringObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value);
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.

short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read tool. ParquetRecordReader knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.

回答2:

The best way is to go with the String. The varchar is also internally stored as string. If you want to datatypes definitely, create a view on top of same data as required.

TThe only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded. String efficiently limits the data if it is not using it.

The vectorization support is also available for the String.

回答3:

My case will be to restrict and focus this discussion around ORC format given its become a default standard for Hive storage.I don't believe performance is really a question between VARCHAR and STRING in Hive itself. The encoding of the data (refer link below) is the same on both the cases when it comes to ORC format. This applies even when you are using your custom Serde, its all treated as STRING and encoding then applied.

The real issue for me will be how the STRING is consumed by other third party tools and programming languages. If the end use has no documented issue with STRING, its easy to move forward with STRING as type over VARCHAR(n) type. This is especially so useful when working with ETL that requires mapping elements over a pipeline and you don't want to take risk of size errors ignored. Coming back to third party tools, For example SAS has number of documented issues with reading STRING type when connected to Hive. It will become a pain area for some and for some it will be a point of awareness in their respective architecture. For example, a database when connecting to Hive via JDBC or ODBC might read the data as VARCHAR(max) which may imply number of challenges that needs to be considered.

I would suggest to consider this as a major factor rather than performance in Hive itself. I have not come across anything so far that suggests VARCHAR performs better than STRING for deciding the type to be used.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-StringColumnSerialization

Another point is that VARCHAR now supports vectorization. In any case, UDF that receives VARCHAR will be considered STRING therefore point negated.

Thanks for correcting me in case you find the understanding different. Also, can provide a reference link that may help.

来源：https://stackoverflow.com/questions/45191793/hive-varchar-vs-string-is-there-any-advantage-if-the-storage-format-is-parqu

标签

Hive

hql

parquet

hcatalog