orc | 易学教程

Spark read multiple directories into multiple dataframes

阅读更多关于 Spark read multiple directories into multiple dataframes

问题 I have a directory structure on S3 looking like this: foo |-base |-2017 |-01 |-04 |-part1.orc, part2.orc .... |-A |-2017 |-01 |-04 |-part1.orc, part2.orc .... |-B |-2017 |-01 |-04 |-part1.orc, part2.orc .... Meaning that for directory foo I have multiple output tables, base , A , B , etc in a given path based on the timestamp of a job. I'd like to left join them all, based on a timestamp and the master directory, in this case foo . This would mean reading in each output table base , A , B ,

Java 使用 Tess4J 进行图片文字识别笔记

阅读更多关于 Java 使用 Tess4J 进行图片文字识别笔记

最近的工作中需要使用到从图片中识别文字的操作,就在网上找到到Tess4j.那么,现在来总结一下使用中遇到的问题. 关于Tess4J简价: http://tess4j.sourceforge.net/ (需要翻墙) 很简洁的项目主页.一个从Java角度使用JNA封闭的针对 Tesseract ORC 的开源项目,使用 Apache License, v2.0 协议.支持TIFF, JPEG, GIF, PNG, and BMP image formats,Multi-page TIFF images,PDF document format.(支持Tiff是一个很大的亮点) 那就再了解一下 Tesseract ORC. https://code.google.com/p/tesseract-ocr/ 是一个Google支持的开源的OCR图文识别开源项目.去持多语言(当前3.02 版本支持包括英文,简体中文,繁体中文),支持Windows,Linux,Mac OSX 多平台.使用中Tesseract 的识别率非常高. ( 自己仅对数字,使用中图片清析的情况下没发生错误 ) 网上传的代码示例大多是在Windows下安装Tesseract ORC后通过CMD命令操作进行图识别操作.而 Tess4j 针对Tesseract 提供了JNI支持,同时还提供了一些图片操作的工具类,提供比如图片放大

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

阅读更多关于 Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC. Using show create table , you get this: STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’ But if you create the table with those clauses, you will then get the casting error when selecting. Error likes: Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable To fix this, just

Hadoop ORC file - How it works - How to fetch metadata

阅读更多关于 Hadoop ORC file - How it works - How to fetch metadata

I am new to ORC file. I went through many blogs, but didn't get clear understanding. Please help and clarify below questions. Can I fetch schema from ORC file? I know in Avro, schema can fetched. How it actually provides schema evolution? I know that few columns can be added. But how to do it. The only I know, creating orc file is by loading data into hive table which store data in orc format. How ORC files index works? What I know is for every stripe index will be maintained. But as file is not sorted how it helps looking up data in list of stripes. How it helps in skipping stripes while

Parquet vs ORC vs ORC with Snappy

阅读更多关于 Parquet vs ORC vs ORC with Snappy

问题 I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB

Parquet vs ORC vs ORC with Snappy

阅读更多关于 Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB Parquet was worst as far as compression for my table is concerned. My tests with the above tables yielded

pyspark Loading multiple partitioned files in a single load

阅读更多关于 pyspark Loading multiple partitioned files in a single load

I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives? CODE Below to re-create the problem: basepath = '/file/' paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc', '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', '/file

pyspark Loading multiple partitioned files in a single load

阅读更多关于 pyspark Loading multiple partitioned files in a single load

问题 I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives? CODE Below to re-create the problem: basepath = '/file/' paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc', '/file/df201604.orc', '/file/df201605.orc',

Hive: Merging Configuration Settings not working

阅读更多关于 Hive: Merging Configuration Settings not working

On Hive 2.2.0, I am filling an orc table from another source table of size 1.34 GB, using the query INSERT INTO TABLE TableOrc SELECT * FROM Table; ---- (1) The query creates TableORC table with 6 orc files, which are much smaller than the block size of 256MB. -- FolderList1 -rwxr-xr-x user1 supergroup 65.01 MB 1/1/2016, 10:14:21 AM 1 256 MB 000000_0 -rwxr-xr-x user1 supergroup 67.48 MB 1/1/2016, 10:14:55 AM 1 256 MB 000001_0 -rwxr-xr-x user1 supergroup 66.3 MB 1/1/2016, 10:15:18 AM 1 256 MB 000002_0 -rwxr-xr-x user1 supergroup 63.83 MB 1/1/2016, 10:15:41 AM 1 256 MB 000003_0 -rwxr-xr-x user1

Aggregating multiple columns with custom function in Spark

阅读更多关于 Aggregating multiple columns with custom function in Spark