orc

Spark read multiple directories into multiple dataframes

大兔子大兔子 提交于 2019-12-06 11:16:38
问题 I have a directory structure on S3 looking like this: foo |-base |-2017 |-01 |-04 |-part1.orc, part2.orc .... |-A |-2017 |-01 |-04 |-part1.orc, part2.orc .... |-B |-2017 |-01 |-04 |-part1.orc, part2.orc .... Meaning that for directory foo I have multiple output tables, base , A , B , etc in a given path based on the timestamp of a job. I'd like to left join them all, based on a timestamp and the master directory, in this case foo . This would mean reading in each output table base , A , B ,

Java 使用 Tess4J 进行 图片文字识别 笔记

拟墨画扇 提交于 2019-12-05 13:24:40
最近的工作中需要使用到从图片中识别文字的操作,就在网上找到到Tess4j.那么,现在来总结一下使用中遇到的问题. 关于Tess4J简价: http://tess4j.sourceforge.net/ (需要翻墙) 很简洁的项目主页.一个从Java角度使用JNA封闭的针对 Tesseract ORC 的开源项目,使用 Apache License, v2.0 协议.支持TIFF, JPEG, GIF, PNG, and BMP image formats,Multi-page TIFF images,PDF document format.(支持Tiff是一个很大的亮点) 那就再了解一下 Tesseract ORC. https://code.google.com/p/tesseract-ocr/ 是一个Google支持的开源的OCR图文识别开源项目.去持多语言(当前3.02 版本支持包括英文,简体中文,繁体中文),支持Windows,Linux,Mac OSX 多平台.使用中Tesseract 的识别率非常高. ( 自己仅对数字,使用中图片清析的情况下没发生错误 ) 网上传的代码示例大多是在Windows下安装Tesseract ORC后通过CMD命令操作进行图识别操作.而 Tess4j 针对Tesseract 提供了JNI支持,同时还提供了一些图片操作的工具类,提供比如图片放大

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

我只是一个虾纸丫 提交于 2019-12-04 05:04:53
Issue when executing a show create table and then executing the resulting create table statement if the table is ORC. Using show create table , you get this: STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’ But if you create the table with those clauses, you will then get the casting error when selecting. Error likes: Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable To fix this, just

Hadoop ORC file - How it works - How to fetch metadata

佐手、 提交于 2019-12-03 15:48:34
I am new to ORC file. I went through many blogs, but didn't get clear understanding. Please help and clarify below questions. Can I fetch schema from ORC file? I know in Avro, schema can fetched. How it actually provides schema evolution? I know that few columns can be added. But how to do it. The only I know, creating orc file is by loading data into hive table which store data in orc format. How ORC files index works? What I know is for every stripe index will be maintained. But as file is not sorted how it helps looking up data in list of stripes. How it helps in skipping stripes while

Parquet vs ORC vs ORC with Snappy

跟風遠走 提交于 2019-12-03 00:04:26
问题 I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB

Parquet vs ORC vs ORC with Snappy

心不动则不痛 提交于 2019-12-02 13:50:52
I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB Parquet was worst as far as compression for my table is concerned. My tests with the above tables yielded

pyspark Loading multiple partitioned files in a single load

强颜欢笑 提交于 2019-12-02 10:57:22
I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives? CODE Below to re-create the problem: basepath = '/file/' paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc', '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', '/file

pyspark Loading multiple partitioned files in a single load

时光怂恿深爱的人放手 提交于 2019-12-02 07:49:15
问题 I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives? CODE Below to re-create the problem: basepath = '/file/' paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc', '/file/df201604.orc', '/file/df201605.orc',

Hive: Merging Configuration Settings not working

[亡魂溺海] 提交于 2019-12-01 11:17:11
On Hive 2.2.0, I am filling an orc table from another source table of size 1.34 GB, using the query INSERT INTO TABLE TableOrc SELECT * FROM Table; ---- (1) The query creates TableORC table with 6 orc files, which are much smaller than the block size of 256MB. -- FolderList1 -rwxr-xr-x user1 supergroup 65.01 MB 1/1/2016, 10:14:21 AM 1 256 MB 000000_0 -rwxr-xr-x user1 supergroup 67.48 MB 1/1/2016, 10:14:55 AM 1 256 MB 000001_0 -rwxr-xr-x user1 supergroup 66.3 MB 1/1/2016, 10:15:18 AM 1 256 MB 000002_0 -rwxr-xr-x user1 supergroup 63.83 MB 1/1/2016, 10:15:41 AM 1 256 MB 000003_0 -rwxr-xr-x user1

Aggregating multiple columns with custom function in Spark

浪尽此生 提交于 2019-11-30 10:21:47
问题 I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns. I have a table like this of the type (name, item, price): john | tomato | 1.99 john | carrot | 0.45 bill | apple | 0.99 john | banana | 1.29 bill | taco | 2.59 to: I would like to aggregate the item and it's cost for each person into a list like this: john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29) bill | (apple, 0.99), (taco, 2.59) Is this possible in dataframes? I