parquet

独家 | 10个数据科学家常犯的编程错误(附解决方案)

僤鯓⒐⒋嵵緔 提交于 2019-11-29 00:22:07
数据科学家是“比软件工程师更擅长统计学,比统计学家更擅长软件工程的人”。许多数据科学家都具有统计学背景,但是在软件工程方面的经验甚少。我是一名资深数据科学家,在Stackoverflow的python编程方面排名前1%,并与许多(初级)数据科学家共事。以下是我经常看到的10大常见错误,本文将为你相关解决方案: 不共享代码中引用的数据 对无法访问的路径进行硬编码 将代码与数据混合 在Git中和源码一起提交数据 编写函数而不是DAG 写for循环 不编写单元测试 不写代码说明文档 将数据保存为csv或pickle文件 使用jupyter notebook 1. 不共享代码中引用的数据 数据科学需要代码和数据。因此,为了让别人可以复现你的结果,他们需要能够访问到数据。道理很简单,但是很多人忘记分享他们代码中的数据。 import pandas as pd df1 = pd.read_csv('file-i-dont-have.csv') # fails do_stuff(df) 解决方案:使用d6tpipe( https://github.com/d6t/ d6tpipe)来共享你的代码中的数据文件、将其上传到S3/web/google驱动等,或者保存到数据库,以便于别人可以检索到文件(但是不要将其添加到git,原因见下文)。 2. 对无法访问的路径进行硬编码 与错误1相似

什么是parquet文件?

核能气质少年 提交于 2019-11-28 22:08:26
Apache Parquet是Hadoop生态系统中任何项目均可使用的 列式存储 格式,而与选择数据处理框架,数据模型或编程语言无关。 parquet的起源: 我们创建Parquet是为了使Hadoop生态系统中的任何项目都可以使用压缩的,高效的列式数据表示形式。 Parquet是从头开始构建的,考虑了复杂的嵌套数据结构,并使用了Dremel论文中描述的 记录粉碎和组装算法 。我们相信这种方法优于嵌套名称空间的简单扁平化。 文件格式 阅读此文件以了解格式。 4-byte magic number "PAR1" <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column

Spark SQL saveAsTable is not compatible with Hive when partition is specified

若如初见. 提交于 2019-11-28 20:50:33
Kind of edge case, when saving parquet table in Spark SQL with partition, #schema definitioin final StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("time", DataTypes.StringType, true), DataTypes.createStructField("accountId", DataTypes.StringType, true), ... DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD); df.coalesce(1) .write() .mode(SaveMode.Append) .format("parquet") .partitionBy("year") .saveAsTable("tblclick8partitioned"); Spark warns: Persisting partitioned data source relation into Hive metastore in Spark SQL specific

How to read partitioned parquet files from S3 using pyarrow in python

*爱你&永不变心* 提交于 2019-11-28 20:22:50
问题 I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet pyarrow's ParquetDataset module has the capabilty to read from partitions. So I have tried the following code : >>> import pandas as pd >>> import pyarrow.parquet as pq >>> import s3fs >>> a = "s3://my_bucker/path/to/data_folder/" >>> dataset = pq

How do you control the size of the output file?

☆樱花仙子☆ 提交于 2019-11-28 18:29:23
In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates. I am looking for similar solution for parquet file. Is there a max file size option available when writing a file? I have few workarounds, but none is good. If I want to limit files to 64mb, then One option is to repartition the data and write to temp location. And then merge the files together using the file size in the temp location. But getting the correct file size is difficult. It's impossible for Spark to control the size of Parquet

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

こ雲淡風輕ζ 提交于 2019-11-28 18:24:31
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java. To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications

How to read a Parquet file into Pandas DataFrame?

我的未来我决定 提交于 2019-11-28 18:11:32
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark. I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external

“Failed to find data source: parquet” when making a fat jar with maven

女生的网名这么多〃 提交于 2019-11-28 12:33:29
I am assembling the fat jar with maven assembly plugin and experience the following issue: Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: parquet. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78) at org.apache.spark.sql.execution.datasources.DataSource

Spark Dataframe validating column names for parquet writes (scala)

核能气质少年 提交于 2019-11-28 12:21:09
I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format. However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ,;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names. How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark

Cast int96 timestamp from parquet to golang

醉酒当歌 提交于 2019-11-28 10:58:53
问题 Having this 12 byte array (int96) to timestamp. [128 76 69 116 64 7 0 0 48 131 37 0] How do I cast it to timestamp? I understand the first 8 byte should be cast to int64 millisecond that represent an epoch datetime. 回答1: The first 8 bytes are time in nanosecs, not millisecs. They are not measured from the epoch either, but from midnight. The date part is stored separatly in the last 4 bytes as Julian day number. Here is the result of an experiment I did earlier that may help. I stored '2000