Hive count tuple?

我的未来我决定 提交于 2019-12-07 16:56:25
Daniel Koverman

Based on the information you've provided, the output you want is not possible. Right now you have a table that looks like this:

res                           field
---                           -----
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
the distance                  title
the distance                  title
open book                     title
daria                         title
fashion nugget                album
fashion nugget                album
fashion nugget                album
fashion nugget                album
cake                          artist
cake                          artist
cake                          artist
cake                          artist

Because you partitioned it, Hive happens to store it in three different folders but this doesn't affect the results of the query. I added some extra tracks and I imagine with the extra tracks you would want the output to be (correct me if I'm wrong):

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

But there is no way to tell that "open book" has anything to do with "fashion nuggets" or "cake", just like there is no way to tell that "baby one more time" is associated with "britney spears". You could try to match on the counts but then you would end up with something like this

title                  album                       artist              count
baby one more time     hit me baby one more time   britney spears      9
null                   fashion nuggets             cake                3
the distance           null                        null                1
open book,daria        null                        null                1

I think you wanted a table with columns like this

title                  album                         artist
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
the distance           fashion nuggets               cake
the distance           fashion nuggets               cake
open book              fashion nuggets               cake
daria                  fashion nuggets               cake

but still partitioned on maybe artist and/or album. With or without the partitioning, you can write the query as if the table is not partitioned (it doesn't effect results as long as the data isn't corrupted, only performance). It will affect how you create and populate the table, however. Let me know if this is what you wanted an I'll edit this answer to answer that question instead.


THE EDIT AS PROMISED:

Okay, creating the table without any partitions is straightforward:

CREATE TABLE results (title string, album string, artist string)

Creating the table with with partitions is almost as straightforward, you just need to first decided what to partition on. If you partition on artist it will mean you can run queries particular to a single or set of artists without having to process information for other artists. If you partition by artist and album you can do the same thing with albums as well. This does come at the cost of breaking a large file into smaller files, and generally MapReduce (and therefore Hive) works better with large files. I wouldn't worry about partitioning at all unless you dealing with at least 10's of GBs and feel like you have a handle on how paritioning works and HiveQL in general. But for completeness, partitioning by artist:

CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);

and partitioned by artist then by album. Partitioning by (artist string, album string) vs (album string, artist string) won't change your results, but you should put the logical top of the hierarchy first.

CREATE TABLE (title string) PARTITIONED BY (artist string, album string);

Populating this table won't be easy if the only information we have access to are from the tables titles, artists, and albums since we have a huge list of titles, artists, and albums but no way to tell which title goes with which album for example. I hope you have some data where these relationships are still intact or your data set is still intact. Without knowing the form of this hypothetical data, I can't provide an answer for how to populate your tables. But if you have partitioned tables, this answer might be useful to you if you don't want to manually specify every artist and album(since every artist gets there own partition, and within though partition every album gets it's own partition).

EDIT: The asker has xml files which have the title, ablum, arist relationships intact. More information on this in the comments.

Now the meat of the question is counting unique tuples. This will be the same regardless of how data was partitioned, if at all. We do this using the GROUP BY clause. When you specify one column (or partition, which can be thought of as a column with special properties), you break the data down into groups which have distinct values for that column. If you specify several columns, you break the data down into groups with have distinct values for their combination of columns. This is the we take advantage of to count distinct tuples:

SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist

and here we are:

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!