问题
I am trying to generate a stacked area graph in r using below command:
ggplot(p_ash_r_100,aes(x=SMPL_TIME,y=SMPL_CNT,col=EVENT,group=1))+ geom_area()
Here EVENT is the 3rd variable which I want to chart out based on time and sample counts in the ORACLE DB.
But the graph with above commands is returning empty.
My questions are:
How to fix the empty graph problem.
How to filter only the 10 top variables based on the amount of data when displaying or earlier? We can do it easily in excel as I show here in the image file.
my dataset looks like this:
> p_ash_r_100
SMPL_TIME SQL_ID MODULE EVENT SMPL_CNT
1 11-APR-17 09:00 03d5x9busf1d8 SQL*Plus CPU 1
2 11-APR-17 09:00 2pb7bzzadj0pn OGG-RCASI004-OPEN_DATA_SOURCE db file sequential read 1
3 11-APR-17 09:00 NO_SQL GoldenGate CPU 1
4 11-APR-17 09:00 NO_SQL MMON_SLAVE CPU 1
5 11-APR-17 09:00 NO_SQL NO_SQL Log archive I/O 1
6 11-APR-17 09:00 NO_SQL XStream CPU 1
7 11-APR-17 09:00 acuzxh557cq81 GoldenGate db file sequential read 1
8 11-APR-17 09:00 cqtby4bsrmxzh GoldenGate CPU 1
9 11-APR-17 09:00 dgzp3at57cagd GoldenGate db file sequential read 2
10 11-APR-17 09:00 fjp9t92a5yx1v GoldenGate db file sequential read 1
11 11-APR-17 09:00 guh1sva39p9db GoldenGate db file sequential read 1
12 11-APR-17 09:01 0hz0dhgwk12cd GoldenGate direct path write 1
13 11-APR-17 09:01 2jafq5d4n0akv GoldenGate CPU 1
14 11-APR-17 09:01 37cspa0acgqxp GoldenGate db file sequential read 2
15 11-APR-17 09:01 79rugrngrvpt1 OGG-RADDR025-OPEN_DATA_SOURCE db file sequential read 1
16 11-APR-17 09:01 7k6zp92kbv28m GoldenGate CPU 1
17 11-APR-17 09:01 7nvtkfc0bt8vv GoldenGate db file sequential read 1
18 11-APR-17 09:01 7pvpzvd1g769d GoldenGate CPU 1
19 11-APR-17 09:01 9gduk46rmt5jy GoldenGate db file sequential read 1
20 11-APR-17 09:01 NO_SQL GoldenGate CPU
7
Adding image of the dataset below for ease of understanding
The end graph which I want to get it something like this one from excel=>
Value filters in excel to get Top 10 events in excel =>
回答1:
I'll start with the second question, which is easier. Using the dplyr
package, you can use top_n
to get the n largest rows for a given column. For example:
> top_n(p_ash_r_100a, 3, SMPL_CNT) %>% arrange(desc(SMPL_CNT))
# A tibble: 3 × 5
SMPL_TIME SQL_ID MODULE EVENT SMPL_CNT
<dttm> <chr> <chr> <chr> <int>
1 2017-04-11 09:01:00 NO_SQL GoldenGate CPU 7
2 2017-04-11 09:00:00 dgzp3at57cagd GoldenGate db file sequential read 2
3 2017-04-11 09:01:00 37cspa0acgqxp GoldenGate db file sequential read 2
Note that you will get more than n rows if there are ties for nth place. Thus top_n(p_ash_r_100, 10, SMPL_CNT)
will return the entire sample data set because of the 17-way tie for 4th.
As for the first question, the documentation for geom_area
provides a clue:
An area plot is the continuous analog of a stacked bar chart (see geom_bar), and can be used to show how composition of the whole varies over the range of x.
This suggests that geom_area
expects the column mapped to x should be numeric. Based on the listing for p_ash_r_100
, SMPL_TIME
appears to be a character vector. With the lubridate
package, we can convert SMPL_TIME
to a date-time with dmy_hm
:
p_ash_r_100a <- p_ash_r_100 %>%
mutate_at(vars(SMPL_TIME), dmy_hm)
However, this isn't enough to get the plot you want since there are multiple values of y
for each combination of x
and fill
(which is the correct aesthetic for geom_area
, not "col
"). We need to summarise the data before plotting:
p_ash_r_100a %>%
group_by(SMPL_TIME, EVENT) %>%
summarise(total = sum(SMPL_CNT)) %>%
ggplot(aes(SMPL_TIME, total, fill = EVENT)) +
geom_area()
Yet the plot is still not correct. This is because every combination of SMPL_TIME
and EVENT
is not represented in the data set. We need to explicitly tell geom_area
that y
is equal to zero for those missing rows. One way is to use the handy fill
argument in tidyr::spread
.
group_by(p_ash_r_100a, SMPL_TIME, EVENT) %>%
summarise(smpl_sum = sum(SMPL_CNT)) %>%
spread(EVENT, smpl_sum, fill = 0) %>%
gather(EVENT, smpl_sum, CPU, `db file sequential read`,
`direct path write`,
`Log archive I/O`) %>%
ggplot(aes(x = SMPL_TIME, y = smpl_sum, fill = EVENT)) +
geom_area()
来源:https://stackoverflow.com/questions/43357180/why-is-my-stacked-area-graph-in-ggplot2-empty