问题
I would like to create a scatter plot in ggplot2 which displays male test_scores on the x-axis and female test_scores on the y-axis using the dataset below. I can easily create a geom_line plot splitting male and female and putting the date ("dts") on the x-axis.
library(tidyverse)
#create data
dts <- c("2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05",
"2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05")
sex <- c("M","F","M","F","M","F","M","F","M","F")
test <- round(runif(10,.5,1),2)
semester <- data.frame("dts" = as.Date(dts), "sex" = sex, "test_scores" =
test)
#show the geom_line plot
ggplot(semester, aes(x = dts, y = test, color = sex)) + geom_line()
It seems with only one time series, ggplot2 does better with the data in wide format than long format. For instance, I could easily create two columns, "male_scores" and "female_scores" and plot those against each other, but I would like to keep my data tidy and in long format.
Cheers and thank you.
回答1:
You've over-tidied. Tidying data isn't just the mechanism of making it as long as possible, its making it as wide as necessary..
For example, if you had location as X and Y for animal sightings you wouldn't have two rows, one with a "label" column containing "X" and the X coordinate in a "value" column and another with "Y" in the "label" column and the Y coordinate in the "value" column - unless you really where storing the data in a key-value store but that's another story...
Widen your data and put the test scores for male and female into test_core_male
and test_score_female
, then they are the x and y aesthetics for your scatter plot.
回答2:
The problem with keeping the data long is that you will not have a corresponding X value a given Y value. The reason for this is the structure of the dataset --
dts sex test_scores
1 2011-01-02 M 0.67
2 2011-01-02 F 0.78
3 2011-01-03 M 0.58
4 2011-01-04 F 0.58
5 2011-01-05 M 0.51
If ypu were to use the code --
ggplot(semester, aes(x = semester$test_scores[semester$sex=='M',] ,
y = semester$test_scores[semester$sex=='F',],
color = sex)) + geom_point()
GGplot will kick an error. The main reason is by subsetting the male score there are no corresponding female scores for that subset. You need to first collapse the data down to a date level. As you correctly point out this isn't in a long format at that point.
I would recommend for this one off plot creating a wide dataset. There are multiple ways of doing that, but that is a different topic.
来源:https://stackoverflow.com/questions/41618041/scatter-plot-in-ggplot-one-numeric-variable-across-two-groups