Accessing google docs revision history through the API using R?

痴心易碎 提交于 2019-12-22 05:04:43

问题


I wish to download and analyse the revision history of one of my google docs using R, finding out statistics like how many edits did whom.

I see that there are already some ways for accessing google docs using R.

Does anyone know upfront (before I go ahead and try to hack my way), if or how it might be (reasonably easily) done?

Thanks.


回答1:


The Google Documents List API gives you access to a list of revisions, the contents of each revision, a way to make sure that a revision is stored for a given update, etc.

The list of revisions may give you what you're looking for, but you'll need to be more specific on what statistics you want. Note that it seems that history data on who made what change can be a bit incomplete - see API problems with revision history - Google Documents List API | Google Groups




回答2:


The googledrive package includes some low-level API functions that can be used for this. For example, here's how we can get a list of revisions for one Google doc:

library(googledrive)
library(tidyverse)

# replace this with the ID of your google doc
# this doc is private, it wont work for you
fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

# Get the name of the file and some other metadata
file <- build_request(
  path = "drive/v3/files/{fileId}",
  method = "GET",
  params = list(
    fileId = fileId,
    fields = "*"
  ),
  token = drive_token()
)
file_ret <-  process_response(make_request(file))

# Now for this doc, query the Drive API to get get URLs and other meta-data for all the revisions available to us

req2 <- build_request(
  path = "drive/v2/files/{fileId}/revisions",
  method = "GET",
  params = list(
    fileId = fileId
  ),
  token = drive_token()
)
revs2 <-  process_response(make_request(req2))

# See 
# https://developers.google.com/drive/api/v2/reference/revisions#resource
# for an explanation of each variable that we have here

# tidy revisions into a dataframe
revs2_df <-
  map_df(
    revs2$items,
    `[`,
    c(
      "kind",
      "etag" ,
      "id",
      "selfLink"   ,
      "mimeType"     ,
      "modifiedDate",
      "published"   ,
      "lastModifyingUserName"
    )
  )
# get exportLinks URLs out of its nest
revs2_export_url <- map_df(revs2$items, "exportLinks")
# bind together
revs2_df_bind <- bind_cols(revs2_df, revs2_export_url)

The result includes, for each revision, the date, time, name of the user that did the revision, and URLs to export that revision into a download file:

# A tibble: 140 x 16
   kind   etag  id    selfLink mimeType modifiedDate published lastModifyingUs… `application/rt…
   <chr>  <chr> <chr> <chr>    <chr>    <chr>        <lgl>     <chr>            <chr>           
 1 drive… "\"H… 28367 https:/… applica… 2017-09-12T… FALSE     Gayoung Park     https://docs.go…
 2 drive… "\"H… 28487 https:/… applica… 2017-09-12T… FALSE     Gayoung Park     https://docs.go…
 3 drive… "\"H… 28862 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 4 drive… "\"H… 29221 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 5 drive… "\"H… 29258 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 6 drive… "\"H… 29434 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 7 drive… "\"H… 29454 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
 8 drive… "\"H… 29603 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
 9 drive… "\"H… 30108 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
10 drive… "\"H… 30115 https:/… applica… 2017-09-21T… FALSE     Gayoung Park     https://docs.go…
# ... with 130 more rows, and 7 more variables: `application/vnd.oasis.opendocument.text` <chr>,
#   `text/html` <chr>, `application/pdf` <chr>, `application/epub+zip` <chr>,
#   `application/zip` <chr>,
#   `application/vnd.openxmlformats-officedocument.wordprocessingml.document` <chr>,
#   `text/plain` <chr>

We can then loop over the export URLs to download all the revisions, and compare the size or word count or whatever, and eventually get some plots like this:

Full code for those plots is here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed

Be aware that there are some severe limitations to the Google Drive revision history that is available via the API. For example,

  • When many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured.
  • When many edits happen in a short period of time, these are combined by Google into one revision, and we cannot see them individually. We do not have good temporal resolution.
  • Older edits are deleted by Google to save space. We do not know what their rules are for this.


来源:https://stackoverflow.com/questions/9522334/accessing-google-docs-revision-history-through-the-api-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!