How to show corpus text in R tm package?

前端 未结 8 817
一个人的身影
一个人的身影 2020-12-14 23:49

I\'m completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package?

I\'ve loaded a corpu

相关标签:
8条回答
  • 2020-12-15 00:00

    I can confirm that as of tm 0.6-1 the inspect does not print pretty. You can pair it with the qdap package that I maintain to convert easily to a data.frame as folows:

    library(qdap)
    as.data.frame(crude)
    

    To make it more ike the old inspect behavior you can use:

    as.data.frame(crude) %>%
        with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))
    

    This looks like:

    Diamond Shamrock Corp said that effective today it had cut its
    contract prices for crude oil by 1.50 dlrs a barrel. The reduction
    brings its posted price for West Texas Intermediate to 16.00 dlrs a
    barrel, the copany said. "The price reduction today was made in the
    light of falling oil product prices and a weak crude oil market," a
    company spokeswoman said. Diamond is the latest in a line of U.S. oil
    companies that have cut its contract, or posted, prices over the last
    two days citing weak oil markets. Reuter
    
    
    OPEC may be forced to meet before a scheduled June session to
    readdress its production cutting agreement if the organization wants
    to halt the current slide in oil prices, oil industry analysts said.
    "The movement to higher oil prices was never to be as easy as OPEC
    thought. They may need an emergency meeting to sort out the
    problems," said Daniel Yergin, director of Cambridge Energy Research
    Associates, CERA. Analysts and oil industry sources said the problem
    OPEC faces is excess oil supply in world oil markets. "OPEC's problem
    is not a price problem but a production issue and must be addressed
    in that way," said Paul Mlotok, oil analyst with Salomon Brothers
    Inc. He said the market's earlier optimism about OPE
    .
    .
    .
    
    0 讨论(0)
  • 2020-12-15 00:02

    Here is a simple and direct way to display the text of a corpus:

    strwrap(corpus[[1]])
    

    For the crude data this will output

    [1] "Diamond Shamrock Corp said that effective today it had cut its contract"      
    [2] "prices for crude oil by 1.50 dlrs a barrel.  The reduction brings its posted" 
    [3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."   
    [4] "\"The price reduction today was made in the light of falling oil product"     
    [5] "prices and a weak crude oil market,\" a company spokeswoman said.  Diamond is"
    [6] "the latest in a line of U.S. oil companies that have cut its contract, or"    
    [7] "posted, prices over the last two days citing weak oil markets.  Reuter"
    
    0 讨论(0)
  • 2020-12-15 00:03

    You can try converting your corpus text into a dataframe, and accessing the required text from the dataframe itself. I have used the built-in sample data "crude" (from the tm package) as an example.

    data("crude")
    dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)
    
    dataframe[1,]
    [1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
    
    0 讨论(0)
  • 2020-12-15 00:10

    This works in mine, to print the content text, with latest version of tm,

    corpus[[1]]$content
    

    Note: More or less as suggested by Ricky in the previous comment. Sorry, I wanted to write comment, only my rep is only 25 (need min. of 50 rep to comment).

    0 讨论(0)
  • 2020-12-15 00:11
    > inspect(crude[1])
    <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
    
    $`reut-00001.xml`
    <<PlainTextDocument (metadata: 15)>>
    Diamond Shamrock Corp said that
    effective today it had cut its contract prices for crude oil by
    1.50 dlrs a barrel.
        The reduction brings its posted price for West Texas
    Intermediate to 16.00 dlrs a barrel, the copany said.
        "The price reduction today was made in the light of falling
    oil product prices and a weak crude oil market," a company
    spokeswoman said.
        Diamond is the latest in a line of U.S. oil companies that
    have cut its contract, or posted, prices over the last two days
    citing weak oil markets.
     Reuter
    
    0 讨论(0)
  • 2020-12-15 00:12

    From the tm Vignette, this works:

    writeLines(as.character(doc.corpus[[8]]))

    Where '8' is whatever element number you wish

    0 讨论(0)
提交回复
热议问题