Splitting .ttl or .nt file - Spark Scala

你离开我真会死。 提交于 2019-12-11 09:02:08

问题


I'm new to scala and I need to read line by line a ttl file and split on particular delimiter and extract values to put in respective columns in dataframe.

< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .

I want to have this output

+-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|http://website/Jimmy_Car       |http://web/name            |"James Earl Carter                                                       |
|http:///website/Jimmy_Car      |http://web/country         |http://web/country                   |
|http://website/Jimmy_Car       |http://web/birthPlace      |http://web/Georgia_(US)             |
|

I tried this code

case class T(S: Option[String], P: Option[String],O:Option[String])


 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . ")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

And I got this result

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|<http://website/Jimmy_Car       |<http://web/name            |"James                                                       |
|<http:///website/Jimmy_Car      |<http://web/country         |<http://web/country                   |
|<http://website/Jimmy_Car       |<http://web/birthPlace      |<http://web/Georgia_(US) 

To remove the separator "<" in the begin of each triple I added "|<" to the split

 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . |<")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

And I had this result result

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|                                |http://web/name            |                                                      |
|                                |http://web/country         |                   |
|                                |http://web/birthPlace      | 

How can I solve this problem


回答1:


Please find below the answer in the case that is not clear how to replace your code with the build-in regex functionality in Spark. Although you need to be sure that you understand how regex work before using this approach.

val df = Seq(
        ("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
        ("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
        ("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
    ).toDF("S", "P", "O")

val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
            .withColumn("P", regexp_extract($"P", url_regex, 1))
            .withColumn("O", regexp_extract($"O", url_regex, 1))

This will output:

+---------------------------+---------------------+----------------------------+
|S                          |P                    |O                           |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name      |James Earl Carter           |
|http://website/Jimmy_Car   |http://web/country   |http://website/United_States|
|http://website/Jimmy_Car   |http://web/birthPlace|http://web/Georgia_(US)     |
+---------------------------+---------------------+----------------------------+

Just a little explanation how the regex works even if this is not the subject of the post.

  1. (?:"|<{1}\s?) Identify rows that start with " or < or <
  2. (.*) extract content of the matches into the 1st group
  3. (?:>(?:\s\.)?|,\s.*) Identify rows that end either with > or > . or ,\s.* the last for the James Earl case



回答2:


You can't read a Turtle file like this. Plus, regex is a very naive way of reading N-Triples. Don't reinvent the wheel and use https://github.com/banana-rdf/banana-rdf



来源:https://stackoverflow.com/questions/55121544/splitting-ttl-or-nt-file-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!