问题
I'm new to scala and I need to read line by line a ttl file and split on particular delimiter and extract values to put in respective columns in dataframe.
< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .
I want to have this output
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|http://website/Jimmy_Car |http://web/name |"James Earl Carter |
|http:///website/Jimmy_Car |http://web/country |http://web/country |
|http://website/Jimmy_Car |http://web/birthPlace |http://web/Georgia_(US) |
|
I tried this code
case class T(S: Option[String], P: Option[String],O:Option[String])
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . ")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
And I got this result
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|<http://website/Jimmy_Car |<http://web/name |"James |
|<http:///website/Jimmy_Car |<http://web/country |<http://web/country |
|<http://website/Jimmy_Car |<http://web/birthPlace |<http://web/Georgia_(US)
To remove the separator "<" in the begin of each triple I added "|<" to the split
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . |<")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
And I had this result result
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
| |http://web/name | |
| |http://web/country | |
| |http://web/birthPlace |
How can I solve this problem
回答1:
Please find below the answer in the case that is not clear how to replace your code with the build-in regex functionality in Spark. Although you need to be sure that you understand how regex work before using this approach.
val df = Seq(
("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
).toDF("S", "P", "O")
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
.withColumn("P", regexp_extract($"P", url_regex, 1))
.withColumn("O", regexp_extract($"O", url_regex, 1))
This will output:
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
Just a little explanation how the regex works even if this is not the subject of the post.
(?:"|<{1}\s?)Identify rows that start with"or<or<(.*)extract content of the matches into the 1st group(?:>(?:\s\.)?|,\s.*)Identify rows that end either with>or> .or,\s.*the last for the James Earl case
回答2:
You can't read a Turtle file like this. Plus, regex is a very naive way of reading N-Triples. Don't reinvent the wheel and use https://github.com/banana-rdf/banana-rdf
来源:https://stackoverflow.com/questions/55121544/splitting-ttl-or-nt-file-spark-scala