Easiest way to import a simple csv file to a graph with OrientDB ETL

纵饮孤独 提交于 2019-12-10 17:47:40

问题


I would like to import a very simple directed graph file in csv to OrientDB. Concretely, the file is the roadNet-PA dataset from the SNAP collection https://snap.stanford.edu/data/roadNet-PA.html. The first lines of the file are as follows:

# Directed graph (each unordered pair of nodes is saved once)
# Pennsylvania road network
# Nodes: 1088092 Edges: 3083796
# FromNodeId    ToNodeId
0       1
0       6309
0       6353
1       0
6353    0
6353    6354

There is only one type of vertex (a road intersection) and edges have no information (I suppose OrientDB lightweight edges are the best option for this). Note also that vertices are spaced with tabs.

I've tried to create a simple etl to import the file with no success. Here is the etl:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "   ", "skipFrom": 1, "skipTo": 4 } },
    { "vertex": { "class": "Intersection" } },
    { "edge": { "class": "Road" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 

The etl works but it does not import the file as I expect. I suppose the problem is in the transformers. My idea is to read the csv line by line and create and edge connecting both vertices, but I'm not sure how to express this in an etl file. Any ideas?


回答1:


Try this:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "\t", "skipFrom": 1, "skipTo": 4,
               "columnsOnFirstLine": false, 
               "columns":["id", "to"] } },
    { "vertex": { "class": "Intersection" } },
    { "merge": { "joinFieldName":"id", "lookup":"Intersection.id" } },
    { "edge": {
       "class": "Road",
       "joinFieldName": "to",
       "lookup": "Intersection.id",
       "unresolvedLinkAction": "CREATE"
      }
    },
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "wal": false,
       "batchCommit": 1000,
       "tx": true,
       "txUseLog": false,
       "useLightweightEdges" : true,
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 

To speedup loading I suggest you to shutdown the server, and import the ETL by using "plocal:" instead of "remote:". Example replacing the existent with:

       "dbURL": "plocal:/orientdb/databases/roads",



回答2:


It finally worked. I've moved the merge before vertex line as suggested by Luca. I've also changed the 'id' field to 'from' to avoid the error "property key is reserved for all elements id". Here is the snippet:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roads.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "\t",
               "columnsOnFirstLine": false, 
               "columns":["from", "to"] } },
    { "merge": { "joinFieldName":"from", "lookup":"Intersection.from" } },
    { "vertex": { "class": "Intersection" } },
    { "edge": {
       "class": "Road",
       "joinFieldName": "to",
       "lookup": "Intersection.from",
       "unresolvedLinkAction": "CREATE"
      }
    },
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "wal": false,
       "batchCommit": 1000,
       "tx": true,
       "txUseLog": false,
       "useLightweightEdges" : true,
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["from:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 


来源:https://stackoverflow.com/questions/30967506/easiest-way-to-import-a-simple-csv-file-to-a-graph-with-orientdb-etl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!