Python - Parsing JSON formatted text file with regex

邮差的信 提交于 2021-02-04 08:29:27

问题


I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?

The text shows up like this:

{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent": 

I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error

For the FileAssetid I tried this regex:

regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")

But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991

Im not to sure how to get the data as its displayed.


回答1:


How about using positive lookahead and lookbehind:

(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")

captures the fileAssetId and

(?<=\"filename\":\").+?(?=\")

matches the filename.

For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)

To get a list of all matches use re.findall or re.finditer instead of re.match.

re.findall(pattern, string) returns a list of matching strings.

re.finditer(pattern, string) returns an iterator with the objects.




回答2:


You can use python's walk method and check each entry with re.match.

In case that the string you got is not convertable to a python dict, you can use just regex:

print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)

Solution for your example:

import re

example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'

regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))

executing this yields:

34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991



回答3:


Try adding \n to the string that you are entering in to the file (\n means new line)




回答4:


Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:

json_pattern = (
    r'(?(DEFINE)'
    r'(?P<whitespace>( |\n|\r|\t)*)'
    r'(?P<boolean>true|false)'
    r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
    r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
    r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
    r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
    r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
    r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
    r'(?P<document>(?&object)|(?&array))'
    r')'
    r'(?&document)'
)

json_regex = regex.compile(json_pattern)

match = json_regex.match(json_document_text)

You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.



来源:https://stackoverflow.com/questions/47454689/python-parsing-json-formatted-text-file-with-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!