Introduction
Hi! I'm trying to extract JSon from a 300K line text file that has a combination of Text output and JSon format from HTTP Result. The big size in lines makes it unable to retain the JSon manually.
Problematic
Don't have much choice, i probably need to fix it manually using a command-line. Here's how it's looks like inside the file:
[2K 100.00% - C: 164148 / 164149 - S: 263 - F: 3686 - dhcp-140-247-148-215.fas.harvard.edu:443 - id3.sshws.me
[2K 100.00% - C: 164149 / 164149 - S: 263 - F: 3686 - public-1300503051.cos.ap-shanghai.myqcloud.com:443 - id3.sshws.me
[2K
[
{
"Request": {
"ProxyHost": "pro.ant.design",
"ProxyPort": 443,
"Bug": "pro.ant.design",
"Method": "HEAD",
"Target": "id3.sshws.me",
"Payload": "GET wss://pro.ant.design/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
},
"ResponseLine": [
"HTTP/1.1 101 Switching Protocol",
"Server: cloudflare"
]
},
{
"Request": {
"ProxyHost": "industrialtech.ft.com",
"ProxyPort": 443,
"Bug": "industrialtech.ft.com",
"Method": "HEAD",
"Target": "id3.sshws.me",
"Payload": "GET wss://industrialtech.ft.com/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
},
"ResponseLine": [
"HTTP/1.1 101 Switching Protocol",
"Server: cloudflare"
]
}
]
Several problem to this if using RegEx is:
It has multiple JSon object
The Text string that doesn't part of JSon has [ and :
I realize the problem when trying to use sed regex.
sed '/^[/,/^]/!d'
You can remove all lines that start with [ and any non-whitespace char:
sed '/^\[[^[:space:]]/d' file > newfile
Details:
^ - start of a line
\[ - [ char
[^[:space:]] - any non-whitespace chars.
Alternate way to this is; to get advantage for special char. If someone wanted to remove the progress bar from the output and extract only appropiate output:
Use nano <output_file>
You will see that there's new line unicode got readed as ^M^[ in the first text. I assume it's the same as [crlf]
Use sed -e "/\^M^[/d" remove lines that contains specific unicode.
Use \ to escape ^ as RegEx.
Make sure to always find the pattern readed from the terminal and not inside a text editor app, as some of them couldn't read Unicodes.
Related
I have the following JSON structure, generated by Zabbix Discovery key, with the following data:
[{
"{#SERVICE.NAME}": ".WindowsService1",
"{#SERVICE.DISPLAYNAME}": ".WindowsService1 - Testing",
"{#SERVICE.DESCRIPTION}": "Application Test 1 - Master",
"{#SERVICE.STATE}": 0,
"{#SERVICE.STATENAME}": "running",
"{#SERVICE.PATH}": "E:\\App\\Test\\bin\\testingApp.exe",
"{#SERVICE.USER}": "LocalSystem",
"{#SERVICE.STARTUPTRIGGER}": 0,
"{#SERVICE.STARTUP}": 1,
"{#SERVICE.STARTUPNAME}": "automatic delayed"
},
{
"{#SERVICE.NAME}": ".WindowsService2",
"{#SERVICE.DISPLAYNAME}": ".WindowsService2 - Testing",
"{#SERVICE.DESCRIPTION}": "Application Test 2 - Slave",
"{#SERVICE.STATE}": 0,
"{#SERVICE.STATENAME}": "running",
"{#SERVICE.PATH}": "E:\\App\\Test\\bin\\testingApp.exe",
"{#SERVICE.USER}": "LocalSystem",
"{#SERVICE.STARTUPTRIGGER}": 0,
"{#SERVICE.STARTUP}": 1,
"{#SERVICE.STARTUPNAME}": "automatic delayed"
}]
So, what i want to do is: Use JSONPath to get ONLY the object that {#SERVICE.NAME} == WindowsService1...
The problem is, i am trying to create the JSONPath but it's giving me a couple of errors.
Here's what i tried, and what i discovered so far:
JSONPath:
$.[?(#.{#SERVICE.NAME} == '.WindowsService1')]
Error output:
jsonPath: Unexpected token '{': _$_v.{#SERVICE.NAME} ==
'.WindowsService1'
I also tried doing the following JSONPath, to match Regular Expression:
$.[?(#.{#SERVICE.NAME} =~ '^(.WindowsService1$)')]
It gave me the same error - So the problem is not after the == or =~ ...
What i discovered is, if i REMOVE the curly braces {}, the hashtag # and replace the dot . in "Service name" with _ (Underline), in JSONPath and in JSON data, it works, like this:
Data without # {} . :
[{
"SERVICE_NAME": ".WindowsService1",
[...]
JSONPath following new data structure:
$.[?(#.SERVICE_NAME == '.WindowsService1')]
But the real problem is, i need to maintain the original strucutre, with the curly braces, dots, and hashtags...
How can i escape those and stop seeing this error?
Thank you...
$.[?(#['{#SERVICE.NAME}'] == '.WindowsService1')]
UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)
Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).
Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.
I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.
I have a NiFi flow that takes JSON files and evaluates a JSON Path argument against them. It work perfectly except when dealing with records that contain Korean text. The Jayway JSONPath evaluator does not seem to recognize the escape "\" in the headline field before the double quote character. Here is an example (newlines added to help with formatting):
{"data": {"body": "[이데일리 김관용 기자] 우리 군이 2018 남북정상회담을 앞두고 남
북간 군사적 긴장\r\n완화와 평화로운 회담 분위기 조성을 위해 23일 0시를 기해 군사분계선
(MDL)\r\n일대에서의 대북확성기 방송을 중단했다.\r\n\r\n국방부는 이날 남북정상회담 계기
대북확성기 방송 중단 관련 내용을 발표하며\r\n“이번 조치가 남북간 상호 비방과 선전활동을
중단하고 ‘평화, 새로운 시작’을\r\n만들어 나가는 성과로 이어지기를 기대한다”고 밝혔
다.\r\n\r\n전방부대 우리 군 장병이 대북확성기 방송을 위한 장비를 점검하고 있다.\r\n[사
진=국방부공동취재단]\r\n\r\n\r\n\r\n▶ 당신의 생활 속 언제 어디서나 이데일리 \r\n▶
스마트 경제종합방송 ‘이데일리 TV’ | 모바일 투자정보 ‘투자플러스’\r\n▶ 실시간 뉴스와
속보 ‘모바일 뉴스 앱’ | 모바일 주식 매매 ‘MP트래블러Ⅱ’\r\n▶ 전문가를 위한 국내 최상의
금융정보단말기 ‘이데일리 마켓포인트 3.0’ | ‘이데일리 본드웹 2.0’\r\n▶ 증권전문가방송
‘이데일리 ON’ 1666-2200 | ‘ON스탁론’ 1599-2203\n<ⓒ종합 경제정보 미디어 이데일리 -
무단전재 & 재배포 금지> \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 대북확성기 방송, 23일 0시부터 중단\"",
"id": "EDYM00251_1804232beO/5WAUgdlYbHS853hYOGrIL+Tj7oUjwSYwT"}}
If this object is in my file, the JSON path evaluates blanks for all the path arguments. Is there a way to force the Jayway engine to recognize the "\"? It appears to function correctly in other languages.
I was able to resolve this after understanding the difference between definite and indefinite paths. The Jayway github README points out the following will make a path indefinite and return a list:
When evaluating a path you need to understand the concept of when a
path is definite. A path is indefinite if it contains:
.. - a deep scan operator
?(<expression>) - an expression
[<number>, <number> (, <number>)] - multiple array indexes Indefinite paths
always returns a list (as represented by current JsonProvider).
My JSON looked like the following:
{
"Version":"14",
"Items":[
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "1"}
},
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "2"}
...
}
]
}
This JSON path selector I was using ($.data.headline) did not grab the values as I expected. It instead returned null values.
Changing it to $.Items[*].data.headline or $..data.headline returns a list of each headline.
I keep getting that there is an error uploading/importing my JSON file into Firebase. I initially had an excel spreadsheet that I saved as a CSV file, then I used a CSV to JSON converter.
I validated the JSON file (which have the .json extension) with a couple of online tools.
Though, I'm still getting an error.
Here is an example of my JSON:
{
"Rk": 1,
"Tm": "SEA",
"H/A": "H",
"DOW": "Sun",
"Opp": "CLE",
"QB": "Russell Wilson",
"Grade": "BLUE",
"Def mu pts": 4,
"Inj status": 0,
"Notes": "Got to wonder if not having a proven power RB under center will negatively impact Wilson's production.",
"TFS $50K": "$8,300",
"Init sal": "$8,300",
"Var": "$0",
"WC": 0
}
The issue is your key's..
Firebase keys must be:
UTF-8 encoded, cannot contain . $ # [ ] / or ASCII control characters
0-31 or 127
your $50k key and the H/A are the issues.