I have a multiline string variable which includes multiline of system log like below, and I would like to extract the JSON part only
System 123456
Logs start 2021-07-03 12:00:00
<event> {log_in_json}
<event> {log_in_json}
I using find to search over the string variable but this only allow me to get the first occurrence. Anyone could advise?
start = var.find(<event>)
end = var.find("}}")
extracted_line = var[start:end+len("}}")]
json_str = extracted_line.lstrip(<event>)
print(json_str)
Using the optional second argument to the find method, we can set the starting
point for the search. So, second and following times around, we'll start where
we previously found the last match (end), until the method returns -1:
var = '''
System 123456
Logs start 2021-07-03 12:00:00
<event> {log_in_json}
<event> {log_in_json2}
'''
start = var.find('<event>')
while start > 0:
end = var.find("}", start)
extracted_line = var[start:end+len("}")]
json_str = extracted_line.lstrip('<event> ')
print(json_str)
start = var.find('<event>', end)
# {log_in_json}
# {log_in_json2}
Related
I'm using the following code to check the coherence value. The problem is code below works well when I change the coherence type into "u_mass", but if I want to compute "c_v", an Index error occure.
Previous text process:
# Remove Stopwords, Form Bigrams, Trigrams and Lemmatization
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
texts = [bigram_mod[doc] for doc in texts]
texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
texts_out = []
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
# remove stopwords once more after lemmatization
texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]
## Remove numbers, but not words that contain numbers.
texts_out = [[word for word in simple_preprocess(str(doc)) if not word.isdigit()] for doc in texts_out]
## Remove words that are only one character.
texts_out = [[word for word in simple_preprocess(str(doc)) if len(word) > 3] for doc in texts_out]
return texts_out
data_ready = process_words(data_words)
# Create Dictionary
id2word = corpora.Dictionary(data_ready)
#dictionary.filter_extremes(no_below=10, no_above=0.2) #filter out tokens
# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_ready]
# View:the produced corpus shown above is a mapping of (word_id, word_frequency).
print(corpus[:1])
print('Number of unique tokens: %d' % len(id2word))
print('Number of documents: %d' % len(corpus))
The output is :
[[(0, 1), (1, 1), (2, 1), (3, 1)]]
Number of unique tokens: 6558
Number of documents: 23141
Now I set a base model:
## set a base model
num_topics = 5
chunksize = 100
passes = 10
iterations = 100
eval_every = 1
lda_model = LdaModel(corpus=corpus,id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
The last step is where the problem occurs:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_ready, dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Here is the error:
IndexError: index 0 is out of bounds for axis 0 with size 0
If I change coherence into 'u_mass', however, the code ablove can compute successfully. I don't understand why and how to fix it?
!pip install gensim==4.1.0
It seems that downgrade solves everything.
Just in case anyone else runs into the same issue.
Apparently the error described here persist in gensim 4.2.0. Downgrading to 4.1.0 worked well for me.
I'm not sure if I'm returning the value from the function incorrectly, but when I try to access it's info, it has the above error,
Cannot index into a null array
I've tried a couple different ways, and I'm not sure if I'm not returning this correctly from the function, or if I'm just accessing the info returned incorrectly. Looking at Cannot index into null array, it looks like for him, some of his array had null values. But when I print my info to screen before I exit the function, it has info. How do I return the value found in the function such that I can loop through the contents in my main code and use one of the strings in the object? This is a continuation of parsing repeated pattern.
#parse data out of cpp code and loop through to further process
#function
Function Get-CaseContents{
[cmdletbinding()]
Param ( [string]$parsedCaseMethod, [string]$parseLinesGroupIndicator)
Process
{
# construct regex
$fullregex = [regex]"_stprintf[\s\S]*?_T\D*", # Start of error message, capture until digits
"(?<sdkErr>\d+)", # Error number, digits only
"\D[\s\S]*?", # match anything, non-greedy
"(?<sdkDesc>\((.+?)\))", # Error description, anything within parentheses, non-greedy
"([\s\S]*?outError\s*=(?<sdkOutErr>\s[a-zA-Z_]*))", # Capture OutErr string and parse out part after underscore later
"[\s\S]*?", # match anything, non-greedy
"(?<sdkSeverity>outSeverity\s*=\s[a-zA-Z_]*)", # Capture severity string and parse out part after underscore later
'' -join ''
# run the regex
$Values = $parsedCaseMethod | Select-String -Pattern $fullregex -AllMatches
# Convert Name-Value pairs to object properties
$result = foreach ($match in $Values.Matches){
[PSCustomObject][ordered]#{
sdkErr = $match.Groups['sdkErr']
sdkDesc = $match.Groups['sdkDesc']
sdkOutErr = $match.Groups['sdkOutErr']
sdkSeverity = ($match.Groups['sdkSeverity'] -split '_')[-1] #take part after _
}
}
#Write-Host "result:" $result -ForegroundColor Green
$result
return $Values
...
#main code
...
#call method to get case info (sdkErr, sdkDesc, sdkOutErr, sdkSeverity)
$ValuesCase = Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf" #need to get returned info back
$result = foreach ($match in $ValuesCase.Matches){
[PSCustomObject][ordered]#{
sdkErr = $match.Groups['sdkErr']
sdkDesc = $match.Groups['sdkDesc']
sdkOutErr = $match.Groups['sdkOutErr']
sdkSeverity = ($match.Groups['sdkSeverity'] -split '_')[-1] #take part after _
} #result
} #foreach ValuesCase
The example of string sent to the function to parse is:
...
case kRESULT_STATUS_Undefined_Opcode:
_stprintf( outDevStr, _T("8004 - (Comm. Err 04) - %s(Undefined Opcode)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
case kRESULT_STATUS_Comm_Timeout:
_stprintf( outDevStr, _T("8005 - (Comm. Err 05) - %s(Timeout sending command)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
case kRESULT_STATUS_TXD_Failed:
_stprintf( outDevStr, _T("8006 - (Comm. Err 06) - %s(TXD Failed--Send buffer overflow.)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
...
Another thing I tried is (but it also had the index into null array issue):
foreach($matchRegex in $ValuesCase.Matches)
{
$sdkOutErr = $matchRegex.Groups['sdkOutErr']
Write-Host sdkOutErr -ForegroundColor DarkMagenta
}
Ultimately, I need to grab $sdkOutErr to further process. I'll need to use the other variables too in the returned object, but this is the first one I need. I love the way the output is formatted in the function, but probably don't know how to return the info and use what is returned. I'm not sure what to search for to figure out the issue other than the error message, which leads me to believe I'm returning the info wrong. I don't think I need to return $result, because I think that's just a string with the values in the $values.Matches in the function. I need to access the values returned as I mentioned.
I checked, and the contents sent to the function is not blank.
I tried returning $results, and it looks like this when I write-Host, which would be difficult to access each sdkOutErr:
#{sdkErr=1000; sdkDesc=(Out of Memory); sdkOutErr= NO_MEMORY; sdkSeverity=FATAL} #{sdkErr=1002; sdkDesc=(Failed to load DLL); sdkOutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1003; sdkDesc=(Failed to load DLL); sdk
OutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1004; sdkDesc=(Failed to open); sdkOutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1005; sdkDesc=(Unable to access the specified profile); sdkOutErr= OTHER_ERROR; sdkSeverity=
FATAL} #{sdkErr=100 ...
How can I return this from the function so that it's not a null array/index, and the data is accessible if I use a foreach loop (or two) in the main code to get the sdkOutErr (to start).
I'm fairly new to (complicated)powershell and I have a feeling I need a map inside the array in my function, but I'm not sure.
Before I returned the function Values or results, it was printing something like this out. Once I added in main $ValuesCase=Get-CaseContents... (returning $values from function), or $parsedCase = Get-CaseContents... (returning $results from function), it stopped showing this on the screen:
sdkErr sdkDesc sdkOutErr sdkSeverity
------ ------- --------- -----------
1000 (Out of Memory) NO_MEMORY FATAL
1002 (Failed to load DLL) OTHER_ERROR FATAL
1003 (Failed to load DLL) OTHER_ERROR FATAL
1004 (Failed to open) OTHER_ERROR FATAL
I tried returning $results, and it looks like this when I write-Host, which would be difficult to access each sdkOutErr:
Getting all the sdkOutErr values is not as difficult as you might imagine:
$results.sdkOutErr # this will output the `sdkOutErr` value from each object in the array
Or, outside the function:
(Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf").sdkOutErr
Another option, which might perform better if the result set is large, is to use ForEach-Object to grab just the sdkOutErr values:
$fullResults = Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf"
$sdkOutErrValuesOnly = $fullResults |ForEach-Object -MemberName sdkOutErr
I have a json-like string that represents a nested structure. it is not a real json in that the names and values are not quoted. I want to parse it to a nested structure, e.g. list of lists.
#example:
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
and the result should be like this:
x_list = list(a=1,b=2,c=c(1,2,3),d=list(e="something"))
is there any convenient function that I don't know that does this kind of parsing?
Thanks.
If all of your data is consistent, there is a simple solution involving regex and jsonlite package. The code is:
if(!require(jsonlite, quiet=TRUE)){
#if library is not installed: installs it and loads it into the R session for use.
install.packages("jsonlite",repos="https://ftp.heanet.ie/mirrors/cran.r-project.org")
library(jsonlite)
}
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
json_x_string = "{\"a\":1, \"b\":2, \"c\":[1,2,3], \"d\":{\"e\":\"something\"}}"
fromJSON(json_x_string)
s <- gsub( "([A-Za-z]+)", "\"\\1\"", gsub( "([A-Za-z]*)=", "\\1:", x_string ) )
fromJSON( s )
The first section checks if the package is installed. If it is it loads it, otherwise it installs it and then loads it. I usually include this in any R code I'm writing to make it simpler to transfer between pcs/people.
Your string is x_string, we want it to look like json_x_string which gives the desired output when we call fromJSON().
The regex is split into two parts because it's been a while - I'm pretty sure this could be made more elegant. Then again, this depends on if your data is consistent so I'll leave it like this for now. First it changes "=" to ":", then it adds quotation marks around all groups of letters. Calling fromJSON(s) gives the output:
fromJSON(s)
$a
[1] 1
$b
[1] 2
$c
[1] 1 2 3
$d
$d$e
[1] "something"
I would rather avoid using JSON's parsing for the lack of extendibility and flexibility, and stick to a solution of regex + recursion.
And here is an extendable base code that parses your input string as desired
The main recursion function:
# Parse string
parse.string = function(.string){
regex = "^((.*)=)??\\{(.*)\\}"
# Recursion termination: element parsing
if(iselement(.string)){
return(parse.element(.string))
}
# Extract components
elements.str = gsub(regex, "\\3", .string)
elements.vector = get.subelements(elements.str)
# Recursively parse each element
parsed.elements = list(sapply(elements.vector, parse.string, USE.NAMES = F))
# Extract list's name and return
name = gsub(regex, "\\2", .string)
names(parsed.elements) = name
return(parsed.elements)
}
.
Helping functions:
library(stringr)
# Test if the string is a base element
iselement = function(.string){
grepl("^[^[:punct:]]+=[^\\{\\}]+$", .string)
}
# Parse element
parse.element = function(element.string){
splits = strsplit(element.string, "=")[[1]]
element = splits[2]
# Parse numeric elements
if(!is.na(as.numeric(element))){
element = as.numeric(element)
}
# TODO: Extend here to include vectors
# Reformat and return
element = list(element)
names(element) = splits[1]
return(element)
}
# Get subelements from a string
get.subelements = function(.string){
# Regex of allowed elements - Extend here to include more types
elements.regex = c("[^, ]+?=\\{.+?\\}", #Sublist
"[^, ]+?=\\[.+?\\]", #Vector
"[^, ]+?=[^=,]+") #Base element
str_extract_all(.string, pattern = paste(elements.regex, collapse = "|"))[[1]]
}
.
Parsing results:
string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
string_2 = "{a=1, b=2, c=[1,2,3], d=somthing}"
named_string = "xyz={a=1, b=2, c=[1,2,3], d={e=something, f=22}}"
named_string_2 = "xyz={d={e=something, f=22}}"
parse.string(string)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 2
#
# [[1]]$c
# [1] "[1,2,3]"
#
# [[1]]$d
# [[1]]$d$e
# [1] "something"
I have recently started using R and have a task regarding parsing json in R to get a non-json format. For this, i am using the "fromJSON()" function. I have tried to parse json as a text file. It runs successfully when i do it with just a single row entry. But when I try it with multiple row entries, i get the following error:
fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: invalid char in json text.
[{'CategoryType':'dining','City':
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
"mumbai","Location":"all"}] [{"JourneyType":"Return","Origi
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: after array element, I expect ',' or ']'
:"mumbai","Location":"all"} {"JourneyType":"Return","Origin
(right here) ------^
The above errors are due to three different formats in which i tried to parse the json text, but the result was the same, only the location suggested by changed.
Please help me to identify the cause of this error or if there is a more efficient way o performing the task.
The original file that i have is an excel sheet with multiple columns and one of those columns consists of json text. The way i tried right now is by extracting just the json column and converting it to a tab separated text and then parsing it as:
fromJSON("D:/Eclairs/Printing/test3.txt")
Please also suggest if this can be done more efficiently. I need to map all the columns in the excel to the non-json text as well.
Example:
[{"CategoryType":"dining","City":"mumbai","Location":"all"}]
[{"CategoryType":"reserve-a-table","City":"pune","Location":"Kothrud,West Pune"}]
[{"Destination":"Mumbai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"3","NoOfAdult":"3","NoOfChildren":"0"}]
Consider reading in the text line by line with readLines(), iteratively saving the JSON dataframes to a growing list:
library(jsonlite)
con <- file("C:/Path/To/Jsons.txt", open="r")
jsonlist <- list()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
jsonlist <- append(jsonlist, list(fromJSON(line)))
}
close(con)
jsonlist
# [[1]]
# CategoryType City Location
# 1 dining mumbai all
# [[2]]
# CategoryType City Location
# 1 reserve-a-table pune Kothrud,West Pune
# [[3]]
# Destination CheckInDate CheckOutDate Rooms NoOfPax NoOfAdult NoOfChildren
# 1 Mumbai 14-Oct-2016 15-Oct-2016 1 3 3 0
ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.