Parsing pseudo-HTML/XML logfiles into a data frame (Symantec Altiris) [R] - html

I've been asked to help parse some log files for a Symantec application (Altiris) and they were delivered to me in a pseudo-HTML/XML format. I've managed to use readLines() and grepl() to get the logs into a decent character vector format and clean out the junk, but can't get it into a data-frame.
As of right now, an entry looks something like this (since I can't post real data), all in a character vector with structure chr[1:312]:
[310] "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >"
I've had no luck with XML parsing and it does look more like HTML to me, and when I tried htmlTreeParse(x) I just ended up with a massive pyramid of tags.

If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr and dplyr for stuff like this.
Here's a two-element vector (instead of 312 in your case):
vec <- c(
"<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
"<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)
Convert it to a data.frame object:
df <- data.frame(vec, stringsAsFactors = FALSE)
And select out your data based on their character index positions, relative to the positions of your variables of interest:
require(stringr)
require(dplyr)
df %>%
mutate(
severityStr = str_locate(vec, "severity")[, "start"],
hostnameStr = str_locate(vec, "hostname")[, "start"],
sourceStr = str_locate(vec, "source")[, "start"],
moduleStr = str_locate(vec, "module")[, "start"],
processStr = str_locate(vec, "process")[, "start"],
pidStr = str_locate(vec, "pid")[, "start"],
endStr = str_locate(vec, ">")[, "start"],
severity = substr(vec, severityStr + 10, hostnameStr - 4),
hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
source = substr(vec, sourceStr + 8, moduleStr - 4),
module = substr(vec, moduleStr + 8, processStr - 4),
process = substr(vec, processStr + 9, pidStr - 4),
pid = substr(vec, pidStr + 5, endStr - 3)) %>%
select(severity, hostname, source, module, process, pid)
Here's the resulting data frame:
severity hostname source module process pid
1 4 computername125 PackageDownload herpderp.dll masterP.exe 234
2 5 computername126 PackageDownload herpderp.dll masterP.exe 235
This solution is robust enough to handle string inputs of different lengths. For example, it would read pid in correctly even if it's 95 (two digits instead of three).

Related

Running timeseries graphing function in Rmd producing cluttered x-axis labels (not present in test code)

I have a folder of xx .csv timeseries that I want to graph and knit into a clean HTML document. I have a ggplot code that produces the plot that I want using a single timeseries.csv. However, when I try to put the bones of that ggplot code in a function inside of a for loop to run each of the timeseries.csv files through the function I get a some plots with pretty different formatting.
Plot generated with my test ggplot code:
Plot generated with function and for loop:
Changes I'm trying to make to the ugly Rmd plot:
Nicely space the x-axis tick marks to whole mins (i.e. "11:14:00", "11:15:00")
Connect the data points (solved with subbing geom_line() with geom_path())
Example Rmd Code Below. Please Note that the graphs produced still have nice formatting, I'm not sure how to reproduce this problem sort of posting a 500 row dataframe. I also don't know how to post my rmd code without SO using the formatting commands in this post, so I threw in at 3 of " around my header formatting and at the end of the code to disable it.
Edits and Updates
I am getting a persistent error geom_path: Each group consists of only one observation. Do you need to adjust the group
aesthetic?.
As suggested by the commenters I tried removing plot() and using the the createChlDiffPlot() directly and replacing plot() with print(). Both produce the same ugly plots as before.
Replaced geom_line() with geom_path(). The points are now connected! x-axis cluttering is still there.
Time variable is reading as hms num
Many thanks for any help on this!
```
---
title: "Chl Filtration"
output:
flexdashboard::flex_dashboard:
theme: yeti
orientation: rows
editor_options:
chunk_output_type: console
---
```{r setup}
library(flexdashboard)
library(dplyr)
library(ggplot2)
library(hms)
library(ggthemes)
library(readr)
library(data.table)
#### Example Data
df1 <- data.frame(Time = as_hms(c("11:22:33","11:22:34","11:22:35","11:22:38","11:23:00","11:23:01","11:23:02")),
Chl_ug_L_Up = c(0.2,0.1,0.25,-0.2,-0.3,-0.15,0.1),
Chl_ug_L_Down = c(0.5,0.4,0.3,0.2,0.1,0,-0.1))
df2 <- data.frame(Time = as_hms(c("08:02:33","08:02:34","08:02:35","08:02:40","08:02:42","08:02:43","08:02:49")),
Chl_ug_L_Up = c(-0.2,-0.1,-0.25,0.2,0.3,0.15,-0.1),
Chl_ug_L_Down = c(-0.1,0,0.1,0.2,0.3,0.4,0.1))
data_directory = "./" # data folder in R project folder in the real deal
output_directory = "./" # output graph directory in R project folder
write_csv(df1, file.path(data_directory, "SO_example_df1.csv"))
write_csv(df2, file.path(data_directory, "SO_example_df2.csv"))
#### Function to create graphs
createChlDiffPlot = function(aTimeSeriesFile, aFileName, aGraphOutputDirectory, aType)
{
aFile_Mod = aTimeSeriesFile %<>%
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down)
one_plot = ggplot(data = aFile_Mod, aes(x = Time, y = Chl_diff)) + # tried adding 'group = 1' in aes to connect points
geom_path(size = 1, color = "green") +
geom_point(color = "green") +
theme_gdocs() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank()) +
labs(x = "", y = "Chl Difference", title = paste0(aFileName, " - ", "Filtration"))
one_graph_name = paste0(gsub(".csv", "", aFileName), "_", aType, ".pdf")
ggsave(one_graph_name, one_plot, dpi = 600, width = 7, height = 5, units = "in", device = "pdf", aGraphOutputDirectory)
return(one_plot)
}
"``` ### remove the quotes when running example
Plots - After Velocity Adjustment
=====================================" ### remove quotes when running example
```{r, fig.width=13.5, fig.height=5}
all_files_Filtration = list.files(data_directory, pattern = ".csv")
# Loop to plot function
for(file in 1 : length(all_files_Filtration))
{
file_name = all_files_Filtration[file]
one_file = fread(file.path(data_directory, file_name))
# plot the time series agains
plot(createChlDiffPlot(one_file, file_name, output_directory, "Velocity_Paired"))
}
"``` #remove quotes when running example
```
I finally figured it out.
1) Replacing geom_line() with geom_path() connected the data points when rendered in Rmd.
2) df1$Time was formatted as a difftime object. When I looked at the dataframe in the global environment, Time :hmsnum 11:11:09 .... This made me think my format was ok, but when I ran class(df1$Time) I got [1] "hms" "difftime". With a quick google I found out difftime objects are not quite the same as hms, and my original time was generated by subtracting times. I added a conversion into my mutate function:
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down,
Time = as_hms(Time)) # convert difftime objecct to hms
ggplot I think has some auto-formatting for hms variables, which is why difftime variable was producing ugly crowded x- axes.

Can prefix beam search commonly used in speech recognition with CTC be implemented in such a simpler way?

I am learning about speech recognition recently, and I have learned that the idea of prefix beam search is to merge paths with the same prefix, such as [1,1,_] and [_,1,_] (as you can see, _ indicates blank mark).
Based on this understanding, I implemented a version of mine, which can be simplified using pseudo code like this:
def prefix_beam_search(y, beam_size, blank):
seq_len, n_class = y.shape
logY = np.log(y)
beam = [([], 0)]
for t in range(seq_len):
buff = []
for prefix, p in beam:
for i in range(n_class):
new_prefix = list(prefix) + [i]
new_p = p + logY[t][i]
buff.append((new_prefix, new_p))
# merge the paths with same prefix'
new_beam = defaultdict(lambda: ninf)
for prefix, p in buff:
# 'norm_prefix' can simplify the path, [1,1,_,2] ==> [1,2]
# However, the ending 'blank' is retained, [1,1,_] ==> [1,_]
prefix = norm_prefix(prefix, blank)
new_beam[prefix] = logsumexp(new_beam[prefix], p)
# choose the best paths
new_beam = sorted(new_beam.items(), key=lambda x: x[1], reverse=True)
beam = new_beam[: beam_size]
return beam
But most of the versions I found online (according to the paper) are like this:
def _prefix_beam_decode(y, beam_size, blank):
T, V = y.shape
log_y = np.log(y)
beam = [(tuple(), (0, ninf))]
for t in range(T):
new_beam = defaultdict(lambda: (ninf, ninf))
for prefix, (p_b, p_nb) in beam:
for i in range(V):
p = log_y[t, i]
if i == blank:
new_p_b, new_p_nb = new_beam[prefix]
new_p_b = logsumexp(new_p_b, p_b + p, p_nb + p)
new_beam[prefix] = (new_p_b, new_p_nb)
continue
end_t = prefix[-1] if prefix else None
new_prefix = prefix + (i,)
new_p_b, new_p_nb = new_beam[new_prefix]
if i != end_t:
new_p_nb = logsumexp(new_p_nb, p_b + p, p_nb + p)
else:
new_p_nb = logsumexp(new_p_nb, p_b + p)
new_beam[new_prefix] = (new_p_b, new_p_nb)
if i == end_t:
new_p_b, new_p_nb = new_beam[prefix]
new_p_nb = logsumexp(new_p_nb, p_nb + p)
new_beam[prefix] = (new_p_b, new_p_nb)
beam = sorted(new_beam.items(), key=lambda x: logsumexp(*x[1]), reverse=True)
beam = beam[:beam_size]
return beam
The results of the two are different, and my version tends to return longer strings. And I don't quite understand the main two aspects:
Are there any details of my version that are not thoughtful?
The common version while generate new prefix by new_prefix = prefix + (i,) regardless of whether the end of the previous are the same as the given 's'. For example, the old prefix is [a,a,b] and when a new character s is added, both [a,a,b] and [a,a,b,b] are saved. What is the purpose if this? And does it cause double counting?
Looking forward to your answer, thanks in advance!
When you choose the best paths in your code, you don't want to differentiate between [1,_] and [1] since both correspond to the same prefix [1].
If you have for example:
[1], [1,_], [1,2]
then you want the probability of [1] and [1,_] both to have the sum of the two.
probability([1]) = probability([1])+probability([1,_])
probability([1,_]) = probability([1])+probability([1,_])
And after sorting with these probabilities, you may want to keep so many that the number of true prefixes is beam_size.
For example, you have [1], [1,_], [2], [3].
Of which probabilities are: 0.1, 0.08, 0.11, 0.15
Then the probabilities with which you want to sort them are:
0.18, 0.18, 0.11, 0.15, respectively (0.18 = 0.1 + 0.08)
Sorted: [1]:0.18, [1,_]: 0.18, [3]:0.15, [2]:0.11
And if you have beam_size 2, for example, then you may want to keep
[1], [1,_] and [3] so that you have 2 prefixes in your beam, because [1] and [1,_] count as the same prefix (as long as the next character is not 1 - that's why we keep track of [1] and [1,_] separately).

Python: KeyError: 'style_code' even when style_code is in my json data

I might just need new eyes to look at this but im not sure why im yielding this error KeyError: 'style_code'
I know that style code is in my json data and i looked up why this error occurs and it said it was because it cannot find it in the list. Here is my code of the json data...
data1['task'].append({
'profile': profiles_select.get(),
'proxy pool': pool_combo.get(),
'captcha': captcha,
'task_id': num_id,
'style_code': stylecode_entry.get(),
'delay': int(delay_entry.get()),
'size': size,
'splash': splash,
'browser': browser
})
num_id = num_id + 1
with open('tasks_ys.txt', 'w', encoding='utf-8') as outfile:
json.dump(data1, outfile, indent=2)
As you can see 'style_code' is clearly stated in there. Here is my code which is giving me the error.
with open('tasks_ys.txt', 'r') as json_file:
load_data = json.load(json_file)
for task in load_data['task']:
style_c = load_data['style_code']
product_lbl = Label(ys_tasks_frame, text=style_c, bg='#1a2228', fg=fgcolor,
font=("Candara", 12))
product_lbl.place(x=150, y=task_y)
id_lbl = Label(ys_tasks_frame, text=num_t, bg='#1a2228', fg=fgcolor, font=("Candara", 12))
id_lbl.place(x=25, y=task_y)
num_t = num_t + 1
task_y = task_y + 25
Can anyone help me and point out what is wrong about the code?

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

How to decode a csv file with long lines in tensorflow with tf.decode_csv?

How to decode a csv file with long lines(e.g., with many items per line so as not realistic to list them one by one for output) with tf.TextLineReader() and tf.decode_csv?
The typical usage is:
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [1,1,1,1,1]
a,b,c,d,e = tf.decode_csv(records=value,record_defaults=record_defaults, field_delim=" ")
When we have thousands of items in a line, it's impossible to assign them one by one as (a,b,c,d,e) above, can all the items be decoded to a list or something like that?
Lets say you have 1800 columns of data. You can use this as record default:
record_defaults=[[1]]*1800
and then use
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
to read them.
Well, tf.decode_csv returns a list, so you can simply do:
record_defaults = [[1], [1], [1], [1], [1]]
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
all_columns
Out: [<tf.Tensor 'DecodeCSV:0' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:1' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:2' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:3' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:4' shape=() dtype=int32>
]
You can then evaluate it as usual:
sess = tf.Session()
sess.run(all_columns)
Out: [1, 1, 1, 1, 1]
Note that you need to pass a rank 1 record_defaults. If you have some problems with hanging queue.
Here is the way I am mixing differents dtypes in the record_defaults:
record_defaults = [tf.constant(.1, dtype=tf.float32) for count in range(100)] # 5 fp32 features
record_defaults.extend([tf.constant(1, dtype=tf.int32) for count in range(2)]) # 2 int32 features