Generator usage for log analysing - generator

I have a working python code to analyze logs. Logs are at least 10 MBytes of size and they can sometimes reach 250-300 Mbytes depending on failures, retries.
I used generator which could yield the big file as chunks and it can be configurable and I normally use 1 or 2 Mbytes of log to yield. So I analyze logs as 1Mb chunks for verification of some tests.
My problem is when I use generator it could bring up some edge cases. In log analyzing I check for subsequent appearance of some patterns as the following, so if only those 4 list seen then I keep them for next verification part of the code. The following 4 pattern can be seen in the logs once or twice, not more.
listA
listB
listC
listD
if these all occurs subsequently then I keep them all to evaluate in next step, otherwise ignore..
However now there is a small change the following could happen, some patterns (lists as I use regex findall method to find patterns) can be in next chunk to complete the check. So in the following I have 3 matching case chunk 3-4 and 5-6 and 7-8 creates the condition to take into account.
---- chunk 1 -----
listA
listB
----- chunk 2 -----
nothing
----- chunk 3 -----
listA
listB
----- chunk 4 -----
listC
listD
----- chunk 5 -----
listA
----- chunk 6 -----
listB
listC
listD
---- chunk 7 ------
listA
listB
listC
----- chunk 8 ------
listD
---------------------
Usually it does not happen like this, some patterns (B,C,D) is mostly seen subsequently in logs but listA can be seen 20 maybe the most 30 rows earlier than the rest. But any scenario like above can happen.
Please advise a good approach, I'm not sure what to use, I know there is next() function can be used to check next chunk, in such case
should I use any([listA, listB, listC, listD]) method and if any of the patterns occurs then do I need to check one by one the rest in next chunk like the following?:
if any([listA, listB, listC, listD]):
Then here check which of the patterns not seen and keep them in a notSeen list then check them one by one in next chunk?
next_chunk = next(gen_func(chunksize))
isListA = re.findall(pattern, next_chunk)
Or maybe I completely miss an easier approach for this little project. please let me know your thoughts as you might experience such situation before.

I have used next_chunk = next(gen_func(chunksize))
and added necessary if statements underneath to check only 1 next log piece becase I would arrange log chunks with a generator suitably:
I shared only a part of the code as the rest confidential
import re, os
def __init__(self, logfile):
self.read = self.ReadLog(logfile)
self.search = self.SearchData(logfile)
self.file = os.path.abspath(logfile)
self.data = self.read.read_in_chunks
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
def test123(self, r_reason: str, cyc: int, b_r):
''' Usage : verify_log.py
--log ${LOGS_FOLDER}/log --reason r_low 1 <True | False>'''
ret = False
r_count = 2*int(cyc) if b_r.lower() == "true" else int(cyc)
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
result_pattern = self.search.r_scan_pattern()
def check_patterns(chunk):
search_cached = re.findall(self.search.r_search_cached, chunk)
search_full = re.findall(self.search.r_search_full, chunk)
scan_complete = re.findall(self.search.r_scan_complete, chunk)
scan_result = re.findall(result_pattern, chunk)
r_complete = re.findall(self.search.r_auth_complete, chunk)
return search_cached, search_full, scan_complete, scan_result, r_complete
with open(self.file) as rf:
for idx, piece in enumerate(self.data(rf), start=1):
is_failed = re.findall(self.search.r_failure, piece)
if is_failed:
print(f'general failure received : {is_failed}')
r_failed.extend(is_failed)
is_r_search_cached, is_r_search_full, is_scan_complete, is_scan, is_r_complete = check_patterns(piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_complete, is_scan, is_r_complete]):
if is_r_search_cached:
r_search_combined.extend(is_r_search_cached)
if is_r_search_full:
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete)
elif (is_r_search_cached or is_r_search_full) and not any([is_scan, is_r_complete]):
next_piece = next(self.data(rf))
_, _, _, is_scan_next, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_next, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan_next)
r_complete.extend(is_r_complete_next)
elif (is_r_search_cached or is_r_search_full) and is_scan and not is_r_complete:
next_piece = next(self.data(rf))
_, _, _, _, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete_next)

Related

Python- Return a verified variable from a function to the main program

Can anyone please direct me to an example of where one can send a user input variable to a checking function or module & return the validated input assigning / updating the initialised variable?. I am trying to re-create something I did in C++ many years ago where I am trying to validate an integer! In this particular case that the number of bolts input in a building frame connection is such. Any direction would be greatly appreciated as my internet searches and trawls through my copy of Python A Crash Course have yet to shed any light! Many thanks in anticipation that someone will feel benevolent towards a Python newbie!
Regards Steve
Below is one on my numerous attempts at this, really I would just like to abandon and use While and a function call. In this one apparently I am not allowed to put > (line 4) between str and int, this desite my attempt to force N to be int - penultimate line!
def int_val(N):
#checks
# check 1. n > 0 for real entries
N > 0
isinstance(N, int)
N=N
return N
print("N not ok enter again")
#N = input("Input N the Number of bolts ")
# Initialiase N=0
#N = 0
# Enter the number of bolts
N = input("Input N the Number of bolts ")
int_val(N)
print("no of bolts is", N)
Is something like this what you have in mind? It takes advantage of the fact that using the built-in int function will convert a string to an integer if possible, but otherwise throw a ValueError.
def str_to_posint(s):
"""Return value if converted and greater than zero else None."""
try:
num = int(s)
return num if num > 0 else None
except ValueError:
return None
while True:
s = input("Enter number of bolts: ")
if num_bolts := str_to_posint(s):
break
print(f"Sorry, \"{s}\" is not a valid number of bolts.")
print(f"{num_bolts = }")
Output:
Enter number of bolts: twenty
Sorry, "twenty" is not a valid number of bolts.
Enter number of bolts: 20
num_bolts = 20
def str_to_posint(s):
"""Return value if converted and greater than zero else None."""
try:
num = int(s)
return num if num > 0 else None
except ValueError:
return None
while True:
s = input("Enter number of bolts: ")
if num_bolts := str_to_posint(s):
break
print(f"Sorry, "{s}" is not a valid number of bolts.")
print(f"{num_bolts = }")

Understanding the output of a pyomo model, number of solutions is zero?

I am using this code to create a solve a simple problem:
import pyomo.environ as pyo
from pyomo.core.expr.numeric_expr import LinearExpression
model = pyo.ConcreteModel()
model.nVars = pyo.Param(initialize=4)
model.N = pyo.RangeSet(model.nVars)
model.x = pyo.Var(model.N, within=pyo.Binary)
model.coefs = [1, 1, 3, 4]
model.linexp = LinearExpression(constant=0,
linear_coefs=model.coefs,
linear_vars=[model.x[i] for i in model.N])
def caprule(m):
return m.linexp <= 50
model.capme = pyo.Constraint(rule=caprule)
model.obj = pyo.Objective(expr = model.linexp, sense = maximize)
results = SolverFactory('glpk', executable='/usr/bin/glpsol').solve(model)
results.write()
And this is the output:
# ==========================================================
# = Solver Results =
# ==========================================================
# ----------------------------------------------------------
# Problem Information
# ----------------------------------------------------------
Problem:
- Name: unknown
Lower bound: 50.0
Upper bound: 50.0
Number of objectives: 1
Number of constraints: 2
Number of variables: 5
Number of nonzeros: 5
Sense: maximize
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Error rc: 0
Time: 0.09727835655212402
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
It says the number of solutions is 0, and yet it does solve the problem:
print(list(model.x[i]() for i in model.N))
Will output this:
[1.0, 1.0, 1.0, 1.0]
Which is a correct answer to the problem. what am I missing?
The interface between pyomo and glpk sometimes (always?) seems to return 0 for the number of solutions. I'm assuming there is some issue with the generalized interface between the pyomo core module and the various solvers that it interfaces with. When I use glpk and cbc solvers on this, it reports the number of solutions as zero. Perhaps those solvers don't fill that data element in the generalized interface. Somebody w/ more experience in the data glob returned from the solver may know precisely. That said, the main thing to look at is the termination condition, which I've found to be always accurate. It reports optimal.
I suspect that you have some mixed code from another model in your example. When I fix a typo or two (you missed the pyo prefix on a few things), it solves fine and gives the correct objective value as 9. I'm not sure where 50 came from in your output.
(slightly cleaned up) Code:
import pyomo.environ as pyo
from pyomo.core.expr.numeric_expr import LinearExpression
model = pyo.ConcreteModel()
model.nVars = pyo.Param(initialize=4)
model.N = pyo.RangeSet(model.nVars)
model.x = pyo.Var(model.N, within=pyo.Binary)
model.coefs = [1, 1, 3, 4]
model.linexp = LinearExpression(constant=0,
linear_coefs=model.coefs,
linear_vars=[model.x[i] for i in model.N])
def caprule(m):
return m.linexp <= 50
model.capme = pyo.Constraint(rule=caprule)
model.obj = pyo.Objective(expr = model.linexp, sense = pyo.maximize)
solver = pyo.SolverFactory('glpk') #, executable='/usr/bin/glpsol').solve(model)
results = solver.solve(model)
print(results)
model.obj.pprint()
model.obj.display()
Output:
Problem:
- Name: unknown
Lower bound: 9.0
Upper bound: 9.0
Number of objectives: 1
Number of constraints: 2
Number of variables: 5
Number of nonzeros: 5
Sense: maximize
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Error rc: 0
Time: 0.00797891616821289
Solution:
- number of solutions: 0
number of solutions displayed: 0
obj : Size=1, Index=None, Active=True
Key : Active : Sense : Expression
None : True : maximize : x[1] + x[2] + 3*x[3] + 4*x[4]
obj : Size=1, Index=None, Active=True
Key : Active : Value
None : True : 9.0

Troubleshooting a Loop in R

I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.
However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).
My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.
iteration_number cumulative_results
1 13432 5673
2 13433 5673
3 13434 5673
But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?
Thank you!
Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.
Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).
I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:
}, error = function(e){})
to:
}, error = function(e){ message(e) })
After a while, I got output like:
HTTP error 429
The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution.
I modified your script as follows:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
Sys.sleep(1) # Changed. Change the value as needed.
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){
message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.
As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt number:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
maxattempt <- 3
for (i in 1:1000)
{
attempt <- 1
r_i <- NULL
while( is.null(r_i) && attempt <= maxattempt ) {
if (attempt>1) {print(paste("retry: i =",i))}
attempt <- attempt + 1
url_i<- paste0(part1, i+1, part2, i, part3)
try({
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i))) })
}
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Result:
[1] 1 250
[1] 2 498
[1] 3 748
[1] 4 997
[1] 5 1247
[1] 6 1497
[1] 7 1747
[1] 8 1997
[1] 9 2247
[1] 10 2497
...
[1] 416 101527
[1] 417 101776
Error in open.connection(con, "rb") :
Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received
[1] "retry: i = 418"
[1] 418 102026
[1] 419 102276
...
On above example, a retry occured for i = 418.
Note that for i in 1:500, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000 : do you have enough RAM?

Error in eval(expr, envir, enclos) while using Predict function

When I try to run predict() on the dataset, it keeps giving me error -
Error in eval(expr, envir, enclos) : object 'LoanRange' not found
Here is the part of dataset -
LoanRange Loan.Type N WAFICO WALTV WAOrigRev WAPTValue
1 0-99999 Conventional 109 722.5216 63.55385 6068.239 0.6031879
2 0-99999 FHA 30 696.6348 80.00100 7129.650 0.5623650
3 0-99999 VA 13 698.6986 74.40525 7838.894 0.4892977
4 100000-149999 Conventional 860 731.2333 68.25817 6438.330 0.5962638
5 100000-149999 FHA 285 673.2256 82.42225 8145.068 0.5211495
6 100000-149999 VA 125 704.1686 87.71306 8911.461 0.5020074
7 150000-199999 Conventional 1291 738.7164 70.08944 8125.979 0.6045117
8 150000-199999 FHA 403 672.0891 84.65318 10112.192 0.5199632
9 150000-199999 VA 195 694.1885 90.77495 10909.393 0.5250807
10 200000-249999 Conventional 1162 740.8614 70.65027 8832.563 0.6111419
11 200000-249999 FHA 348 667.6291 85.13457 11013.856 0.5374226
12 200000-249999 VA 221 702.9796 91.76759 11753.642 0.5078298
13 250000-299999 Conventional 948 742.0405 72.22742 9903.160 0.6106858
Following is the code used for predicting count data N after determining the overdispersion-
model2=glm(N~Loan.Type+WAFICO+WALTV+WAOrigRev+WAPTValue, family=quasipoisson(link = "log"), data = DF)
summary(model2)
This is what I have done to create a sequence of count and use predict function-
countaxis <- seq (0,1500,150)
Y <- predict(model2, list(N=countaxis, type = "response")
At this step, I get the error -
Error in eval(expr, envir, enclos) : object 'LoanRange' not found
Can someone please point me where is the problem here.
Think about what exactly you are trying to predict. You are providing the predict function values of N (via countaxis), but in fact the way you set up your model, N is your response variable and the remaining variables are the predictors. That's why R is asking for LoanRange. It actually needs values for LoanRange, Loan.Type, ..., WAPTValue in order to predict N. So you need to feed predict inputs that let the model try to predict N.
For example, you could do something like this:
# create some fake data to predict N
newdata1 = data.frame(rbind(c("0-99999", "Conventional", 722.5216, 63.55385, 6068.239, 0.6031879),
c("150000-199999", "VA", 12.5216, 3.55385, 60.239, 0.0031879)))
colnames(newdata1) = c("LoanRange" ,"Loan.Type", "WAFICO" ,"WALTV" , "WAOrigRev" ,"WAPTValue")
# ensure that numeric variables are indeed numeric and not factors
newdata1$WAFICO = as.numeric(as.character(newdata1$WAFICO))
newdata1$WALTV = as.numeric(as.character(newdata1$WALTV))
newdata1$WAPTValue = as.numeric(as.character(newdata1$WAPTValue))
newdata1$WAOrigRev = as.numeric(as.character(newdata1$WAOrigRev))
# make predictions - this will output values of N
predict(model2, newdata = newdata1, type = "response")

Egg dropping in worst case

I have been trying to write an algorithm to compute the maximum number or trials required in worst case, in the egg dropping problem. Here is my python code
def eggDrop(n,k):
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
for i in range(1, n+1):
eggFloor[i][1] = 1
eggFloor[i][0] = 0
for j in range(1, k+1):
eggFloor[1][j] = j
for i in range (2, n+1):
for j in range (2, k+1):
eggFloor[i][j] = 'infinity'
for x in range (1, j + 1):
res = 1 + max(eggFloor[i-1][x-1], eggFloor[i][j-x])
if res < eggFloor[i][j]:
eggFloor[i][j] = res
return eggFloor[n][k]print eggDrop(2, 100)
```
The code is outputting a value of 7 for 2eggs and 100floors, but the answer should be 14, i don't know what mistake i have made in the code. What is the problem?
The problem is in this line:
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
You want this to create a list containing (n+1) lists of (k+1) zeroes. What the * (n+1) does is slightly different - it creates a list containing (n+1) copies of the same list.
This is an important distinction - because when you start modifying entries in the list - say,
eggFloor[i][1] = 1
this actually changes element [1] of all of the lists, not just the ith one.
To instead create separate lists that can be modified independently, you want something like:
eggFloor=[ [0 for i in range(k+1) ] for j in range(n+1) ]
With this modification, the program returns 14 as expected.
(To debug this, it might have been a good idea to write out a function to pring out the eggFloor array, and display it at various points in your program, so you can compare it with what you were expecting. It would soon become pretty clear what was going on!)