Optimizing web parsing with Beautifulsoup, Functions I don't know? - html

I have a multiprocessing loop which uses urllib and beautifulsoup to scan webpages for data, then I run if statements. Each process takes about 3 seconds to run. 2.95 of those seconds are spent getting the html, and the remainder is spent running ifs and cutting up the very small amount of data that I need.
The webpages consists of about 622 lines with something like 125,000 characters. I only need two or three lines and 200-300 characters. I am looking for a way to shorten the time this loop runs. Is there a function that will allow me to skip the first 500 lines of html? does anyone have other recommendations? for now I am using tags and attributes to determine what info I need, but if I could just say 'I want to read only lines 500-700' wouldnt that be faster?
given that the entire pool of multiprocesses takes nearly three minutes to run, any amount of time I can shave off will be helpful to me. Here's what I am using so far to pick apart the html.
source = urllib.request.urlopen(l[y]).read()
soup = bs.BeautifulSoup(source,'lxml')
for row in soup.html.body.find_all('table', attrs={'class':'table-1'}):
for i,j in zip(row.find_all('a'), row.find_all('td', attrs={'width':'130', 'align':'right'})):
>run ifs
Thank you for reading.

Related

why dont i get all multiples of ten when using this line of code?

Can anyone tell me why im not printing all multiples of ten, it seems to skip over chunks at random?
if pygame.time.get_ticks()%10 == 0:
print (pygame.time.get_ticks())
From the docs here it says 'Return the number of milliseconds since pygame.init() was called.'
You check if the elapse msec returned by get_ticks ends in a 0 and if so you print.
Your game logic is not likely to take less than a msec so each time you make this call you will get back some number the last digit of which will move around but statistically has a 1/10 chance of ending in a 0 and causing you to print. Even if you make this call in less than 10 msec intervals, say every 3 msec as an example, you would only get e return that ended in 0 every 30 msec. Your calls are going to be more erratically timed than that, but statistically it will likely end up about 1/10 calls end in a 0. So there will be significant chunks of time between prints even if it called reasonably frequently.
By the way you should not call get_ticks() twice, but should call it once and save it in a temp variable which you test and print instead, but that is besides the point.

How do I wait for a random amount of time before executing the next action in Puppeteer?

I would love to be able to wait for a random amount of time (let's say a number between 5-12 seconds, chosen at random each time) before executing my next action in Puppeteer, in order to make the behaviour seem more authentic/real world user-like.
I'm aware of how to do it in plain Javascript (as detailed in the Mozilla docs here), but can't seem to get it working in Puppeteer using the waitFor call (which I assume is what I'm supposed to use?).
Any help would be greatly appreciated! :)
You can use vanila JS to randomly wait between 5-12 seconds between action.
await page.waitFor((Math.floor(Math.random() * 12) + 5) * 1000)
Where:
5 is the start number
12 is the end number
1000 means it's converting seconds to milliseconds
(PS: However, if you question is about waiting 5-12 seconds randomly before every action, then you should have a class with wrapper, which is a different issue until you update your question.)

Is it possible to use topic modeling for a single document

Is it rational to use topic modelling for a single document or to be more precise is it mathematically okay to use LDA-gibbs method for a single document.If so what should be value of k and seed.
Also what is be the role of k and seed for single as well as large set of documents.
K and SEED are variable of the function LDA (in r studio).
Also let me know if I am wrong anywhere in this question.
To tell about my project ,I am trying to find out the main topics which can be used to represent the content of a single document.
I have already tried using k=4,7,10.Part of my question also is what value of k should be better.
It really depends on the document. A document could be a 700 page book or a single sentence. Your k is also going to be dependent on the document I think you mean the number of topics? If your document is the entire Wikipedia corpus 1500 topics might be appropriate if your document is a list of comments about movies then 20 topics might be appropriate. Optimizing that number can be done using the elbow method check out 17.
Seed can be pretty random it's just a leaver so your results can be replicated - it runs if you leave it blank. I would say try it and check your coherence, eyeball your topics and if it looks right then sure you can train an LDA on one document. A single document should process pretty fast.
Here is an example in python of using seed parameters. My data set is 1,048,575 rows note the seed is much higher:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus,
num_topics=20, alpha =.1, id2word=dictionary, iterations = 1000,
random_seed = 569356958)

How to do a base conversion with Little Man Computer?

I need to convert a decimal number to a base between 2 and 9 using Little Man Computer. How do I proceed?
I believe successive divisions are the best method. In my opinion, I must write a code which divides two numbers, then save the integer ratio for the next division, as well as all of the remainders in an array of indefinite size, but I've been struggling with the division code for hours now. I tried searching for a code which divides two numbers, but all the ones I tried have mistakes/don't work. I'm stuck at the easiest part of the problem, I can't imagine how I'm ever going to be able to write a self-modifying code which manages an array of ever-increasing line positions and backtracks through it at the end to extract all the remainders. I'm at a loss here, any help would be appreciated.

How can I get better randomization in my sql query?

I am attempting to get a random bearing, from 0 to 359.9.
SET bearing = FLOOR((RAND() * 359.9));
I may call the procedure that runs this request within the same while loop, immediately one after the next. Unfortunately, the randomization seems to be anything but unique. e.g.
Results
358.07
359.15
357.85
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
In any other situation, I would wait a few milliseconds in between calls or reinit my Random object (such as in C#), which would greatly vary my randomness. However, I don't want to wait in this situation.
How can I increase randomness without waiting?
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
That's not quite right. Where folks get into trouble is when they re-seed a random number generator repeatedly with the current time, and because they do it very quickly the time is the same and they end up re-seeding the RNG with the same seed. This results in the RNG spitting out the same sequence of numbers each time it is re-seeded.
Importantly, by "the same" I mean exactly the same. An RNG is either going to return an identical sequence or a completely different one. A "close" seed won't result in a "similar" sequence. You will either get an identical sequence or a totally different one.
The correct solution to this is not to stagger your re-seeds, but actually to stop re-seeding the RNG. You only need to seed an RNG once.
Anyways, that is neither here nor there. MySQL's RAND() function does not require explicit seeding. When you call RAND() without arguments the seeding is taken care of for you meaning you can call it repeatedly without issue. There's no time-based limitation with how often you can call it.
Actually your SQL looks fine as is. There's something missing from your post, in fact. Since you're calling FLOOR() the result you get should always be an integer. There's no way you'll get a fractional result from that assignment. You should see integral results like this:
187
274
89
345
That's what I got from running SELECT FLOOR(RAND() * 359.9) repeatedly.
Also, for what it's worth RAND() will never return 1.0. Its range is 0 &leq; RAND() < 1.0. You are safe using 360 vs. 359.9:
SET bearing = FLOOR(RAND() * 360);