Perl table Extract or other method for multi page table - html

I'm trying to extract elements from a table, I have successfully used get and HTML:TableExtract to get elements of the table. The problem is the table is multi page and navigated with an arrow button to show additional pages. How would i extract these other pages as they are not new links but I think generated with JS or such?
Specifically I am trying to extract the table under Data for this Data Range at:
http://ycharts.com/companies/GOOG/pe_ratio#series=type:company,id:GOOG,calc:pe_ratio,,id:AAPL,type:company,calc:pe_ratio,,id:AMZN,type:company,calc:pe_ratio&zoom=3&startDate=&endDate=&format=real&recessions=false
See how there is the Viewing x of 45 and the First, Previous, Next, Last button.
The rest of the table elements can be viewed with next, how would i extract these in perl?
Update::
Hi Simbabque, Thanks for the response.
So I see if you click on next it calls:
ng-click="getHistoricalData(historicalData.currentPage+1)"
Is there a way I can call this method? I tried to use click,but it is not bound a name. (JS?)
I was trying to use Mechanize::Firefox now but I feel like their must be an easy way to use regular Mech and call the function and re-read the page?

The website builds up the tables using AJAX requests. Those are a little harder to parse. You can use WWW::Mechanize to fetch the initial page and then hit the AJAX calls for the table. It helps you keep track of cookies and stuff automatically.
use strict; use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get('http://ycharts.com/companies/GOOG/pe_ratio#series=type:company,id:GOOG,calc:pe_ratio,,id:AAPL,type:company,calc:pe_ratio,,id:AMZN,type:company,calc:pe_ratio&zoom=3&startDate=&endDate=&format=real&recessions=false');
my $response = $mech->post(
'http://ycharts.com/companies/GOOG/pe_ratio/data_ajax',
{
startDate => '1/1/1962',
endDate => '12/3/2013',
pageNum => 4,
}
);
if ( $response->is_success ) {
print $response->decoded_content; # or whatever
} else {
die $response->status_line;
}
This is just a basic example and will not work. It gives a 403 Forbidden. Probably there is more data required. Use Firebug or a similar tool to inspect what is happening. For example, there's another call to http://ping.chartbeat.net/ping?h=ycharts.com&p=%2Fcompanies%2FGOOG%2Fpe_ratio&u=o3m6snxteynby1b8&d=ycharts.com&g=20054&n=1&f=00001&c=10.81&x=200&y=1812&o=1663&w=658&j=30&R=0&W=1&I=0&E=109&e=6&b=1903&t=usmc0fjfd1j0h87g&V=16&_ happening automatically every now and again, with varying parameters. That is most likely required to keep the session going.
This page is pretty sophisticated. This might not be the best approach.
You could also try to use WWW::Mechanize::Firefox or even Selenium to remote-operate a browser. That will be better suited as it takes care of all the AJAX stuff that is happening.
Or you could look for a public API that just hands over that data voluntarily. I bet there is one around... or just pay for a ycharts pro account and hit the download button. ;-)

Related

PUG with javascript

If I have the following piece of code
script.
function getProductParams(params) {
return JSON.parse(localStorage.getItem(params));
}
each product in products
-var getVariable = getProductParams("ids");
This piece of code doesn't work, I'm guessing that - is on server side, while script. is on clients?
How can I access a variable from localStorage in pug and use it for comparison with variables received from server.
I want to make something like this
script.
function getProductParams(params) {
return JSON.parse(localStorage.getItem(params));
}
each product in products
-var getVariable = getProductParams("ids");
-if (getVariable.includes(#{product._id}) {
// create the element in html
-} else {
- // call next product and compare if we have it
-}
The simple answer is that you can't do this. LocalStorage is exactly that - local. Your server can't directly access it unless you explicitly pass it back using some other format/method.
What your server can access easily is the cookie. If you store the data in the cookie and use cookie-parser then you can handle this list quickly. This use case is exactly why cookies exist.
Another option would be to pass the data to the server in the body of a POST request, but your question doesn't provide a lot of information as to what exactly you're doing and if this is an option or not.
It would also be much easier for you to accomplish all of the sorting and filtering in your route instead of the pug template. That way you have full request to the cookie, request body, etc. directly and you don't have to pass that down to the template too.
Then, when your template gets the list it's a really simple matter to render it:
each product in products
//create the element in html

Obtain list of My Places from Google Maps

I am trying to obtain the list of places the user has saved on Google Maps. Now I know there isnt an API for this (for whatever reason), but I saw here:
"My Places" Google Maps API
That apparently there used to be a way to obtain the URL, but it does not seem to work with my list of places.
E.g.
https://www.google.com/maps/#46.889424,0.1194148,6z/data=!4m3!11m2!2s1KbZtik1IdXyNhwfXEb3P9vaZvzU!3e3
Does not seem to work if I append &output=kml or &output=json
I created this list on Google Maps, then hit share and obtained that link.
I even tried parsing the resulting HTML but it seems everything is handled by some Javascript Engine and I can't find any reference to Google Ids there --- I dont even know how they handle clicks!
Any help? There must be a way to retrieve this information programmatically!
EDIT:
I managed to get something working by visiting the shared link, then processing the html and storing the window.APP_INITIALIZATION_STATE variable. I then convert it to an javascript array and loop over it. Deep inside the array/map structure, I managed to get the google name and google place id out of that array. That seems to work a bit, but when trying with lists over 20 items long, google only gets the first 20 and is waiting for the user to 'scroll down' to get the next 20. That seems to trigger another call to get the next 20 results and looks a bit like:
https://www.google.com/search?tbm=map&fp=1&authuser=0&hl=en&gl=nl&pb=!4m8!1m3!1d54065472.4384380........
I can see the original feature id being included at the end of the url, but have no idea how to construct this url in full though to get the next 20 items.... Any ideas?
Your saved places list actually has what you call a feature ID attribute, this isn't a common practice and Google frowns upon this technique but take a look at this URL:
https://www.google.com/maps/preview/entity?authuser=0&hl=en&gl=us&pb=!1m10!1s0x0%3A0x3743ae09a161976b!3m8!1m3!1d14318.72623152007!2d-98.2296425!3d26.2070353!3m2!1i1024!2i768!4f13.1!12m3!2m2!1i392!2i106!13m57!2m2!1i203!2i100!3m2!2i4!5b1!6m6!1m2!1i86!2i86!1m2!1i408!2i200!7m42!1m3!1e1!2b0!3e3!1m3!1e2!2b1!3e2!1m3!1e2!2b0!3e3!1m3!1e3!2b0!3e3!1m3!1e8!2b0!3e3!1m3!1e3!2b1!3e2!1m3!1e9!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e10!2b0!3e4!2b1!4b1!9b0!14m3!1snyc5W-WeHY3r5gLwkoRI!7e81!15i10112!15m19!2b1!5m4!2b1!3b1!5b1!6b1!10m1!8e3!14m1!3b1!17b1!24b1!25b1!26b1!30m1!2b1!36b1!52b1!53b1!21m28!1m6!1m2!1i0!2i0!2m2!1i458!2i768!1m6!1m2!1i974!2i0!2m2!1i1024!2i768!1m6!1m2!1i0!2i0!2m2!1i1024!2i20!1m6!1m2!1i0!2i748!2m2!1i1024!2i768!22m1!1e81!29m0!30m1!3b1
Highlighted is the feature ID from the link you posted:
https://www.google.com/maps/#46.889424,0.1194148,6z/data=!4m3!11m2!2s1KbZtik1IdXyNhwfXEb3P9vaZvzU!3e3
Along with other maps parameters; when you hit that link you're actually manually triggering the same callback that Google's own scripts in maps use to parse the data to feed back to the maps UI; if you look at array item 2, or {c:..} you'll find a stringified array with the contents of your list, now depending on the program language you're using all it takes is a little tweaking (find/replace, loop through, lint and trim, etc.) to this array and you can pull your results; the cool thing is if you add or remove a place the next time you hit that end point it's updated in real-time.
Some people may call it a "hack"; but it gets the job done. :)
Hope I pointed you to a direction in the event you haven't found a solution; give this a shot.
Note the URL has to be pasted in its entirety, SO truncated the hyperlink; copy and paste the whole thing in one shot and a text file from Google with the arrays will be produced; in my case I curl the URLs I need and parse the returned strings as needed to pull data from Google where their API has limitations. Just a tip. :)
Also check Joel's Answer who did some research and refined some of the following information.
Pagination
You can use this tool to decrypt the pb-parameter. PB stands for protocol buffer (protobuf) and Google uses its own kind of it for maps. You can find different decoders for this by googling it.
In my case, the pagination was done via one parameter (8iX0). It seems, that it always comes with another similar parameter (7i20) but I don't know that it does. I can't yet confirm that this is always the case, but from my experience you're basically looking for two integers that are 20/40/60 etc. apart.
Here's what this looks like for me:
page 2 (7i20, 8i20)
page 3 (7i20, 8i40)
page 4 (7i20, 8i60)
From this information, I tried 7i20 8i00 for page 1, that seemed to work. For lists with >100 items, it just continues like that (8i120, 8i140 etc.)
Here's a code snippet in python (quick & dirty). Make sure to add (long) delays if your list has many pages as you will get rate-limited by captchas eventually if you don't. Notice the 8i%s0 in the url, make sure to put the %s back when you paste your pb-block.
url = "https://www.google.com:443/search?tbm=map&pb=!7i20!8i%s0!..."
headers = {"Referer": "https://www.google.com/"}
def fetch_stops_from_maps():
new_results = -1
page = 0
results = []
while new_results != 0:
new_results = 0
x = requests.get(url % page, headers=headers)
txt = html.unescape(x.text)
txt = txt.split("\n")[1]
results = re.findall(r"\[null,null,[0-9]{1,2}\.[0-9]{4,15},[0-9]{1,2}\.[0-9]{4,15}]", txt)
print(len(results))
for cord in results:
# curr = the description you can manually type in when saving
curr = txt.split(cord)[1].split("\"]]")[0]
curr = curr[curr.rindex(",\"") + 2:]
cords = str(cord).split(",")
lat = cords[2]
lon = cords[3][:-1]
results.append(s)
new_results += 1
page += 2
Actually getting the correct url
Getting the correct url currently seems to be the hardest part when doing this and I have not fully figured this out aswell. However, for my use-case this is not really important, so I extracted the correct pb-block once and called it a day.
As explained in the other answers, the id of the list is visible in the basic url (here, the 2sXX...) when you navigate to the list in your browser. It seems to usually be 24-32 (?) characters long.
.../maps/<coords>/data=!4m3!11m2!2sXXXX...XXXX!3e3
If you have this id, you can put it into an existing protobuf-block and it may work (I only tested this with 3 different lists, which were all created by the same account, so this theory is far from proven).
Now, how do you get the block? I would just share the one I have, but because I only understand parts of what it does, I fear that it may contain some personal info. Instead, I will share my process of getting it. For this I use Burpsuite. It's a program mainly used for web-security testing and has a free community edition, however for our use-case it is the perfect tool, because with it you can easily tinker with requests, change small parts in the request, send it again and immediately see if your changes changed the response. However for extracting the pb-block, one should also be able to use any program that can intercept browser traffic.
Heres the basic rundown with burp:
From GMaps, share a list that has >20 items (this is important) and copy the public link
In Burp, go to the tab "Proxy", make sure "Intercept" is off and click "Open browser" to open the integrated chromium browser
There, paste the link and wait until maps loaded completely
In Burp, turn "Intercept" on, then in google maps, scroll down in the list, until it starts loading new results (always blocks of 20)
Burp now intercepted all requests the browser made since you turned intercepting on. Click "Forward" and go through all requests, until you see a request in the format
GET /search?tbm=map&authuser=0&hl=de&gl=de&pb=!7i20....
This is what you're looking for.
Optionally, you can now right click into the request-text and click "send to repeater", then switch to the repeater-tab. Here you can edit the request and then send it again, being able to see the response immediately. For example, removing the authuser, hl, gl, q, ech, psi url parameters, the request still works flawlessly. If you remove the tch=1 parameter, the response you get will be in a more human readable format.
In the request-text you should now be able to just search for the list-id you got from the link previously and replace it with the id of another list (search bar is at the bottom in burp). As I said, this worked for me, but it may be possible that the pb-block contains some additional metadata that makes lists from different google-accounts or different types of lists incompatible with specific pb-blocks. Just a theory though. Let me know how it goes!
Further automating
I have theorised that one could automate getting the pb-block using requests-html because it can load html-sites fully but it doesn't get updated anymore. Another option (probably the better one) is Selenium Wire, as you should be able to load the page and intercept the requests, like we did in burp. Seems like a whole lot of work tho :D
This was the only API was able to find was this:
https://www.google.com/bookmarks/?output=xml
Used in a browser you would have to first log in through Google's OAuth. It would then return your saved places. Not sure at the moment how you would embedded the authentication to do this programmatically, but this might send you in the right direction.
I was able to extract the data I needed from my google maps list. Below are some comments that expand on some of the other comments here, along with a script that extracts all of the relevant data points from the network response.
Obtaining the underlying URL
You can easily find this URL by just opening the devtools on your browser, going to the network tab, and refreshing the webpage or scrolling down on the list until it loads new results (the list must be larger than 20 results). You should be able to find the network request that starts with https://www.google.com/search?tbm=map&pb... and go from there.
Increase the results size
I was able to increase the number of results returned from the request by changing the value of the 7i20 parameter. From what I can tell, the 71XX parameter is the size of the page, and the 8iXX parameter is the starting point. I haven't tested how large you can make the page limit, but I tested 100 and it seemed to work fine. This should make dealing with larger lists much easier.
Parsing out the data
Instead of using regex to parse out the relevant data from the response, I found that the response is basically just a massive JSON object and I was able to identify the indexes for specific types of data, such as the name of the place, location, notes, etc. See the script below.
If you look at the buildResults function in the script below, you can see the exact indexes used to extract specific pieces of information. This of course may change over time if the network response changes format at all, so use these as a starting point in the case where the specific values aren't at those indexes anymore. Hopefully they would be close to those locations
Script to parse the data (javascript / node.js)
// Insert the raw text content from the network response from the
// https://www.google.com/search?tbm=map&pb... url below.
const rawInput = null
function prepare(input) {
// There are 5 random characters before the JSON object we need to remove
// Also I found that the newlines were messing up the JSON parsing,
// so I removed those and it worked.
const preparedForParsing = input.substring(5).replace(/\n/g, '')
const json = JSON.parse(preparedForParsing)
const results = json[0][1].map(array => array[14])
return results
}
function prepareLookup(data) {
// this function takes a list of indexes as arguments
// constructs them into a line of code and then
// execs the retrieval in a try/catch to handle data not being present
return function lookup(...indexes) {
const indexesWithBrackets = indexes.reduce((acc, cur) => `${acc}[${cur}]`, '')
const cmd = `data${indexesWithBrackets}`
try {
const result = eval(cmd)
return result
} catch(e) {
return null
}
}
}
function buildResults(preparedData) {
const results = []
for (const place of preparedData) {
const lookup = prepareLookup(place)
// Use the indexes below to extract certain pieces of data
// or as a starting point of exploring the data response.
const result = {
address: {
street_address: lookup(183, 1, 2),
city: lookup(183, 1, 3),
zip: lookup(183, 1, 4),
state: lookup(183, 1, 5),
country_code: lookup(183, 1, 6),
},
name: lookup(11),
tags: lookup(13),
notes: lookup(25,15,0,2),
placeId: lookup(78),
phone: lookup(178,0,0),
coordinates: {
long: lookup(208,0,2),
lat: lookup(208,0,3)
}
}
results.push(result)
}
return results
}
const preparedData = prepare(rawInput)
const listResults = buildResults(preparedData)
console.log(listResults)

Changing html on a view based on if get parameter is set on Codeigniter

Before I used Codeigniter I had a page show certain html as long as the url had no get parameters and then have some of the html be replaced by another as soon as something like this is set in the url:
localhost/signup.php?success
Now my question is, what is the best way to do this in Codeigniter? Would I have to use one of those parameters on the controller's function (which I still can't get my head around)? And if so, how? Or if I just had php logic in the view like I used to do in plain PHP, what would I check for if not a get parameter? Thanks.
Too many ways to achieve this certain thing.
routes.php
extending controller and using constructor so you apply rules for every extended controller
flashdata
Before you start please read up on frameworks watch some video tutorials on how to make simple blog system etc. I myself wouldn't just jump in to concept, study up.
I mentioned flashdata and that is how you do things done (success, alert, warning bars).
By default, GET parameters are not enabled or useful in codeigniter, but URI segments work the same way. So...
If you had a controller called, signup.php and a function inside it called success, you could link to that with:
localhost/signup/success
then if you loaded the URL helper, which I always do in config/autoload.php or just with:
$this->load->helper('url');
You could say:
if($this->uri->segment(2) == 'success') {
//Show success message or load a view for it...
}else {
//The second URI segment is NOT 'success' so do something else...
}
But... codeigniter is just a framework for PHP. If it's possible in PHP, it's possible in codeigniter. You can simply go into the config/config.php file and enable query strings, but I would strongly suggest using URI segments and reading up on them as well as the URL helper.

PHP avoid browser reposting $_POST on page refresh?

I wonder what are the techniques i can use to avoid users to post form twice when they refresh page and chose submit again?
e.g. i have form inside regiter.php and process it as well inside register.php.
1st i could process in another file e.g. register_process.php and redirect to register.php, but then i have to create about 20 new pages and relocate a lot of code, i dont want that option.
2nd i could play with headers i dont remember exact trick but had some bad experience with that - users seen old data on page after refreshing it...
3rd i could just redirect upon success to some dummy.php and from dummy.php jump back to register.php then even if they refresh page browser would not re-post, however it does not protect against them using back button and choosing re-post, i know i could expire page, but i find that annoying experience for me and probably other users to see page expired error.
4th use some unique "access key" for each form once page loaded that will post with form and once used cannot be reused, however i kind of struggle with logics of that feature. how do know key was used without storing it in MySQL DB, i think time based accesis not great either because some users can take long between page open and form submit.
I need more suggestions how to avoid users reposing form again.
Try this:
<?php
session_start();
if( strcasecmp($_SERVER['REQUEST_METHOD'],"POST") === 0) {
$_SESSION['postdata'] = $_POST;
header("Location: ".$_SERVER['PHP_SELF']."?".$_SERVER['QUERY_STRING']);
exit;
}
if( isset($_SESSION['postdata'])) {
$_POST = $_SESSION['postdata'];
unset($_SESSION['postdata']);
}
This will basically save the POST data and cause the browser to re-request as a GET request.
5th. Use AJAX or jQuery and when the form is clicked on the page submit that data in the background. Output a response to the screen. Mark that form as submitted, or save to a session, and when they refresh they will not be able to submit the form again.
In my opinion it is the best way to do it anyway. I had a scoreboard with 20 or more forms it is worked really well to send the data without refreshing. You can return a response and make the page look very professional. Using jQuery you can also use some great form validation to make sure that they are submitting the fields that are required.
I think that the best solution would be something like this:
* create a md5 or base64 of posted data
* compare this hash with a session variable (let's call it $_SESSION['repost'])
* if hashes match, skip whatever this save should do and output a warning
* if hashes do not match, or hash is not present
* assign session variable with current hash
* do whatever save should do
When you consider the first option, I don't know about the design (programming design ^^) of your website, but you should only need 1 page. Let's says you call redirect.php with all the parameters, you should call your controller and your controller should know regardless of the parameters what to do with them.
It is a good habit to be able to do some abstraction and good design when programming, it help alot in those situation.
Other way is store the last post in the session variable, and check if is equal, like this:
<?php
... program ...
if ($_SESSION['PREPOST'] == $_POST) {
// DO SOMETHING
}
... program ...
$_SESSION['PREPOST'] = $_POST;
?>

Web automation

I'm developing an interface between an old web based application and another one. That old web-based application works fine but there no exists any API to communicate with.
There is any programmatic way to say a web-form something like: enter this value on this field, this one ins other and submit form?
UPDATE: I looking for something like this:
WebAutomation w = new WebAutomation("http://apphost/report");
w.forms[0].input[3].value = 123;
w.forms[0].input[4].value = "hello";
Response r = w.forms[0].submit();
...
Despite the tag on your question, the answer is going to be highly language specific. There are also going to be wide range of solutions depending on how complex of a solution you are willing to implement and how flexible a result you are looking for.
On the one hand you can accomplish a lot in a very short period of time with something like Python's mechanize, but on the other hand, you can really get into the guts and have a lot of control by automating a browser using a COM object such as SHDocVw (Windows-only, of course).
Or, as LoveMeSomeCode suggested, you can really hit your head against the concrete and start forging POST requests, but good-luck figuring out what the server expects if is doing any front-end processing of the form data.
EDIT:
One more option, if you are looking for something that you can come up to speed on quickly, is to use a AutoIt's IE module, which basically provides a programmatic interface over an instance of Internet Explorer (its all COM in underneath, of course). Keep in mind that this will likely be the least supportable option you could choose. I have personally used this to produce proof-of-concept automation suites that were then migrated to a more robust C# implementation where I handled the COM calls myself.
In .NET: http://watin.sourceforge.net/
In ruby: http://wtr.rubyforge.org/
Cross platform: http://seleniumhq.org/
You can, but you have to mock up a POST request. The fields (textboxes, radio buttons, etc.) are transmitted as key-value pairs back to the resource. You need to make a request for this resource(whichever one is used in the SUBMIT action for the FORM tag) and put all your field-value pairs in a POST payload no the request.
Here's a good program to see what values are being transmitted: http://www.httpwatch.com
Or, you can use Firebug, a free Firefox extension.
The Perl module WWW::Mechanize does exactly that. Your
example would look something like this:
use WWW::Mechanize;
my $agent = WWW::Mechanize->new;
$agent->get("http://apphost/report");
my $response = $agent->submit_form(
with_fields => {
field_1_name => 123,
field_2_name => "hello",
},
);
There is also a Python port, and I guess similar libraries exist for many other languages.