Trying to crawl some data on Google Sheets but getting beat by XPath

Trying to crawl some data on Google Sheets but getting beat by XPath - html

I'm trying to make a sheet for study purposes on stock market, and I'm using this website to get the data from. Taking this stock as example.
My goals here are:
I want to grab some of the indicators from this div area (such as P/L, LPA, M. LÍQUIDA, and others);
And some of the numbers from this tables's first column (such as row 11, 15, and others).
My issues:
I'm not being able to fetch the data that I want from the div with the IMPORTXML function, neither with copying the XPath nor trying to find a specific class name to find a match.
I'm being able to fetch the specific number that I want, but it's returning 3 different values from 3 different rows (I want only the first one), due the XPath that I'm using //table/tbody/tr[11]/td[2]/span.
There's 2 more tables down the page that uses the same XPath, and the function is returning the values from row #11 of the other tables, as you can see here. The only thing that makes the 3 different from one another it's their divs, but I'm not being able to figure out how to manipulate these divs. There's any way to fix this or any function that automatically deletes the other 2 rows?
Can someone give me a light? :(

It's almost always easier to find the values you need by a reference. This should work to get the 20,76 from the first table
(//*[contains(text(), 'P/L')]/following::strong)[1]
As far as the second table goes, this should get 52.562,18 M
(//span[contains(text(), 'Receita Líquida')]/following::td)[1]
If you need to get different columns, you can just pass a higher index, this will return -0,07% for instance.
(//span[contains(text(), 'Receita Líquida')]/following::td)[5]
I also highly recommend getting some sort of xpath tester addon for your browser to play around with these if you don't already have one. I use ChroPath:
Firefox -
https://addons.mozilla.org/en-US/firefox/addon/chropath-for-firefox/
Chrome - https://chrome.google.com/webstore/detail/chropath/ljngjbnaijcbncmcnjfhigebomdlkcjo?hl=en-US

Related

Google Spreadsheets ArrayFormula: How to split and transpose a cell-range?

Hello everybody and thanks a lot for your help.
Here's my problem:
What I have:
I have a table with raw data in 53 rows and numerous columns which I would like to reduce and restructure into three columns: City, Date and Value.
https://docs.google.com/spreadsheets/d/1bsdC8lrtSGk957ae8Z0VRGnDqTZfFLPpLkfoid0UbIQ/edit?usp=sharing
What I've done so far:
For a single row, I used the following formula to make everything work as I wanted it to:
ArrayFormula({SPLIT(TRANSPOSE(Base_Data!A2)&"|"&TRANSPOSE(Base_Data!AJ1:1&"|"&Base_Data!AJ2:2),"|")})
What I want:
I'd like to extend the formula to work for the entire area, all 53 rows. Does anyone have a tip for this? The solution doesn't have to be a formula, it would work as a script, too

I've set up a new sheet called "New_Data [Erik]" and placed the following formula into A2:
=ArrayFormula(SPLIT(FLATTEN(Base_Data!A2:A&"\"&Base_Data!AJ1:1&"\"&Base_Data!AJ2:54),"\",0,1))
If this is a one-time conversion, I'd recommend copying the results in place. To do that, select A:C, hit Ctrl-C to Copy and then Ctrl-Alt-V to Paste Special. A small clipboard icon will appear. Click it and choose "Paste Values Only."
If you'll need this functionality ongoing, just understand that FLATTEN is a not-yet-official function of Google Sheets, which means that while Google sheets may very well make it official, they may also decide to do away with it at any time. (This is why I suggest copying and pasting the results in place, if it's just a one-time conversion.)

Not sure what you're trying to get to there. If you are trying to leave out all columns but 3, just do ={Base_Data!A2:A, Base_Data!E2:E} and add as many columns as you require comma-separated within the curly brackets

Google sheets importxml failure - Can't find the correct path to table from the link

I'm trying to retrieve a table which is updating twice per day. On other websites i was able to find the element but i saw that the way i see don't work on all websites where i tried.
In this case the issue is:
In google sheets using importxml, i can't find the correct path to table from the link or identify the element.
The website for this example is: http://lotopolonia.com/tabel/arhiva/index.php
1. I need to retrieve the dates and numbers.
2. They are updated twice per day and being updated in my sheet with adding just the last line at the top of the others. But this one after i solve the first one.
I looked at xpath tutorial from w3c and understood the syntax a bit.
The problem is how to identify correctly the elements and nodes in the inspector to retrieve the data i need.
Also, i've installed a chrome extension (XPath Helper) which shows xpath better that what i got from chrome.
I tried the following:
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row']/td[#class='colon2']")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row']/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='first_row'][1]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//*[#class='table_01']/table/tbody/tr[#class='first_row'][1]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[3]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[*]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row'][1]/child::td[*]")
The formula looks ok, without errors, but at all above requests i get the same result: imported content is empty
Unfortunately i ran out of ideas and how to interpret that elements...
Any ideea how to go on?
Cheers

How about this answer? I used //table[#class='table_01']/tr[position()>2] as a xpath. "A1" has http://lotopolonia.com/tabel/arhiva/index.php.
=IMPORTXML(A1,"//table[#class='table_01']/tr[position()>2]")
Using table[#class='table_01'], retrieve the table.
Using tr[position()>2], retrieve the dates and numbers.
Result :
Note :
If you want to retrieve the whole table, please use =IMPORTXML(A1,"//table[#class='table_01']/tr").
If this was not what you want, I'm sorry.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated

Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.

Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use

I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

Return certain part of string mysql

I am fetching results from a database table which contains the text of multiple pages.
These pages have links in their content.
I am trying to get all the links from the pages in a table, but I am also getting the unwanted text.
For example, this could be the content of a certain part of a page:
line 1: This is the link for lalalaal </a href="page5.html"> click</a>
line 2 if you want to go to page lalalala2 click
Now I only want the area starting from the <a href and ending at </a> in the result record. if there are more than 1 anchor tags in the text, then each anchor tag should be treated as a record.
the returned result should be like
ID value
1 ' click '
2 ' click '
I have tried the following queries :
Select * from [Database.tablename] where value between <a href and </a>;
Select * from [Database.tablename] locate '(<a href, Value)>0' and locate (</a>, value)>0;
but none of the 2 queries are giving me the wanted result...

This sort of text extraction is probably best addressed using regular expressions.
MySQL has some support (see here), but it could only be useful to identify which rows do have an <a></a> pair. Even identifying that there is at least one link inside a record doesn't help you extracting the (possibly many) links and treating them as different records themselves.
To successfully extract those links, at least according to my knowledge, you need a tool better suited to work with regular expressions. Most languages (Perl, PHP, Python, Java, etc.) support them, some natively, some using available libraries. You can select only records containing links (using REGEXP), and extract every link via code.
Another way of handling this would be performing the query on MySQL, exporting the results to a text file, and working on its contents with shell scripting (for instance, using sed under UNIX/Linux).
If you need it to be implemented using only MySQL, then my best guess is trying with a stored procedure (to be able to work on the results record-by-record.) I still cannot think of an implementation of such SP that guarantees detecting and successfully extracting every possible link inside a record as one record per link.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Trying to crawl some data on Google Sheets but getting beat by XPath - html

Related

Google Spreadsheets ArrayFormula: How to split and transpose a cell-range?

Google sheets importxml failure - Can't find the correct path to table from the link

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

Best way to parse a big and intricated Json file with OpenRefine (or R)

Return certain part of string mysql

Categories

Resources