Extracting and Constructing Tables from HTML Files using Julia

Extracting and Constructing Tables from HTML Files using Julia - html

Here's a public link to an example html file. I would like to extract each set of CAN and yearly tax information (example highlighted in red in the image below) from the file and construct a dataframe that looks like the one below.
Target Fields
Example DataFrame
| Row | CAN | Crtf_NoCrtf | Tax_Year | Land_Value | Improv_Value | Total_Value | Total_Tax |
|-----+--------------+-------------+----------+------------+--------------+-------------+-----------|
| 1 | 184750010210 | Yes | 2016 | 16720 | 148330 | 165050 | 4432.24 |
| 2 | 184750010210 | Yes | 2015 | 16720 | 128250 | 144970 | 3901.06 |
| 3 | 184750010210 | Yes | 2014 | 16720 | 109740 | 126460 | 3412.63 |
| 4 | 184750010210 | Yes | 2013 | 16720 | 111430 | 128150 | 3474.46 |
| 5 | 184750010210 | Yes | 2012 | 16720 | 99340 | 116060 | 3146.17 |
| 6 | 184750010210 | Yes | 2011 | 16720 | 102350 | 119070 | 3218.80 |
| 7 | 184750010210 | Yes | 2010 | 16720 | 108440 | 125160 | 3369.97 |
| 8 | 184750010210 | Yes | 2009 | 16720 | 113870 | 130590 | 3458.14 |
| 9 | 184750010210 | Yes | 2008 | 16720 | 122390 | 139110 | 3629.85 |
| 10 | 184750010210 | Yes | 2007 | 16720 | 112820 | 129540 | 3302.72 |
| 11 | 184750010210 | Yes | 2006 | 12380 | 112760 | | 3623.12 |
| 12 | 184750010210 | Yes | 2005 | 19800 | 107400 | | 3882.24 |
Additional Information
If it is not possible to insert the CAN to each row that is okay, I can export the CAN numbers separately and find a way to attach them to the dataframe containing the tax values. I have looked into using beautiful soup for python, but I am an absolute novice with python and the rest of the scripts I am writing are in Julia, so I would prefer to keep everything in one language.
Is there any way to achieve what I am trying to achieve? I have looked at Gumbo.jl but can not find any detailed documentation/tutorials.

So Gumbo.jl will parse the HTML and give you a programatic representation of the structure of the HTML file (called a DOM - Document Object Model). This is typically a tree of html tags, which you can traverse and extract the data you need.
To make this easier, what you really want is a way to query the DOM, so that you can extract the data you need without having to traverse the entire tree yourself. The Cascadia.jl project does this for you. It is built on top of Gumbo, and uses CSS selectors as the query language.
So for your example, you could use something like the following to extract all the CAN fields:
julia> using Gumbo
julia> using Cascadia
julia> h=parsehtml(read("/Users/aviks/Download/z1.html", String))
julia> c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root)
13-element Array{Gumbo.HTMLNode,1}:
Gumbo.HTMLElement{:span}:
<span class="value">184750010210</span>
...
#print all the CAN values
julia> for x in c
println( x.children[1].text )
end
184750010210
186170040070
175630130020
172640020290
168330020230
156340030160
118210000020
190490040500
173480080430
161160010050
153510060090
050493000250
050470630910
Hopefully this gives you an idea of how to extract all the data you need.

The current answer is a bit out of date since the readall() function no longer exists. I'll update his answer below.
Here's a general breakdown of the package ecosystem for Julia (as of the time of writing this answer):
Requests.jl is used to download the HTML file itself (note that in avik's answer, he reads the HTML file from his local machine)
Cascadia.jl is required to search for CSS tags (e.g. the tag that you would find if you were to use Selector Gadget).
Gumbo.jl is required to parse the resulting HTML
The key thing to remember is that Gumbo stores objects in tree format as HTMLNodes or HTMLElements. So most objects have "parents" and "children." To get the data you need, it's simply a matter of filtering with the right selector (using Cascadia) and then going to the correct point in the Gumbo tree.
An updated version of avik's answer:
using Requests, Cascadia, Gumbo
# r = get(url) # Normally, you'd put a url here, but I couldn't find a way to grab it without having to download it and read it locally
# h = parsehtml(String(r.data)) # Then normally you'd execute this
# Instead, I'm going to read in the html file as a string and give it to Gumbo
h = parsehtml(readstring("z1.html"))
# Exploring with the various structure of Gumbo objects:
println(fieldnames(h.root))
println(fieldnames(h.root.children))
println(size(h.root.children))
# aviks code:
c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root);
for x in c
println( x.children[1].text )
end
This particular webpage is more difficult to scrape than most, since it doesn't have a great CSS structure.
There's some nice documentation on workflow on the Cascadia README, but I still had some questions after reading it. For anyone else (like me, yesterday) who comes to this page looking for guidance on web scraping in Julia, I've created a jupyter notebook with a simple example that will hopefully help you understand the workflow in greater detail.

Related

pyqt4 - MySQL How print single/multiple row(s) of a table in the TableViewWidget

I've recently tried to create an executable with python 2.7 which can read a MySQL database.
The database (named 'montre') regroups two tables : patient and proto_1
Here is the content of those tables :
mysql> select * from proto_1;
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
| id | Nom_Montre | Date_Heure | Temperature | Pulsion_cardiaque | Taux
_oxy_sang | Humidite |
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
| 1 | montre_1 | 2017-11-27 19:33:25 | 22.30 | NULL |
NULL | NULL |
| 2 | montre_1 | 2017-11-27 19:45:12 | 22.52 | NULL |
NULL | NULL |
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
mysql> select * from patient;
+----+-----------+--------+------+------+---------------------+------------+----
----------+
| id | nom | prenom | sexe | age | date_naissance | Nom_Montre | com
mentaires |
+----+-----------+--------+------+------+---------------------+------------+----
----------+
| 2 | RICHEMONT | Robert | M | 37 | 1980-04-05 23:43:00 | montre_3 | ess
aye2 |
| 3 | PIERRET | Mandy | F | 22 | 1995-04-05 10:43:00 | montre_4 | ess
aye3 |
| 14 | PIEKARZ | Allan | M | 22 | 1995-06-01 10:32:56 | montre_1 | Hea
lthy man |
+----+-----------+--------+------+------+---------------------+------------+----
----------+
As I'm just used to code in C (no OOP), I didn't create class in the python project (shame on me...). But I managed, in two files, to create something (with mysql.connector) which can print (on the cmd) my database and excecute sub like looking-for() etc.
Now, I want to create a GUI for users with pyqt. Unfortunately, I saw that the structure is totally different, with class etc. But okay, I tried to go throught this and I've created a GUI which allows to display the table "patient". But I didn't manage (in the datasheet of QT) to find how I can use the programs I've already created to display. Neither how to display in a tableWidget only several rows of my table patient for exemple (Using QSQL).
For example, if I want to display all the table patient, I use this line (pyQt):
self.model.setTable("patient")
For this one, I got it, but that disturb me because there is no MySQL coding requisites to display my table and so I don't know how to sort only the rows we want to see and display them. If we only want to see, for example, the ID n°2, how to display in the table:widget only Robert ?
To recap, I want to know :
If I can take the coding I've created and combine it with pyQT
How to display (tableWidget) only rows which are selected by MySQL. Is that possible ?
Please find in the URL my code for a better understanding of my problem :
https://drive.google.com/file/d/1nxufjJfF17P5hN__CBEcvrbuHF-23aHN/view?usp=sharing
I hope I was clear, thank you all for your help !

MySQL -> HTML Report, Styled like a Pivot Table

Ok, I'd like to start off by apologizing (profusely), since this seems to be a common question. Most of the examples seem to be somewhat similar, as well, but - for the life of me, I cannot wrap my brain around how to apply the myriad of quality responses to my specific table. And, I'm sure it's probably just the easiest thing in the world, what with all the very thorough responses/examples/links to resources with explanations/etc.
So, I suppose I'll just get right to it. The basics:
We host off-site copies of our clients' backups.
We need to know how much space they're using.
We are not at all consistent in Naming Convention, folder vs. disk per client, etc.
We need to automate a 'report', monthly, with data as follows:
-[C.Srv 01]---Size(GB)--Free(%)
Client 01 [Total] [AVG]
Server 01 109.43 25
Server 02 415.19 25
WHERE C.Srv = [Specified Cloud Server]
Clients Get a Total Size(GB) and an Average Free(%)
My MySQL table is this:
# Name DataType Length/Set Unsigned Allow NULL ZeroFill Default
1. ID INT 11 AUTO_INCREMENT
2. Client TEXT
3. Server TEXT
4. C.Srv TEXT
5. Size DECIMAL 10,2
6. Free DECIMAL 10,4
So, for Example, let's say I have this...
___ ________ ________ _________ _________ _______
ID | CLIENT | SERVER | C.SRV | SIZE | FREE
---|--------|--------|---------|---------|-------
1 | a | adc | cs_01 | 109.43 | 0.2504
2 | a | asql | cs_01 | 415.19 | 0.2504
3 | b | bdc | cs_01 | 583.91 | 0.1930
4 | b | bdev | cs_01 | 316.52 | 0.1930
5 | b | bsql | cs_01 | 1259.56 | 0.1930
6 | c | cdc | cs_01 | 355.30 | 0.7631
7 | d | ddc | cs_01 | 398.21 | 0.5808
Is it possible to get something pretty, in HTML (preferably), that has the basic structure of this...
_______ __________ ________
CS_01 | Size(GB) | Free(%)
-------|----------|--------
-a | 524.62 | 25.04%
-------|----------|--------
adc | 109.43 | 25.04%
asql | 415.19 | 25.04%
-b | 2178.88 | 19.30%
-------|----------|--------
bdc | 583.91 | 19.30%
bdev | 316.52 | 19.30%
bsql | 1259.56 | 19.30%
+c | 355.30 | 76.31%
-------|----------|--------
+d | 398.21 | 58.08%
_______|__________|________
Or, am I just S.O.L.? Format, I can mess with in CSS, or whatever (I hope), just so long as it's in that basic structure. (I don't know if it matters, but the final goal will be to collapse at the Client Level; in case that somehow factors into the approach/data-gathering.)

Multiple Data Sources in Microsoft Excel SQL Query

I have a lot of spreadsheets that pull transactional information from our ERP software into Excel using the Microsoft Query that we then perform other calculations on automatically. Recently we upgraded our ERP system, but management made the decision to leave the transactional history in the old databases to have a clean one going forward in the new system. I still need to have some "rolling 12 months" graphs, but if I use only the old database, I'm missing new data and if I use only the new, I'm missing the last 11 months data.
Is there a way that I can write a query in Excel to pull data from the old database PartTran table and merge it with the new database PartTran table without user intervention each time? For instance, I don't want my users (if possible) to have to have two queries that they copy and paste into one Excel table. The schema of the tables (at least the columns I need) are identically named and defined.

If you want to take a bit of a fun, hacky Excel approach, you could do the "copy-paste" bit FOR your users behind the scenes. Given two similar tables OLD and NEW with structures
+-----+------+-------+------------+
| id | foo | bar | date |
+-----+------+-------+------------+
| 95 | blah | $25 | 2015-06-01 |
| 96 | bork | $12 | 2015-07-01 |
| 97 | bump | $200 | 2015-08-01 |
| 98 | fizz | | 2015-09-01 |
| 99 | buzz | $50 | 2015-10-01 |
| 100 | char | ($1) | 2015-11-01 |
| 101 | mope | | 2015-12-01 |
+-----+------+-------+------------+
and
+----+-----+-------+------------+------+---------+
| id | foo | bar | date | fizz | buzz |
+----+-----+-------+------------+------+---------+
| 1 | cat | ($10) | 2016-01-01 | 285B | 1110111 |
| 2 | dog | $25 | 2016-02-01 | 27F5 | 1110100 |
| 3 | ant | $100 | 2016-03-01 | 1F91 | 1001111 |
+----+-----+-------+------------+------+---------+
... you can union together the data for these two datasets with some prudent excel wizardry as below:
Your UNION table ( named using alt+j+t+a ) should have the following items:
New natural ID
DataSet pointer ( name of old or new table )
Derived ID from original dataset
Columns of data you want from Old & New DataSets
example:
+---------+------------+------------+----+------+-----+------------+------+------+
| UnionId | SourceName | SourceRank | id | foo | bar | date | fizz | buzz |
+---------+------------+------------+----+------+-----+------------+------+------+
| 1 | OLD | | | | | | | |
| 2 | NEW | | | | | | | |
+---------+------------+------------+----+------+-----+------------+------+------+
You will then make judicious use of Indirect() and VlookUp() to derive the lookup id and column targets. Sample code below
SourceRank - helper column
=COUNTIFS([SourceName],[#SourceName],[UnionId],"<="&[#UnionId])
id - the id from the original DataSet
=SMALL(INDIRECT([#SourceName]&"[id]"),[#SourceRank])
Everything else is just VlookUp madness!! Although I've taken the liberty of copying the sample code below for reference
foo =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[foo]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
bar =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[bar]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
date =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[date]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
fizz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
buzz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
Output
You'll likely want to make prudent use of If() and/or IfError() to help your users ignore the new column references to the old table and those rows that do not yet have data. Without that, however, you'll end up with something like the below.
This is both ready to accept & read new inputs to both OLD and NEW DataSets and is sortable to get rid of those pesky placeholder rows...
Hope this helps! Happy coding!

MacVim+NERDTree: How to open a file as a split in furthest horizontal split

I've been browsing mvim docs and have tested out the various commands, but I can't seem to find one that solves my issue.
Here is what I have:
/========================================================\
| | | |
| | | |
| | file 1 | |
| | | |
| |______________________| |
| NERDTree | | File 3 |
| | | |
| | file 2 | |
| | | |
\__________|______________________|______________________/
What I'd like to have:
/========================================================\
| | | |
| | | |
| | file 1 | File 4 |
| | | |
| |______________________|______________________|
| NERDTree | | |
| | | |
| | file 2 | File 3 |
| | | |
\__________|______________________|______________________/
I'm able to move things far right, into a new vsplit, as well as far top and far bottom.
New NERDTree files are opening by default in the File 1/File 2 vsplit.
Any help is appreciated, thanks!

It seems as though my particular setup at that time may have been the issue, and I think I understand why. First, how to do what I asked:
Open up nerdtree with :NERDTree
Open your first file with or o
Open second file in horizontal split pane with i
From each of 2 horizontal panes create your third and fourth panes with s. This will open the selected files in vertical split of the last buffer you interacted with, splitting them each in half.
Bare in mind that you'll need to be in the pane you'd like to split, previous to selecting your file to open from NERDTree.
My issue arose primarily from my panes already being in an orientation of my top most diagram above. Everytime I tried to create a horizontal split with File 3 the split would just wind up in the first column of files.
I think I may see why now, though. With mvim you can interact through your mouse - and that's the only way to get directly from that furthest column to NERDTree, without touching any other buffers (as far as I can tell). Whereas with regular vim, you wouldn't be able to have the furthest column as the last interacted window, and therefore would never be able to split it.

Search Replace in MySQL: remove directory structure but keep filename

I am changing directory structures in a Drupal installation and need to remove all path data except the file name itself.
So the basic structure is:
+-------------+--------------+---------+-----------+-------------+----------+-------+----------------------------------------------------------------------------------+-----------------------+
| entity_type | bundle | deleted | entity_id | revision_id | language | delta | field_filename_value | field_filename_format |
+-------------+--------------+---------+-----------+-------------+----------+-------+----------------------------------------------------------------------------------+-----------------------+
The filename is stored in field_filename_value. Here's a sample record:
+-------------+--------------+---------+-----------+-------------+----------+-------+----------------------------------------------------------------------------------+-----------------------+
| entity_type | bundle | deleted | entity_id | revision_id | language | delta | field_filename_value | field_filename_format |
+-------------+--------------+---------+-----------+-------------+----------+-------+----------------------------------------------------------------------------------+-----------------------+
| node | presentation | 0 | 11 | 11 | und | 0 | /really long path name/with lots of words/167 Clarence Ashley - Coo Coo Bird.mp3 | NULL |
+-------------+--------------+---------+-----------+-------------+----------+-------+----------------------------------------------------------------------------------+-----------------------+
That ridiculous filename value needs to be changed from:
/really long path name/with lots of words/167 Clarence Ashley - Coo Coo Bird.mp3
To this:
167 Clarence Ashley - Coo Coo Bird.mp3
Setting aside the bad practice of using spaces in file/directory names, how would you correct this? Is it possible using MySQL features alone?
As an added challenge, some files may be more than 2 directories deep.

Use substring_index
select substring_index('http://www.example.com/dev/archive/examples/test.htm','/',-1)
(both above are fully from
MySQL String Last Index Of
How you would use it is easy, but just to explain, you select the last index of the / and then do another substring function to cut off anything to the left of it

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting and Constructing Tables from HTML Files using Julia - html

Related

pyqt4 - MySQL How print single/multiple row(s) of a table in the TableViewWidget

MySQL -> HTML Report, Styled like a Pivot Table

Multiple Data Sources in Microsoft Excel SQL Query

MacVim+NERDTree: How to open a file as a split in furthest horizontal split

Search Replace in MySQL: remove directory structure but keep filename

Categories

Resources