Web scraping with python - Dynamic table data isn't downloaded

Web scraping with python - Dynamic table data isn't downloaded - html

I want to get transaction times from a website https://explorer.flitsnode.app/address/FieXP1irJKvmWUiqV18AFdDZD8bgWvfRiC/
but when I make a request for html I don't get full site data.
I get everything except the contents of the table I need - "Transactions of address"
I have the css selector for the table #txaddr but it returns just the top (Timestamp, Block, Hash, ..)
My code so far - I added a few comments to it.
import bs4
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def NodeRewardTime(link):
req = Request(link,headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = bs4.BeautifulSoup(webpage, 'html5lib') # pip install html5lib
all_results = soup.select("#txaddr") # CSS selector for the entire table
try:
[print(x.text) for x in all_results] # prints results
except:
print("No data to show")
link = "https://explorer.flitsnode.app/address/FieXP1irJKvmWUiqV18AFdDZD8bgWvfRiC/"
NodeRewardTime(link)
input("End")
Output: TimestampBlockHashAmount (FLS)Balance (FLS)TX Type [End]

If we inspect the page, you see that the data is loaded in JSON format via this site.
The following will print the data in a table format:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import json
def NodeRewardTime(link):
req = Request(link, headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html5lib")
json_data = json.loads(soup.text)
return "\n".join(" | ".join(i) for i in json_data["data"])
URL = "https://explorer.flitsnode.app/get_address_transactions?address=fiexp1irjkvmwuiqv18afddzd8bgwvfric"
print(NodeRewardTime(URL))
Outputs:
2020-08-14 00:00 | 562586 | cfc5fc6e81c0f31aaac85c2e3e6e727ce00cfdf4b938e7092472ce6f549b7fbf | 3.67999999 | 1003.67999999 | MASTERNODE
2020-08-13 16:37 | 562211 | 68f08eefef36aecd33645b13f3c95d0c3160ade5bc180b1f3b32ced670d97bef | -3.67999999 | 1000.00000000 | OUT
2020-08-12 18:58 | 561193 | 31958481f27f3d40ef5df4f437169f169f58b7b9556cc8ea5c381d4daf6d96b2 | 3.67999999 | 1003.67999999 | MASTERNODE
2020-08-11 22:00 | 560155 | 7ae289b8250fd94af10aa5e0a884149f548c7e3d1c6e05e7d78ac80284b3833a | -36.79999990 | 1000.00000000 | OUT
2020-08-11 15:02 | 559828 | 618185e5f12436e4c5fc97d45d36098ca56662780bbd037abfedfa316219571e | 3.67999999 | 1036.79999990 | MASTERNODE
2020-08-10 14:52 | 558579 | 3afeaa5e9e9130f03fac0303de680d790d075f1bbbae95e730bcf90fc33b82b9 | 3.67999999 | 1033.11999991 | MASTERNODE
2020-08-09 12:37 | 557281 | 0943156c88cc667502aef84b8143ba89f84cc069e342c86e028cae034abf3b36 | 3.67999999 | 1029.43999992 | MASTERNODE
2020-08-08 12:10 | 556044 | 31f56c608a02ae8f90b0e113dc60a4f35eec86b91c0be7242c4409bab2f4ece2 | 3.67999999 | 1025.75999993 | MASTERNODE
2020-08-07 09:07 | 554717 | 3e3e73db2491dec2071088a080a86567d769a6979c0304bfc26bfa194bfa8e5f | 3.67999999 | 1022.07999994 | MASTERNODE
2020-08-06 07:47 | 553471 | 92605aff1c7ee92302323b22ea4b2d812e71afa3e07be8a80e8a62d3f7281314 | 3.67999999 | 1018.39999995 | MASTERNODE
2020-08-05 04:47 | 552123 | 286261dc57262a2d2e34e1e3fd8c008946d6a08cf8a00617b2b66c14af3f2a82 | 3.67999999 | 1014.71999996 | MASTERNODE
2020-08-04 02:14 | 550794 | ccc75788a0b2c1b441fe9f2c3594c39ce9dcc90583112d795fd3666942c0014d | 3.67999999 | 1011.03999997 | MASTERNODE
2020-08-02 22:32 | 549388 | d2587f7a8adf268b881a22cf8b441382093916a95ab1c9f2f91c8a0ce59a281b | 3.67999999 | 1007.35999998 | MASTERNODE
2020-08-01 23:04 | 548196 | 1279fada75e56f2397288ce9eb4fcc7d04d10b15ea646189df75a117a2585707 | 3.67999999 | 1003.67999999 | MASTERNODE
... and on

You have to take the whole row and clear it up with loops so you show on output what you need only.

Related

How to parse JSON array that has no element/property names using Oracle

I am new to JSON and am trying to parse the data returned by following URL
https://api.binance.com/api/v3/klines?symbol=LTCBTC&interval=5m
The data is public if you want to see the exact output
I am in an Oracle 18c database trying to use json_table but I am not sure how to format the query or reference the columns as the JSON has no names, just values.
If I just paste in one record from the array as follows then I can get a column with all the values, but I need to parse the entire array and get the output into a table
SELECT *
FROM json_table( '[1617210000000,"0.00325500","0.00326600","0.00325400","0.00326600","780.81000000",1617210299999,"2.54374363",210,"569.58000000","1.85545803","0"]' , '$[*]'
COLUMNS (value PATH '$' ))
I have been searching google for days and not found an example of what I am trying to do, all the example use JSON with name:value pairs.
Thank you in advance.

The raw data is an array of arrays, so you can use $[*] to get the individual arrays, and then numbered positions to get the values from each of those arrays:
SELECT *
FROM json_table(
'[[...], [...], ...]', -- use actual data, as CLOB?
'$[*]'
COLUMNS (
open_time PATH '$[0]',
open PATH '$[1]',
high PATH '$[2]',
low PATH '$[3]',
close PATH '$[4]',
volume PATH '$[5]',
close_time PATH '$[6]',
quote_av PATH '$[7]',
number_of_trades PATH '$[8]',
taker_buy_base_av PATH '$[9]',
taker_buy_quote_av PATH '$[10]',
ignore PATH '$[11]'
)
)
I've taken the column names from the API documentation. Not sure why some are strings, presumably a precision thing; but you can obviously specify the data types. (And there are lots of examples of converted epoch timestamps to Oracle dates/timestamps if you want to do that.)
db<>fiddle with four entries, and an additional column for ordinality, which you might not want/need.
IDX | OPEN_TIME | OPEN | HIGH | LOW | CLOSE | VOLUME | CLOSE_TIME | QUOTE_AV | NUMBER_OF_TRADES | TAKER_BUY_BASE_AV | TAKER_BUY_QUOTE_AV | IGNORE
--: | :------------ | :--------- | :--------- | :--------- | :--------- | :----------- | :------------ | :--------- | :--------------- | :---------------- | :----------------- | :-----
1 | 1617423900000 | 0.00356800 | 0.00357100 | 0.00356400 | 0.00356800 | 358.71000000 | 1617424199999 | 1.27964866 | 90 | 313.96000000 | 1.12008826 | 0
2 | 1617424200000 | 0.00356800 | 0.00357000 | 0.00356600 | 0.00356800 | 349.47000000 | 1617424499999 | 1.24704741 | 105 | 283.05000000 | 1.01005077 | 0
3 | 1617424500000 | 0.00357000 | 0.00357900 | 0.00357000 | 0.00357400 | 412.32000000 | 1617424799999 | 1.47359944 | 127 | 53.73000000 | 0.19203676 | 0
4 | 1617424800000 | 0.00357500 | 0.00357500 | 0.00356500 | 0.00356600 | 910.58000000 | 1617425099999 | 3.25045272 | 198 | 463.30000000 | 1.65400945 | 0

Processing a nested dict and list Json to a DataFrame

I have a json object like this:
[{"ID": "101",
"OagCode": "1000",
"house": [{"from": [{"CneeCode":"30100"}], "ShprCode": "20100"}]},
{"ID": "102",
"OagCode": "1001",
"house": [{"from": [{"CneeCode":"30101"}], "ShprCode": "20101"},
{"from": [{"CneeCode":"30102"}], "ShprCode": "20102"},
{"from": [{"CneeCode":"30103"}], "ShprCode": "20103"}]}]
I want to convert this json to a dataframe in such a way that the interior list expands and form a dataframe with proper values as follows:
+-----+---------+----------+----------+
| ID | OagCode | CneeCode | ShprCode |
+-----+---------+----------+----------+
| 101 | 1000 | 30100 | 20100 |
| 102 | 1001 | 30101 | 20101 |
| 102 | 1001 | 30102 | 20102 |
| 102 | 1001 | 30103 | 20103 |
+-----+---------+----------+----------+
Is there a way to convert the above stated json to dataframe without using loops?
I have tried orient and it doesn't works.

Use json_normalize:
from pandas.io.json import json_normalize
df = json_normalize(j,record_path='house',meta=['ID','OagCode'])
print (df)
CneeCode ShprCode ID OagCode
0 30100 20100 101 1000
1 30101 20101 102 1001
2 30102 20102 102 1001
3 30103 20103 102 1001

adding a unique consecutive row number to dataframe in pyspark

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?

I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+

Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

Convert Json to separate columns in HIVE

I have 4 columns in Hive database table. First two columns are of type string, 3rd and 4th are of JSON. Type. How to extract json data in different columns.
SERDE available in Hive seems to be handling only json data. I have both normal (STRING) and JSON data. How can I extract data in separate colums here.
Example:
abc 2341 {max:2500e0,value:"20",Type:"1",ProviderType:"ABC"} {Name:"ABC",minA:1200e0,StartDate:1483900200000,EndDate:1483986600000,Flags:["flag4","flag3","flag2","flag1"]}
xyz 6789 {max:1300e0,value:"10",Type:"0",ProviderType:"foo"} {Name:"foo",minA:3.14159e0,StartDate:1225864800000,EndDate:1225864800000,Flags:["foo","foo"]}

Given a fixed JSON
create table mytable (str string,i int,jsn1 string, jsn2 string);
insert into mytable values
('abc',2341,'{"max":2500e0,"value":"20","Type":"1","ProviderType":"ABC"}','{"Name":"ABC","minA":1200e0,"StartDate":1483900200000,"EndDate":1483986600000,"Flags":["flag4","flag3","flag2","flag1"]}')
,('xyz',6789,'{"max":1300e0,"value":"10","Type":"0","ProviderType":"foo"}','{"Name":"foo","minA":3.14159e0,"StartDate":1225864800000,"EndDate":1225864800000,"Flags":["foo","foo"]}')
;
select str,i
,jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
,jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate
,jsn2_Flags
from mytable
lateral view json_tuple (jsn1,'max','value','Type','ProviderType')
j1 as jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
lateral view json_tuple (jsn2,'Name','minA','StartDate','EndDate','Flags')
j2 as jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate,jsn2_Flags
;
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| str | i | jsn1_max | jsn1_value | jsn1_type | jsn1_providertype | jsn2_name | jsn2_mina | jsn2_startdate | jsn2_enddate | jsn2_flags |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| abc | 2341 | 2500.0 | 20 | 1 | ABC | ABC | 1200.0 | 1483900200000 | 1483986600000 | ["flag4","flag3","flag2","flag1"] |
| xyz | 6789 | 1300.0 | 10 | 0 | foo | foo | 3.14159 | 1225864800000 | 1225864800000 | ["foo","foo"] |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+

Extracting and Constructing Tables from HTML Files using Julia

Here's a public link to an example html file. I would like to extract each set of CAN and yearly tax information (example highlighted in red in the image below) from the file and construct a dataframe that looks like the one below.
Target Fields
Example DataFrame
| Row | CAN | Crtf_NoCrtf | Tax_Year | Land_Value | Improv_Value | Total_Value | Total_Tax |
|-----+--------------+-------------+----------+------------+--------------+-------------+-----------|
| 1 | 184750010210 | Yes | 2016 | 16720 | 148330 | 165050 | 4432.24 |
| 2 | 184750010210 | Yes | 2015 | 16720 | 128250 | 144970 | 3901.06 |
| 3 | 184750010210 | Yes | 2014 | 16720 | 109740 | 126460 | 3412.63 |
| 4 | 184750010210 | Yes | 2013 | 16720 | 111430 | 128150 | 3474.46 |
| 5 | 184750010210 | Yes | 2012 | 16720 | 99340 | 116060 | 3146.17 |
| 6 | 184750010210 | Yes | 2011 | 16720 | 102350 | 119070 | 3218.80 |
| 7 | 184750010210 | Yes | 2010 | 16720 | 108440 | 125160 | 3369.97 |
| 8 | 184750010210 | Yes | 2009 | 16720 | 113870 | 130590 | 3458.14 |
| 9 | 184750010210 | Yes | 2008 | 16720 | 122390 | 139110 | 3629.85 |
| 10 | 184750010210 | Yes | 2007 | 16720 | 112820 | 129540 | 3302.72 |
| 11 | 184750010210 | Yes | 2006 | 12380 | 112760 | | 3623.12 |
| 12 | 184750010210 | Yes | 2005 | 19800 | 107400 | | 3882.24 |
Additional Information
If it is not possible to insert the CAN to each row that is okay, I can export the CAN numbers separately and find a way to attach them to the dataframe containing the tax values. I have looked into using beautiful soup for python, but I am an absolute novice with python and the rest of the scripts I am writing are in Julia, so I would prefer to keep everything in one language.
Is there any way to achieve what I am trying to achieve? I have looked at Gumbo.jl but can not find any detailed documentation/tutorials.

So Gumbo.jl will parse the HTML and give you a programatic representation of the structure of the HTML file (called a DOM - Document Object Model). This is typically a tree of html tags, which you can traverse and extract the data you need.
To make this easier, what you really want is a way to query the DOM, so that you can extract the data you need without having to traverse the entire tree yourself. The Cascadia.jl project does this for you. It is built on top of Gumbo, and uses CSS selectors as the query language.
So for your example, you could use something like the following to extract all the CAN fields:
julia> using Gumbo
julia> using Cascadia
julia> h=parsehtml(read("/Users/aviks/Download/z1.html", String))
julia> c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root)
13-element Array{Gumbo.HTMLNode,1}:
Gumbo.HTMLElement{:span}:
<span class="value">184750010210</span>
...
#print all the CAN values
julia> for x in c
println( x.children[1].text )
end
184750010210
186170040070
175630130020
172640020290
168330020230
156340030160
118210000020
190490040500
173480080430
161160010050
153510060090
050493000250
050470630910
Hopefully this gives you an idea of how to extract all the data you need.

The current answer is a bit out of date since the readall() function no longer exists. I'll update his answer below.
Here's a general breakdown of the package ecosystem for Julia (as of the time of writing this answer):
Requests.jl is used to download the HTML file itself (note that in avik's answer, he reads the HTML file from his local machine)
Cascadia.jl is required to search for CSS tags (e.g. the tag that you would find if you were to use Selector Gadget).
Gumbo.jl is required to parse the resulting HTML
The key thing to remember is that Gumbo stores objects in tree format as HTMLNodes or HTMLElements. So most objects have "parents" and "children." To get the data you need, it's simply a matter of filtering with the right selector (using Cascadia) and then going to the correct point in the Gumbo tree.
An updated version of avik's answer:
using Requests, Cascadia, Gumbo
# r = get(url) # Normally, you'd put a url here, but I couldn't find a way to grab it without having to download it and read it locally
# h = parsehtml(String(r.data)) # Then normally you'd execute this
# Instead, I'm going to read in the html file as a string and give it to Gumbo
h = parsehtml(readstring("z1.html"))
# Exploring with the various structure of Gumbo objects:
println(fieldnames(h.root))
println(fieldnames(h.root.children))
println(size(h.root.children))
# aviks code:
c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root);
for x in c
println( x.children[1].text )
end
This particular webpage is more difficult to scrape than most, since it doesn't have a great CSS structure.
There's some nice documentation on workflow on the Cascadia README, but I still had some questions after reading it. For anyone else (like me, yesterday) who comes to this page looking for guidance on web scraping in Julia, I've created a jupyter notebook with a simple example that will hopefully help you understand the workflow in greater detail.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Web scraping with python - Dynamic table data isn't downloaded - html

You have to take the whole row and clear it up with loops so you show on output what you need only.

Related

How to parse JSON array that has no element/property names using Oracle

Processing a nested dict and list Json to a DataFrame

adding a unique consecutive row number to dataframe in pyspark

Convert Json to separate columns in HIVE

Extracting and Constructing Tables from HTML Files using Julia

Categories

Resources