Webscraping the data using R - html

Aim: I am trying to scrape the historical daily stock price for all companies from the webpage http://www.nepalstock.com/datanepse/previous.php. The following code works; however, it always generates the daily stock price for the most recent (Feb 5, 2015) date only. In another words, output is the same, irrespective of the date that I entered. I would appreciate if you could help in this regard.
library(RHTMLForms)
library(RCurl)
library(XML)
url <- "http://www.nepalstock.com/datanepse/previous.php"
forms <- getHTMLFormDescription(url)
# we are interested in the second list with date forms
# forms[[2]]
# HTML Form: http://www.nepalstock.com/datanepse/
# Date: [ ]
get_stock<-createFunction(forms[[2]])
#create sequence of dates from start to end and store it as a list
date_daily<-as.list(seq(as.Date("2011-08-24"), as.Date("2011-08-30"), "days"))
# determine the number of elements in the list
num<-length(date_daily)
daily_1<-lapply(date_daily,function(x){
show(x) #displays the particular date
readHTMLTable(htmlParse(get_stock(Date = x)), which = 7)
})
#18 tables out of which 7 is one what we desired
# change the colnames
col_name<-c("SN","Traded_Companies","No_of_Transactions","Max_Price","Min_Price","Closing_Price","Total_Share","Amount","Previous_Closing","Difference_Rs.")
daily_2<-lapply(daily_1,setNames,nm=col_name)
Output:
> head(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount
1 1 Agricultural Development Bank Ltd 24 489 471 473 2,868 1,359,038
2 2 Arun Valley Hydropower Development Company Limited 40 365 360 362 8,844 3,199,605
3 3 Alpine Development Bank Limited 11 297 295 295 150 44,350
4 4 Asian Life Insurance Co. Limited 10 1,230 1,215 1,225 898 1,098,452
5 5 Apex Development Bank Ltd. 23 131 125 131 6,033 769,893
Previous_Closing Difference_Rs.
1 480 -7
2 363 -1
3 303 -8
4 1,242 -17
5 132 -1
> tail(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount Previous_Closing
140 140 United Finance Ltd 4 255 242 242 464 115,128 255
141 141 United Insurance Co.(Nepal)Ltd. 3 905 905 905 234 211,770 915
142 142 Vibor Bikas Bank Limited 7 158 152 156 710 109,510 161
143 143 Western Development Bank Limited 35 320 311 313 7,631 2,402,497 318
144 144 Yeti Development Bank Limited 22 139 132 139 14,355 1,921,511 134
Difference_Rs.
140 -13
141 -10
142 -5
143 -5
144 5

Here's one quick approach. Note that the site uses a POST request to send the date to the server.
library(rvest)
library(httr)
page <- "http://www.nepalstock.com/datanepse/previous.php" %>%
POST(body = list(Date = "2015-02-01")) %>%
html()
page %>%
html_node(".dataTable") %>%
html_table(header = TRUE)

Related

mysql return 360 degrees returned with strengths

I have a table of wind directions and strengths over a 24 hour period, sample data at the bottom of this question.
only directions that have strengths are stored in the database, I'm currently using the following SQL:
SELECT winddirection, avespeed
FROM wp_weather_data
WHERE ID%10 = 0
what I would like to return is an entry for every degree (0 value for any degree not in the db) and where there are multiple entries for a given degree to only return the highest value. Oh, and they need to be in ascending order of degrees.
Is this possible?
This is so I can plot a wind distribution chart on a polar chat plugin in WordPress.
sample data returned from the above sql:
294 2
271 3
269 2
285 3
289 2
123 1
130 1
144 1
160 0
168 0
161 0
135 0
138 0
331 0
115 0
136 0
161 0
267 0
114 0
265 0
204 0
248 1
206 0
199 1
250 2
244 3
257 3
272 5
267 5
208 3
221 3
223 4
253 6
233 5

Retrieving Tree in Hierarchical data in MySQL

I have stored some hierarchical data of categories where each category is related to others, the trick is a single category can have multiple parents (Maximum 3, Minumum 0).
The table structures are:
category table
id - Primary Key
name - Name of the Category
ref_id - Reference ID that is being used for relationship
id
name
ref_id
1
everything
-1
2
computing
0
3
artificial intelligence
1
4
data science
2
5
machine learning (ML)
3
6
programming
4
7
web technologies
5
8
programming languages
7
9
content technologies
8
10
operating systems
9
11
algorithms
10
12
software development systems
102
category_relation table
id
child_ref_id
parent_ref_id
1
0
-1
2
1
0
3
2
0
4
3
1
5
3
2
6
4
102
7
5
0
8
7
4
9
8
0
10
9
0
11
10
0
12
10
4
13
102
0
as you can see in the diagram, the relationship is pretty complicated, algorithms has two parents computing and programming, similarly machine learning (ML) also has two parents artificial intelligence and data science
How can I retrieve all the children of a specific category, e.g. computing, I need to retrieve all the children till the third level, i.e. programming languages and algorithms.
MySQL dump of the database: https://github.com/codersrb/multi-parent-hierarchy/blob/main/taxonomy.sql
Assuming the data structure is fixed with a good PK, in MySQL 8.x you can do:
with recursive
n (id, name, ref_id, lvl) as (
select id, name, ref_id, 1 from category where id = 2 -- starting node
union all
select c.id, c.name, c.ref_id, n.lvl + 1
from n
join category_relation r on r.parent_ref_id = n.ref_id
join category c on c.ref_id = r.child_ref_id
)
select * from n where lvl <= 3
Result:
id name ref_id lvl
---- --------------------------------------- ------- ---
2 computing 0 1
3 artificial intelligence 1 2
4 data science 2 2
7 web technologies 5 2
9 content technologies 8 2
10 operating systems 9 2
11 algorithms 10 2
62 information science 61 2
103 software / systems development 102 2
165 scientific computing 165 2
296 image processing 316 2
297 text processing 317 2
301 Google 321 2
322 computer vision 343 2
5 machine learning (ML) 3 3
5 machine learning (ML) 3 3
6 programming 4 3
18 models 17 3
21 classification 20 3
27 data preparation 26 3
28 data analysis 27 3
29 imbalanced datasets 28 3
50 visualization 49 3
61 information retrieval 60 3
68 k-means 67 3
71 Random Forest algorithm 70 3
104 project management 103 3
105 software development methodologies 104 3
107 web development 106 3
113 kNN model 112 3
132 CRISP-DM methodology 131 3
143 data 142 3
153 SMOTE 153 3
154 MSMOTE 154 3
157 backward feature elimination 157 3
158 forward feature selection 158 3
176 deep feature synthesis (DFS) 177 3
196 unsupervised learning 197 3
210 mean-shift 211 3
212 DBSCAN 213 3
246 naïve Bayes algorithm 247 3
248 decision tree algorithm 249 3
249 support vector machine (SVM) algorithm 250 3
251 neural networks 252 3
252 artificial neural networks (ANN) 253 3
281 deep learning 300 3
281 deep learning 300 3
285 image classification 304 3
285 image classification 304 3
286 natural language processing (NLP) 305 3
286 natural language processing (NLP) 305 3
288 text representation 307 3
294 visual recognition 314 3
295 optical character recognition (OCR) 315 3
295 optical character recognition (OCR) 315 3
296 image processing 316 3
298 machine translation (MT) 318 3
299 speech recognition 319 3
300 TensorFlow 320 3
302 R 322 3
304 Android 324 3
322 computer vision 343 3
323 object detection 344 3
324 instance segmentation 345 3
325 edge detection 346 3
326 image filters 347 3
327 feature maps 348 3
328 stride 349 3
329 padding 350 3
335 text preprocessing 356 3
336 tokenization 357 3
337 case normalization 358 3
338 removing punctuation 359 3
339 stop words 360 3
340 stemming 361 3
341 lemmatization 362 3
342 Porter algorithm 363 3
350 word2vec 371 3
351 Skip-gram 372 3
364 convnets 385 3
404 multiplicative update algorithm 716 3
If you want to remove duplicates you can use DISTINCT. For example:
with recursive
n (id, name, ref_id, lvl) as (
select id, name, ref_id, 1 from category where id = 2 -- starting node
union all
select c.id, c.name, c.ref_id, n.lvl + 1
from n
join category_relation r on r.parent_ref_id = n.ref_id
join category c on c.ref_id = r.child_ref_id
)
select distinct * from n where lvl <= 3
See running example at DB Fiddle.

BeautifulSoup parsed only one Column instead of entire Wikipedia table in Python

I am trying to parse 1st table located here using BeautifulSoup in Python. It parsed my First column but for some reason It didn't parsed entire table. Any help is appreciated!
Note: I am trying to parse entire table and convert into pandas dataframe
My Code:
import requests
from bs4 import BeautifulSoup
WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)
It only parsed one column because you did a findall for only the items in the first column. To parse the entire table you'd have to do a findall for the table rows <tr> and then a findall within each row for the table divides <td>. Right now you are just doing a findall for the links and then printing the links.
my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))
NOTE: Accept B.Adler's solution as it is good work and sound advice. This solution is simply so you can see some alternatives as you are learning.
Whenever I see <table> tags, I'll usually check out pandas first to see if I can find what I need from the tables that way. pd.read_html() will return a list of dataframes, and you can work/manipulate those to extract what you need.
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"
tables = pd.read_html(WIKI_URL)
You can also look through the dataframes to see which has the data you want.
I just used dataframe in index position 2 for this one, which is the first table you were looking for
table = tables[2]
Output:
print (table)
0 1 ... 6 7
0 Team Won ... Total Games Conference
1 Michigan 953 ... 1331 Big Ten
2 Ohio State 1 911 ... 1289 Big Ten
3 Notre Dame 2 897 ... 1263 Independent
4 Boise State 448 ... 618 Mountain West
5 Alabama 3 905 ... 1277 SEC
6 Oklahoma 896 ... 1274 Big 12
7 Texas 908 ... 1311 Big 12
8 USC 4 839 ... 1239 Pac-12
9 Nebraska 897 ... 1325 Big Ten
10 Penn State 887 ... 1319 Big Ten
11 Tennessee 838 ... 1281 SEC
12 Florida State 5 544 ... 818 ACC
13 Georgia 819 ... 1296 SEC
14 LSU 797 ... 1259 SEC
15 Appalachian State 617 ... 981 Sun Belt
16 Georgia Southern 387 ... 616 Sun Belt
17 Miami (FL) 630 ... 1009 ACC
18 Auburn 759 ... 1242 SEC
19 Florida 724 ... 1182 SEC
20 Old Dominion 76 ... 121 C-USA
21 Coastal Carolina 112 ... 180 Sun Belt
22 Washington 735 ... 1234 Pac-12
23 Clemson 744 ... 1248 ACC
24 Virginia Tech 743 ... 1262 ACC
25 Arizona State 614 ... 1032 Pac-12
26 Texas A&M 741 ... 1270 SEC
27 Michigan State 701 ... 1204 Big Ten
28 West Virginia 750 ... 1292 Big 12
29 Miami (OH) 690 ... 1195 MAC
.. ... ... ... ... ...
101 Memphis 482 ... 1026 The American
102 Kansas 582 ... 1271 Big 12
103 Wyoming 526 ... 1122 Mountain West
104 Louisiana 510 ... 1098 Sun Belt
105 Colorado State 520 ... 1124 Mountain West
106 Connecticut 508 ... 1107 The American
107 SMU 489 ... 1083 The American
108 Oregon State 530 ... 1173 Pac-12
109 UTSA 38 ... 82 C-USA
110 Kansas State 526 ... 1207 Big 12
111 New Mexico 483 ... 1103 Mountain West
112 Temple 468 ... 1094 The American
113 Iowa State 524 ... 1214 Big 12
114 Tulane 520 ... 1197 The American
115 Northwestern 535 ... 1240 Big Ten
116 UAB 126 ... 284 C-USA
117 Rice 470 ... 1108 C-USA
118 Eastern Michigan 453 ... 1089 MAC
119 Louisiana-Monroe 304 ... 727 Sun Belt
120 Florida Atlantic 87 ... 205 C-USA
121 Indiana 479 ... 1195 Big Ten
122 Buffalo 370 ... 922 MAC
123 Wake Forest 450 ... 1136 ACC
124 New Mexico State 430 ... 1090 Independent
125 UTEP 390 ... 1005 C-USA
126 UNLV11 228 ... 574 Mountain West
127 Kent State 341 ... 922 MAC
128 FIU 64 ... 191 C-USA
129 Charlotte 20 ... 65 C-USA
130 Georgia State 27 ... 94 Sun Belt
[131 rows x 8 columns]

Percentage by Row Group

I have a matrix with rows grouped by Dept (Department). I am trying to get the actual hours / required hours percentage in a column for each row group, but I can only get the total %, not the % by group. Ex:
I should get this:
Total Employee Req Hrs Rep Hrs % Billable hrs % NonBill Hrs % Time Off %
Dept A Total 672 680 101 575 85 140 21 8 1
Emp1 168 170 101 150 89 50 29 0 0
Emp2 168 165 98 120 71 20 12 8 4
Emp3 168 175 104 155 92 20 12 0 0
Emp4 168 170 101 150 89 50 29 0 0
Dept B Total 420 428 102 365 87 80 19 4 .1
Emp5 168 170 101 150 89 50 29 0 0
Emp6 84 84 98 60 71 10 12 4 4
Emp7 168 175 104 155 92 20 12 0 0
G Total 1092 1108 101 940 86 190 17 12 1
But I get this:
Total Employee Req Hrs Rep Hrs % Billable hrs % NonBill Hrs % Time Off %
Dept A Total 1684 1675 101 1250 86 225 17 12 1
Emp1 168 170 101 150 89 50 29 0 0
Emp2 168 165 98 120 71 20 12 8 4
Emp3 168 175 104 155 92 20 12 0 0
Emp4 168 170 101 150 89 50 29 0 0
Dept B Total 1092 1108 101 1250 86 225 17 12 1
Emp5 168 170 101 150 89 50 29 0 0
Emp6 84 84 98 60 71 10 12 4 4
Emp7 168 175 104 155 92 20 12 0 0
G Total 1092 1108 101 940 86 190 17 12 1
The totals are correct but the % is wrong.
I have several Datasets because the report only runs the department you are in, except for the VPs who can see all departments.
I Insert the percentage columns into the matrix and have tried several expressions with no results including:
=Fields!ActHrs.Value/Fields!ReqHrs.Value
=Sum(Fields!ActHrs.Value, "Ut_Query")/Sum(Fields!ReqHrs.Value, "Ut_Query")
=Sum(Fields!ActHrs.Value, "Ut_Query","Dept")/Sum(Fields!ReqHrs.Value,
"Ut_Query","Dept")
=Sum(Fields!ActHrs.Value,"Dept", "Ut_Query")/Sum(Fields!ReqHrs.Value,
"Dept","Ut_Query")
Plus more I can't even remember.
I tried creating new groups, and even a new matrix.
There must be a simple way to get the percentage by group but I have not found an answer on any of the interned boards.
OK, I figured this out, but it doesn't make much sense. If I try:
=Textbox29/TextBox28 I get error messages about undefined variables.
If I go the the textbox properties and rename the textboxes to Act and Req and use:
=Act/Req I get the right answer.

MySql query - Grouping by date ranges?

I'm trying to write a query that will return data grouped by date ranges and am wondering if there's a way to do it in one query (including the calculation), or if I need to write three separate ones? The dates in my table are stored as unix timestamps.
For example, my records look like this:
id type timestamp
312 87 1299218991
313 87 1299299232
314 90 1299337639
315 87 1299344130
316 87 1299348977
497 343 1304280210
498 343 1304280392
499 343 1304280725
500 343 1304280856
501 343 1304281015
502 343 1304281200
503 343 1304281287
504 343 1304281447
505 343 1304281874
566 90 1305222137
567 343 1305250276
568 343 1305387869
569 343 1305401114
570 343 1305405062
571 343 1305415659
573 343 1305421418
574 343 1305431457
575 90 1305431756
576 343 1305432456
577 259 1305441833
578 259 1305442234
580 343 1305456152
581 343 1305467261
582 343 1305483902
I'm trying to write a query that will find all records with a "created" date between:
2011-05-01 and 2011-06-01 (Month)
2011-03-01 and 2011-06-01 (Quarter)
2010-05-01 AND 2011-06-01 (Year)
I tried the following (in this case, I hardcoded the unix value for just the month to see if I could get it to work... ):
SELECT COUNT(id) AS idCount,
MIN(FROM_UNIXTIME(timestamp)) AS fromValue,
MAX(FROM_UNIXTIME(timestamp)) AS toValue
FROM uc_items
WHERE ADDDATE(FROM_UNIXTIME(timestamp), INTERVAL 1 MONTH)
>= FROM_UNIXTIME(1304233200)
But, it doesn't seem to work because the fromValue is: 2011-04-02 21:12:56 and the toValue is 2011-10-25 06:20:14, which, obviously, isn't a date between 5/1/2011 and 6/1/2011.
This aught to work:
SELECT COUNT(id) AS idCount,
FROM_UNIXTIME(MIN(timestamp)) AS fromValue,
FROM_UNIXTIME(MAX(timestamp)) AS toValue
FROM uc_items
WHERE timestamp BETWEEN UNIX_TIMESTAMP('2011-05-01') AND UNIX_TIMESTAMP('2011-06-01 23:59:59')
Also, as a performance tip - avoid applying functions to columns in your tables in a WHERE clause (e.g you have WHERE ADDDATE(FROM_UNIXTIME(timestamp))). Doing that will prevent MySQL from using any indexes on the timestamp column.