Extract html table and create column in R

Extract html table and create column in R - html

I'm trying to extract the table on the following URL https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html
So far I've been trying to use
url <- getURL("https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html",.opts = list(ssl.verifypeer = FALSE) )
parse <- xmlParse(url, isHTML = TRUE)
r <- readHTMLTable(parse)
But only null tables are returned. I've some knowledge in XML, and as far as I understand, I can scrape html webpages just as XML files. But I can't find any specific node in order to do that.
Also, I would like to know how can I create a column with the Fecha value.
Thanks!

Related

R dbplyr mysql column conversion

I have a table in mySQL that looks something like this:
tbl<-tibble(
Result=c("0.1","<0.0001","1.1"),
Unit=c("mg/L","ug/L","mg/L"),
Pref_Unit=c("mg/L","mg/L","mg/L"),
Conversion=c(1,1000,1)
)
What I would like to do using dbplyr, pool, and RMariaDB is to convert the Result column to the preferred unit using the conversion factor in the table, while preserving the "<", and also splitting the Result column into a numeric fraction containing only the number and censored indicating whether the Result contained a "<".
With regular dplyr, I would do something like this:
tbl<-tbl %>%
mutate(numb_Result=as.numeric(gsub("<","",Result)),
cen_Result=grepl("<",Result)) %>%
mutate(new_Result=ifelse(cen_Result,paste0("<",numb_Result*Conversion),paste0(numb_Result*Conversion)))
But that doesn't work with the database table. Any help would be appreciated.

The challenge is most likely because dbplyr does not have translations defined for gsub and grepl. A couple of possibilities for you to test below:
library(dplyr)
library(dbplyr)
tbl<-tibble(
Result=c("0.1","<0.0001","1.1"),
Unit=c("mg/L","ug/L","mg/L"),
Pref_Unit=c("mg/L","mg/L","mg/L"),
Conversion=c(1,1000,1)
)
remote_table = tbl_lazy(tbl, con = simulate_mssql())
remote_table %>%
mutate(has_sign = ifelse(substr(Result, 1, 1) == "<", 1, 0)) %>%
mutate(removed_sign = ifelse(has_sign == 1, substr(Result, 2, nchar(Result)), Result)) %>%
mutate(num_value = as.numeric(removed_sign)) %>%
mutate(converted = as.character(1.0 * num_value * Conversion)) %>%
mutate(new_Result = ifelse(has_sign, paste0("<",converted), converted))
There are dbplyr translations for ifelse, substr, nchar, as.numeric, as.character, and paste0. So I expected this to work. However, I keep getting an error because the translator requires the start and stop arguments to substr to be constants and hence it does not like me passing nchar(Results) as an argument. But this might be fixed in a more recent version of the package.
My second attempt:
remote_table %>%
mutate(has_sign = ifelse(substr(Result, 1, 1) == "<", 1, 0),
character_length = nchar(Result),
remove_first = sql(REPLACE(Result, "<", ""))) %>%
mutate(removed_sign = ifelse(has_sign == 1, remove_first, Result)) %>%
mutate(num_value = as.numeric(removed_sign)) %>%
mutate(converted = as.character(1.0 * num_value * Conversion)) %>%
mutate(new_Result = ifelse(has_sign, paste0("<",converted), converted))
This produces the expected SQL translation. But as I am using a simulated database connection, I have not been able to test whether it returns the expected output. The downside of this approach is that it used the SQL function REPLACE directly (it passes untranslated into the SQL code), which is less elegant than a fully translated solution.
There are probably more elegant ways to do this. But hopefully between these two you can find a suitable solution.

Thank you Simon!
I had found a similar solution which does work on the actual SQL database environment (note that I also had to pass the column name as a variable result_col, hence the use of !!sym()):
tbl %>%
mutate(numb_res = REGEXP_REPLACE(!!sym(result_col),"<",""),
cen_res = !!sym(result_col) %like% "<%") %>%
mutate(numb_res=numb_res*Conversion) %>%
mutate(!!result_col:=case_when(
cen_res==1 ~ paste0("<",numb_res),
T ~ paste0(numb_res)
))
It seems you are correct that there is no SQL translation for as.character() and as.numeric(), but just doing the multiplication on the character vector is enough to make it numeric, and similarly, pasting the values with "<" make its back into a character.
I think this is working for me, but I will investigate your answer as well.

sqlalchemy - Creating a table with a dictionary

I'm trying to make a project where the user can also create a table. Initially I was getting tables as json from the user and adding them as a column of a table named application. but from some problems now I have to make the user also create a table directly.
If we come to the question exactly, let's assume that there is such a table.
name = "t_name"
rows = ["column1","column2","column3"]
how can i convert this to:
t_name = Table(
't_name', meta,
Column('column1', String),
Column('column2', String),
Column('column3', String),
)

I solved the problem in a similar way.
columns_names = ['id','date','name',"username","password"]
columns_types = [Integer,DateTime]
primary_key_flag = [True,False]
for i in columns_names:
primary_key_flag.append(False)
columns_types.append(VARCHAR(80))
Table(isim, meta,
*(Column(column_name, column_type,primary_key = primary_key_flag, column_nullable = True)
for column_name,
column_type,
primary_key_flag in zip(columns_names,
columns_types,
primary_key_flag)))

Scrapy and incomplete data, how do I store everything correctly?

I'm still a beginner with Scrapy, but this problem really got me scratching my head. I've got a webstore from which I need to extract data. The data is all on one page, but most of the time incomplete. It always has a name, but not always an amount or a description. It's structured in repeating classes like this. Note that this example has all three datafields filled.
I need:
The product name, located in < h4 class="mod-article-tile__title">
The product amount, located in < span class="price__unit">
The product description, located in < div class="mod-article-tile__info">
I managed to extract the data I need like this:
import pprint
import scrapy
class BasicSpider(scrapy.Spider):
name = 'aldi'
allowed_domains = ['aldi.nl']
base_url = 'https://www.aldi.nl/onze-producten/a-merken.html'
start_urls = ['https://www.aldi.nl/onze-producten/a-merken.html']
def parse(self, response):
products = response.xpath('//*[#class="mod-article-tile__content"]').extract()
name = response.xpath('//*[#class="mod-article-tile__title"]/text()').extract()
amount = response.xpath('//*[#class="price price--50 price--right mod-article-tile__price"]/text()').extract()
info = response.xpath('//*[#class="mod-article-tile__info"]/p/text()').extract()
i = 0
for product in products:
pprint.pprint(name[i] + " : " + amount[i] + ", " + info[i])
i+=1
However, this doesn't take incomplete data into account. So now since not all lists have the same length, an IndexError is thrown, and the data isn't assigned correctly. I tried parsing it using product, but I can't use xpath on it afterwards because it's a string.
So, is there a way to use xpath on the string result, or another way to extract the data from product? Or should I rather look into checking if the parsed data is empty, and insert empty data there?
Oh, and also, I can't seem to remove the pesky \n\t's that appear everywhere. I tried
def clean_string(self, string):
result = string.replace('\\n', '')
result = result.replace('\\t', '')
return result.strip()
But it didn't do the trick. Anyone able to drop a hint to resolve that?

Date is getting converted to number, while making html of datafrane

While converting the data frame to HTML, Date is getting converted to a number.
library("xtable")
print(xtable(Data), type="html", file="Data.html",timestamp=date())
The first column of this data frame is in Date format, which is getting converted to a number.

You could try tableHTML which handles dates. As a quick example:
library(tableHTML)
Data <- data.frame(a = 1:10, b = as.Date('2017-01-01'))
mytable <- tableHTML(Data, rownames = FALSE)
mytable
And to write it in a file, you can use:
write_tableHTML(mytable, file = 'Data.html')

Programming R - Error in if (errors) return(odbcGetErrMsg(channel)) else return(invisible(stat)) :

I am new to R and I am trying to do something that feels simple but cant get my code to work.
When I run my code (sqlQuery and which saves the data to a SQL database) it works fine with the database name but when I use an object as the database name instead of the actual name I get the following error -Error in if (errors) return(odbcGetErrMsg(channel)) else return(invisible(stat)) :argument is not interpretable as logical
The way I am using the object name in my R code is for example is select * from ",object,".dbo.tstTable The object dataBase is the date of every previous Friday.
StartCode(Server = "Server01",DB=dataBase,WH=FALSE) POLICYLIST <- sqlQuery(channel1," SELECT DISTINCT [POLICY_ID] FROM ",dataBase,".[dbo].[policy] ") StartCode(Server = "SERVER02",DB="DataQuality",WH=FALSE) sqlQuery(channel1,"drop table DQ1") sqlSave (channel1, POLICYLIST, "DQ1")

Finally figured out why my code was not working I changed my code to the below to make it work. I just needed to add paste. appologies for my stupid question!
StartCode(Server = "Server01",DB=dataBase,WH=FALSE) POLICYLIST <- sqlQuery(channel1, paste" SELECT DISTINCT [POLICY_ID] FROM ",dataBase,".[dbo].[policy] ")) StartCode(Server = "SERVER02",DB="DataQuality",WH=FALSE) sqlQuery(channel1,"drop table DQ1") sqlSave (channel1, POLICYLIST, "DQ1")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract html table and create column in R - html

Related

R dbplyr mysql column conversion

sqlalchemy - Creating a table with a dictionary

Scrapy and incomplete data, how do I store everything correctly?

Date is getting converted to number, while making html of datafrane

Programming R - Error in if (errors) return(odbcGetErrMsg(channel)) else return(invisible(stat)) :

Categories

Resources