R: Extracting elements between characters in a web page

R: Extracting elements between characters in a web page - html

I have two lines of info from a web page that I want to parse into a data.frame.
[104] " $1775 / 2br - 1112ft² - Wonderful two bedroom two bathroom with balcony! (14001 NE 183rd Street )"
[269] " var pID = \"4619136687\";"
I'd like it to look like this.
postID |rent|type|size|description |location
4619136687|1775|2br |1112|Wonderful two bedroom...|14001 NE 183rd Street
I was able to use the sub() command to get the ID but I'm not exactly familiar with regex in the sub() command to parse out what I need when there are spaces, such as in line [104].
sub(".*pID = \"(.*)\";.*","\\1", " var pID = \"4619136687\";")
Any help would be wonderful, Thanks!

Related

Unable to import json file into spyder

This is my first time using any sort of code. I have been following along with an interactive tutorial and I seem to be stuck at the very first step, trying to import a json file containing info regarding football competition data. It seems fairly straightforward but error message after error message has started to drive me insane.
I am trying to load the data into python in order to follow along with a tutorial (I will leave a link below). I believe I have saved my files and data in the same way as in the tutorial but when I change the file directory and run: import json I get a few different error messages if someone could advise on what I’m doing wrong it would be greatly appreciated. My goal is to load in the data which I have downloaded from GitHub and open the competitions JSON file.
I am also happy to provide any information required to help answer this question.
YouTube video:https://youtu.be/GTtu0t03FMO
error messages:
FileNotFoundError: [Errno 2] No such file or directory: 'Statsbomb/data/competitions.json'
JSONDecodeError:Expecting value
#Load in Statsbomb competition and match data
#This is a library for loading json files.
import json
#Load the competition file
#Got this by searching 'how do I open json in Python'
with open('Statsbomb/data/competitions.json') as f:
competitions = json.load(f)
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Load the list of matches for this competition
with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
#Look inside matches
matches[0]
matches[0]['home_team']
matches[0]['home_team']['home_team_name']
matches[0]['away_team']['away_team_name']
#Print all match results
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
home_score=match['home_score']
away_score=match['away_score']
describe_text = 'The match between ' + home_team_name + ' and ' + away_team_name
result_text = ' finished ' + str(home_score) + ' : ' + str(away_score)
print(describe_text + result_text)
#Now lets find a match we are interested in
home_team_required ="England"
away_team_required ="Sweden"
#Find ID for the match
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
if (home_team_name==home_team_required) and (away_team_name==away_team_required):
match_id_required = match['match_id']
print(home_team_required + ' vs ' + away_team_required + ' has id:' + str(match_id_required))
#Exercise:
#1, Edit the code above to print out the result list for the Mens World cup
#2, Edit the code above to find the ID for England vs. Sweden
#3, Write new code to write out a list of just Sweden's results in the tournament.

with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
try:
with open('Statsbomb/data/matches/'+str(competition_id)+'/3.json') as f:
matches = json.load(f)

Unexpected behavior of gsub in R

Excuse me for not being more specific in the title, but I don't know how to explain this without an example.
I have a .html file that looks like this:
<TR><TD>log p-value:</TD><TD>-2.797e+02</TD></TR>
<TR><TD>Information Content per bp:</TD><TD>1.736</TD></TR>
<TR><TD>Number of Target Sequences with motif</TD><TD>894.0</TD></TR>
<TR><TD>Percentage of Target Sequences with motif</TD><TD>47.58%</TD></TR>
<TR><TD>Number of Background Sequences with motif</TD><TD>10864.6</TD></TR>
<TR><TD>Percentage of Background Sequences with motif</TD><TD>22.81%</TD></TR>
<TR><TD>Average Position of motif in Targets</TD><TD>402.4 +/- 261.2bp</TD></TR>
<TR><TD>Average Position of motif in Background</TD><TD>400.6 +/- 246.8bp</TD></TR>
<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>-0.0</TD></TR>
<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>1.48</TD></TR>
I read it in:
html = readLines("file.html")
I am interested in whatever is between </TD><TD> and </TD></TR>. When I run the following, I get the result I want:
mypattern = '<TR><TD>log p-value:</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
[1] "-2.797e+02"
It works well for almost all lines I want to match, but when I do the same thing for the last two lines, it does not extract anything.
mypattern = '<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
mypattern = '<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
Why is this happening?
Thank you for your help.

If your data structure is really like this. You have a xml file with keys and values so I assume it is easier to utilize this!
library(xml2)
xd <- read_xml("file.html", as_html = TRUE)
key_values <- xml_text(xml_find_all(xd, "//td"))
is_key <- as.logical(seq_along(key_values) %% 2)
setNames(key_values[!is_key], key_values[is_key])

First, I'll say that I would actually solve this problem like this:
gsub(".+>([^<]+)</TD></TR>", "\\1", html)
#> [1] "-2.797e+02" "1.736" "894.0"
#> [4] "47.58%" "10864.6" "22.81%"
#> [7] "402.4 +/- 261.2bp" "400.6 +/- 246.8bp" "-0.0"
#> [10] "1.48"
But, to answer the question of why your way didn't work, we need to checkout the help file for R regular expressions (help("regex")):
Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? ...
The patterns that you had trouble with included parentheses, which you needed to escape (note the double backslash, since backslashes themselves need to be escaped):
mypattern = '<TR><TD>Multiplicity \\(# of sites on avg that occur together\\)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
# [1] "1.48"

LuaLaTex using fontspec package and luacode reading JSON file

I'm using Latex since years but I'm new to embedded luacode (with Lualatex). Below you can see a simplified example:
\begin{filecontents*}{data.json}
[
{"firstName":"Max", "lastName":"Möller"},
{"firstName":"Anna", "lastName":"Smith"}
];
\end{filecontents*}
\documentclass[11pt]{article}
\usepackage{fontspec}
%\setmainfont{Carlito}
\usepackage{tikz}
\usepackage{luacode}
\begin{document}
\begin{luacode}
require("lualibs.lua")
local file = io.open('data.json','rb')
local jsonstring = file:read('*a')
file.close()
local jsondata = utilities.json.tolua(jsonstring)
tex.print('\\begin{tabular}{cc}')
for key, value in pairs(jsondata) do
tex.print(value["firstName"] .. ' & ' .. value["lastName"] .. '\\\\')
end
tex.print('\\hline\\end{tabular}')
\end{luacode}
\end{document}
When executing Lualatex following error occurs:
LuaTeX error [\directlua]:6: attempt to index field 'json' (a nil value) [\directlua]:6: in main chunk. \end{luacode}
When commenting the line \usepackage{fontspec} the output will be produced. Alternatively, the error can be avoided by commenting utilities.json.tolua(jsonstring) and all following lua-code lines.
So the question is: How can I use both "fontspec" package and json-data without generating an error message? Apart from this I have another question: How to enable german umlauts in output of luacode (see first "lastName" in example: Möller)?
Ah, I'm using TeX Live 2015/Debian on Ubuntu 16.04.
Thank you,
Jerome

Error while trying to parse json into R

I have recently started using R and have a task regarding parsing json in R to get a non-json format. For this, i am using the "fromJSON()" function. I have tried to parse json as a text file. It runs successfully when i do it with just a single row entry. But when I try it with multiple row entries, i get the following error:
fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: invalid char in json text.
[{'CategoryType':'dining','City':
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
"mumbai","Location":"all"}] [{"JourneyType":"Return","Origi
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: after array element, I expect ',' or ']'
:"mumbai","Location":"all"} {"JourneyType":"Return","Origin
(right here) ------^
The above errors are due to three different formats in which i tried to parse the json text, but the result was the same, only the location suggested by changed.
Please help me to identify the cause of this error or if there is a more efficient way o performing the task.
The original file that i have is an excel sheet with multiple columns and one of those columns consists of json text. The way i tried right now is by extracting just the json column and converting it to a tab separated text and then parsing it as:
fromJSON("D:/Eclairs/Printing/test3.txt")
Please also suggest if this can be done more efficiently. I need to map all the columns in the excel to the non-json text as well.
Example:
[{"CategoryType":"dining","City":"mumbai","Location":"all"}]
[{"CategoryType":"reserve-a-table","City":"pune","Location":"Kothrud,West Pune"}]
[{"Destination":"Mumbai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"3","NoOfAdult":"3","NoOfChildren":"0"}]

Consider reading in the text line by line with readLines(), iteratively saving the JSON dataframes to a growing list:
library(jsonlite)
con <- file("C:/Path/To/Jsons.txt", open="r")
jsonlist <- list()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
jsonlist <- append(jsonlist, list(fromJSON(line)))
}
close(con)
jsonlist
# [[1]]
# CategoryType City Location
# 1 dining mumbai all
# [[2]]
# CategoryType City Location
# 1 reserve-a-table pune Kothrud,West Pune
# [[3]]
# Destination CheckInDate CheckOutDate Rooms NoOfPax NoOfAdult NoOfChildren
# 1 Mumbai 14-Oct-2016 15-Oct-2016 1 3 3 0

octave/matlab read text file line by line and save only numbers into matrix

I have a question regarding octave or matlab data post processing.
I have files exported from fluent like below:
"Surface Integral Report"
Mass-Weighted Average
Static Temperature (k)
crossplane-x-0.001 1242.9402
crossplane-x-0.025 1243.0017
crossplane-x-0.050 1243.2036
crossplane-x-0.075 1243.5321
crossplane-x-0.100 1243.9176
And I want to use octave/matlab for post processing.
If I read first line by line, and save only the lines with "crossplane-x-" into a new file, or directly save the data in those lines into a matrix. Since I have many similar files, I can make plots by just calling their titles.
But I go trouble on identify lines which contain the char "crossplane-x-". I am trying to do things like this:
clear, clean, clc;
% open a file and read line by line
fid = fopen ("h20H22_alongHGpath_temp.dat");
% save full lines into a new file if only chars inside
txtread = fgetl (fid)
num_of_lines = fskipl(fid, Inf);
char = 'crossplane-x-'
for i=1:num_of_lines,
if char in fgetl(fid)
[x, nx] = fscanf(fid);
print x
endif
endfor
fclose (fid);
Would anybody shed some light on this issue ? Am I using the right function ? Thank you.

Here's a quick way for your specific file:
>> S = fileread("myfile.dat"); % collect file contents into string
>> C = strsplit(S, "crossplane-x-"); % first cell is the header, rest is data
>> M = str2num (strcat (C{2:end})) % concatenate datastrings, convert to numbers
M =
1.0000e-03 1.2429e+03
2.5000e-02 1.2430e+03
5.0000e-02 1.2432e+03
7.5000e-02 1.2435e+03
1.0000e-01 1.2439e+03

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

R: Extracting elements between characters in a web page - html

Related

Unable to import json file into spyder

Unexpected behavior of gsub in R

LuaLaTex using fontspec package and luacode reading JSON file

Error while trying to parse json into R

octave/matlab read text file line by line and save only numbers into matrix

Categories

Resources