How can I go about reading a specific line/lines from html in R?
I have "HTMLInternalDocument" object as a result of following code:
url<-myURL
html<-htmlTreeParse(url,useInternalNodes=T)
Now I need get a specific lines from this html object in text format to count number of characters in each lines for example.
How can I do that in R?
Seeing that you are using the XML library, you will need to use one of the library's getNodeSet functions such as xpathApply. This requires some knowledge on xPaths, which the function uses to parse the HTMLInternalDocument. You can learn more by using ?xpathApply
Using the XML library is over-complicating the problem. As Grothendieck pointed out readLines, a base function, will do the job. Something like this:
x <- 10 ## or any other index you want to subset on
html <- readLines(url)
html[x]
Related
I'm using Screaming Frog as a way to extract data from a Json generated from an URL.
The Json generated is this form :
{"ville":[{"codePostal":"13009","ville":"VAUFREGE","popin":"ouverturePopin","zoneLivraison":"1300913982","url":""},{"codePostal":"13009","ville":"LES BAUMETTES","popin":"ouverturePopin","zoneLivraison":"1300913989","url":""},{"codePostal":"13009","ville":"MARSEILLE 9EME ARRON","popin":"ouverturePopin","zoneLivraison":"1300913209","url":""}]}
I'm using this regex in Custom > Extraction in Screaming Frog as a way to extract the values of "codePostal".
"codePostal":".*?"
Problem is it doesn't extract anything.
When I test my regex in regex101, it seems correct.
Do you have any clue about what is wrong ?
Thanks.
Regards.
Have you tried to save the output to understand what ScreamingFrog sees? It doesn't matter - not at the beginning - whether your RegEx works.
That said, don't forget that SF is a Java based tool hence it is the engine used by the reg ex, so make sure you test your regular expressions with the correct dialect.
You need to specify group extractors enclosed in parentheses. For instance in your example, you need to have ("codePostal":".*?") as extractor.
In addition if you simply want to extract the value, you could use the following instead.
"codePostal":"(.*?)"
It's not a problem with your Regular Expression. It seems to be that the problem is with the Content Type. ScreamingFrog isn't properly reading application/JSON content types for scraping. Hopefully they will fix this bug.
EDIT: I know similar questions like this have been asked on SO but nothing compares to how simple I need this to be :)
I have a classic-asp web page that reads a CSV file and spits it onto the page using HTML. The content in the CSV file however contains some paragraphs with properly formed sentences.
In short, I need to display the grammar of these paragraphs which includes commas.
This is a snippet of what my parsing looks like:
sRows = oInStream.readLine
arrRows = Split(sRows,",")
If arrRows(0) = aspFileName And arrRows(1) = "minCamSys" Then
minCamSys1= arrRows(2)
minCamSys2= arrRows(3)
minCamSys3= arrRows(4)
How can I alter my Split() so that I can display commas without breaking the CSV format.
I would prefer to use double quotes around the data that contains a comma (as is usually the CSV standard when importing to Excel). For example:
Peter,Jeff,"Jim was from Ontario, Canada",Scott
I would like to avoid the use of a library as this is a simple in-house application.
Thank you!
Well folks the answer was right in front of my face. Kind of silly really but for this application, it will suffice.
I swapped out the , delimiter with a |. So the new code looks like this:
sRows = oInStream.readLine
arrRows = Split(sRows,"|")
This may not be a great solution but for this simple application it is all that is necessary.
I want to give some of the entries of my data.frame a special id or class, that I can use it later in html after making an html table out of the data.frame with knitr. Is it possible? I intend to use this later with jquery-datatables for special formatting.
There are many ways to add a new column to your data.frame. You might try something like:
mydf$newID <- seq(nrow(mydf))
Or
mydf <- transform(mydf, newID = seq(nrow(mydf))
And many others...
Or using the data.table library
mydt[,newID:=.I]
I have three files: Conf.txt, Temp1.txt and Temp2.txt. I have done regex to fetch some values from config.txt file. I want to place the values (Which are of same name in Temp1.txt and Temp2.txt) and create another two file say Temp1_new.txt and Temp2_new.txt.
For example: In config.txt I have a value say IP1 and the same name appears in Temp1.txt and Temp2.txt. I want to create files Temp1_new.txt and Temp2_new.txt replacing IP1 to say 192.X.X.X in Temp1.txt and Temp2.txt.
I appreciate if someone can help me with tcl code to do same.
Judging from the information provided, there basically are two ways to do what you want:
File-semantics-aware;
Brute-force.
The first way is to read the source file, parse it to produce certain structured in-memory representation of its content, then serialize this content to the new file after replacing the relevant value(s) in the produced representation.
Brute-force method means treating the contents of the source file as plain text (or a series of text strings) and running something like regsub or string replace on this text to produce the new text which you then save to the new file.
The first way should generally be favoured, especially for complex cases as it removes any chance of replacing irrelevant bits of text. The brute-force way me be simpler to code (if there's no handy library to do this, see below) and is therefore good for throw-away scripts.
Note that for certain file formats there are ready-made libraries which can be used to automate what you need. For instance, XSLT facilities of the tdom package can be used to to manipulate XML files, INI-style file can be modified using the appropriate library and so on.
I apologize if this has been asked previously, but I haven't been able to find an example online or elsewhere.
I have very dirty data file in a text file (it may be JSON). I want to analyze the data in R, and since I am still new to the language, I want to read in the raw data and manipulate as needed from there.
How would I go about reading in JSON from a text file on my machine? Additionally, if it isn't JSON, how can I read in the raw data as is (not parsed into columns, etc.) so I can go ahead and figure out how to parse it as needed?
Thanks in advance!
Use the rjson package. In particular, look at the fromJSON function in the documentation.
If you want further pointers, then search for rjson at the R Bloggers website.
If you want to use the packages related to JSON in R, there are a number of other posts on SO answering this. I presume you searched on JSON [r] already on this site, plenty of info there.
If you just want to read in the text file line by line and process later on, then you can use either scan() or readLines(). They appear to do the same thing, but there's an important difference between them.
scan() lets you define what kind of objects you want to find, how many, and so on. Read the help file for more info. You can use scan to read in every word/number/sign as element of a vector using eg scan(filename,""). You can also use specific delimiters to separate the data. See also the examples in the help files.
To read line by line, you use readLines(filename) or scan(filename,"",sep="\n"). It gives you a vector with the lines of the file as elements. This again allows you to do custom processing of the text. Then again, if you really have to do this often, you might want to consider doing this in Perl.
Suppose your file is in JSON format, you may try the packages jsonlite ou RJSONIO or rjson. These three package allows you to use the function fromJSON.
To install a package you use the install.packages function. For example:
install.packages("jsonlite")
And, whenever the package is installed, you can load using the function library.
library(jsonlite)
Generally, the line-delimited JSON has one object per line. So, you need to read line by line and collecting the objects. For example:
con <- file('myBigJsonFile.json')
open(con)
objects <- list()
index <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
objects[[index]] <- fromJSON(line)
index <- index + 1
}
close(con)
After that, you have all the data in the objects variable. With that variable you may extract the information you want.