Matlab text string/html parse - html

I am trying to get information from a website (html) into MATLAB. I am able to get the html from online into a string using:
urlread('http://www.websiteNameHere.com...');
Once I have the string I have a very LONG string variable, containing the entire html file contents. From this variable, I am looking for the value/characters in very specific classes. For example, the html/website will have a bunch of lines, and then will have the classes of interest in the following form:
...
<h4 class="price">
<span class="priceSort">$39,991</span>
</h4>
<div class="mileage">
<span class="milesSort">19,570 mi.</span>
</div>
...
<h4 class="price">
<span class="priceSort">$49,999</span>
</h4>
<div class="mileage">
<span class="milesSort">9,000 mi.</span>
</div>
...
I need to be able to get the information between <span class="priceSort"> and </span>; ie $39,991 and $49,999 in the above example. What is the best way to go about this? If the tags were specific beginning and ends that were also the same (such as <price> and </price>), I would have no problem...
I also need to know the most robust method, since I would like to be able to find <span class="milesSort"> and other information of this sort too. Thanks!

Try this and let us know if it works for you -
url_data = urlread('http://www.websiteNameHere.com...');
start_string = '<span class="priceSort">'; %// For your next case, edit this to <span class="milesSort">
stop_string = '</span>';
N1 = numel(start_string);
N2 = numel(stop_string);
start_string_ind = strfind(url_data,start_string);
for count1 = 1:numel(start_string_ind)
relative_stop_string_ind = strfind(url_data(start_string_ind(count1)+N1:end),stop_string);
string_found_start_ind = start_string_ind(count1)+N1;
string_found = url_data(string_found_start_ind:string_found_start_ind+relative_stop_string_ind(1)-2);
disp(string_found);
end

Simple solution using strsplit
s = urlread('http://www.websiteNameHere.com...');
x = 'class="priceSort">'; %starting string x
y = 'class="milesSort">'; %starting string y
z = '</span>'; %ending string z
s2 = strsplit(s,x); %split for starting string x
s3 = strsplit(s,y); %split for starting string y
result1 = cell(size(s2,2)-1,1); %create cell array 1
result2 = cell(size(s3,2)-1,1); %create cell array 2
%loop through values ignoring first value
%(change ind=2:size(s2,2) to ind=1:size(s2,2) to see why)
%starting string x loop
for ind=2:size(s2,2)
m = strsplit(s2{1,ind},z);
result1{ind-1} = m{1,1};
end
%starting string y loop
for ind=2:size(s3,2)
m = strsplit(s3{1,ind},z);
result2{ind-1} = m{1,1};
end
Hope this helps

Related

Get only inner text from webelemnt

I want to get only innerText from a webelement. I want to get only "Name" from the anchor tag.I have access to webdriver element associated with tag in below example(anchorElement).
I tried anchorElement.getText() and anchorElement.getAttribute("innerText"). Both return me "Name, sort Z to A". What should I do here ?
<a id="am-accessible-userName" href="javascript:void(0);" class="selected">
Name
<span class="util accessible-text">, sort Z to A</span>
<span class="jpui iconwrap sortIcon" id="undefined" tabindex="-1">
<span class="util accessible-text" id="accessible-"></span>
<i class="jpui angleup util print-hide icon" id="icon-undefined" aria-hidden="true"></i>
</span>
</a>
A bit of Javascript can pick out just the child text node:
RemoteWebDriver driver = ...
WebElement anchorElement = driver.findElement(By.id("am-accessible-userName"));
String rawText = (String) driver.executeScript(
"return arguments[0].childNodes[0].nodeValue;",
anchorElement);
So anchorElement is passed into the Javascript as arguments[0] there.
Clearly childNodes[0] is assuming where the text node is. If that's not safe, you could iterate the childNodes too, perhaps checking for childNode.nodeName === "#text"
As per the HTML the desired element is a Text Node and also the First Child Node of the <a> tag. So to extract the text Name you can use the following code block :
Java Binding Art :
WebElement myElement = driver.findElement(By.xpath("//a[#class='selected' and #id='am-accessible-userName']"));
String myText = (String)((JavascriptExecutor)driver).executeScript("return arguments[0].firstChild.textContent;", myElement);
System.out.println(myText);
The text alone could be obtained by using proper javascript code to iterate through child nodes of a given weblement and then returning text if the current node is a text node.
Note: A trimmed value of node text will be returned.
public String getInnerText(WebDriver e, String xpathStr){
WebElement ele = e.findElement(By.xpath(xpathStr));
return ((String) ((JavascriptExecutor) e).executeScript("var children = arguments[0].childNodes;\n" +
"for(child in children){\n" +
" if(children[child].nodeName === \"#text\"){" +
" return children[child].nodeValue };\n" +
"}" , ele )).trim();
}

Octave - putting words into vector

I am working on creating an email filter. I have a sample email which says something like this:
Hi how are you lets meet up
I want to put each one of these words into a vector. I am looking for something like this.
Words = ['Hi';'how','are','you','lets','meet','up']
and when I enter
words(1), I want it to display Hi.
I really don't know where to start. I found answers for different languages such as Ruby and JS. But not for Octave.
Adding to Andy's answer about cells, you can collect your email as a string and process it using string operations such as strtok, strsplit etc. e.g.
octave:7> s = 'Hi how are you lets meet up';
octave:8> words = strsplit(s, ' ')
words =
{
[1,1] = Hi
[1,2] = how
[1,3] = are
[1,4] = you
[1,5] = lets
[1,6] = meet
[1,7] = up
}
octave:9> words{1}
ans = Hi
Use Cell Arrays of Strings:
octave:1> words = {'hi', 'how', 'are', 'you', 'lets', 'meet', 'up'};
octave:2> words{1}
ans = hi
and you can use indexing:
octave:4> words{3:4}
ans = are
ans = you
if you struggle why this returns a different result:
octave:5> words(3:4)
ans =
{
[1,1] = are
[1,2] = you
}
then read here:
So with ‘{}’ you access elements of a cell array, while with ‘()’ you access a sub array of a cell array.

Writing items from function to separate text files?

I'm running some web scraping, and now have a list of 911 links saved in the following (I included 5 to demonstrate how they're stored):
every_link = ['http://www.millercenter.org/president/obama/speeches/speech-4427', 'http://www.millercenter.org/president/obama/speeches/speech-4425', 'http://www.millercenter.org/president/obama/speeches/speech-4424', 'http://www.millercenter.org/president/obama/speeches/speech-4423', 'http://www.millercenter.org/president/obama/speeches/speech-4453']
These URLs link to presidential speeches over time. I want to store each individual speech (so, 911 unique speeches) in different text files, or be able to group by president. I'm trying to pass the following function on to these links:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub('',item_str)
item_str_processed_final = item_str_processed.replace('—',' ')
for l in every_link:
processURL(l)
So, I would want to save to unique text files words from the all the processed speeches. This might look like the following, with obama_44xx representing individual text files:
obama_4427 = "blah blah blah"
obama_4425 = "blah blah blah"
obama_4424 = "blah blah blah"
...
I'm trying the following:
for l in every_link:
processURL(l)
obama.write(processURL(l))
But that's not working...
Is there another way I should go about this?
Okay, so you have a couple of issues. First of all, your processURL function doesn't actually return anything, so when you try to write the return value of the function, it's going to be None. Maybe try something like this:
def processURL(link):
open_url = urllib2.urlopen(link).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub('',item_str)
item_str_processed_final = item_str_processed.replace('—',' ')
splitlink = link.split("/")
president = splitlink[4]
speech_num = splitlink[-1].split("-")[1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # returning a tuple
for link in every_link:
filename, content = processURL(link) # yay tuple unpacking
with open(filename, 'w') as f:
f.write(content)
This will write each file to a filename that looks like president_number. So for example, it will write Obama's speech with id number 4427 to a file called obama_4427. Lemme know if that works!
You have to call the processURL function and have it return the text you want written. After that, you simply have to add the writing to disk code within the loop. Something like this:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
#item_str_processed = punctuation.sub('',item_str)
#item_str_processed_final = item_str_processed.replace('—',' ')
return item_str
for l in every_link:
speech_text = processURL(l).encode('utf-8').decode('ascii', 'ignore')
speech_num = l.split("-")[1]
with open("obama_"+speech_num+".txt", 'w') as f:
f.write(speech_text)
The .encode('utf-8').decode('ascii', 'ignore') is purely for dealing with non-ascii characters in the text. Ideally you would handle them in a different way, but that depends on your needs (see Python: Convert Unicode to ASCII without errors).
Btw, the 2nd link in your list is 404. You should make sure your script can handle that.

HTMLAgilityPack getting <P> and <STRONG> text

Hey all I am looking for a way to get this HTML code:
<DIV class=schedule_block>
<DIV class=channel_row><SPAN class=channel>
<DIV class=logo><IMG src='/images/channel_logos/WGNAMER.png'></DIV>
<P><STRONG>2</STRONG><BR>WGNAMER </P></SPAN>
using the HtmlAgilityPack.
I have been trying this:
For Each channel In doc.DocumentNode.SelectNodes(".//div[#class='channel_row']")
Dim info = New Dictionary(Of String, Object)()
With channel
info!Logo = .SelectSingleNode(".//img").Attributes("src").Value
info!Channel = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(0).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(2).InnerText
End With
.......
I can get the Logo but it comes up with a blank string for the Channel and for the Station it says
Index was out of range. Must be non-negative and less than the size of
the collection.
I've tried all types of combinations:
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(1).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(3).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(1).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(2).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(3).InnerText
What do I need to do in order to correct this?
If the whitespace is actually there, it counts as a child node. So:
Dim channelSpan = .SelectSingleNode(".//span[#class='channel']")
info!Channel = channelSpan.ChildNodes(3).ChildNodes(0).InnerText
info!Station = channelSpan.ChildNodes(3).ChildNodes(2).InnerText

How to parse HTML tags in Matlab using regexp?

I'm short on time and specifically wanted to extract a string like the one below. Problem is the tag isn't of the form <a> data </a>.
Given,
s = <em style="font-size:medium"> 5,888 </em>
how to extract out just 5,888 in matlab?
You will find useful info here, or here, or here, all of which are google-first-page results and would have been faster than asking a question here.
Anyway, quick-'n-dirty way: You can filter on the <> symbols:
>> s = '<em style="font-size:medium"> 5,888 </em> <sometag> test </sometag>'
>> a = regexp(s, '[<>]');
>> s( cell2mat(arrayfun(#(x,y)x:y, a(2:2:end-1)+1, a(3:2:end)-1, 'uni',false)) )
ans =
5,888 test
Or, slightly more robust and much cleaner, replace everything between any tags (including the tags) with emptyness:
>> s = regexprep(s, '<.*?>', '')
ans =
5,888 test
Thanks folks for your help. I'm basically trying to get the population of a US county on Matlab. Thought I'l share my code, though not the most elegant. Might help some soul. :)
county = 'morris';
state = 'ks';
county = strrep(county, ' ' , '+');
str = sprintf('https://www.google.com/search?&q=population+%s+%s',county,state);
s = urlread(str);
pop = regexp(s,'<em[^>]*>(.*?)</em>', 'tokens');
pop = char(pop{:});
pop = strrep(pop, ',' , '');
pop = str2num(pop);