Using MATLAB to parse HTML for URL in anchors, help fast - html

I'm on a strict time limit and I really need a regex to parse this type of anchor (they're all in this format)
20120620_0512_c2_102..>
for the URL
20120620_0512_c2_1024.jpg
I know its not a full URL, it's relative, please help
Here's my code so far
year = datestr(now,'yyyy');
timestamp = datestr(now,'yyyymmdd');
html = urlread(['http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed/' year '/c2/' timestamp '/']);
links = regexprep(html, '<a href=.*?>', '');

Try the following:
url = 'http://sohowww.nascom.nasa.gov/data/REPROCESSING/Completed/2012/c2/20120620/';
html = urlread(url);
t = regexp(html, '<a href="([^"]*\.jpg)">', 'tokens');
t = [t{:}]'
The resulting cell array (truncated):
t =
'20120620_0512_c2_1024.jpg'
'20120620_0512_c2_512.jpg'
...
'20120620_2200_c2_1024.jpg'
'20120620_2200_c2_512.jpg'

I think this is what you are looking for:
htmlLink = '20120620_0512_c2_102..>';
link = regexprep(htmlLink, '(.*)', '$2');
link =
20120620_0512_c2_1024.jpg
regexprep works also for cell arrays of strings, so this works too:
htmlLinksCellArray = { '20120620_0512_c2_102..>', '20120620_0512_c2_102..>', '20120620_0512_c2_102..>' };
linksCellArray = regexprep(htmlLinksCellArray, '(.*)', '$2')
linksCellArray =
'20120620_0512_c2_1024.jpg' '20120620_0512_c2_1025.jpg' '20120620_0512_c2_1026.jpg'

Related

I'm having trouble with sending a form using POST to retrieve data in R

I'm having trouble collecting doctors from https://www.uchealth.org/providers/. I've found out it's a POST method but with httr I can't seem to create the form. Here's what I have
url = url = 'https://www.uchealth.org/providers/'
formA = list(title = 'Search', onClick = 'swapForms();', doctor-area-of-care-top = 'Cancer Care')
formB = list(Search = 'swapForms();', doctor-area-of-care = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
I'm fairly certain formB is the correct one. However, I can't test it since I yield an error when trying to make the list. I believe it is because you can't use "-" characters when naming although I could be wrong on that. Could somebody help please?
I am unable to comment properly but try this to create an list. Code below worked for me.
library(httr)
url = 'https://www.uchealth.org/providers/'
formB = list(Search = 'swapForms();', `doctor-area-of-care` = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
When you are creating names with spaces or some other special character you have to put it into the operator above.

Need help in scraping the img src from a website; below is my code

Below is my web scraping code for a website; it clicks a form which redirects to a page. From that page I need to extract [img] src url and export it into csv in a text form. I used the code below to extract a content from a td tag. When I run the same code it doesn't work because the td tag has no content but only a img tag. Any help will be appreciated. I am new to web-scraping. Thanks in Advance.
browser.find_element_by_css_selector(".textinput[value='APPLY']").click()
#select_finder = "//tr[contains(text(), 'NB')]//a"
select_finder = "//td[text()='NB')]/../td[2]/a"
browser.find_element_by_css_selector(".content a").click()
assert "Application Details" in browser.title
file_data = []
try:
assert "Application Details" in browser.title
enlargement = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[3]/td[2]/b").text
enlargement_answer1 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[2]").text
enlargement_answer2 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[3]").text
enlargement_text = enlargement + enlargement_answer1 + enlargement_answer2
considerations = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[2]/b").text
considerations_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[3]").text
considerations_text = considerations + considerations_answer
alteration = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[6]/b").text
alteration_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[7]").text
alteration_text = alteration + alteration_answer
units = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[5]/td[3]/b").text
units_answer = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[5]/td[4]").text
units_text = units + units_answer
occupancy = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[3]/b").text
occupancy_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[4]").text
occupancy_text = occupancy + occupancy_answer
coo = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[3]/b").text
coo_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[4]").text
coo_text = coo + coo_answer
floors = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[3]/b").text
floors_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[4]").text
floors_text = floors + floors_answer
except (NoSuchElementException, AssertionError) as e:
floors_text.append("No Zoning Characteristics Present")
coo_text.append("n/a")
occupancy_text.append("n/a")
units_text.append("n/a")
alteration_text.append("n/a")
considerations_text.append("n/a")
enlargement_text.append("n/a")
with open('DOB.csv', 'a') as f:
wr = csv.writer(f, dialect='excel')
wr.writerow((block_number, lot_number, houseno, street, condo_text,
vacant_text, city_owned_text, file_data, floors_text, coo_text, occupancy_text, units_text, alteration_text,
considerations_text, enlargement_text ))
browser.close()
As you stated you are new to web scraping I encourage you to read up a bit: http://selenium-python.readthedocs.io/locating-elements.html
You are using XPath exclusively and in ways that are not recommended.
From the docs: "You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute."
Try using other locators to get your image.
for example: driver.find_element_by_css_selector("img[src='images/box_check.gif']")

how to get table cell value while Parsing with XPath, and the cell contain value like <19.00 OR >23.99

actually i need to parse a HTML table and that table contains HTML character, you can see in image.
i need each cell data with that special character also. Right now when i am parsing the table with XPath its ignore that cell and returns that cell value as empty.
Both Image attached here.
$table_head = $summary_nodes->childNodes->item(0);
$table_body = $summary_nodes->childNodes->item(1);
$head = [];
$body = [];
// print_r($table_head);
foreach($table_head->childNodes as $h_index => $h_node){
$head_temp = [];
foreach($h_node->childNodes as $cell_index => $cell){
$head_temp[] = trim($cell->nodeValue);
}
$head[] = $head_temp;
}
foreach($table_body->childNodes as $b_index => $b_node){
$body_temp = [];
// print_r($b_node);
foreach($b_node->childNodes as $cell_index => $cell){
print_r($cell);
$body_temp[] = trim($cell->nodeValue);
}
$body[] = $body_temp;
}
return ['table_ready'=>array_merge([$head[count($head)-1]], $body), 'headers'=> $head];
Hello friends I got answer for this, actually what is happening we are adding HTML entity inside our real data that's why while passing it's conflicting with HTML content and while parsing parser remove automatically that HTML entities so we have to make sure our real data does not have any HTML entities if we are using or if we need any entity which is similar to HTML entity please try to use they are HTML entity code.

Is there a simple way to have a local webpage display a variable passed in the URL?

I am experimenting with a Firefox extension that will load an arbitrary URL (only via HTTP or HTTPS) when certain conditions are met.
With certain conditions, I just want to display a message instead of requesting a URL from the internet.
I was thinking about simply hosting a local webpage that would display the message. The catch is that the message needs to include a variable.
Is there a simple way to craft a local web page so that it can display a variable passed to it in the URL? I would prefer to just use HTML and CSS, but adding a little inline javascript would be okay if absolutely needed.
As a simple example, when the extension calls something like:
folder/messageoutput.html?t=Text%20to%20display
I would like to see:
Message: Text to display
shown in the browser's viewport.
You can use the "search" property of the Location object to extract the variables from the end of your URL:
var a = window.location.search;
In your example, a will equal "?t=Text%20to%20display".
Next, you will want to strip the leading question mark from the beginning of the string. The if statement is just in case the browser doesn't include it in the search property:
var s = a.substr(0, 1);
if(s == "?"){s = substr(1);}
Just in case you get a URL with more than one variable, you may want to split the query string at ampersands to produce an array of name-value pair strings:
var R = s.split("&");
Next, split the name-value pair strings at the equal sign to separate the name from the value. Store the name as the key to an array, and the value as the array value corresponding to the key:
var L = R.length;
var NVP = new Array();
var temp = new Array();
for(var i = 0; i < L; i++){
temp = R[i].split("=");
NVP[temp[0]] = temp[1];
}
Almost done. Get the value with the name "t":
var t = NVP['t'];
Last, insert the variable text into the document. A simple example (that will need to be tweaked to match your document structure) is:
var containingDiv = document.getElementById("divToShowMessage");
var tn = document.createTextNode(t);
containingDiv.appendChild(tn);
getArg('t');
function getArg(param) {
var vars = {};
window.location.href.replace( location.hash, '' ).replace(
/[?&]+([^=&]+)=?([^&]*)?/gi, // regexp
function( m, key, value ) { // callback
vars[key] = value !== undefined ? value : '';
}
);
if ( param ) {
return vars[param] ? vars[param] : null;
}
return vars;
}

extending href programmatically in typo3

What's the best way to extend the attribute href programmatically in Typo3?
The links were setted by RTE like
<a class="download" target="_blank" href="fileadmin/ablage/test_material/pdf_1.pdf">
and shall be changed to
<a class="download" target="_blank" href="fileadmin/ablage/test_material/pdf_1.pdf#zoom=100">
Untested code:
you could try to add the section to the parameter
lib.parseFunc_RTE.tags.link.typolink.parameter.append = TEXT
lib.parseFunc_RTE.tags.link.typolink.parameter.append {
value = #zoom=100
if.equals.data = parameters:0
if.equals.substring = -3,3
if.value = pdf
}
or you can try to use "section"
lib.parseFunc_RTE.tags.link.typolink.section.cObject = TEXT
lib.parseFunc_RTE.tags.link.typolink.section.cObject {
value = zoom=100
if.equals.data = parameters:0
if.equals.substring = -3,3
if.value = pdf
}
BUT the most important issue is the "if" statement. I assume that the first parameter is the name of the file (i do not remember). The last 3 charachters should be "pdf". If you use DAM you need to retrieve the UID and get the filetype from there.
Just a rought guess, this could give you a hint, which params do you have:
lib.parseFunc_RTE.tags.link.typolink.parameter.append = TEXT
lib.parseFunc_RTE.tags.link.typolink.parameter.append {
data = parameters : allParams
htmlSpecialChars = 1
wrap = ?debug=|
}
Just a side note: this would be affect all RTE fields!
If there is a fixed class for that link, you could use jQuery...
jQuery(document).ready(function(){
$('.download').each(function(){
var linkhref = $(this).attr('href');
$(this).attr('href', linkhref + '#zoom=100');
});
});
This code do it.
parseFunc_RTE.tags.link.typolink.parameter.append = TEXT
parseFunc_RTE.tags.link.typolink.parameter.append {
value = #zoom=100
if.equals.data = parameters : allParams
if.equals.substring = -3,3
if.value = pdf
}