Trouble with Xpath in Google Spreadsheets (ImportXML) - html

This is a great site, and I've already had a lot of questions answered simply by scrolling and searching through other postings. Unfortunately, I can't seem to track down an answer that specifically helps this problem, and figured I would try posting and looking for help-
I'm using ImportXML and google spreadsheets to 'scrape'a few product descriptions from a retail site. It's been working fine for the most part, and I have done it in 2 ways:
1) Specific call to the description part of a post:
=ImportXML(A1,"//div[#class='desc']")
2) Call to the entire 'product Card', which also returns info such as product title, price, time posted, and places these items in adjacent cells in my Google spreadsheet:
=ImportXML(A1,"//div[#class='productCard']")
Both have worked fine, but I've ran into a different problem using each method. If I can resolve even one of these problems, then I'll happily scrap the other method, I just need one of them to work. The problems are:
Method 1) The website prohibits sellers from including contact information in product postings-- when they include an email address anyways, the site automatically blocks it, so that in the posting it simply appears as "...you can reach me at [obscured]" or something like that. The [obscured] appears in a different colour text and is obviously treated differently somehow. When I scrape these descriptions using Method 1, ImportXML appears to get 'bumped' when it hits the word [obscured], and it passed the remaining text from that product description to the next cell over in my spreadsheet. This ruins the entire organization of the sheet, and I'd like to find a way where I can get ImportXML to just ignore the [obscured], and still place the entire text of the product description in one cell.
Method 2) My call for the entire 'product Card' is as follows:
=ImportXML(A1,"//div[#class='productCard']")
As mentioned, this works fine (for most products), and I don't mind the additional info (price, date, etc.) being posted in adjacent cells.
However, the website also allows certain products to be 'featured', where they appear in a different colour box on the site, and are therefore more likely to get a buyer's attention.
Using this method, the 'featured' products are not scraped or imported into my spreadsheet, but are simply passed over.
The source code (on actual site) (via 'inspect element' in Safari) for both the description (Method 1) and product card (Method 2) look as follows (for a normal product (a) and a featured product (b)):
(a)
<div id="productSearchResults">
<div class="productCard tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
(b)
<div id="productSearchResults">
<div class="productCard featured tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
You can see in both (a) an (b) the 'desc' class that I call in Method 1, which seems to work fine.
From my reading on this site, I think I've learned that a given class can't have more than one word, and therefore the use of "desc collapsed descFull" and "productCard tracked" and "productCard featured tracked" don't represent classes with 3, 2 and 3 words in the title, but instead cases where multiple classes have been assigned?
Regardless, the call to 'desc' (Method 1) works fine and seems to get all descriptions.
In method 2 therefore, I would have thought that a call to 'productCard' would get the info for all products, both featured and regular, as 'featured' is an extra class assigned to some 'productCard's. If I call all 'productCard's, shouldn't the normal AND featured ones be returned? This is currently not the case. I've tried calling just 'tracked' and just 'featured' as classes, and neither returns anything, so my logic that they are their own class equivalent to 'productCard' may be flawed.
In summary, the 'desc' call in Method 1 works fine, and even gets descriptions for 'featured' products. However, when contact information is included in the description and is displayed as [obscured] it bumps my data into the next cell in the spreadsheet, immediately following the word. This throws off and ruins all organization.
In Method 2, I am not getting the featured products at all, which greatly weakens what I am trying to do. Can either (or both!) of these problems be fixed??
Thanks so so much for any help you can give me.
***UPDATE: As seen in the comments below, use of the 'contain' as suggested improved Method 2 by retrieving both regular and featured products. However, featured product cards have extra text elements, and since the entire card is being scraped in this method, featured products do not match the cell alignment that regular products do. If there is a way to fix Method 1, this would therefore be much better.
As outlined in the comments below, the [obscured] text appears in a 'span' that follows underneath/indented from the
<div class="desc descFull collapsed"
as
<span class="obscureText">[obscured]</span>
Is there any way that I can import the 'desc's as I have been, but tell the XPath to essentially 'ignore' the [obscured] span, or at least deal with it in a way that doesn't make description text immediately after [obscured] appear one cell over?
Thanks so much everyone!

You can wrap your function with the concatenate()-function to make sure it all shows up in one cell:
=concatenate(ImportXML(A1,"//div[#class='productCard']"))

Related

What does the "ved" parameter in a google search refer to?

I've spent like two hours or more trying to figure out what a "ved" parameter on a Google search means. A curious person I am.
My finds so far:
$ved value changes-
1 - every different search result (diff keywords)
2 - every different resulted block (the url blocks/boxed on the resulted google search, but they are quite similar, as I'll write down below)
3 - every different geolocation perhaps
Consider these tests or lookups:
1-
Diff keywords, but first block/position in list:
&ved=2ahUKEwidsaSd4M_1AhVlk_0HHUxOCQYQFnoECAsQAg
&ved=2ahUKEwj2pZyN5s_1AhVRmuYKHZ5IB5EQFnoECAcQAg
I thought the "ved" value refers to the block/position of a url in the result list, but no.
2-
Twree different urls, first and second from the 1st and 2nd blocks of first page, then third from a "much farther on the list" block:
ved=2ahUKEwjq1-Wb1s_1AhW6SWwGHZwpBMwQFnoECD8QAQ
ved=2ahUKEwjq1-Wb1s_1AhW6SWwGHZwpBMwQFnoECCAQAQ
ved=2ahUKEwiZ2NDe1s_1AhVaTmwGHThIA5U4PBAWegQIGRAB
The same website url, from different countries (not considering blocks or position in list):
&ved=2ahUKEwiopK2X08_1AhUgxzgGHQEbDkcQFnoECBIQAQ
&ved=2ahUKEwjpueqC1M_1AhWJq3IEHYEDAfc4FBAWegQIDBAB
&ved=2ahUKEwih09Wz08_1AhUY7WEKHQYdBB8QFnoECEIQAQ
Very similar they are.
I'd really love to know what they mean. Any ideas are appreciated too!
I found an interesting article explaining the subject : https://moz.com/blog/inside-googles-ved-parameter
TL;DR:
A ved code contains up to five separate parameters, which each tell you something about the link that was clicked on:
1st (parameter1: Link index) gives you an idea of where the link was on the page.
2nd (parameter2: Link type) is a number that corresponds to the 'type' of the link that was clicked.
3rd (parameter7: Start result position) is the cumulative result position of the first result on the page.
4th (parameter 6: Result position) indicates the position of your page in the search results.
5th (parameter 5: Sub-result position) like the (parameter 6), except it tells you the position in a list of sub-results, such as breadcrumbs, or one-page sitelinks.

What is the problem with my JS function searching html classes and conditionally hiding or showing?

I am trying to modify a Dawn 2.0 theme to, on the product page, only render images with certain tags depending on an input selector. I have one product which consists of 8 items, each item has an image. This product can be ordered in three variations:
The first variation would be the product with 4 items.
The second variation would be the product with 6 items.
The third variation would be the product with 8 items.
There is a variant selector in the products page:
and the three variants have been created on the Shopify admin products page. The solution that I'm attempting to use is this one. I chose it as it's the only one I found reference to elsewhere (apart from buying an add on which I don't want to do), it seemed a logical approach to me, though I have no particular loyalty to this approach if anyone can think of a better way to achieve what I'm trying to.
This solution involves using values assigned to image's alt text attribute via the Shopify admin products page. The alt text values applied are:
set_all for the first 4 images, because these 4 items would be included in all three of variants (the one with only x4 itmes, the one with x6 items, and the one with x8 items).
set_6&8 for the 5th and 6th images, because these 2 items would be included in both the 2nd variant (the one with x6 items) and the 3rd variant (the one with x8 items).
set_8 for the 7th and 8th images, because these 2 items would only be included in the 3rd variant (the one with x8 items)
Then interpolating that alt text value into a data tag in the html:
data-thumbnail-set-number="{{media.alt}}"
which is applied to the <li> that each image is housed within and then writing a js function which:
hides all images
stores in local variable the alt text value of the featured image ('featured image' being the one, and only, image that you can apply to each variant in the Shopify admin products page. Therefore if the user selects a different variant a different featured image with different alt text is attached to that variant).
shows the images whose data-thumbnail-set-number value matches the value from the variant's featured_image.alt text. JS being:
'./assets/global.js' in VariantSelects class, line ~561
onVariantChange() {
...
this.filterThumbnails();
...
};
filterThumbnails() {
if (this.currentVariant.featured_media != null && this.currentVariant.featured_media.alt != null) {
document.querySelectorAll('[data-thumbnail-set-number]').forEach(function(el) {
el.style.display = 'none';
});
// show thumbnails for selected set
var selected_set = this.currentVariant.featured_media.alt;
document.querySelectorAll('[data-thumbnail-set-number="' + selected_set + '"]').forEach(function(el) {
el.style.display = 'block';
});
}
else {
//show all thumbnails
document.querySelectorAll('[data-thumbnail-set-number]').forEach(function(el) {
el.style.display = 'block';
});
}
// console.log("thumbnail update", this.currentVariant.featured_media.alt)
}
This all works apart from two problems:
I need to write further conditions in the JS to not just match the featured_image.alt text but to also include:
the set_all images as those items come with all 3 variations, and for the set_8 variant to include the set_6&8 images. I should be able to deal with this fine, will probably hard code the data-thumbnail-set-number that each three conditional is looking for.
Even though the JS function seems to be effective at searching and showing the images with a data-thumbnail-set-number value that matches the variant's featured image alt text value, the page also always seems to render one other image, which does not have the data-thumbnail-set-number class applied to the <li> at all, and it seems that that image is always the first of the images saved to the main product.
Despite a good deal of time looking I can't seem to understand why issue 2. is happening, and how to stop it. Can anybody help me solve this?
PEBCAK 🥴
To solve issue 2, I found that another <li> is inserted in the main-product.liquid above the <li> which iterates and shows all images in the image colleciton just to show just the featured image. I don't really understand why this featured image is separated out for rendering apart from the others in the collection, but it it. I just needed to add the data-thumbnail-set-number="{{media.alt}}" into this newly found featured image <li>.
As i took some time to write out this question I thought that I'd leave it here in case it's of help to anyone else.

Filtering name with two condition

I have a sample list of student and grades/subject in this file
enter image description here
https://docs.google.com/spreadsheets/d/1NeHlUaRnbvdJ2yJ38fUETGgBoYseQ8CuXmwRCwObAlM/edit#gid=0
On the range A16:A I'd like to see the list of names who has the grades of around 90-100 when I check any of the checkbox on B15:k15
the first example is when I check all of the boxes
I will only see the first name on the list because he is the only one with the 90-100 scores on all subject
2nd example when I check B15 and C15
I will only see the 1st and 2nd names on the list because he's those who only able to get a 90-100 score on those two subjects.
Is there a way to do this kind of filtering? thank you so much
Since this is your first post, I'm going to go with the approach I think you'll find easiest to understand. It's a long formula (which I've placed in a new sheet called "Erik Help" in A16), but it's just a repeat of the same element several times:
=FILTER(A2:K11, IF(B15=TRUE, B2:B11>=90, B2:B11^0), IF(C15=TRUE, C2:C11>=90, B2:B11^0), IF(D15=TRUE, D2:D11>=90, D2:D11^0), IF(E15=TRUE, E2:E11>=90, E2:E11^0), IF(F15=TRUE, F2:F11>=90, F2:F11^0), IF(G15=TRUE, G2:G11>=90, G2:G11^0), IF(H15=TRUE, H2:H11>=90, H2:H11^0), IF(I15=TRUE, I2:I11>=90, I2:I11^0), IF(J15=TRUE, J2:J11>=90, J2:J11^0), IF(K15=TRUE, K2:K11>=90, K2:K11^0))
The first argument of FILTER tells the function what to filter (in this case A2:K11).
After that, an IF statement is set up to check each checkbox. If the checkbox is checked, the FILTER will only include students who obtained a 90 or higher in that subject.
If the checkbox is NOT checked, then the student is automatically included (that's the part that says "B2:B11^0" etc., since anything to the zero-power equals 1, and 1 and TRUE are the same to Google Sheets). In other words, if no checkboxes were checked, then all students would read TRUE for all subjects, i.e., all students would be included (or, to think of it another way, no one is rules out). While the ^0 is not strictly necessary (i.e., any number other than zero is the same as TRUE), I think it's better formula practice and easier for others to understand if TRUE is represented either as TRUE or as 1.
I also set conditional formatting on A15:A, to bold the name as you had it. (The conditional formatting rule says, in English, "If anything is there, use bold.") You can see the rule by clicking anywhere in the range A15:A, then selecting Format > Conditional formatting from the menu and clicking the open the rule that appears in the window to the right of the screen.

HTML-assigned 'id' Missing in DOM

Within the Moodle (v. 3.5.7) Atto editor (using both Chrome and Firefox) I've been trying to assign an ID to a particular row class, "span9". My ultimate objective is to assign this a unique ID and reference this element via jquery so as to append another element within it.
The ISSUE is that once I add an ID (id="checklist01") and click save, the ID simply does not appear in the DOM, and seems to not exist. When I re-enter the atto editor however, voila, there it is just sitting there. So it's NOT being removed completely... just not expressed somehow?
I have 2 screenshots linked below showing (1) the editor view, with the element and assigned ID highlighted, and (2) a screenshot of the DOM once the changes have been saved, with that same area highlighted, without the assigned ID.
Screenshots of ID Missing from DOM
Bootstrap ver. 4
So far I've tried switching the placement of the id in the atto editor (class coming first vs second after ); tried to add a "span" in front of the id (for some reason, I was desperate); and really just searched all over for someone who has encountered something similar.
I'm not sure how much help the html will provide, but here it is:
<div class="row-fluid colored">
<div class="iconbox span3">
h4>Your Completion Status (%)</h4>
</div>
<div id="checklist01" class="span9">
</div>
</div>
I found the reason for the removal of id attributes.
id attributes are removed because "Checklist" activity used safe HTML function of Moodle. If you want to access id attributes of description HTML follow below steps.
Go to mod\checklist\locallib.php file.
Then search formatted_intro() function (which is around line number 880).
In that function they used Moodle's format_text() function to return description text.
In that function, they have used 3 parameters.
string $text The text to be formatted.
int $format Identifier of the text format to be used
object/array $options text formatting options
Replace
$opts = array('trusted' => $CFG->enabletrusttext);
to
$opts = array('trusted' => $CFG->enabletrusttext,'allowid'=>true);
Then save your file and check. By following the above steps you can use id attributes.

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]