parsing dynamic website but can't find request

parsing dynamic website but can't find request - json

I am trying to parse this website:
https://www3.wipo.int/branddb/en/#
and get trademark data based on certain search conditions (for example, getting trademarks which contain certain "love the"). It is a dynamic website so the url doesn't change when I search. I am trying to follow the steps in this stack overflow answer:
Crawling data but the url doesn't change
The answer mentions that after clicking on search button, the network tab on dev tools in browser will show an XRH request and then I can do further JSON parsing on it. I can't find the request for this website (it shows about 20 different requests in network tab though, none seem to change with my query). How do I go about parsing this? Also, could you share what to do after finding the correct request?
Thanks a ton

Related

Can Power Automate extract data from the HTML source of a web page?

I’ve had success doing web scraping with Microsoft Access, using MSXML2.XMLHTTP objects and Regex. I’ve been exploring the web scraping possibilities of Power Automate, and see that it doesn’t have regex, but can execute regex scripts from Excel. The problem: accessing the relevant data in the first place.
Take a look at this link: https://letterboxd.com/tiff_net/list/2022-toronto-international-film-festival/ When you try to extract data from one of the entries, nothing useful is available.
And yet all the information I want is contained behind it:
You can display the source behind a web page by using Edge as your browser and adding “view-source:” to the beginning of the web address you want to go to. But then what? How do you get the HTML source into a variable where you can work on it? With MSXML2.XMLHTTP, you just access the responseText property. Can something like this be done with Power Automate or are you limited to scraping sites with extraction-compatible objects?

HTTP 403 on images loaded from googleusercontent.com

First off, I don't think my problem is related to these questions: question 1, or question 2.
Because I'm not using authentication anywhere, or any library either (I don't need to).
I'm simply loading some publicly-available album art images in my web application:
// urlList is an array than contains URLs like the examples given below
<img *ngFor="let url of urlList" src="url">
Example URLs:
Glass Mansion, Summertime, Side Effects
99% of the time, it works. But sometimes I get 403 errors on the console for those exact same URLs.
I know they're not related to authentication, because, well. These URLs are publicly accessible.
Debugging this has been difficult, because a few page refreshes later, it magically works again. There's nothing out of the ordinary in logs either (except the GET 403 errors).
What in the world is happening?
I'm using Angular v7.2.15. Browser: Google Chrome

Add referrerpolicy="no-referrer" attribute
<img src="your-google-link-here" referrerpolicy="no-referrer"/>

Within several Google API's (like the gmail API for example), Google uses HTTP403 and/or HTTP429 in order to ratelimit certain requests over certain time periods. I do not know what method you are using, if you are using some sort of API etc, nor do I know how busy or large your webapp is. But rate limiting or fair use compliance could be coming into play.
Gmail API Rate Limit Info Source - https://developers.google.com/gmail/api/v1/reference/quota

Google Apps Script to download pdfs from UN ODS

Background
The UN Secretary-General and other organs issue hundreds of reports to the General Assembly each year, and there is no unified list of these reports, like there are for other documents. There is, however, a simplified url for reading these reports using their document codes http://undocs.org/[document code] with the document codes having the format A/[Session]/[Document Number]. An example document code would be "A/71/1" and the url for accessing it would be "https://undocs.org/A/71/1".
I'm trying to download all of these documents for the past 15 years, but instead of manually typing in each of these, I'd like to set up a Google Apps Script to do it for me.
Problem
When I try to use the simple method UrlFetchApp.fetch("http://undocs.org/A/71/1"); for example, it fetches an error page saying that I am using an unauthorized method of accessing the page. This is the same page that shows up if you block cookies or sometimes when you try to access the page in an incognito window.
Now, I'm not looking to hack into the UN, but simply to download some PDFs that are up for public access. I need to figure out what sort of parameters I need to pass with the .fetch() method for the request to be authorized by the page.
Note: I scoured the undocs.org site looking for any guidance, and I found none.
tl;dr
Trying to access United Nations Official Document System using the UrlFetchApp from Google Apps Script, but I can't figure out how to get the request to be authorized.

Short answer - I don't think you'll be able to get it with a one-line fetch.
If you look at the HTML returned when you fetch https://undocs.org/A/71/1, you'll see that it embeds a frame that gets its content from https://daccess-ods.un.org/access.nsf/Get?OpenAgent&DS=A/71/1&Lang=E. Then, if you look at the HTML returned by that frame, you'll see two things:
A frame that loads https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234
A redirect to the actual PDF at https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/206/02/PDF/N1620602.pdf?OpenElement
I presume that the first link sets a cookie indicating that the login has occurred, which the second link then verifies before returning the content.
Things you could try:
A multi-step fetch, where you first get the content from undocs.org, parse it to get the link to the actual PDF, then login and fetch the PDF. Google Apps Script would have to persist cookies between fetches though.
Write your script in different tool (such as Python).
Use a spider/crawler tool to navigate the UN site as if it was a real human.

How can i find a specific word (such as Norway), in a LinkedIn profile. I am trying to find it in inspect

How can i find a specific word (such as Norway), in a LinkedIn profile. I am trying to find it in inspect. I am trying to find all the people that went to Norway, but sometimes it doesnt say in their main summary, but hidden in their full summary.
Any other suggestions, are welcomed.
Thanks.

There's no designated area where people would put where they've been. They could put it anywhere in their profile. You can simply search the page in the Inspector, with Cmd+F (Mac) or Ctrl+F (Windows) shortcuts and keep pressing enter until you find. It's tediuous and there's no guarantee all the content has been loaded at any given time, e.g. requires AJAX requests etc.
However, if you want to get this kind of information in a more structured and automated way, I recommend checking out the LinkedIn API. You can either use the REST API, which requires OAuth, or the JavaScript SDK, which requires an API key. The latter would be my preferred choice.
The API provides a list of basic profile fields you can get back from the JSON response.

Restrict Access to the JSON URL within Drupal from the App

I am working in Drupal which generates JSON that can be assessed through a URL. The URL in turn is parsed by the app (made in Titanium) to show data.
Now the problem is that this URL can be publicly assessed by anyone too and one can see all the details. The app, on the other hand, shows the same data to the restrictive users.
My question is that how can I restrict anyone who opens the URL through a browser whilst allowing the app to assess the data through the same URL?
The URL looks like this:-
http://site.com/section/allowed-users-in-the-list.json
Many Thanks.

look into $_SERVER['HTTP_REFERER'] this will tell you where the script is coming from. You can do a validation there too.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

parsing dynamic website but can't find request - json

Related

Can Power Automate extract data from the HTML source of a web page?

HTTP 403 on images loaded from googleusercontent.com

Google Apps Script to download pdfs from UN ODS

How can i find a specific word (such as Norway), in a LinkedIn profile. I am trying to find it in inspect

Restrict Access to the JSON URL within Drupal from the App

Categories

Resources