Using Scrapy response, XPath element doesn't exist, although Google Chrome Inspect Element shows that it does exist - html

I am encountering a problem in which elements that I am trying to select using their XPath do not exist according to Scrapy response. However, the when I inspect the same page on Google Chrome, the element DOES exist.
This problem is occurring on a LinkedIn scrape after using LinkedIn advanced search and getting to a results page. I want to scrape links in the results container.
For example: On the results page for a search on "John," there should be a div element with id="results-container" according to an Inspect Element on Google Chrome. When I use Scrapy response.xpath('//div[#id="results-container]'), there are no selectors returned.
url of result page: https://www.linkedin.com/vsearch/p?firstName=John&openAdvancedForm=true&locationType=Y&rsid=4319659841436374935558&orig=ADVS

Did you try to look up the URL you provided in a private session window of your browser (sometimes called incognito mode)?
If you do this you see that you get a registration form for LinkedIn.
As alecxe suggests in his comment try using the LinkedIn API (it is REST) and you can get XML responses which you can parse along to gather the information needed.
Alternatively you could try to log-in with Scrapy and store the authentication credentials and repeat your request (but I would use the API anyway).

Related

Unable to use the top sites url of google chrome in html anchor tag

So I was making a cool homepage to replace my default homepage of google chrome of my phone, which I made
And I was trying to use the top sites provided by chrome to skip the extra work of adding sites manually
I somehow found the URL
chrome://explore
Which works as expected when entered manually in chrome
But when I use the URL in the href of the anchor tag, it simply doesn't work
Is there a way to make it work or any other website that provides the same thing
comment:
share some more details. Getting any errors in console or share your code.
ps: i can't add direct comment(newbie) that's why adding comment here.
It's not allowed for Chromium browsers. As it's not secure.
Long explanation:
If it were allowed - any website can request this with let's say fetch and read any of your chrome://history page.

Communicating between a Chrome extension popup and an iFrame embedded in that popup

I have an iFrame embedded in my Chrome extension popup which displays a webpage that I am in control of. I am able to send data from the iFrame to the Chrome popup script using sendMessage from my embedded website and onMessageExternal from my popup script, but I would also like to send data the other way around (such as the extension id - I can’t access this value within an iframe).
I’ve read about methods such as using the window.postMessage() function available in HTML5, and have investigated the method discussed here, although I am not sure the second method would work in the context I intend to use it in. If I were to use postMessage, I would not be able to confirm that the message was sent by my extension as there is no domain for me to check against unless I hardcoded my plugin id in, which I would like to avoid.
Is there another method of doing what I am trying to do, or would postMessage be the best way? I want to avoid query strings to make it somewhat more difficult to send an illegitimate request to my webpage. I’m not doing anything with sensitive data, I’m just using the data to make changes to the behaviour of the webpage based on whether it is running in an extension or running natively in the browser, and using the extension id for logging purposes.

How to find the HTTP request from google chrome inspect element?

Forgive me if I don't use the proper terminology. I have a webpage that I'm trying to scrape information from. The problem is that when I view the page source the data I want to scrape is not there. I've encountered this problem before where the main http request triggers other requests and so the information I'm looking for is actually somewhere else which I find using Google chromes inspect - Network feature. I manually search the various documents and xhr files so the one that has the correct information. This is sometimes long and tedious. I can also use google chromes inspect feature to inspect the element that contains the information I want and that brings up the correct source code but it I can't seem to figure out where or how I can use that to quickly find the corresponding HTTP headers.
Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?
I'll add the case study I'm working on.
http://www.flashscore.com/tennis/atp-singles/acapulco/results/
shows the different matches that took place at a tennis tournament. I'm trying to scrape the the match hrefs but if you view source of the page you'll see they're not there.
Thanks
Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?
No. This isn't something that the browser keeps track of.
In most situations, the HTTP response will pass through a good deal of Javascript code before being eventually turned into elements on the page. Tracing which HTTP response was "responsible" for a given element would involve a great deal of data flow analysis, and is impractical for a browser to do.
One way:
open firefox, install LiveHttpHeaders, then run it, and you will see the expected HEADERS.
There's the same addon for google chrome, but not tested.

Using Instagram API for simple web page

So I am working on a fairly simple project, basically a web page that should list the captions from a certain instagram account. It's all designed, it just needs to be lit up with the content. Have a look at http://evanshellborn.com/speechofthebeets/.
I found that you can see a json file containing all the necessary data at instagram.com/{username}/media. So in my case, https://www.instagram.com/beets_are_life/media/. So before I put that page actually online, I was on my local machine, and I did a JSON call to that page and it worked perfectly. So I built it all out and my web page loaded the captions just like I wanted it to.
Then I went to put it online, (http://evanshellborn.com/speechofthebeets), but it doesn't work. Have a look at the script at the bottom of it, on my localhost that code works and the captions get loaded. But on the live page, I get an access not allowed error in the console. So I think Instagram doesn't allow this sort of direct access anymore, you have to go through their API.
Now I've tried looking at the API but it seems rather confusing. Basically what I'm asking for is a different JSON url that would give me the same result as https://www.instagram.com/beets_are_life/media/, but that would work from the live page.
I think https://api.instagram.com/v1/users/{user-id}/?access_token=ACCESS-TOKEN would work, just replacing {user-id} with the appropraite user_id. But where do I get an access token?
From reading https://www.instagram.com/developer/authentication/, it looks like you get one when a user puts in their user credentials. But I don't want to have anyone log in, I just want a simple web page.
Hopefully that made sense. How can I do what I want?
Looks like the API url https://www.instagram.com/beets_are_life/media/ does not support jsonp (no callback support), so u cannot use javascript (client side) for making API request, it will fail because of Access-Control-Allow-Origin error on browser side, you have make this API call on server side as proxy.
I guess https://www.instagram.com/<USER_NAME>/media/ is not a publicly documented API, thats the reason it is not supporting jsonp, Instagram uses it for their website and since it is same-origin it will work for them on client-side
This link will help you embeding the instagram on a simple html webpage.
There is a button on the bottom of the post on instagram.when you click on the link a menu pops up. then click on embed
now a box pops up
just copy paste the html and you are done.
it will fetch the post for you

Google Chrome Extensions: Get Current Page HTML (incl. Ajax or updated HTML)

In my Google Chrome extension, I need to be able to get the current page HTML, including any updated Ajax HTML (unlike the browser's View Source command, which doesn't update it).
Is there a way to get it as a string in my Extension?
Suppose my extension is a right-click context menu called "View Actual HTML Source" which should print the current HTML to the console, or maybe count the number of certain tags there. I wasn't able to find an easy answer to this.
You can get the current state of the DOM as HTML programmatically using document.documentElement.innerHTML
Or just use Developer Tools
I followed the exact solution here, and this gave me the Page Source HTML:
Getting the source HTML of the current page from chrome extension
The solution is to inject the HTML into the Popup.