R: extracting data from password secured site - html

I am trying to extract data from a password secured site thru R. I have tried many options. But I am getting only the HTML of non-login page even after passing login credentials. Here is one of my attempts:
loginurl ="https://login.recruit.naukri.com/"
dataurl = "http://resdex.naukri.com/search/setSrchSess? SRCHTYPE=ez&SRCH_INC_KEYWORD=XXX"
pars=list(username="XXX", password="XXX")
curl = getCurlHandle()
curlSetOpt(cookiejar="", useragent = agent, followlocation = TRUE, curl=curl)
html=postForm(loginurl, .params = pars, curl=curl)
html=getURL(dataurl, curl=curl)
Both htmls are same and they are HTML of non-login page.
The same is happening with others commands also.

Related

How to fix: "Operation returned an invalid status code 'BadRequest’" when connecting PowerBI Embedded to an SSAS Multidimensional cube

I have an (on premise) SSAS (Multidimensional) cube with a live connection to Power BI. Then it has to be shown in a portal with Power BI Embedded. I used the method: 'App owns data' and with a 'master user' account. This part works.
But when i try to add Row Level Security(RLS), it keeps giving errors. The report will be shown to customers (outside the organization). Based on their login (the authentication is held by the portal itself), they need to see their own data.
I tried to connect, using JSON script, adding username, roles, datasets and customdata.
The username contains the actual active directory username which has permissions within SSAS.
The customdata contains the part i want to filter.
The role 'Test' is currently made for testing purposes.
The role 'Test' is setup in SSAS with read permissions and the specific 'company' dimension is setup with the following 'Allowed member set':
STRTOMEMBER('[Dim Company].[BK_Company].&[{'+CUSTOMDATA()+'}]')
This is based on another topic which used this as the solution.
I have tried using USERNAME() to be the filter of RLS, but it seems I can only use actual accountnames in this field. Our current active directory doesn't hold all customer names in it.
var rls = new EffectiveIdentity(#"domain\powerbiportal", new List { report.DatasetId }, new List { "Test" }, "19164");
var tokenRequest = new GenerateTokenRequest("view", identities: new List { rls });
var tokenResponse = client.Reports.GenerateTokenInGroupAsync("[ID]", report.Id, tokenRequest).Result;
Sending JSON
{
"accessLevel": "view",
"identities": [
{
"username": "domain\\powerbiportal",
"roles": [
"Test"
],
"datasets": [
"[dataset]"
],
"customData": "19164"
}
]
}
The error i get is the following:
Operation returned an invalid status code 'BadRequest’
After contacting Microsoft Support the problem was fixed.
First the solution was to uncheck the box 'enable read permissions' in the 'cell data' tab of the role. The cell was empty in my case, but for some reason it still created the problem.
Secondly the statement to filter had to be:
{STRTOMEMBER('[Dim Company].[BK_Company].&[' + CUSTOMDATA() + ']')}
instead of
STRTOMEMBER('[Dim Company].[BK_Company].&[{'+CUSTOMDATA()+'}]')

A python client to ContextualWeb News API using RapidAPI

I am trying to consume ContextualWeb News API. The endpoint is described here:
https://rapidapi.com/contextualwebsearch/api/web-search
Here is the request snippet in Python as described in RapidAPI:
response = unirest.get("https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/NewsSearchAPI?autoCorrect=true&pageNumber=1&pageSize=10&q=Taylor+Swift&safeSearch=false",
headers={
"X-RapidAPI-Host": "contextualwebsearch-websearch-v1.p.rapidapi.com",
"X-RapidAPI-Key": "XXXXXX"
}
)
How do I send the request and parse the response? Can you provide a complete code example for the News API?
use the python version 3.X for below code.Below is the complete example example where I am passing string Taylor Swift and parsing response...Let me know if you stuck anywhere
import requests # install from: http://docs.python-requests.org/en/master/
# Replace the following string value with your valid X-RapidAPI-Key.
Your_X_RapidAPI_Key = "XXXXXXXXXXXXXXXXXXX";
# The query parameters: (update according to your search query)
q = "Taylor%20Swift" # the search query
pageNumber = 1 # the number of requested page
pageSize = 10 # the size of a page
autoCorrect = True # autoCorrectspelling
safeSearch = False # filter results for adult content
response = requests.get(
"https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/NewsSearchAPI?q={}&pageNumber={}&pageSize={}&autocorrect={}&safeSearch={}".format(
q, pageNumber, pageSize, autoCorrect, safeSearch),
headers={
"X-RapidAPI-Key": Your_X_RapidAPI_Key
}
).json()
# Get the numer of items returned
totalCount = response["totalCount"];
# Get the list of most frequent searches related to the input search query
relatedSearch = response["relatedSearch"]
# Go over each resulting item
for webPage in response["value"]:
# Get the web page metadata
url = webPage["url"]
title = webPage["title"]
description = webPage["description"]
keywords = webPage["keywords"]
provider = webPage["provider"]["name"]
datePublished = webPage["datePublished"]
# Get the web page image (if exists)
imageUrl = webPage["image"]["url"]
imageHeight = webPage["image"]["height"]
imageWidth = webPage["image"]["width"]
thumbnail = webPage["image"]["thumbnail"]
thumbnailHeight = webPage["image"]["thumbna`enter code here`ilHeight"]
# An example: Output the webpage url, title and published date:
print("Url: %s. Title: %s. Published Date:%s." % (url, title, datePublished))

Retrieve custom attribute from user profile in Google API Scripts- Google Admin Directory

This is about G suite users.The following works in Google Admin Directory using Google Admin SDK. It retrieves email address and full name of user.
var myemail = Session.getActiveUser().getEmail();
var mycontact = AdminDirectory.Users.get(myemail);
var myname = mycontact.name.fullName;
There is a custom attribute in user profile named "Department". The following does NOT retrieve anything. It throws null
var mydept = mycontact.Department;
How can one retrieve custom attribute from user profile in G suite?
According to Directory Api - Users: get you need to set the projection to "custom".
projection - What subset of fields to fetch for this user.
Acceptable values are:
"basic": Do not include any custom fields for the user. (default)
"custom": Include custom fields from schemas requested in customFieldMask.
"full": Include all fields associated with this user.
Then you should define a Schema for the custom data
customFieldMask (string) A comma-separated list of schema names. All fields from these schemas are fetched. This should only be set when projection=custom.
So something like:
var mycontact = AdminDirectory.Users.get({
"userKey": myemail,
"projection": "full",
"customFieldMask": "Define Schema Here"
});
You can then Logger.log(mycontact); to see how to access the returned custom fields
For a custom schema, you can just use the full projection to get all custom schema fields.
For the standard department field, see user.organizations[0].department
https://developers.google.com/admin-sdk/directory/v1/reference/users
If you got an error :
Resource Not Found: userKey
Try this :
mycontact = AdminDirectory.Users.get(
myemail,{
projection: 'full'
});

How to parse encoded HTML

I'm working on a digest email to send to users of my companies app. For this I'm going through each users emails and trying to find some basic information about each email (from, subject, timestamp, and, the aspect that's causing me difficulty, an image).
I assumed Nokogiri's search('img') function would be fine to pull out images. Unfortunately it looks like most emails have a lot of garbage embedded in the URLs of those images, like newlines ("\n"), escape characters ("\"), and the string "3D" for some reason. For example:
<img src=3D\"https://=\r\nd3ui957tjb5bqd.cloudfront.net/images/emails/1/logo.png\"
This is causing the search to only pull out pieces of the actual URLs/src's:
#(Element:0x3fd0c8e83b80 {
name = "img",
attributes = [
#(Attr:0x3fd0c8e82a28 { name = "src", value = "3D%22https://=" }),
#(Attr:0x3fd0c8e82a14 { name = "d3ui957tjb5bqd.cloudfront.net", value = "" }),
#(Attr:0x3fd0c8e82a00 { name = "width", value = "3D\"223\"" }),
#(Attr:0x3fd0c8e829ec { name = "heigh", value = "t=3D\"84\"" }),
#(Attr:0x3fd0c8e829d8 { name = "alt", value = "3D\"Creative" }),
#(Attr:0x3fd0c8e829c4 { name = "market", value = "" }),
#(Attr:0x3fd0c8e829b0 { name = "border", value = "3D\"0\"" })]
})
Does anyone have an idea why this is happening, and how to remove all this junk?
I'm getting decent results from lots of gsub's and safety checks but it feels pretty tacky.
I've also tried Sanitize.clean which doesn't work and the PermitScrubber mentioned in "How to sanitize html string except image url?".
The mail body is encoded as quoted printable. You will need to decode the body before you parse it with Nokogiri. You can do this fairly easily with Ruby using unpack:
decoded = encoded.unpack('M').first
You should check what the encoding is by looking at the mail headers before trying to decode, not all mail is encoded this way, and there are other types of encoding.
I am not a master in scraping, but you are able to get it through the CSS attribute
.at_css("img")['src']
For example:
require "open-uri"
require "nokogiri"
doc = open(url_link)
page = Nokogiri::HTML(doc)
page.css("div.col-xs-12.visible-xs.visible-sm div.school-image").each do |pic|
img = pic.at_css("img")['src'].downcase if pic.at_css("img")
end

Using rvest or httr to log in to non-standard forms on a webpage

I am attempting to use rvest to spider a webpage that requires an email/password login on a form.
rm(list=ls())
library(rvest)
### Trying to sign into a form using email/password
url <-"http://www.perfectgame.org/" ## page to spider
pgsession <-html_session(url) ## create session
pgform <-html_form(pgsession)[[1]] ## pull form from session
set_values(pgform, `ctl00$Header2$HeaderTop1$tbUsername` = "myemail#gmail.com")
set_values(pgform, `ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")
submit_form(pgsession,pgform,submit=`ctl00$Header2$HeaderTop1$Button1`)
This gives me the following error message:
Error in submit_request(form, submit) :
object 'ctl00$Header2$HeaderTop1$Button1' not found
If I submit the form without specifying the submit parameter, I get this:
Submitting with 'ctl00$Header2$HeaderTop1$Button1'
Error in function (type, msg, asError = TRUE) : <url> malformed
I also tried passing the parameters directly to httr as mentioned in this question: How can I POST a simple HTML form in R?, but the "submit" parameter did not accept the submit button either with backwards quotes (``), quotation marks, or without any quotes:
library(httr)
url <- "http://www.perfectgame.org/Rankings/Players/Default.aspx?gyear=2015&num=500"
fd <- list(
submit = `ctl00$Header2$HeaderTop1$Button1`,
`ctl00$Header2$HeaderTop1$tbUsername` = "myemail#gmail.com",
`ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")
resp<-POST(url, body=fd, encode="form")
content(resp)
Any ideas for how I can log in from an R session and spider the data that's behind the login wall?
Your rvest code isn't storing the modified form, so in you're example you're just submitting the original pgform without the values being filled out. Try:
library(rvest)
url <-"http://www.perfectgame.org/" ## page to spider
pgsession <-html_session(url) ## create session
pgform <-html_form(pgsession)[[1]] ## pull form from session
# Note the new variable assignment
filled_form <- set_values(pgform,
`ctl00$Header2$HeaderTop1$tbUsername` = "myemail#gmail.com",
`ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")
submit_form(pgsession,filled_form)
And I now see a nice 200 status code response instead of an error. Note that because the desired submit button appears to be the first submit button, we don't need to give it as an argument, but otherwise we'd just be giving it a a string (straight quotes, not back quotes).