scrapy xpath not returning desired results. Any idea? - html

Please look at this page http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845. As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1] and Table[2]); the question scrapes perfectly but the Answer does not. Here are my xpaths:
question:
['q_main'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()
works perfect
Answer:
['q_answer'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()
returns a blank. I have reproduced the full xpath, as returned by/verified in Xpath Helper and console.
What am i overlooking? What am I not able to see?

seems like your xpath has some problem,
checkout the demo from scrapy shell,
In [1]: response.xpath('//tr[td[#class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[#class="griditemq"]//text()').extract()
Out[1]:
[u'\r\n\r\n',
u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY (SHRI PIYUSH GOYAL)\r\n\r\n ',
u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and above the \r\nfixed reserve price of Rs. 100/- per tonne.\r\n\r\n',
u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value. ',
u'\r\n\r\n\r\n']
when you are dealing with the tables sometimes it happens and for more information you can refer this.

At least part of the source of your difficulty lies in the fact that the code you see in the console is not the source html that your spider gets as a response (and on which the selectors operate).
In particular, it is extremely common for a <table> to not include a <tbody>; but when your browser translates the html to the DOM tree, it slaps in <tbody> tags. And there was a time when much of the layout of webpages was actually accomplished with (crazily) nested tables. As a result, the DOM of such a website will typically have many more <tbody> elements than the html source.
What this means in practical terms is that:
It is generally a good idea to find a relatively simple xpath (or CSS selector, or ...) for the element(s) you want to select -- not the behemoth you sometimes get from your developer tools.
It is generally a bad idea to include /tbody in your xpath (unless there is an associated attribute, indicating that the tag exists in the source html).
For the site in question,
response.xpath('//td[#class="griditemq"]').extract()
returns a list with the first element the question and the second element the answer.

Related

Employing a large discrete observation space in OpenAI Gym

I am creating a custom environment in OpenAI Gym, and I'm having some trouble navigating the observation space.
Every timestep, the agent is given two potential students to accept or deny admission to - these are randomized and are part of the observation space. As the reward is based on which students are currently enrolled (who we have accepted in the past), we need to keep track of who has been accepted and who has not within the state space (there are a limited number of spots available to students). Each student has a 'major' (1-15) and a 'minor' (1-5) which, in the simulator I built, have weights associated with them that have a bearing on the reward, so they must be included in the state space. After a number of timesteps (varies depending on the major/minor combination), students graduate and can be removed from the list of enrolled students (and removed from being represented in the state space).
Thus, I currently have something like:
spaces = {
'potential_student_I': spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))),
'potential_student_II': spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))),
'enrolled_student_I': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
'enrolled_student_II': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
'enrolled_student_III': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
}
self.observation_space = spaces.Dict(spaces)
In the above code, there's only room for three potential accepted students to be represented. These are spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))) rather than spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))) because the list doesn't necessarily need to be filled, so there are extra options for 'NULL'.
Is there a better way to do this? I thought about maybe using one-hot encoding or something similar. Ideally this environment could have up to 50 enrolled students, which obviously is not efficient if I continue representing the observation space the way I currently am. I plan on using a neural net because of the large state space, but I'm caught up on how to efficiently represent the observation space.

Fuel Consumtion data via OBD2 is wrong - can you help me out?

So I am trying to get real time fuel consumtions data from my Car (2021 Kia Sorento PHEV) via OBD2. I've read up on the topic and it seems to be simple enough.
Fuel Consumtion in Liters per Hour (PID 5E(hex)/94(dec) "Engine fuel rate") divided by Speed in Km/h == Liters/100km.
The problem is: The results are... absurd. When i coast around town #50km/h and the gauge cluster reads an instant fuel consumtion ~3-4 Liters/100km the OBD2 Data suggest an usage of ~17-21 Liters/100km.
I've started to calculate the fuel rate in l/h manually using MAP AFR etc. Data from the OBDII Port and arrive at the same value for Liters/Hour and therefor for the same absurd instant fuel consumtion values.
OBD2 Bluetooth Dongles and popuplar Apps like "Car Scanner" or Torque also report this insanely high instant fuel consumtion.
So I am asking you guys: Is there some alternate formula for fuel consumtion I (And the developers of all those android apps) am not aware of?
Thanks :)
Instantaneous consumption can show some "wild" results.
Top Gear's Richard Hammond made reference to this in one series when he pointed out he was getting 99mpg going downhill.
If you want an accurate check of fuel consumption then the most accurate that I know of is to "brim" the tank, drive then "brim" the tank again. You then have distance and fuel consumption.

Confused about Rewards in David Silver Lecture 2

While watching the Reinforcement Learning course by David Silver on youtube (and the slide: Lecture 2 MDP), I found the "Reward" and "Value Function" really confusing.
I tried to understand the "given rewards" marked on the slide (P11), but I cannot figure out why it is the case. Like, the "Class 1: R = -2" but "Pub: R = +1"
why the negative reward for Class and the positive reward for Pub? why the different value?
How to calculate the reward with the Discount Factor? (P17 and P18)
I think the lack of intuition for Reinforcement Learning is the main reason why I have encountered this kind of problem...
So, I'd really appreciate it if someone can give me a little hint.
You usually set the reward and the discount such that using RL you will drive the agent to solve a task.
In the student example the goal is to pass the exam. The student can spend his time attending a class, sleeping, on Facebook or at the pub. Attending a class is something "boring", so the student doesn't see the immediate benefits of doing it. Hence the negative reward. On the contrary, going to the pub is fun and gives a positive reward. However, only by attending all 3 classes the student can pass the exam and get the big final reward.
Now the question is: how much does the student value immediate vs future rewards? The discount factor tells you that: a small discount gives more importance to immediate rewards, because future rewards just "fade" in the long run. If we use a small discount, the student may prefer to always go to the pub or to sleep. With a discount close to 0, already after one step all rewards get close to 0 as well, so at each state the student will try to maximize the immediate reward, because after that "nothing else matter".
On the contrary, high discounts (max 1) value long-term rewards more: in this case the optimal student will attend all classes and pass the exam.
Choosing the discount can be tricky, especially if there is no terminal state (in this case "sleep" is terminal), because with a discount of 1 the agent may ignore the number of steps used to reach the highest reward. For instance, if classes would give a reward of -1 instead of -2, for the agent would be the same to spend time alternating between "class" and "pub" forever and at some point to pass the exam, because with discount 1 the rewards never fade, so even after 10 years the students will still get +10 for passing the exam.
Think also of a virtual agent having to reach a goal position. With discount 1, the agent would not learn to reach it in the least amount of steps: as long as it reaches it, it's the same for him.
Beside that, there is also a numerical problem with discount 1. Since the goal is to maximize the cumulative sum of the discounted reward, if rewards are not discounted (and the horizon is infinite) the sum will not converge.
Q1) First of all you should not forget that there rewards are given by the environment. The actions taken by the agent do not have an effect on the rewards of the environment, but of course it affects the reward gained by the followed trajectory.
In the example these +1 and -2 are just funny examples :) "As a student" you get bored during the class, so the reward of it is -2, while you have fun in the pub, so the reward is +1. Don't get confused with the reasons behind these numbers, they are environment given.
Q2) Let's do the calculation for the state with the value 4.1 in "Example: State-Value Function for Student MRP (2)":
v(s) = (-2) + 0.9 * [(0.4 * 1.9) + (0.6 * 10)] = (-2) + 6.084 =~ 4.1
Here David is using the Bellman Equation for MRPs. You can find it on the same slide.

Angular.js parse html tags in JSON

Is it possible to parse html tags in a JSON value? Possibly through a filter? I have the following JSON.
{
"title" : "Auto Donation Program",
"shortname" : "auto_donation_program",
"summary": "Donated vehicles find new homes through this program. Recipients are eligible to apply if they have been actively participating at Vineyard Cincinnati or The Healing Center under the guidelines of the program for six months.",
"description" : "<h2>Give your automobile to a new home to help a family in need</h2><p>Please contact Deena Casagrande at (513) 346-4080 Ext. 207 to make arrangements for auto donations. Please do not drop your car off in the parking lot.</p><h2>Tax Benefits</h2><p>It seems that every non-profit these days is encouraging you to donate your vehicle to charity and \"get a tax deduction.\" But there’s a simple distinction between donating your car to The Healing Center versus donating it almost anywhere else.</p><p>As of January 1, 2005, the rules on how much you can write off your taxes were tightened. If the organization sells your car, as most do, you can deduct only the amount they sold it for--and they may sell it for far less than it’s worth. However, if the organization gives your car to someone who will drive it, as The Healing Center does, you can claim full Blue Book value--a significant difference on your taxes. (It’s important to note that when you donate a vehicle, you receive a tax deduction, not a tax credit.)</p><h2>So where does your car end up? </h2><p>Those on the receiving end of The Healing Center’s auto donation program must fill out a detailed questionnaire, meet the eligibility requirements of the program, and be approved by the Benevolence Review Team. Vehicles are given to single parent families or individuals needing transportation for employment or who are enrolled in school to obtain employment.</p>"
}
Displayed in my template as:
<p>{{service.description}}</p>
This worked for me:
<div ng-bind-html="'{{service.description}}' | to_trusted"></div>
Filter
angular.module('app')
.filter('to_trusted', ['$sce', function($sce){
return function(text) {
return $sce.trustAsHtml(text);
};
}]);

Retrieving a fully qualified street address from ZIP / postal code

I have a form in which my users need to enter the following location data:
Full address line (street address, apartment, suite, unit, building, floor)
House number
City
State / province / region
ZIP / Postal code
Country
To simplify the completion of this form, I would like to automatically fill in the fully qualified address (addrses line, city, state province etc) by letting the user only enter his country, zip code and house number.
Is it correct that these 3 items are sufficient to lookup the address in the United States? Or is less or more information necessary? And is the answer to this question different for every country? Moreover, is there a service, API, or library that can be utilized for this purpose (e.g. Google Maps or OpenStreetMap)?
Great questions!
Is it correct that these 3 items are sufficient to look up the address in the United States?
No. Unfortunately these three will get you down to ~hundreds of possible addresses in the
US.
Is the answer to this question different for every country?
Yes! The postal systems from country to country vary greatly and you're users in them will have different expectations about what they expect to supply - Brits don't expect to have to enter a full address for example.
With the UK, Canada and Australia you can usually get to a single address from the house number and postcode. BUT, you can not guarantee this. There may be sub-premise information or business information which requires a bit of interaction with the user to check you have right address.
Some countries, such as France, do not have complete premise number coverage. With these you can take the premise number & postcode but depending upon the town you have to alter your behavior to either trust and accept the input or prompt them for a correction.
Another important consideration when planning your workflow is the need to allow for people who perhaps do not know their postcode / zip. It does not happen often but sometimes people have just moved, or occasionally a properties postcode/zip changes so it is important to be flexible in the information you need.
Is there a service, API, or library that can be utilized for this purpose?
Yes - there are several solutions around that offer the ability to capture global addresses. Experian Data Quality (my company) offer a hosted or on premise solution that allows for this.
Try it out here - on the right hand side under the "Do you want to know more?" you can switch countries, the prompt updates and the interaction occurs if needed.
I can only answer about US addresses (I work at SmartyStreets), but the answer is no, that won't work.
Kudos for your desires to improve the user experience. Unfortunately, I would not recommend trying this, and here's why:
A US ZIP code, in its entirety, is actually 11 digits long (12 with the check digit):
The first three digits are the SCF (Sectional Center Facility), kind of like a region code
The first five digits are your typical 5-digit ZIP code that specifies a set of carrier routes
The next 4 digits are more precise, often narrowing down an address to block-level.
The next 2 digits are seldom used except in barcodes, but they indicate the delivery point. In theory, this would specify a particular house, apartment, or mailbox, but in reality, sometimes the 11-digit code is ambiguous (common in large complexes, street blocks, or PO facilities). It's typical for the delivery point to correlate to the house or apartment number, but not always.
So in your situation:
Knowing the country narrows down the possibilities to just 350,000,000+ addresses
Knowing the 5-digit ZIP code narrows it down to somewhere around 10,000+ addresses (important note: not everyone knows the 5-digit ZIP code, and they change. What's more, is that they may not be sure whether to enter their PO box ZIP code or their house ZIP code. And what if their house doesn't receive mail? Or what if they're in the military and their 5-digit ZIP is in flux?)
Knowing the house number may narrow down the address candidates to anywhere from 1-1000. It depends how "big" the ZIP code is. (But ZIP codes are not polygons).
So no, it is not sufficient to know these three parts of the address. The country is practically worthless at that point, and the ZIP code is locality/city-specific at best. The house number might appear dozens, if not hundreds, of times in a ZIP code. (I grew up in the boonies where our house number was unique, but that's rare.)
And yes, the answer to this question varies country to country, but this reasoning holds true for most developed countries. Less developed countries don't have such organization to their postal system.
Is there a service that can do this? Not if you don't want your users to scroll through dozens or hundreds of results. If they have to look through more than just a couple, you're better off just asking them to type their full address.
I answered a very similar question just the other day. You might find it useful.
So now that I've rained doomsday on your idea, how about an alternative? Of course I'm partial to SmartyStreets' autocomplete, which suggests addresses, geo-located close to the user, as they're typing. I should mention that it's free. It doesn't actually verify the address until the user is finished or has chosen one of the suggestions, but it does reduce keystrokes.
Further on this UX vein, I'd recommend putting country as the first field of your address form. This way, you can alter the form's format based on the country they choose. If you use a service like LiveAddress, you can have the user type their address in a format comfortable to them in a single field, rather than across multiple text boxes in your arbitrary order, since LiveAddress can parse their input.
You could easily achieve this by using the google maps reverse geocoding api. Heres a link to its documentation. link
I don't know of any country where there is a one-to-one mapping between a post code and a street address. Except Singapore. Postal Codes in SG
In that particular case you can use the post code to fill in the remaining fields, in any other case you can derive the city name and the street address, but not likely the House number.
Example 1: (derive full street address from post code)
https://geocode.xyz/339696?geoit=xml
<geodata>
<latt>1.32035</latt>
<longt>103.87430</longt>
<elevation/>
<standard>
<stnumber>88</stnumber>
<addresst>88 GEYLANG BAHRU</addresst>
<postal>339696</postal>
<city>Singapore</city>
<prov>SG</prov>
<countryname>Singapore</countryname>
<confidence>0.5</confidence>
</standard>
</geodata>
Example 2: (Get most common street address, and other variations of city name)
https://geocode.xyz/27777?region=DE&geoit=xml
<geodata>
<latt>53.06060</latt>
<longt>8.58388</longt>
<elevation/>
<standard>
<stnumber>20</stnumber>
<addresst>20 Bokenbusch</addresst>
<postal>27777</postal>
<city>Ganderlesee</city>
<prov>DE</prov>
<countryname>Germany</countryname>
<confidence>0.5</confidence>
</standard>
<alt>
<loc>
<city>Ganderkesee</city>
<latt>53.06868</latt>
<longt>8.57437</longt>
<cc>951</cc>
</loc>
<loc>
<city>Bremen</city>
<latt>53.07675</latt>
<longt>8.57559</longt>
<cc>172</cc>
</loc>
<loc>
<city>Schierbrok</city>
<latt>53.08639</latt>
<longt>8.58037</longt>
<cc>166</cc>
</loc>
The number in "cc" indicates how many street addresses in that city share the given post code.
Good luck!