The following JSON is not passing validation. The validator complains at the first string value starting at "Can an officer...
It looks like the start of a valid string value for the key. What on earth could be wrong with this?
{
"DUI": {
"Can an officer arrest me because he smelled alcohol on my breath?":"<br />No, odor alone is not sufficient basis for arrest. However, odor combined with other observations such as weaving, slurred speech, and bloodshot eyes
may be enough to give an officer probable cause to arrest you for DUI.",
"Can I be convicted if I refused to take the breath test or the result was below .08?":"<br />Yes, in Washington the prosecutor can prove DUI one of two ways: 1. Blood or breath test result above .08, OR 2. Proof the person was under the
influence of or affected by liquor or drugs.
<br /><br />Additionally, if a person refused to take a test, that fact may be introduced as evidence at trial.",
"How can I be arrested if I wasn't driving my car?":"<br />A person who is in physical control of a vehicle and appears to be under the influence of drugs or alcohol may be arrested and charged under RCW
46.61.504.",
"What can I do now that I have been charged?":"<br />Contact an attorney to find out what options are available to you."
}
}
Doesn't seem to like some line breaks present in your pasted JSON:
Parse error on line 3:
...ohol on my breath?":"<br />No, odor alon
-----------------------^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '['
I pasted the JSON from your question into the validator and removed unnecessary line breaks - this works:
{
"DUI": {
"Can an officer arrest me because he smelled alcohol on my breath?": "<br />No, odor alone is not sufficient basis for arrest. However, odor combined with other observations such as weaving, slurred speech, and bloodshot eyes may be enough to give an officer probable cause to arrest you for DUI.",
"Can I be convicted if I refused to take the breath test or the result was below .08?": "<br />Yes, in Washington the prosecutor can prove DUI one of two ways: 1. Blood or breath test result above .08, OR 2. Proof the person was under the influence of or affected by liquor or drugs.<br /><br />Additionally, if a person refused to take a test, that fact may be introduced as evidence at trial.",
"How can I be arrested if I wasn't driving my car?": "<br />A person who is in physical control of a vehicle and appears to be under the influence of drugs or alcohol may be arrested and charged under RCW 46.61.504.",
"What can I do now that I have been charged?": "<br />Contact an attorney to find out what options are available to you."
}
}
http://jsonlint.com/
Related
I was testing SARSA with lambda = 1 with Windy Grid World and if the exploration causes the same state-action pair to be visited many times before reaching the goal, the eligibility trace gets incremented each time without any decay, therefore it explodes and causes everything to overflow.
How can this be avoided?
If I've understood correctly your question, the problem is that the trace for a given state gets incremented too much. In this case, a potential solution is to use replacing traces instead of the classic incremental traces.
The idea in replacing traces is to reset the trace to a value (typically 1) each time the state is visited. The following figure illustrates the main difference between both kinds of traces:
You can find more information in the classical Sutton & Barto book Reinforcement Learning: An Introduction, especifically in Section 7.8.
Please look at this page http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845. As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1] and Table[2]); the question scrapes perfectly but the Answer does not. Here are my xpaths:
question:
['q_main'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()
works perfect
Answer:
['q_answer'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()
returns a blank. I have reproduced the full xpath, as returned by/verified in Xpath Helper and console.
What am i overlooking? What am I not able to see?
seems like your xpath has some problem,
checkout the demo from scrapy shell,
In [1]: response.xpath('//tr[td[#class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[#class="griditemq"]//text()').extract()
Out[1]:
[u'\r\n\r\n',
u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY (SHRI PIYUSH GOYAL)\r\n\r\n ',
u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and above the \r\nfixed reserve price of Rs. 100/- per tonne.\r\n\r\n',
u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value. ',
u'\r\n\r\n\r\n']
when you are dealing with the tables sometimes it happens and for more information you can refer this.
At least part of the source of your difficulty lies in the fact that the code you see in the console is not the source html that your spider gets as a response (and on which the selectors operate).
In particular, it is extremely common for a <table> to not include a <tbody>; but when your browser translates the html to the DOM tree, it slaps in <tbody> tags. And there was a time when much of the layout of webpages was actually accomplished with (crazily) nested tables. As a result, the DOM of such a website will typically have many more <tbody> elements than the html source.
What this means in practical terms is that:
It is generally a good idea to find a relatively simple xpath (or CSS selector, or ...) for the element(s) you want to select -- not the behemoth you sometimes get from your developer tools.
It is generally a bad idea to include /tbody in your xpath (unless there is an associated attribute, indicating that the tag exists in the source html).
For the site in question,
response.xpath('//td[#class="griditemq"]').extract()
returns a list with the first element the question and the second element the answer.
Is it possible to parse html tags in a JSON value? Possibly through a filter? I have the following JSON.
{
"title" : "Auto Donation Program",
"shortname" : "auto_donation_program",
"summary": "Donated vehicles find new homes through this program. Recipients are eligible to apply if they have been actively participating at Vineyard Cincinnati or The Healing Center under the guidelines of the program for six months.",
"description" : "<h2>Give your automobile to a new home to help a family in need</h2><p>Please contact Deena Casagrande at (513) 346-4080 Ext. 207 to make arrangements for auto donations. Please do not drop your car off in the parking lot.</p><h2>Tax Benefits</h2><p>It seems that every non-profit these days is encouraging you to donate your vehicle to charity and \"get a tax deduction.\" But there’s a simple distinction between donating your car to The Healing Center versus donating it almost anywhere else.</p><p>As of January 1, 2005, the rules on how much you can write off your taxes were tightened. If the organization sells your car, as most do, you can deduct only the amount they sold it for--and they may sell it for far less than it’s worth. However, if the organization gives your car to someone who will drive it, as The Healing Center does, you can claim full Blue Book value--a significant difference on your taxes. (It’s important to note that when you donate a vehicle, you receive a tax deduction, not a tax credit.)</p><h2>So where does your car end up? </h2><p>Those on the receiving end of The Healing Center’s auto donation program must fill out a detailed questionnaire, meet the eligibility requirements of the program, and be approved by the Benevolence Review Team. Vehicles are given to single parent families or individuals needing transportation for employment or who are enrolled in school to obtain employment.</p>"
}
Displayed in my template as:
<p>{{service.description}}</p>
This worked for me:
<div ng-bind-html="'{{service.description}}' | to_trusted"></div>
Filter
angular.module('app')
.filter('to_trusted', ['$sce', function($sce){
return function(text) {
return $sce.trustAsHtml(text);
};
}]);
Google Maps API does a great job trying to locate a match for nearly every query. But if I'm only interested in real locations, how can I filter out Google's guesses?
For example, according to Google, "under a rock" is located at "The Rock, Shifnal, Shropshire TF11, UK". But a person who answers the question, "Where are you?" with "Under a rock" does not mean to indicate that they are in Shropshire, UK. Instead they just don't want to tell you — well, either that or they are in real trouble, thankfully with web access, stuck under some rock.
I have several million user generated location strings that I'm attempting to find coordinates for. If someone writes "under a rock" I'd rather just leave the coordinates null instead of putting an obviously wrong point in Shropshire, UK.
Here are some other examples:
under a rock => Shropshire, UK
planet earth => Cheshire, UK
nowhere => Scituate, RI, USA
travelling => Madrid, Spain
hiding => Anderson, CA, USA
global => Midland, TX, USA
on the web => North Part, ON, Canada
internet => Frisco, TX, USA
worldwide => Mie Prefecture, Japan
Ultimately I'm after a solid way to return coordinates from a string but return false if the location is like the above.
I need to build a function that returns the following:
Twin Cities => Return the colloquial coordinates of Minneapolis-St. Paul
right behind you => false [Google get's this one "right" -- at least for my purposes]
under a rock => false
nowhere => false
Canada => Return coordinates
Mission District San Francisco => Return coordinates
Chicago => Return coordinates
a galaxy far far away => false [Google also get's this "right" — zero results]
What do you recommend?
Here's a comma-delimited array for you to play at home:
'twin cities','right behind you','under a rock','nowhere','canada','mission district san francisco','chicago','a galaxy far far away','london, england','1600 pennsylvania ave, washington, d.c.','california','41.87194,12.56738','global','worldwide','on the internet','mars'
And here's the url format:
'http://maps.googleapis.com/maps/api/geocode/json?address=' + query + '&sensor=false'
ex: http://maps.googleapis.com/maps/api/geocode/json?address=twin+cities&sensor=false
It seems most of your incorrect results have a "partial_match" attribute set to "true".
e.g.
Twin Cities, no partial match:
http://maps.googleapis.com/maps/api/geocode/json?address=Twin%20Cities&sensor=false
under a rock, 10+ results, all with partial match:
http://maps.googleapis.com/maps/api/geocode/json?address=under%20a%20rock&sensor=false
Though the original purpose of this attribute is not to tell wether a locality is correct or not, it's still pretty accurate on the dataset you provided.
From Google Maps API documentation:
partial_match indicates that the geocoder did not return an exact match for the original request, though it was able to match part of the requested address. You may wish to examine the original request for misspellings and/or an incomplete address.
Partial matches most often occur for street addresses that do not exist within the locality you pass in the request. Partial matches may also be returned when a request matches two or more locations in the same locality. For example, "21 Henr St, Bristol, UK" will return a partial match for both Henry Street and Henrietta Street. Note that if a request includes a misspelled address component, the geocoding service may suggest an alternate address. Suggestions triggered in this way will not be marked as a partial match.
This might not be the direct answer to your question.
If you are currently going through 1000s of user input saved in db, and filter out the invalid ones, I think it is too late and not feasible. The output can be only good as input.
The better way is to make input as valid as possible, and end users don't always know what they want.
I would suggest you that user enter their address through autocomplete, so that you will always have the valid address
User enters text, and select the suggestions
An marker and info window will be shown
When user confirms input, you save info window text as user input, not from text input.
By doing this way, you don't need to validate or filter user input.
I know there are Bayes Classifier implementations in javascript. Never tried them though, I currently use a Ruby implementation which works correctly.
You could have two classifications (Real and Unreal), training each of them with how many samples you want (30, 50 samples each?). "If your classifier is well trained, it will be more accurate".
Then you'd have to test the location before calling GoogleMaps API to filter Unreal locations.
To truly succeed here you are going to have to build a database driven system that facilitates both positive and negative lookups with AI that gets smarter over time, just like Google did. I don't believe that there is a single algorithm that will filter out results based on cosmetics alone.
I looked around and found a site that contains every city in the world. Unfortunately, it doesn't give it as a single list so you'd have to spend a bit of time harvesting data. the site is http://www.fallingrain.com/world/index.html.
They seem to be using individual directories for organizing countries, states, and cities. Then, broken down further by alphabet. It is however the only comprehensive that I could find.
If you manage to get all of these locations into a database then you will have the beginnings of a positive lookup system for your queries. Also, you'll need to start building separate lists of bi, tri, and quad-city areas as well as popular destinations and land marks.
You should also store a negative lookup table for all known mismatches. People have a tendency to generate similar false data and type-o's across large populations. So, the most popular "nowhere" and "planet earth" answers will be repeated over and over again and, in every language you can think of.
One of the benefits of this strategy is that you can run relational queries against your data to get matches in bulk instead as well as one at a time. Since some false negatives will occur at the beginning then your main decision is to determine what you want to do with unmatched items. You may want to adopt a strategy where you have the ability to both reject non-matches as well as substituting partial matches with the nearest actual match.
Anyhow, I hope this helps. It is a bit of effort but if it's important it will be worth it. Who knows, you may end up with a database that's actually worth something. Maybe even a Google maps gateway service for companies/developers who need the same functionality. (:
Take care.
I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()