I am trying to create a hit using html file for amazon mturk - html

import boto3
#making client object
MTURK_SANDBOX = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'
mturk = boto3.client('mturk',
aws_access_key_id = "AKIA3RTXAGOQVVBX3PWF",
aws_secret_access_key = "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",
region_name='us-east-1',
endpoint_url = MTURK_SANDBOX
)
questionfile = open("/home/nm6088/mturk files/prac4_Jun27/index.html","r")
questions = questionfile .read()
localRequirements = [{
'QualificationTypeId': '00000000000000000071',
'Comparator': 'EqualTo',
'LocaleValues': [{
'Country': 'US'
}],
'RequiredToPreview': True
}]
hit = mturk.create_hit(
Title='Write a simple version of the test',
Description='A test HIT that requires the user to write a simple text.',
Keywords='simple, qualification, test',
Reward='0.01',
MaxAssignments=1,
LifetimeInSeconds=3600,
AssignmentDurationInSeconds=600,
AutoApprovalDelayInSeconds=200,
Question = questions,
QualificationRequirements=localRequirements
)
print ("A new HIT has been created. You can preview it here:")
print ("https://workersandbox.mturk.com/mturk/preview?groupId=" + hit['HIT']['HITGroupId'])
print ("HITID = " + hit['HIT']['HITId'] + " (Use to Get Results)")
botocore.exceptions.ClientError: An error occurred (ParameterValidationError) when calling the CreateHIT operation: There was an error parsing the XML question or answer data in your request. Please make sure the data is well-formed and validates against the appropriate schema. Details: cvc-elt.1.a: Cannot find the declaration of element 'HTMLQuestion'. (1656352060543 s)

Related

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

Passing arguments as json object

I am trying to link my django web app to Azure ML API. I do have Django form with all the required inputs for my Azure API.
def post(self,request):
form = CommentForm(request.POST)
url = 'https://ussouthcentral.services.azureml.net/workspaces/7061a4b24ea64942a19f74ed36e4b438/services/ae2c257d6e164dca8d433ad1a1f9feb4/execute?api-version=2.0&format=swagger'
api_key = # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
if form.is_valid():
age = form.cleaned_data['age']
bmi = form.cleaned_data['bmi']
args = {"age":age,"bmi":bmi}
json_data = str.encode(json.dumps(args))
print(type(json_data))
r= urllib.request.Request(url,json_data,headers)
try:
response = urllib.request.urlopen(r)
result = response.read()
print(result)
except urllib.request.HTTPError as error:
print("The request failed with status code: " + str(error.code))
print(json_data)
# Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
print(error.info())
print(json.loads(error.read()))
return render(request,self.template_name)
When i try to submit the form i am getting type error -
TypeError('POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.',)
Getting status code - 400 and below error
{'error': {'code': 'BadArgument', 'message': 'Invalid argument provided.', 'details': [{'code': 'RequestBodyInvalid', 'message': 'No request body provided or error in deserializing the request body.'}]}}
Arguments are using print(json_data) -
b'{"age": 0, "bmi": 22.0}'
Can someone help me on this?
Try to use JsonResponse:
https://docs.djangoproject.com/en/2.1/ref/request-response/#jsonresponse-objects
Also, I don't think you need a template for API response.
return JsonResponse({'foo': 'bar'})
Thanks all. I found the error it was the data which was not in proper order.

Saving a leaflet map as an html file

I've created a leaflet map of corn yield in Kansas using USDA NASS data. The problem I'm running into is exporting the leaflet into an html file using the command:
htmlwidgets::saveWidget(my_interactive_map, "kansas_corn2.html")
but I get this error:
Error in system.file(config, package = package) : 'package' must be of length 1
However, I can produce an html file by using Export > Save as Web Page.. from the Viewer pane.
How can I achieve the same export result using the command line?
My code for making the map is:
my_interactive_map <- tm_shape(STATE) +
tm_polygons("Value", textNA = "Not Reported",
title = unit_desc, palette=c('#8290af','#512888','#190019'),
auto.palette.mapping=FALSE, n = 6, style = "quantile", contrast = 0.9, colorNA = "#C0C0C0",
border.col = "#E8E8E8", showNA = FALSE, legend.is.portrait = FALSE,
legend.hist = FALSE, popup.vars = c("County: " = "COUNTY_NAME", "Value: " = "Value")) +
tm_credits("U.S. Department of Agriculture, National Agriculture Statistics Service") +
tm_format_World(title = paste(year_filt, prodn_practice_desc, commodity_desc, statisticcat_desc, "by",
agg_level_desc, "for", state, sep = " "))
my_interactive_map
You appear to be using the tmap library. For that you could use the function detailed here:
library(tmap)
save_tmap(my_interactive_map, "kansas_corn2.html")

Ruby: Handling different JSON response that is not what is expected

Searched online and read through the documents, but have not been able to find an answer. I am fairly new and part of learning Ruby I wanted to make the script below.
The Script essentially does a Carrier Lookup on a list of numbers that are provided through a CSV file. The CSV file has just one row with the column header "number".
Everything runs fine UNTIL the API gives me an output that is different from the others. In this example, it tells me that one of the numbers in my file is not a valid US number. This then causes my script to stop running.
I am looking to see if there is a way to either ignore it (I read about Begin and End, but was not able to get it to work) or ideally either create a separate file with those errors or just put the data into the main file.
Any help would be much appreciated. Thank you.
Ruby Code:
require 'csv'
require 'uri'
require 'net/http'
require 'json'
number = 0
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
uri = URI("https://api.message360.com/api/v3/carrier/lookup.json?PhoneNumber=#{number}")
req = Net::HTTP::Post.new(uri)
req.basic_auth 'XXX' , 'XXX'
res = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => true) {|http|
http.request(req)
}
json = JSON.parse(res.body)
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
end
File Data:
number
5556667777
9998887777
Good Response example in JSON:
{"Message360"=>{"ResponseStatus"=>1, "Carrier"=>{"ApiVersion"=>"3", "CarrierSid"=>"XXX", "AccountSid"=>"XXX", "PhoneNumber"=>"+19495554444", "Network"=>"Cellco Partnership dba Verizon Wireless - CA", "Wireless"=>"true", "ZipCode"=>"92604", "City"=>"Irvine", "Price"=>0.0003, "Status"=>"success", "DateCreated"=>"2018-05-15 23:05:15"}}}
The response that causes Script to stop:
{
"Message360": {
"ResponseStatus": 0,
"Errors": {
"Error": [
{
"Code": "ER-M360-CAR-111",
"Message": "Allowed Only Valid E164 North American Numbers.",
"MoreInfo": []
}
]
}
}
}
It would appear you can just check json["Message360"]["ResponseStatus"] first for a 0 or 1 to indicate failure or success.
I'd probably add a rescue to help catch any other errors (malformed JSON, network issue, etc.)
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
...
json = JSON.parse(res.body)
if json["Message360"]["ResponseStatus"] == 1
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
else
# handle bad response
end
rescue StandardError => e
# request failed for some reason, log e and the number?
end

Check_MK - Custom check params specified in wato not being given to check function

I am working on a check_mk plugin and can't seem to get the WATO specified params passed to the check function when it runs for one check in particular...
The check param rule shows in WATO
It writes correct looking values to rules.mk
Clicking the Analyze check parameters icon from a hosts service discovery shows the rule as active.
The check parameters displayed in service discovery show the title from the WATO file so it seems like it is associating things correctly.
Running cmk -D <hostname> shows the check as always having the default values though.
I have been staring at it for awhile and am out of ideas.
Check_MK version: 1.2.8p21 Raw
Bulk of check file:
factory_settings["elasticsearch_status_default"] = {
"min": (600, 300)
}
def inventory_elasticsearch_status(info):
for line in info:
yield restore_whitespace(line[0]), {}
def check_elasticsearch_status(item, params, info):
for line in info:
name = restore_whitespace(line[0])
message = restore_whitespace(line[2])
if name == item:
return get_status_state(params["min"], name, line[1], message, line[3])
check_info['elasticsearch_status'] = {
"inventory_function" : inventory_elasticsearch_status,
"check_function" : check_elasticsearch_status,
"service_description" : "ElasticSearch Status %s",
"default_levels_variable" : "elasticsearch_status_default",
"group" : "elasticsearch_status",
"has_perfdata" : False
}
Wato File:
group = "checkparams"
#subgroup_applications = _("Applications, Processes & Services")
register_check_parameters(
subgroup_applications,
"elasticsearch_status",
_("Elastic Search Status"),
Dictionary(
elements = [
( "min",
Tuple(
title = _("Minimum required status age"),
elements = [
Age(title = _("Warning if below"), default_value = 600),
Age(title = _("Critical if below"), default_value = 300),
]
))
]
),
None,
match_type = "dict",
)
Entry in rules.mk from WATO rule:
checkgroup_parameters.setdefault('elasticsearch_status', [])
checkgroup_parameters['elasticsearch_status'] = [
( {'min': (3600, 1800)}, [], ALL_HOSTS ),
] + checkgroup_parameters['elasticsearch_status']
Let me know if any other information would be helpful!
EDIT: pls help
Posted question here as well and the mystery got solved.
I was matching the WATO rule to item None (5th positional arg in the WATO file), but since this check had multiple items inventoried under it (none of which had the id None) the rule was applying to the host, but not to any of the specific service checks.
Fix was to replace that param with:
TextAscii( title = _("Status Description"), allow_empty = True),