I'm currently trying to create a JSON object to insert into a mongoDB database. However, when I create my JSON object, it appears to reach the end of the creation and then won't continue onto the insert statement in order to place the JSON object in the database. Here is an example of my code. For security reasons, I've changed some of the sensitive code to 99999, but I've verified that the authorisation section of the code works.
'''
client = MongoClient('9999999', 999999) # LAN
db = client.999999
site = client.99999
db.authenticate('9999999', '999999') # user authentication
collection = site.9999999
test = {
'Timestamp': timestamp(),
'Device Type': device,
'Instrument Serial': instrumentserial,
'Calibration Date': cal,
'Survey Date': dateofsurvey,
'Survey Time': timeofsurvey,
'Site Name': nameofsite,
'Survey Location': survloc,
'Frequency Survey Reference': survref,
'Voltage': v,
'Insulation': ins,
'Tuner Select': tunerselect,
'Start Frequency': frequency,
'Reference Level': referencelevel,
'RBW': rbw,
'Instrument Mode': instrumentmode,
'Detector Mode': detectormode,
'Sweep Rate': sweeprate,
'Trigger Mode': triggermode,
'Trigger Level': triggerlevel,
}
collection.insert_one(test)
'''
All the variables have been parsed from an xml file. Any help would be greatly appreciated, as the code appears to just stop.
Related
I have built a customer connector to connect to the Vimeo API via OAuth2. Everything is working well, but it appears I need to come up with a solution to deal with pagination, as I am only getting back 25 items on each page.
I see the documentation on how to use Table.GenerateByPage and getNextPage here:
https://learn.microsoft.com/en-us/power-query/samples/trippin/5-paging/readme#tablegeneratebypage
As well as the implementation within the example GitHub custom connector
https://github.com/microsoft/DataConnectors/blob/master/samples/Github/github.pq
A sample of functions from that example:
Github.Contents = (url as text) =>
let
content = Web.Contents(url),
link = GetNextLink(content),
json = Json.Document(content),
table = Table.FromList(json, Splitter.SplitByNothing())
in
table meta [Next=link];
Github.PagedTable = (url as text) => Table.GenerateByPage((previous) =>
let
// If we have a previous page, get its Next link from metadata on the page.
next = if (previous <> null) then Value.Metadata(previous)[Next] else null,
// If we have a next link, use it, otherwise use the original URL that was passed in.
urlToUse = if (next <> null) then next else url,
// If we have a previous page, but don't have a next link, then we're done paging.
// Otherwise retrieve the next page.
current = if (previous <> null and next = null) then null else Github.Contents(urlToUse),
// If we got data back from the current page, get the link for the next page
link = if (current <> null) then Value.Metadata(current)[Next] else null
in
current meta [Next=link]);
GetNextLink = (response, optional request) =>
let
// extract the "Link" header if it exists
link = Value.Metadata(response)[Headers][#"Link"]?,
links = Text.Split(link, ","),
splitLinks = List.Transform(links, each Text.Split(Text.Trim(_), ";")),
next = List.Select(splitLinks, each Text.Trim(_{1}) = "rel=""next"""),
first = List.First(next),
removedBrackets = Text.Range(first{0}, 1, Text.Length(first{0}) - 2)
in
try removedBrackets otherwise null;
However, my issue is that the metadata on pagination that returns from the Vimeo API is coming through the JSON body response instead of within the headers, as is assumed in the documentation and examples. Is there an easy way or helper function within Power Query/M that would allow me to look into the body of the JSON response, grab the pagination JSON objects (as below), and built out my code from there?
Here is what comes back regarding pagination from Vimeo's API within the JSON body:
"total": 1012,
"page": 1,
"per_page": 25,
"paging": {
"next": "/users/{our-user-id}/videos?page=2",
"previous": null,
"first": "/users/{our-user-id}/videos?page=1",
"last": "/users/{our-user-id}/videos?page=41"
},
Many thanks for any help - it is very much appreciated!
Best,
-Josh
Its hard to put something together from just that info, but see if this helps
GetNextLink = (response) =>
// response is data already run through Web.Contents()
// looks for a row that contains "first":
// x would evaluate to be "first": "/users/{our-user-id}/videos?page=1",
// y would parse x to get /users/{our-user-id}/videos?page=1
Source = Lines.FromBinary(response),
x = List.RemoveNulls(List.Transform(List.Positions(Source), each if Text.Contains(Source{_},"""first"":") then Source{_} else null)){0},
y=Text.BetweenDelimiters( x,": ""","""")
in y
I upload 10 files every day at 11 p.m with a Cron Job to a bucket on GCS. Each file is a .csv with a size from 2 to 30 KB. The file name is always YYYY-MM-DD-ID.csv
A Cloud Function is called everytime I am uploading a file into that bucket to send those .csv files to BigQuery. The trigger type is Cloud Storage on finalise/create events.
My issue is the following:
On BigQuery, each value for each row/columns is multiplied by a multiple. Sometimes it's 1 (so the value is the same), often 2 and sometimes 3. I attached one example bellow with the difference between BigQuery (BQ) and Google Cloud Storage (GCS).
It seems that the cloud function is called multiple times. It's not on the code but rather some duplicate message deliveries from the Cloud Function during the trigger. When I am going o the logs tab for today, I can see the Cloud Function upload_to_bigquery has been called multiple times.
I have tried to fix it but I made a mistake. I thought we could write temporary files to Cloud Functions but we can not. My solution was to write the filename I am uploading to BigQuery on a .txt file. And before to upload a new file on BigQuery, read that .txt file and check if the current file exist on that list. If the filename is already present, skip. Else, write the .txt filename to the list and do my stuff.
if file_to_upload not in text:
text.append(file_to_upload)
with open("all_uploaded_files.txt", "w") as text_file:
for item in text:
text_file.write(item + "\n")
bucket = storage_client.bucket('sfr-test-data')
blob = bucket.blob("all_uploaded_files.txt")
blob.upload_from_filename("all_uploaded_files.txt")
## do my things here
else:
print("file already uploaded")
# skip to new file to upload
But even if I could do that, this solution is not viable. The temporary file will become so large after months of years that it would be a mess. Do you know whats the easiest way to fix this issue?
Cloud Function: upload_to_big_query - main.py
BUCKET = "xxx"
GOOGLE_PROJECT = "xxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
You might would prefer to check this thread:
BigQuery displaying wrong results - Duplicating data from Cloud Function?
Very shortly - the function is to be idempotent, and the state of the process (if the data/file was uploaded into BQ or not) should be kept outside of the cloud function. A text file (in some GCS bucket, not inside the cloud function memory, which can be erased as soon as the cloud function execution is finished) is an option, but GCS has plenty of drawbacks in this particular case. For example, a firestore - is much, much better choice.
You might consider the following algorithm -
When you cloud function starts, it should calculate some hash code based on input data - file/object metadata or file/object data or combination of both. That hash - should be unique for the given set of data.
Your cloud function connects to a predefined firestore collection (the project and the name can be provided in the environment variables) and checks if there a document/record with the given hash as an id - already exists or not.
If that hash already exists (the document exists) in the firestore collection - the cloud function finishes its execution and does not do anything else (can do logging, add some additional details into the firestore document if required, etc.). Thus simply finishes its execution.
If that hash is not found (the document does not exist) - the cloud function creates a new document with the given hash as an id. Some metadata details can be added into that document if needed.
Upon the document is created the cloud function continues the main 'workflow'.
A few things to bear in mind.
1/ IAM permissions - the service account under which the cloud function is running - should have relevant permissions on the firestore. Obviously the firestore API is to be enabled in the given project...
2/ What will happen if the cloud function creates a new firestore document, but then failed to load the data into BigQuery (for any reason). It may be that just a check on the firestore document existence is not enough. Thus, a proper 'state' is to be maintained in the firestore document. For example, when a new document is created (in the firestore), there should be a field __state and a value (for example) IN_PROGRESS is assigned to it. Then, when the data is loaded, the cloud function comes back to the firestore and updates that field with the value DONE (for example). But even that is not enough. As you have a load job - there may be cases, when the load is actually successful, but he cloud function failed (any reason including timeout). You might would like to think what to do in that case as well. In any case, having some 'state' monitoring in the firestore may help to understand/investigate the situation with the loading process. Automation of the monitoring might need developing a separate cloud function, but this is a separate story.
3/ As I mentioned in the thread I pointed above (see reasoning in that answer), loading data from inside the cloud function memory is a risky idea. I would suggest to think about that part of your algorithm again.
4/ It might be a good idea to move the loaded file/object from the "input" bucket to some "processed" (or "archive") bucket in case of success, and to move it into a "failure" bucket, in case the load failed. That is to be done in the cloud function code. Failure outcome can also be reflected in the firestore document (i.e. set the value of the __state field to FAILURE).
I'm trying to update a file if it exists in a particular folder and has a specific name. In this instance the object in question is in a team drive. I followed documentation to compose the q parameter to the list call, tried to switch back to v2...It seems that the query is composed exactly correctly. That being said, even though I see multiple objects present in the target folder, the list call fails to see them. I've tried name = '' and name contains ''. There seems to be enough input validation put in place by the google team, as when i get creative the API bombs. Any pointers?
def import_or_replace_csv_to_td_folder(self, folder_id, local_fn, remote_fn, mime_type):
DRIVE = build('drive', 'v3', http=creds.authorize(Http()))
query = "'{0}' in parents and name = '{1}'.format(folder_id, remote_fn)
print("Searching for previous versions of this file : {0}".format(query))
check_if_already_exists = DRIVE.files().list(q=query, fields="files(id, name)").execute()
name_and_location_conflict = check_if_already_exists.get('files', [])
if not name_and_location_conflict:
body = {'name': remote_fn, 'mimeType': mime_type, 'parents': [folder_id]}
out = DRIVE.files().create(body=body, media_body=local_fn, supportsTeamDrives=True, fields='id').execute().get('id')
return out
else:
if len(name_and_location_conflict)==1:
file_id=name_and_location_conflict['id']
DRIVE.files().update(fileId=file_id, supportsTeamDrives=True, media_body=local_fn)
return file_id
else:
raise MultipleConflictsError("There are multiple documents matching parent folder and file name. Unclear which requires a version update")
When i tried to replace the 'name' parameter to 'title' (used to work in v2, based on some answers I reviewed) the API barfed
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v3/files?q=%27xxxxxxxxxxxxxxxx%27+in+parents+and+title+%3D+%27Somefile_2018-09-27.csv%27&fields=files%28id%2C+name%29&alt=json returned "Invalid Value">
Thanks #tehhowch,
Indeed extra measures are necessary when the target in a team drive, namely includeTeamDriveItems option needs to be set, otherwise TD locations are not included by default:
check_if_already_exists = DRIVE.files().list(
q=query,
fields="files(id, name)",
supportsTeamDrives=True,
includeTeamDriveItems=True
).execute()
I am trying to get elasticsearch cloud-watch metrics using boto but whatever I do, I do not get the value. Below is snippet of my code , same code works for example if I use for RDS metrics.
import datetime
import boto.ec2.cloudwatch
end = datetime.datetime.utcnow()
start = end - datetime.timedelta(minutes=5)
metric="CPUUtilization"
region = boto.regioninfo.RegionInfo(
name='ap-southeast-1',
endpoint='monitoring.ap-southeast-1.amazonaws.com')
conn = boto.ec2.cloudwatch.CloudWatchConnection(region=region)
data = conn.get_metric_statistics(60, start, end, metric, "AWS/ES", "Average", {"DomainName": "My-es-name"})
print data
[]
However if I change the namespace to RDS it works fine with proper dimension value. This is a simple code which I can write. I am not sure what is wrong here. Can anyone help me to figure out this?
What am I doing wrong here?
Thanks
I found the solution.
To pull Elasticsearch metrics for a specific domain name, you need to also indicate your ClientId in the dimensions.
My examples below are in Boto3, but for executing it with your code (boto2), I believe you only need to amend the dimensions as follow, assuming your syntax was originally right:
data = conn.get_metric_statistics(60, start, end, metric, "AWS/ES", "Average", {"ClientId":"My-client-id", "DomainName": "My-es-name"})
Try the code below (boto3). It worked for me.
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.resource('cloudwatch', region_name='ap-southeast-1')
cpu = cloudwatch.Metric('AWS/ES', 'CPUUtilization')
cpu_usage = cpu.get_statistics(
Dimensions=[
{'Name': 'ClientId', 'Value': 'YOUR-CLIENT-ID'},
{'Name': 'DomainName', 'Value': 'YOUR-DOMAIN-NAME'}
],
StartTime=(datetime.utcnow() - timedelta(minutes=5)).isoformat(),
EndTime=datetime.utcnow().isoformat(),
Period=60,
Statistics=['Average']
)
If you prefer to use a client, use the following instead:
client = boto3.client('cloudwatch', region_name='ap-southeast-1')
response = client.get_metric_statistics(
Namespace='AWS/ES',
MetricName='CPUUtilization',
Dimensions=[
{'Name': 'ClientId', 'Value': 'YOUR-CLIENT-ID'},
{'Name': 'DomainName', 'Value': 'YOUR-DOMAIN-NAME'}
],
StartTime=(datetime.utcnow() - timedelta(minutes=5)).isoformat(),
EndTime=datetime.utcnow().isoformat(),
Period=60,
Statistics=['Average']
)
I am working on a project and i am required to store information entered into a form to a database column as json. The form does not have a model of its own but all its values will be stored as json into a column of another model. Here is the model:
class Document(models.Model):
user = models.ForeignKey(User)
document = models.JSONField(default = {})
category = models.CharField(max_length=255)
Now i am required to store json data from different forms (different categorys ) into the column document. Here is one category of such forms:
class InformalLetterForm(forms.Form):
sender_name = forms.CharField(max_length=45)
sender_address = forms.CharField(max_length=255)
date = forms.DateTimeField()
message_body = forms.CharField()
receiver_name = forms.CharField(max_length=255)
How do i serialize data entered in such a form to a json object to be stored in a database column (i.e the column document above).
i have searched online but i have seen serialization done only for data from model forms.
Thanks for any help..
You can call the .cleaned_data attribute from the Form, it will return a dictionary with the form's data in python, then you can call the .dumps() method from json python's library. Let's take an example from the docs:
>>> data = {'subject': 'hello',
... 'message': 'Hi there',
... 'sender': 'foo#example.com',
... 'cc_myself': True}
>>> f = ContactForm(data)
>>> f.is_valid()
True
>>> f.cleaned_data
{'cc_myself': True, 'message': 'Hi there', 'sender': 'foo#example.com', 'subject': 'hello'}
Here you have a dictionary with your data, now let's make it a json:
import json
# An example simple dict
d = {'a': 1, 'b': 2}
json.dumps(d)
# '{"a": 1, "b": 2}'