I am trying to use bigrquery's bq_table_load() command to move a tab delimited csv file from google storage to bigrquery. It works but it doesn't automatically recognize the column names. Doing the same thing interactively (i.e. in the bigquery clould console) works well. Comparing the jobs metadata for the two jobs (R induced jobs vs cloud console job) I note that the column delimiter is not set to TAB for the R job. This is despite me including this in my command call; e.g. as follows:
bq_table_load(<x>,<uri>, fieldDelimiter="Tab", source_format = "CSV", autodetect=TRUE)
I tried all sorts of variations of this...nothing seems to work (i.e. the R job will always have the Comma delimiter set)...here are some of the variations I tried:
bq_table_load(<x>,<uri>, field_delimiter="Tab", source_format = "CSV", autodetect=TRUE)
bq_table_load(<x>,<uri>, field_delimiter="\t", source_format = "CSV", autodetect=TRUE)
bq_table_load(<x>,<uri>, field_delimiter="tab", source_format = "CSV", autodetect=TRUE)
Any suggestions?
You can define Schema using a schema file, a sample is given below:-
Sample BQ load command, where '$schema_dir/$TABLENAME.json' Represent a schema file :-
bq --nosync load --source_format=CSV --skip_leading_rows=3 --allow_jagged_rows=TRUE --max_bad_records=10000 \
--allow_quoted_newlines=TRUE $projectid:$dataset.$TABLENAME \
$csv_data_path/$FILENAME $schema_dir/$TABLENAME.json
Sample Schema file
[
{
"mode": "NULLABLE",
"name": "C1",
"type": "STRING"
}
]
Related
I am experiencing a strange issue with my Python code. It's objective is the following:
Retrieve a .csv from S3
Convert that .csv into JSON (Its an array of objects)
Add a few key value pairs to each object in the Array, and change the original key values
Validate the JSON
Sent the JSON to a /output S3 bucket
Load the JSON into Dynamo
Here's what the .csv looks like:
Prefix,Provider
ABCDE,Provider A
QWERT,Provider B
ASDFG,Provider C
ZXCVB,Provider D
POIUY,Provider E
And here's my python script:
import json
import boto3
import ast
import csv
import os
import datetime as dt
from datetime import datetime
import jsonschema
from jsonschema import validate
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
providerCodesSchema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"providerCode": {"type": "string", "maxLength": 5},
"providerName": {"type": "string"},
"activeFrom": {"type": "string", "format": "date"},
"activeTo": {"type": "string"},
"apiActiveFrom": {"type": "string"},
"apiActiveTo": {"type": "string"},
"countThreshold": {"type": "string"}
},
"required": ["providerCode", "providerName"]
}
}
datestamp = dt.datetime.now().strftime("%Y/%m/%d")
timestamp = dt.datetime.now().strftime("%s")
updateTime = dt.datetime.now().strftime("%Y/%m/%d/%H:%M:%S")
nowdatetime = dt.datetime.now()
yesterday = nowdatetime - dt.timedelta(days=1)
nintydaysfromnow = nowdatetime + dt.timedelta(days=90)
def lambda_handler(event, context):
filename_json = "/tmp/file_{ts}.json".format(ts=timestamp)
filename_csv = "/tmp/file_{ts}.csv".format(ts=timestamp)
keyname_s3 = "newloader-ptv/output/{ds}/{ts}.json".format(ds=datestamp, ts=timestamp)
json_data = []
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key_name = record['s3']['object']['key']
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
data = s3_object['Body'].read()
contents = data.decode('latin')
with open(filename_csv, 'a', encoding='utf-8') as csv_data:
csv_data.write(contents)
with open(filename_csv, encoding='utf-8-sig') as csv_data:
csv_reader = csv.DictReader(csv_data)
for csv_row in csv_reader:
json_data.append(csv_row)
for elem in json_data:
elem['providerCode'] = elem.pop('Prefix')
elem['providerName'] = elem.pop('Provider')
for element in json_data:
element['activeFrom'] = yesterday.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['activeTo'] = nintydaysfromnow.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['apiActiveFrom'] = " "
element['apiActiveTo'] = " "
element['countThreshold'] = "3"
element['updateDate'] = updateTime
try:
validate(instance=json_data, schema=providerCodesSchema)
except jsonschema.exceptions.ValidationError as err:
print(err)
err = "Given JSON data is InValid"
return None
with open(filename_json, 'w', encoding='utf-8-sig') as json_file:
json_file.write(json.dumps(json_data, default=str))
with open(filename_json, 'r', encoding='utf-8-sig') as json_file_contents:
response = s3.put_object(Bucket=bucket_name, Key=keyname_s3, Body=json_file_contents.read())
for jsonElement in json_data:
table = dynamodb.Table('privateProviders-loader')
table.put_item(Item=jsonElement)
print("finished enriching JSON")
os.remove(filename_csv)
os.remove(filename_json)
return None
I'm new to Python, so please forgive any amateur mistakes in the code.
Here's my issue:
When I deploy the code, and add a valid .csv into my S3 bucket, everything works.
When I then add an invalid .csv into my S3 buck, again it work, the import fails as the validation kicks in and tells me the problem.
However, when I add the valid .csv back into the S3 bucket, I get the same cloudwatch log as I did for the invalid .csv, and my Dynamo isn't updated, nor is an output JSON file sent to /output in S3.
With some troubleshooting I've noticed the following behavour:
When I first deploy the code, the first .csv loads as expected (dynamo table updated + JSON file sent to S3 + cloudwatch logs documenting the process)
If I enter the same valid .csv into the S3 bucket, it gives me the same nice looking cloudwatch logs, but none of the other actions take place (Dynamo not updated etc)
If I add the invalid .csv, that seems to break the cycle, and I get a nice Cloudwatch log showing the validation has kicked in, but if I reload the valid .csv, which just previously resulted in good cloudwatch logs (but no actual real outputs), I now get a repeat of the validation error log.
In short, the first time the function is invoked, it seems to work, the second time it doesn't.
It seems as though the python function is caching something or not closing out the function when finished, and I've played about with the return command etc, but nothing I've tried works. I've sunk many hours into trying to move parts of the code around etc. thinking the structure or order of events is the problem, and I've the code above gives me the closest behaviour to expected, given that it seems to work completely the first and only time I load the .csv into S3.
Any help or general pointers would be massively appreciated.
Thanks
P.s. Here's an example of the Cloudwatch log when validation kicks in a and stops an invalid .csv from being processed. If I then add a valid .csv to S£, the function is triggered, but I get this same error, even though the file is actually good.
2021-06-29T22:12:27.709+01:00 'ABCDEE' is too long
2021-06-29T22:12:27.709+01:00 Failed validating 'maxLength' in schema['items']['properties']['providerCode']:
2021-06-29T22:12:27.709+01:00 {'maxLength': 5, 'type': 'string'}
2021-06-29T22:12:27.709+01:00 On instance[2]['providerCode']:
2021-06-29T22:12:27.709+01:00 'ABCDEE'
2021-06-29T22:12:27.710+01:00 END RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9
2021-06-29T22:12:27.710+01:00 REPORT RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9 Duration: 482.43 ms Billed Duration: 483
I have already created a table called Sensors and identified Sensor as the hash key. I am trying to add to the table with my .json file. The items in my file look like this:
{
"Sensor": "Room Sensor",
"SensorDescription": "Turns on lights when person walks into room",
"ImageFile": "rmSensor.jpeg",
"SampleRate": "1000",
"Locations": "Baltimore, MD"
}
{
"Sensor": "Front Porch Sensor",
"SensorDescription": " ",
"ImageFile": "fpSensor.jpeg",
"SampleRate": "2000",
"Locations": "Los Angeles, CA"
}
There's 20 different sensors in the file. I was using the following command:
aws dynamodb batch-write-item \
--table-name Sensors \
--request-items file://sensorList.json \
--returned-consumed-capacity TOTAL
I get the following error:
Error parsing parameter '--request-items': Invalid JSON: Extra data: line 9 column 1 (char 189)
I've tried adding --table name Sensors to the cl and it says Unknown options: --table-name, Sensors. I've tried put-item and a few others. I am trying to understand what my errors are, what I need to change in my .json if anything, and what I need to change in my cl. Thanks!
Your input file is not a valid json. You are missing a comma to separate both objects, and you need to enclose everything with brackets [ ..... ]
I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.
Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.
I am extracting the data from Aurora to S3 using AWS DMS, and would like to use csvDelimiter of my choice, which is ^A (i.e. control-A, octal representation \001) while loading data to S3. How do I do that?. By default when S3 is used as target for DMS, it uses "," as default delimiter
compressionType=NONE;csvDelimiter=,;csvRowDelimiter=\n;
But I want to use something as below
compressionType=NONE;csvDelimiter='\001';csvRowDelimiter=\n;
But it prints the delimiter as a text in the output:
I'\001'12345'\001'Abc'
I am using AWS DMS Console to set the Target Endpoint
I tried below delimiters but did not work:
\\001
\u0001
'\u0001'
\u01
\001
Actual Result:
I'\001'12345'\001'Abc'
Expected Result:
I^A12345^AAbc
Here is what I did to resolve this:
I used aws command line to set this delimiter in my target s3 endpoint.
https://docs.aws.amazon.com/translate/latest/dg/setup-awscli.html
aws cli command:
aws dms modify-endpoint --endpoint-arn arn:aws:dms:us-west-2:000001111222:endpoint:OXXXXXXXXXXXXXXXXXXXX4 --endpoint-identifier dms-ep-tgt-s3-abc --endpoint-type target --engine-name s3 --extra-connection-attributes "bucketFolder=data/folderx;bucketname=bkt-xyz;CsvRowDelimiter=^D;CompressionType=NONE;CsvDelimiter=^A;" --service-access-role-arn arn:aws:iam::000001111222:role/XYZ-Datalake-DMS-Role --s3-settings ServiceAccessRoleArn=arn:aws:iam::000001111222:role/XYZ-Datalake-DMS-Role,BucketName=bkt-xyz,CompressionType=NONE
Output:
{
"Endpoint": {
"Status": "active",
"S3Settings": {
"CompressionType": "NONE",
"EnableStatistics": true,
"BucketFolder": "data/folderx",
"CsvRowDelimiter": "\u0004",
"CsvDelimiter": "\u0001",
"ServiceAccessRoleArn": "arn:aws:iam::000001111222:role/XYZ-Datalake-DMS-Role",
"BucketName": "bkt-xyz"
},
"EndpointType": "TARGET",
"ServiceAccessRoleArn": "arn:aws:iam::000001111222:role/XYZ-Datalake-DMS-Role",
"SslMode": "none",
"EndpointArn": "arn:aws:dms:us-west-2:000001111222:endpoint:OXXXXXXXXXXXXXXXXXXXX4",
"ExtraConnectionAttributes": "bucketFolder=data/folderx;bucketname=bkt-xyz;CompressionType=NONE;CsvDelimiter=\u0001;CsvRowDelimiter=\u0004;",
"EngineDisplayName": "Amazon S3",
"EngineName": "s3",
"EndpointIdentifier": "dms-ep-tgt-s3-abc"
}
}
Note: After you run the aws cli command, the DMS console will not show you the delimiter in the endpoint, (not visible as it is a special character). But once you run the task it appears in the data in your s3 files.
I want to test my method which import a CSV file.
But I don't know how to generate fake CSV files to test it.
I tried a lot of solution I already found on stack but it's not working in my case.
Here is the csv original file :
firstname,lastname,home_phone_number,mobile_phone_number,email,address
orsay,dup,0154862548,0658965848,orsay.dup#gmail.com,2 rue du pré paris
richard,planc,0145878596,0625147895,richard.planc#gmail.com,45 avenue du general leclerc
person.rb
def self.import_data(file)
filename = File.join Rails.root, file
CSV.foreach(filename, headers: true, col_sep: ',') do |row|
firstname, lastname, home_phone_number, mobile_phone_number, email, address = row
person = Person.find_or_create_by(firstname: row["firstname"], lastname: row['lastname'], address: row['address'] )
if person.is_former_email?(row['email']) != true
person.update_attributes({firstname: row['firstname'], lastname: row['lastname'], home_phone_number: row['home_phone_number'], mobile_phone_number: row['mobile_phone_number'], address: row['address'], email: row['email']})
end
end
end
person_spec.rb :
require "rails_helper"
RSpec.describe Person, :type => :model do
describe "CSV file is valid" do
file = #fake file
it "should read in the csv" do
end
it "should have result" do
end
end
describe "import valid data" do
valid_data_file = #fake file
it "save new people" do
Person.delete_all
expect { Person.import_data(valid_data_file)}.to change{ Person.count }.by(2)
expect(Person.find_by(lastname: 'dup').email).to eq "orsay.dup#gmail.com"
end
it "update with new email" do
end
end
describe "import invalid data" do
invalid_data_file = #fake file
it "should not update with former email" do
end
it "should not import twice from CSV" do
end
end
end
I successfully used the Faked CSV Gem from https://github.com/jiananlu/faked_csv to achieve your purpose of generating a CSV File with fake data.
Follow these steps to use it:
Open your command line (i.e. on OSX open Spotlight with CMD+Space, and enter "Terminal")
Install Faked CSV Gem by running command gem install faked_csv. Note: If using a Ruby on Rails project add gem 'faked_csv' to your Gemfile, and then run bundle install
Validate Faked CSV Gem installed successfully by typing in Bash Terminal faked_csv --version
Create a Configuration File for the Faked CSV Gem and where you define how to generate fake data. For example, the below will generate a CSV file with 200 rows (or edit to as many as you wish) and contain comma separated columns for each field. If the value of field type is prefixed with faker: then refer to the "Usage" section of the Faker Gem https://github.com/stympy/faker for examples.
my_faked_config.csv.json
{
"rows": 200,
"fields": [
{
"name": "firstname",
"type": "faker:name:first_name",
"inject": ["luke", "dup", "planc"]
},
{
"name": "lastname",
"type": "faker:name:last_name",
"inject": ["schoen", "orsay", "richard"]
},
{
"name": "home_phone_number",
"type": "rand:int",
"range": [1000000000, 9999999999]
},
{
"name": "mobile_phone_number",
"type": "rand:int",
"range": [1000000000, 9999999999]
},
{
"name": "email",
"type": "faker:internet:email"
},
{
"name": "address",
"type": "faker:address:street_address",
"rotate": 200
}
]
}
Run the following command to use the configuration file my_faked_config.csv.json to generate a CSV file in the current folder named my_faked_data.csv, which contains the fake data faked_csv -i my_faked_config.csv.json -o my_faked_data.csv
Since the generated file may not include the associated Label for each column after generation, simply manually insert the following line at the top of my_faked_data.csv firstname,lastname,home_phone_number,mobile_phone_number,email,address
Review the final contents of the my_faked_data.csv CSV file containing the fake data, which should appear similar to the following:
my_faked_data.csv
firstname,lastname,home_phone_number,mobile_phone_number,email,address
Kyler,Eichmann,8120675609,7804878030,norene#bergnaum.io,56006 Fadel Mission
Hanna,Barton,9424088332,8720530995,anabel#moengoyette.name,874 Leannon Ways
Mortimer,Stokes,5645028548,9662617821,moses#kihnlegros.org,566 Wilderman Falls
Camden,Langworth,2622619338,1951547890,vincenza#gaylordkemmer.info,823 Esmeralda Pike
Nikolas,Hessel,5476149226,1051193757,jonathon#ziemannnitzsche.name,276 Reinger Parks
...
Modify your person_spec.rb Unit Test using the technique shown below, which passes in Mock data to test functionality of the import_data function of your person.rb file
person_spec.rb
require 'rails_helper'
RSpec.describe Person, type: :model do
describe 'Class' do
subject { Person }
it { should respond_to(:import_data) }
let(:data) { "firstname,lastname,home_phone_number,mobile_phone_number,email,address\r1,Kyler,Eichmann,8120675609,7804878030,norene#bergnaum.io,56006 Fadel Mission" }
describe "#import_data" do
it "save new people" do
File.stub(:open).with("filename", {:universal_newline=>false, :headers=>true}) {
StringIO.new(data)
}
Product.import("filename")
expect(Product.find_by(firstname: 'Kyler').mobile_phone_number).to eq 7804878030
end
end
end
end
Note: I used it myself to generate a large CSV file with meaningful fake data for my Ruby on Rails CSV app. My app allows a user to upload a CSV file containing specific column names and persist it to a PostgreSQL database and it then displays the data in a Paginated table view with the ability to Search and Sort using AJAX.
Use openoffice or excel and save the file out as a .csv file in the save options. A spreadsheet progam.