How to use '$' special character in Athena SQL DML referencing Queries? - mysql

I have data coming to s3 from mixpanel and mixpanel adds '$' character before some event properties. Sample:
"event": "$ae_session",
"properties": {
"time": 1646816604,
"distinct_id": "622367f395dd06c26f311c46",
"$ae_session_length": 17.2,
"$app_build_number": "172",
"$app_release": "172",...}
As '$' special character is not supported in Athena I need to use some sort of escape thing to proceeds from here. I would really need any help regarding this.
The error i am getting in subsequent DML queries after My DDL table:
HIVE_METASTORE_ERROR: Error: name expected at the position 262 of
'struct<distinct_id:string,
sheetid:string,
addedUserId:string,
memberId:string,
communityId:string,
businessId:string,
time:timestamp,
communityBusinessType:string,
initialBusinessType:string,
sheetRowIndex:string,
dataType:varchar(50),
screenType:varchar(50),
rowIndex:int,
$ae_session_length:int>' but '$' is found.
(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
Since, I can not change the column names as they are directly populating from mixpanel on daily interval I really think that there should be work around this somehow!

Related

Mongo DB Error while Updating - error on remote shard - caused by cursor id

I have about 8 Million Documents in my Collection.
And I want to remove the special Characters in one of the fields.
I will post my Statement below.
I am using the mongo shell in the Mongo db compass tool.
The update is working about 30-50 Minutes and then throws the following error:
MongoServerError: Error on remote shard thisisjustforstack.com:27000 :: caused by :: cursor id 1272890412590646833 not found
I also see that after throwing this error, he did not update all documents.
db.getCollection('TEST_Collection').aggregate(
[{
$match: {
'1List.Comment': {
$exists: true
}
}
}, {
$project: {
'1List.Comment': 1
}
}]
)
.forEach(function(doc,Index) {doc.1List.Comment=doc.1List.Comment.replace(/[^a-zA-Z 0-9 ]/g, '');
db.TEST_Collection.updateMany({ "_id": doc._id },{ "$set": { "1List.Comment": doc.1List.Comment } });})
Can somebody please help to get this update statement working without running in some sort of timeout? I have read something about noCursorTimeout() but I am not sure on how to use it with my statement and using it in the shell.
Thank you all!
Cursor timeout can't be disabled on individual aggregation cursors.
But you can set on global config:
mongod --setParameter cursorTimeoutMillis=3600000 #1 hour
Anyway I think dividing the task in small batches is a better option

AWS Athena and handling json

I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

Importing JSON dates into SAS gives incorrect $ format where datetime data expected

I am trying to import data containing some date-columns/fields into SAS. The data are in JSON format, and hence need to be converted before import. This I use SAS libname JSON for.
But when I convert/import the data, SAS does not interpret the dates as proper dates, and allow me to manipulate data with date-constraints and so on. Instead, SAS imports the dates as Format = $ whatever that is. But the data show in the imported data. SAS imports the data without errors but any other date-field than the 'date_fi' in the data are not properly formatted as a date.
I am using the following script
filename resp "C:\Temp\transaktioner_2017-07.json" lrecl=1000000000 ;
filename jmap "C:\Temp\transaktioner.map"; filename head
"c:\temp\header.txt";
options metaserver="DOMAIN" metaport=8561
metarepository="Foundation" metauser="USER"
metapass='CENSORED';
libname CLIENT sasiola tag=SOMETAG port=10011
host="DOMAIN"
signer="https://CENSORED";
proc http HEADEROUT=head
url='http://VALID_PATH/acubiz_sas/_design/view/_view/bymonth?key="2017-07"'
method= "GET" CT="application/json" out=resp; run; libname space
JSON fileref=resp map=jmap ;*automap=create;
LIBNAME SASDATA BASE "D:\SASData";* outencoding='UTF-8';
Data SASDATA.Transaktioner ; Set space.Rows_value; run;
data null; if exist("Acubiz.EMS_TRANSAKTIONER", "DATA") then
rc=dosubl("proc sql noprint; drop table Acubiz.EMS_TRANSAKTIONER;
quit;"); run;
data Acubiz.EMS_TRANSAKTIONER; set sasdata.transaktioner; run;
proc metalib; omr (library="/Shared Data/SAS Visual
Analytics/Autoload/AcubizEMSAutoload/Acubiz_EMS"
repname="Foundation"); folder="/Shared Data/SAS Visual
Analytics/Autoload/AcubizEMSAutoload"; select ("EMS_TRANSAKTIONER");
run; quit;
libname CLIENT clear;
libname space clear;
For this conversion, I use the following JSON map.file called 'transaktioner.map'.
The field date_fi imports in the proper date format which I can manipulate as date-format in SAS Visual Analytics, but confirmeddate_fi does not.
The most important parts of this file are here.
{
"NAME": "date_fi",
"TYPE": "NUMERIC",
"INFORMAT": [ "e8601dt19", 19, 0 ],
"FORMAT": ["DATETIME", 20],
"PATH": "/root/rows/value/date_fi",
"CURRENT_LENGTH": 20
},
{
"NAME": "confirmeddate_fi",
"TYPE": "NUMERIC",
"INFORMAT": [ "e8601dt19", 19, 0 ],
"FORMAT": ["DATETIME", 20],
"PATH": "/root/rows/value/confirmeddate_fi",
"CURRENT_LENGTH": 20
},
Does any of you know how I might import the data and interpret the date-fields as such.
I have been messing with different informants in the JSON map-file to solve this riddle and have managed to get to where I can import the data without errors, but SAS does not interpret the date fields as such.
The actual fields are explained here with some examples (taken from the imported data):
Reference that works
date_fi: "2017-07-14T00:00:00" (Apparantly never timestamped but use T00:00:00 - checked 9 instances)
Should work
invoicedate_fi: "2017-08-01T00:00:00" (Apparantly never timestamped but use T00:00:00 - checked 9 instances)
invoicedate_fi: "2017-07-19T00:00:00"
invoicedate_fi: "2017-07-17T00:00:00"
arrivaldate_fi: "2017-08-13T00:00:00" (Apparantly never timestamped but use T00:00:00 - checked 9 instances)
departuredate_fi: "2017-08-09T00:00:00" (Apparantly never timestamped but use T00:00:00 - checked 9 instances)
Do not work as numeric - even though they are specified as dates in map-file (for use with SAS JSON Libname)
markedreadydate_fi: "2017-08-02T11:41:56" (This field is often but not always timestamped)
markedreadydate_fi: "2017-07-31T15:08:03"
markedreadydate_fi: "2017-07-19T00:00:00"
confirmeddate_fi: "2017-07-21T00:00:00" (This field is often but not always timestamped)
confirmeddate_fi: "2017-08-06T20:11:26"
confirmeddate_fi: "2017-07-14T18:38:41"
confirmeddatefinance_fi: "2017-07-31T15:54:10" (This field is often but not always timestamped)
confirmeddatefinance_fi: "2017-08-17T10:33:32"
confirmeddatefinance_fi: "2017-07-26T08:21:34"
markedreadydate_fi: "2017-07-19T00:00:00" (This field is often but not always timestamped)
Does anyone have pertinent info on this issue, as I am at my wit's end? And have exhausted SAS Tech Support about this date-issue.
PS: As a proof of concept, we are importing approx 110.000 rows. And the import finishes without any errors.
A good PDF explaining the different ISO formats in SAS can be found here
Apparantly the solution is to start to import the date-columns as CHARACTERs instead of numbers. And hence do the conversion to date-format in the SAS code like so:
Data SASDATA.Transaktioner(drop=
arrivaldate_fi_temp
departuredate_fi_temp
confirmeddate_fi_temp
confirmeddatefinance_fi_temp
datetoshow_fi_temp
date_fi_temp
invoicedate_fi_temp
markedreadydate_fi_temp
);
Set space.Rows_value(rename=(
confirmeddate_fi=confirmeddate_fi_temp
datetoshow_fi=datetoshow_fi_temp
date_fi=date_fi_temp
invoicedate_fi=invoicedate_fi_temp
markedreadydate_fi=markedreadydate_fi_temp
arrivaldate_fi=arrivaldate_fi_temp
departuredate_fi=departuredate_fi_temp
confirmeddatefinance_fi=confirmeddatefinance_fi_temp
));
*length invoicedate_fi 8.;
format
confirmeddate_fi
datetoshow_fi
date_fi
invoicedate_fi
markedreadydate_fi
arrivaldate_fi
departuredate_fi
confirmeddatefinance_fi
datetime20.;
if confirmeddate_fi_temp ne '' then confirmeddate_fi=input(confirmeddate_fi_temp,E8601DT19.); else confirmeddate_fi=.;
if datetoshow_fi_temp ne '' then datetoshow_fi=input(datetoshow_fi_temp,E8601DT19.); else datetoshow_fi=.;
if date_fi_temp ne '' then date_fi=input(date_fi_temp,E8601DT19.); else date_fi=.;
if invoicedate_fi_temp ne '' then invoicedate_fi=input(invoicedate_fi_temp,E8601DT19.); else invoicedate_fi=.;
if markedreadydate_fi_temp ne '' then markedreadydate_fi=input(markedreadydate_fi_temp,E8601DT19.); else markedreadydate_fi=.;
if arrivaldate_fi_temp ne '' then arrivaldate_fi=input(arrivaldate_fi_temp,E8601DT19.); else arrivaldate_fi=.;
if departuredate_fi_temp ne '' then departuredate_fi=input(departuredate_fi_temp,E8601DT19.); else departuredate_fi=.;
if confirmeddatefinance_fi_temp ne '' then confirmeddatefinance_fi=input(confirmeddatefinance_fi_temp,E8601DT19.); else confirmeddatefinance_fi=.;
run;
I will then remove all specifics for NUMERIC type for importing date-fields in the map file. This way the JSON libname does NOT take care of interpreting the date-formats. SAS does.
ie. the map file specification must be changed back to something like this for alle date-fields.
{
"NAME": "date_fi",
"TYPE": "CHARACTER",
"PATH": "/root/rows/value/date_fi",
"CURRENT_LENGTH": 19
},

OpenEdge ABL reserved keyword as temp-table field name (inferred from JSON data)

I am stuck with the following situation:
My method receives a response from external REST API call. The JSON response structure is as below:
{
"members": [
{
"email_address": "random#address.org",
"status": "randomstatus"
},
...etc...
]}
I am reading this to temp-table with READ-JSON (Inferring ABL schema from JSON data) and try to process the temp-table. And this is where I am stuck:
when I am trying to put together a query that contains temp-table field "status", the error is raised.
Example:
hQuery:QUERY-PREPARE('FOR EACH ' + httSubscriber:NAME + ' WHERE ' + hBuffer:BUFFER-FIELD(iStatus):NAME + ' = "randomstatus"').
gives:
**Unable to understand after -- "members WHERE".(247)
I have tried referencing directly by name as well, same result.
Probably the "status" is a reserved keyword in ABL. Might that be the case? And how can I get over this issue to reference that "status" field?
Unfortunately the format and key names of JSON response are not under my control and I have to work with that.
You could use SERIALIZE-NAME in the temp-table definition to internally rename the field in question. Then you would have to refer to the field with another name but in it's serialized form it would still be known as status.
Here's an example where the status-field is renamed to exampleStatus.
DEFINE TEMP-TABLE ttExample NO-UNDO
FIELD exampleStatus AS CHARACTER SERIALIZE-NAME "status".
/* Code to read json goes here... */
/* Access the field */
FOR EACH ttExample:
DISPLAY ttExample.exampleStatus.
END.
I've been known to do silly things like this:
JSONData = replace( JSONData, '"status":', '"xstatus":' ).
Try naming the temp-table (hard-coded or via string appending) + '.' + hBuffer:BUFFER-FIELD(iStatus):NAME (...)
It should help the compiler understand you're talking about the field. Since it's not restricted, this should force its hand and allow you to query.

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.