I have a metadata activity output which is a json of blobs in my container. I want to input these names into my foreach activity where some u-sql query is performed on the blob as per the file name. Is it possible?
You need to include either a SELECT or an EXTRACT. Since you are pulling from files, you are going to want to use EXTRACT.
If I understand your question correctly, you want to run different U-SQL scripts based on the file name.
There are a couple ways to do this:
1) use If conditions in Data Factory to call different U-SQL scripts based on the file name. Nesting the if statements will allow you to have more than two options. There are several string manipulation functions to help you with this. Say one path is #item.Contains('a').
{
"name": "<Name of the activity>",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "#item() == <file name>",
"type": "Expression"
}
"ifTrueActivities": [
{
"<U-SQL script = 1>"
}
],
"ifFalseActivities": [
{
"<U-SQL script 2>"
}
]
}
}
2) The second option is to use a single U-SQL script and do the split from there. Again, string manipulation functions can help via pattern matching. There is some advantage to this as far as organization goes as you can store the unique scripts in stored procedures and the U-SQL script would simply check the file name passed in and call the relevant stored proc.
//This would be added by data factory
DECLARE #fileName = "/Samples/Data/SearchLog.tsv";
IF #fileName == "/Samples/Data/SearchLog.tsv"
THEN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT #searchlog
TO #fileName
USING Outputters.Csv();
ELSE
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #fileName
USING Extractors.Tsv();
OUTPUT #searchlog
TO "/output/SearchLogResult1.csv"
USING Outputters.Csv();
END;
Something to think about is that Data Lake Analytics is going to be more efficient if you can combine multiple files into one statement. You can have multiple EXTRACT and OUTPUT statements. I would encourage you to explore whether or not you could use pattern matching in your EXTRACT statements to split the U-SQL processing without needing the foreach loop in data factory.
Related
I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme
I am trying to automate/ease a procedure to review firewall rules within ELK (ElasticSearch, Logstash, Kibana).
I have some data obtained from a CSV, which is structured like this:
Source;Destination;Service;Action;Comment
10.0.0.0/8 172.16.0.0/16 192.168.0.0/24 23.2.20.6;10.0.0.1 10.0.0.2 10.0.0.3;udp:53
tcp:53;accept;No.10: ID: INC0000000001
My objective is to import this data within ELK by parsing each field (for subnet and/or IP address) and, if possible, add a sequential field (IP_Source1,IP_Destination2,etc) containing each one.
Is this possible, to your knowledge? How?
Thanks for any hint you may be able to provide
You can create a logstash configuration with input as file. Then use first csv filter. CSV filter should look like this.
filter {
csv {
columns => ["source", "destination", "service", "action", "comment"]
separator => ";"
}
}
Next filter will need to be ruby filter.
filter {
ruby {
code => "
arr = event.get(source).split('')
arr.each.with_index(1) do |a, index|
event.set(ip_source+index, a)
end
"
}
}
Final will be output to elasticsearch.
I have not tested code. But I am hoping this shuld give you good hints.
I am trying to search my database using a string, such as "A". I was just watching this Firebase tutorial Common SQL Queries converted for the Firebase Database - The Firebase Database For SQL Developers #4 and it explains that, in order to search the database for a string (in a certain location), you must use:
firebase.database().ref.child("child_name_here")
.queryOrdered(byChild: "child_name_here")
.queryStarting(atValue: "value_here_uppercase")
.queryEnding(atValue: "value_here_uppercase\\uf8ff")
You must use two \\ in the ending value as an escape character in order to get one \.
When I try this with my Firebase database, it does not work. Here is my database:
{
"Schools": {
"randomUID": {
"location" : "anyTown, anyState",
"name" : "anyName"
}
}
}
Here is my query:
databaseReference.child("Schools")
.queryOrdered(byChild: "name")
.queryStarting(atValue: "A")
.queryEnding(atValue: "A\\uf8ff") ...
When I go to print the snapshot from Firebase, I get back.
If I get rid of the ending .queryEnding(atValue: "A\\uf8ff"), the database returns all of the schools in the Schools node.
How can I search the Firebase database using a String?
queryStarting() and queryEnding() can be used for number. For example: you can get objects with someField varying from 3 to 10.
for searching string: you can search whole string using queryEqualToValue().
This shows all customers that match Wick. (It's not swift but may give you an idea)
// sample
let query = 'Wick'
clientsRef.orderByChild('name')
.startAt(query)
.endAt(query + '\uf8ff')
.once('value', (snapshot) => {
....
})
I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.
I want to do a simple search by text in a specific field of a specific collection in ArangonDB. Something like this ( in SQL ):
SELECT * FROM procedures WHERE procedures.name LIKE '%hemogram%'
I need to search in a string field of object ( document? ) that is part of an array that is a field of my actuall document:
[
{
"name": "Unimed",
"procedures": [
{
"type": "Exames",
"name": "Endoscopia"
},
{
"type": "Exame",
"name": "Hemograma"
}
]
}
]
I want to retrieve, for example, all procedures that name likes a "string", searching in all documents of my clinics collection.
I've been reading about fulltext indexes but I couldn't understand how to create or how to use them.
Any help would be great!
EDIT
I almost got what I wanted. My problem is now return just the information I want.
FOR clinic IN clinics
FILTER LIKE(clinic.procedures[*].name, '%hemogram%', true)
RETURN{
clinic_name: clinic.name,
procedure: clinic.procedures
}
This returns to me all the procedures in a given clinic ( procedures is an array inside a clinic ) and not only the procedure which the field name is 'LIKE' my search string. How can I achieve this?
ArangoDB does matching in a similar way as SQL. You however have to address the field you want to execute the LIKE on:
(You have to have an object as toplevel, with the mandatory attributes _key etc.)
FOR document IN myCollection
FILTER LIKE(document.procedures.name, '%hemogram%')
RETURN document;
You might wanna rethink your data model. Your documents store two different types of entities, clinics and procedures.
You could store clinics in a clinics collection, and their procedures in a procedures collection (with optional de-duplication enabled by the separation into two collections). Then link clinics to the procedures, either by an array of procedure _ids in clinic records, or by using an edge collection to link clinic and procedure documents with edges.
If you want to keep the current data model, use the following AQL query:
FOR clinic IN clinics
FOR proc IN clinic.procedures
FILTER LIKE(proc.name, "%hemo%", true)
RETURN MERGE(
UNSET(clinic, "procedures"),
{procedure: proc}
)
It's not possible to use the abbreviated syntax ([*] operator) in your case.
The approach is to remove the procedures attribute from the document, add a new attribute procedure and the matched procedure object as value. The problem is, that multiple results will be returned if LIKE() found more than a single procedure. You could LIMIT this to 1 below FILTER, but that may not be desired.
To return a single result instead, with procedures attribute reduced to matching procedures, a sub-query is required:
FOR clinic IN clinics
LET p = (
FOR proc IN clinic.procedures
FILTER LIKE(proc.name, "%hemo%", true)
RETURN proc
)
RETURN MERGE(
clinic,
{procedures: p}
)