Converting CSV string to multiple columns in Apache Drill - csv

Using: Apache Drill
I am trying to bring the following data in a more structured form:
"apple","juice", "box:12,shipment_id:143,pallet:B12"
"mango", "pulp", "box:7,shipment_id:133,pallet:B19,route:11"
"grape", "jam", "box:10"
Desired output:
fruit, product, box_id, shipment_id, pallet_id, route_id
apple,juice, 12, 143, B12, null
mango, pulp, 7, 133, B19, 11
grape, jam, 10, null, null, null
Dataset runs into couple of GBs. Drill reads the input into three columns with the last string in one single column. Have successfully achieved the desired output by performing string manipulation operations (REGEXP_REPLACE and CONCAT) on the last column, then reading the column as json (CONVERT_FROM), and finally separating into different columns using KVGEN and FLATTEN.
The execution time is pretty high due to the regex functions. Is there a better approach?
(PS: execution time is compared to using a pyspark job to achieve the desired output).

I do not see any other way to do it 100% with Apache Drill, without any intermediate storage
You may try with a Custom Function in Java, to make it easier to write.
Since you have done the work,
have you tried to save the data in a Parquet file? CTAS command: http://drill.apache.org/docs/create-table-as-ctas-command/
This would make subsequent queries a lot faster.

Related

How do I convert a column of JSON strings into a parquet table

I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.
I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)

Storing large amounts of queryable JSON

I am trying to find a database solution that is capable of the following.
Store flat, random, JSON structures separated by a table name(random_json_table_1, random_json_table_2 for example).
Capable of handling a large number of insert operations(+10000/second).
Able to query the random json structures(SELECT * FROM random_json_table_1 WHERE JSON_SELECT('data', '$.city.busses') NOT NULL AND JSON_SELECT('data', '$.city.busStops', 'length') > 5) for example.
SELECT queries must run fast over gigabytes of data.
I had a look at Amazon Athena and it looks a bit promising but I am curious if there are any other solutions out there.
You may consider BigQuery.
Regarding 2), there is BigQuery streaming interface.
And 4), you can play with BigQuery public data (e.g. the popular BitCoin transaction table) to see how fast BigQuery can be.
Below is sample query using BigQuery standardSQL, showing how to filter data which is stored in JSON string.
#standardSQL
SELECT JSON_EXTRACT(json_text, '$') AS student
FROM UNNEST([
'{"age" : 1, "class" : {"students" : [{"name" : "Jane"}]}}',
'{"age" : 2, "class" : {"students" : []}}',
'{"age" : 10,"class" : {"students" : [{"name" : "John"}, {"name": "Jamie"}]}}'
]) AS json_text
WHERE CAST(JSON_EXTRACT_SCALAR(json_text, '$.age') AS INT64) > 5;
It feels like Google's BigQuery managed database might be of value to you. Reading here we seem to find that there is a soft limit of 100,000 rows per second and the ability to insert 10,000 rows per single request. For performing queries, BigQuery advertises itself as being able to process petabyte sized tables within acceptable limits.
Here is a link to the main page for BigQuery:
https://cloud.google.com/bigquery/

Homogenize a field with different date formats in Mysql

I am working with Mysql workbench.
I have a huge database in a csv, that contains among other things, 3 columns with different formats of dates.
To be able of loading this csv file into my database, I have to set the 3 date columns as text, otherwise, it doesn't upload them properly.
Here an example of my data:
inDate, outDAte
19-01-10, 02-02-10
04-01-11 12:02, 2011-01-11 11:31
29-01-11 6:57, 29-03-2010
30-03-10, 01-04-2010
2012-12-03 05:39:27.040, 12-12-12 17:04
2012-12-04 13:47:01.040, 29-11-12
I want to homogenize them and to make 2 columns of each one of those one only with "date" and other only with "time".
I have tried working with "regular expressions" and with "case".
When I used "reg expressions" gave me nulls and with "case" gave me "truncated incorrect value".
I have tried to find something similar situations in the web. I have found that people got similar issues but with two date formats not with so many different formats as I do:
Convert varchar column to date in mysql at database level
Converting multiple datetime formats to single mysql datetime format
Format date in SELECT * query.
I am really new in this and I do not know how to write so many exceptions in mysql.
Load the CSV into a temporary table; massage the values in that table; finally copy to the 'real' table.
Have 2 columns in that table for each date; one for the raw value coming from the CSV; the other being a DATETIME(3) (or whatever the sanitized version will be).
Do one of these for each distinctly different format:
UPDATE tmp SET inDate = ...
WHERE raw_inDate REGEXP '...';
The WHERE may need things like AND LENGTH(inDate) = 8 and other ways to test other than REGEXP.
SUBSTRING_INDEX(inDate, '-', ...) may be a handy function for splitting up a date.
But, really, I would rather write the code in Perl or some other real programming language.

mySQL table with {"Twitter": 28, "Total": 28, "Facebook": 1}

There is a table with one column, named "info", with content like {"Twitter": 28, "Total": 28, "Facebook": 1}. When I write sql, I want to test whether "Total" is larger than 10 or not. Could someone help me write the query? (table name is landslides_7d)
(this is what I have)
SELECT * FROM landslides_7d WHERE info.Total > 10;
Thanks.
The data format seems to be JSON. If you have MySQL 5.7 you can use JSON_EXTRACT or the short form ->. Those functions don't exist in older versions.
SELECT * FROM landslides_7d WHERE JSON_EXTRACT(info, '$.total') > 10;
or
SELECT * FROM landslides_7d WHERE info->total > 10;
See http://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-extract
Mind that this is a full table scan. On a "larger" table you want to create an index.
If you're on an older version of MySQL you should create an extra column to your table and manually add the total value to that column.
You probably are storing the JSON in a single blob or string column. This is very inefficient, since you can't make use of indexes, and will need to parse the entire JSON structure on every where query. I'm not sure how much flexibility you need, but if the JSON attributes are relatively fixed, I recommend running a script (ruby, Python, etc.) on the table contents and storing "total" in a traditional columnar format. For example, you could add a new column "total" which contains the total attribute as an INT.
A side benefit of using a script is that you can catch any improperly formatted JSON - something you can't do in a single query.
You can also keep "total" column maintained with a trigger (on update/insert of "info"), using the JSON_EXTRACT function referenced in #johannes answer.

MySQL - Extracting numbers out of strings

In a MySQL database, I have a table which contains itemID, itemName and some other fields.
Sample records (respectively itemID and itemName):
vaX652bp_X987_foobar, FooBarItem
X34_bar, BarItem
tooX56, TOOX_What
I want to write a query which gives me an output like:
652, FooBarItem
34, BarItem
56, TOOX_What
In other words, I want to extract out the number from the itemID column. But the condition is that the extracted number should be the number that occurs after the first occurence of the character "X" in the itemID column.
I am currently trying out locate() and substring() but could not (yet) achieve what I want..
EDIT:
Unrelated to the question - Can any one see all the answers (currently two) to this question ? I see only the first answer by "soulmerge". Any ideas why ? And the million dollar question - Did I just find a bug ?!
That's a horrible thing to do in mysql, since it does not support extraction of regex matches. I would rather recommend pulling the data into your language of choice and processing it there. If you really must do this in mysql, using unreadable combinations of LOCATE and SUBSTRING with multiple CASEs is the only thing I can think of.
Why don't you try to make a third column where you can store, at the moment of the insertion of the record (separating the number in PHP or so), the number alone. So this way you use a little more of space to save a lot of processing.
Table:
vaX652bp_X987_foobar, 652, FooBarItem
X34_bar, 34, BarItem
tooX56, 56, TOOX_What
This isn't so unreadable :
SELECT 0+SUBSTRING(itemID, LOCATE("X", itemID)+1), itemName FROM tableName