I have created an external Hive table from a tweets json file which is exported from Mongo DB. Whenever I select more than one column from the hive table, the retrieved results are not well formatted. some columns are empty or NULL (even if i conditioned on specific values)and some data appear in the wrong columns.
I think this is happening because the text has commas in it. when i try to query the hive table without selecting the text of the tweets, the results make sense then. But i don't know how to fix that.
Anyone Has any idea how to fix that??
Best,
Why dont you try formatting the output? Something like this -
SELECT
CONCAT(COALESCE(COL1,''),
'|', COALESCE(COL2,''),
'|', COALESCE(COL3,''),
'|', COALESCE(COL4,''),
'|', COALESCE(COL5,''),
'|', COALESCE(COL6,''),
'|', COALESCE(COL7,'')) as tweetsout
FROM (
SELECT COL1, COL2, COL3, COL4, COL5, COL6, COL7
FROM TWEETS
) TOUT
This would give you the output delimied by a pipe instead of the standard tab delimited output.
It is difficult to tell without knowing the exact create table command you used...
Usually the table is parsed incorrectly if the input data contains table delimiters. For example, some tweets in your input database may contain \n which might be a row separator in the hive table you created.
Related
I have a MySQl Database, the table name is "post_data" field name is "content_data"
I want to delete all the random data between two known strings.
i want to delete all text between 'www.OldDomain.com/' & '/www.NewDomian.com'
For example:
www.OldDomain.com/redirect?RandomTextUrl/www.NewDomian.com/txturl
any suggestion will be appreciated?
Assuming that each of these strings appears exactly once in content_data and in the correct order:
select concat(substring_index(content_data, 'www.OldDomain.com', 1),
'www.OldDomain.com',
'www.NewDomian.com',
substring_index(content_data, 'www.NewDomian.com', -1)
)
If you are looking to struggle and do it using MySQL You could use SUBSTRING_INDEX
I would suggest to use PHP/Java/C# or any other language to do these kinds of string parsing though.
You can use LOCATE and SUBSTRING functions to get the string between two desired strings.
SELECT SUBSTRING(
data,
LOCATE('www.OldDomain.com/', data) + LENGTH('www.OldDomain.com/'),
LOCATE('/www.NewDomain.com', data) - (LOCATE('www.OldDomain.com/', data) + LENGTH('www.OldDomain.com/'))
) as output
FROM TABLE_NAME;
NOTE: TABLE_NAME is your table name, data is your column name and output is the alias for the extracted data.
Sample Run
CREATE TABLE stack (data varchar(100));
INSERT INTO stack VALUES('www.OldDomain.com/redirect?RandomTextUrl/www.NewDomain.com/txturl');
INSERT INTO stack VALUES('BEFOREwww.OldDomain.com/redirect?RandomTextUrl/www.NewDomain.com/txturl');
INSERT INTO stack VALUES('www.OldDomain.com/redirect?RandomTextUrl/www.NewDomain.com/txturlAFTER');
I have a nested JSON to upload in Big Query.
{
"status":{
"sleep":"12333",
"wake":"3837"
}
}
After inserting it in Big Query, I am getting the field names as :
status_sleep and status_wake
I require the field names to be seperated by delimeters like '.' or any other delimeter
status.sleep and status.wake
Please suggest how to add the field deimeter. I checked there is a field delimeter key for uploading the data in csv format.
After you insert data with above schema you have record named status with two fields in it status.sleep and status.wake
When you query as
SELECT * FROM yourtable
without providing aliases - you will get output named as status_sleep and status_wake because dot notation is reserved for referencing nested data.
But you still can reference your data with dots as in below
SELECT status.sleep as sleep, status.wake as wake FROM yourtable
I'm trying to export a MySQL table into csv file using SequelPro, however the result file is not properly formatted. For instance, on the first column should the id, in some of the records there is a number, but in others there is text instead. This happens in pretty much every column.
One explanation might be that the field are being exported vertically rather than horizontally (you have the id 1 on the first row and the id 10 on the 10th row.
Maybe I'm getting the the options (fields) wrong, in which case I'd appreciate help.
What I need is for the file to have the titles as the first row and the actual data in the subsequent rows with each data property as a column.
Appreciate the help
PS: I've tried this but I'm getting the data vertically as well.
For convenience in executing and sharing across the team, consider adding commas as the part of SELECT statement itself like this
SELECT
col1 as 'Column 1',',',
col2 as 'Column 2',',',
col3 as 'Column 3',',',
col4 as 'Column 4',',',
col5 as 'Column 5',','
FROM
table1, table2
WHERE
clause1 and clause2
The data and headers do automatically get their necessary commas.
In my mysql table, one of the fields holds data of the nature:
{"gateway":"somevalue","location":"http://www.somesite.org/en/someresource","ip":"100.0.0.9"}
I need to extract the value of the location attribute alone from this field, which is
http://www.somesite.org/en/someresource
in the this case. How do I write a query to achieve this?
Apart from the fact that you better off not storing delimited values of any form (including JSON) in the database, but rather normalize your data, you can leverage very handy SUBSTRING_INDEX() function in the following way
SELECT TRIM(BOTH '"' FROM SUBSTRING_INDEX(SUBSTRING_INDEX(column_name, '"location":', -1), ",", 1)) location
FROM table_name
WHERE ...
I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.
I have a csv which looks something like this
a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3
The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.
What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.
The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.
Any ideas please?
Thanks
I guess create a javascript to output the fields of a row in a JSON like string added to the row:
{"id":"a","col1":"1","col2":"2","col3":"3"}
Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'
{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}
Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:
var file_name="C:\\dir\\"+ id + ".txt";
Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.
If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.
Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:
In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.
In case you find in confusing, i have uploaded a sample code in here.
Hope it helps :)