Removing single quotes from a flat file when loading to Hive

Removing single quotes from a flat file when loading to Hive - csv

Hey im creating an Hive external table over my flat file data.
The data in my flat file is something like this :
'abc',3,'xyz'
When I load it into the Hive table it shows me the result with the single quotes.
But I want it to be something like this :
abc,3,xyz
Is there any way to do this?

I can think of two ways to get desired result.
Use existing String functions available in hive - SUBSTR and LENGTH.
select SUBSTR("\'abc\'",2,length("\'abc\'")-2) , SUBSTR("\'3\'",2,length("\'3\'")-2) , SUBSTR("\'xyz\'",2,length("\'xyz\'")-2)
Generalized query
select SUBSTR(col1,2,length(col1)-2) , SUBSTR(col2,2,length(col2)-2) , SUBSTR(col3,2,length(col3)-2)
NOTE: Hive SUBSTR method expect string index to start from "1" not "0"
Write your own UDF to chop first and last letter of every string.
How to convert million rows?
Lets assume you have a table (named "staging") with 3 columns and 1million record.
if you run below query, you will have new table "final" which will not have any single quotes at the start or end.
INSERT INTO final SELECT SUBSTR(col1,2,length(col1)-2) , SUBSTR(col2,2,length(col2)-2) , SUBSTR(col3,2,length(col3)-2) from staging
Once the above query finish job , you will have your desired result in "final" table

Related

How to make MySQL MATCH...AGAINST use various word separators?

I have a table with 300K string values. These values contain all types of word separators so it looks like this:
id value
1 A B C
2 A B_C
3 A_B-C
4 A-B-C
Let's say I want to find all four rows containing A and B. This query
SELECT * FROM table WHERE MATCH(value) AGAINST('+A +B' IN BOOLEAN MODE);
will return only one row with space separated values:
1 A B C
Is there a way to make MATCH...AGAINST use other word separators? I tried to use LIKE and it was too slow.

You will probably want to alter your app and schema just a little bit to solve this problem. You have two tasks:
Task 1: Transform your existing data
Assuming you need to keep the source data unchanged:
Step 1: Add a field to your schema, "searchFriendly", same datatype as the source data.
Step 2: Write a script to transform the data you already have. Get the whole data set and do string replaces to get spaces.
Step 3: Save that transformed data to the new searchFriendly field.
Task 2: Modify the app so that all future database save/update's on this data, also perform the transformation and save that data as well.
Step 1: Find the part of the app that saves these records.
Step 2: Before actually writing the data to the database, perform the transformation.
Step 3: Add the transformed data to your API call to save/update the record, under the searchFriendly field.

Add value from every row in a table and output (Cast JSON string to int)

I'm querying an SQL database that I have read only access to (Cannot edit tables/create columns etc)
My table contains a column with JSON strings that have (Actual strings are much larger, this is just an example) the following syntax
{"value":"442","country":"usa"}
I would like to add the values contained in the JSON string from each row together and output it as readable, if this is possible?
The values are in the same point of the JSON, as shown above. The values vary in length also, most are 3/4 characters long.

Try the following (for MySQL v5.7+):
select sum(json_extract(jsonString, '$.value')) from mytable;
An example of this is here.

MySQL to CSV - separating multiple values

I have downloaded a MySQL table as CSV, which has over thousand entries of the following type:
id,gender,garment-color
1,male,white
2,"male,female",black
3,female,"red,pink"
Now, when I am trying to create a chart out of this data, it is taking "male" as one value, and "male,female" as a separate value.
So, for the above example, rather than counting 2 "male", and 3 "female", the chart is showing 3 separate categories ("male", "female", "male,female"), with one count each.
I want the output as follows, for chart to have the correct count:
id,gender,garment-color
1,male,white
2,male,black
2,female,black
3,female,red
3,female,pink
The only way I know is to copy the row in MS Excel and adjust the values manually, which is too tedious for 1000+ entries. Is there a better way?

From MySQL command line or whatever tool you are using to send queries to MySQL:
select * from the_table
into outfile '/tmp/out.txt' fields terminated by ',' enclosed by '"'
Then download /tmp/out.txt' from the server and it should be good to go assuming your data is good. If it is not, you might need to massage it with some SQL function use in theselect`.

The csv likely came from a poorly designed/normalized database that had both those values in the same row. You could try using selects and updates, along some built in string functions, on such rows to spawn additional rows containing the additional values and update their original rows to remove those values; but you will have to repeat until all commas are removed (if there is more than one in some field), and will have to determine if a row containing multiple fields with such comma-separated lists need multiplied out (i.e. should 2 gender and 4 color mean 8 rows total).
More likely, you'll probably want to create additional tables for X_garmentcolors, and X_genders; where X is whatever the original table is supposed to be describing. These tables would have an X_id field referencing the original row and a [garmentcolor|gender] value field holding one of the values in the original rows lists. Ideally, they should actually reference [gender|garmentcolor] lookup tables instead of holding actual values; but you'd have to do the grunt work of picking out all the unique colors and genders from your data first. Once that is done, you can do something like:
INSERT INTO X_[garmentcolor|gender] (X_id, Y_id)
SELECT X.X_id, Y.Y_id
FROM originalTable AS X
INNER JOIN valueTable AS Y
ON X.Y_valuelist LIKE CONCAT('%,' Y.value) -- Value at end of list
OR X.Y_valuelist LIKE CONCAT('%,' Y.value, ',%') -- Value in middle of list
OR X.Y_valuelist LIKE CONCAT(Y.value, ',%') -- Value at start of list
OR X.Y_valuelist = Y.value -- Value is entire list
;

How to change the field name in big query?

I have a nested JSON to upload in Big Query.
{
"status":{
"sleep":"12333",
"wake":"3837"
}
}
After inserting it in Big Query, I am getting the field names as :
status_sleep and status_wake
I require the field names to be seperated by delimeters like '.' or any other delimeter
status.sleep and status.wake
Please suggest how to add the field deimeter. I checked there is a field delimeter key for uploading the data in csv format.

After you insert data with above schema you have record named status with two fields in it status.sleep and status.wake
When you query as
SELECT * FROM yourtable
without providing aliases - you will get output named as status_sleep and status_wake because dot notation is reserved for referencing nested data.
But you still can reference your data with dots as in below
SELECT status.sleep as sleep, status.wake as wake FROM yourtable

How to replace all occurrences of matching string in a database table using ColdFusion

Working with a MS Access database, using one particular table, and scattered throughout the table at varying positions in date columns (which themselves can be in varying orders as a result of the data import) is the text "Not known". I want to replace occurrences of that text string across the whole data table.
The only way I can think of doing it is export to a csv format, and do a REReplace then import the data again, but I would like to know if there is a 'slicker' way?
The columns contain data which is a data import from a csv file so all the columns are text, they can contain a mix of "date string", text, numbers (as string) and null.

You can use replace, it follows basic TSQL implementation :
http://msdn.microsoft.com/en-us/library/ms186862.aspx
Here is an example I did updating the customers table of the Northwind sample database:
update customers set Customers.[Job Title] = replace( Customers.[Job Title], 'Purchasing', 'Manufacturing');
So to distill it into a generic example :
update TABLENAME set FIELD =
replace( FIELD, 'STRING_TO_REPLACE', 'STRING_TO_REPLACE_WITH' )
That updates the entire table in one statement. Be careful ;)

You can do this using Access, running edit-replace command. If you need to do this in code - you can open recordset, loop through records and for each field run:
rst.fields(i)=replace(rst.fields(i),"Not known","Something")
this is how it works in VBA, beleive you can do something similar in coldfusion

Why not just open the CSV file in Notepad++ (or similar) and do a Find/Replace?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Removing single quotes from a flat file when loading to Hive - csv

Hey im creating an Hive external table over my flat file data. The data in my flat file is something like this : 'abc',3,'xyz' When I load it into the Hive table it shows me the result with the single quotes. But I want it to be something like this : abc,3,xyz Is there any way to do this?

Related

How to make MySQL MATCH...AGAINST use various word separators?

Add value from every row in a table and output (Cast JSON string to int)

MySQL to CSV - separating multiple values

How to change the field name in big query?

How to replace all occurrences of matching string in a database table using ColdFusion

Categories

Resources