Handing delimiter in Apache Pig - html

I have a comma separated value file.
Data example:
1001,Laptop,beautify,laptop amazing price,<HTML>XYZ</HTML>,1345
1002,Camera,Best Mega Pixel,<HTML>ABC</HTML>,4567
1003,TV,Best Price,<HTML>DEF</HTML>,8791
We have only 5 columns: id, Device, Description, HTML Code, Identifier.
For a few of the records there is an extra , in the Description column.
For example, First Records in above sample data has the extra , [beautify,laptop amazing price] which I want to eliminate.
While loading data into PIG 5:
INFILE1 = LOAD 'file1.csv' using PigStorage(',') as (id,Device,Description,HTML Code,Identifier)
There is a Data issue getting created.
Could you please suggest how to handle this data issue in Pig Script?

If the file is a correct csv, it should have double quote at the begining and the end of the field that contains the coma. Then, you just have to load your data using CSVLoader : https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/CSVLoader.html.
register 'piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
INFILE1 = LOAD 'file1.csv' using CSVLoader() as (id,Device,Description,HTML Code,Identifier)
If you don't have any double quote, maybe you could try a ragex, knowing that your third field starts by "<" .. (use Regex function in Pig https://pig.apache.org/docs/r0.11.1/func.html#regex-extract-all). Tell me if you need more info.

Related

Bigquery - Handle Double Quotes & Pipe Field Separator in CSV (Federated Table)

I am currently facing issues loading data into big query or even creating federated table as the incoming data is delimited by | pipe symbol with escaped double quotes on fields inside the file
Sample Data (also tried escaping double quote values with double double-quotes i.e "" on field level)
13|2|"\"Jeeps, Trucks & Off-Roa"|"JEEPSTRU"
Create DDL
CREATE OR REPLACE EXTERNAL TABLE `<project>.<dataset>.<table>`
WITH PARTITION COLUMNS (
dt DATE
)
OPTIONS (
allow_jagged_rows=true,
allow_quoted_newlines=true,
format="csv",
skip_leading_rows=1,
field_delimiter="|",
uris=["gs://path/to/bucket/table/*"],
hive_partition_uri_prefix='gs://path/to/bucket/table'
)
Query
SELECT
*
FROM
`<project>.<dataset>.<table>`
WHERE field_ like '%Jeep%'
Error
Error while reading table: <project>.<dataset>.<table>, error message: Error detected while parsing row starting at position: 70908. Error: Data between close double quote (") and field separator.
However, it works if I create the table with the option quote empty character quote="" which makes hard to filter out on SQL query
I need the field_ data to be loaded as "Jeeps, Trucks & Off-Roa
I tried to find various documentation & StackOverflow question (since everything is old or not working - or unlucky me) I am posting this question again.
I have a very basic question > What is the better way to escape double quotes in a column for federated big query table to avoid this problem without preprocessing csv/psv raw data?
This is not problem with external table or bigquery, but rather CSV files feature. I had similar once when I uploaded data to table in UI. I have found some sources(BTW which I cannot find right now) that double quotes should be used twice ("") in CSV file to get such behavior, like using your exaple:
13|2|"""Jeeps, Trucks & Off-Roa"|"JEEPSTRU"
I have tested it in your sample. When I downloaded data to table from csv I got the same error. And after using above it worked as expected. Result field value is:
"Jeeps, Trucks & Off-Roa
I suppose it will work for you as well.
EDIT: I have found it in Basic Rules of CSV on Wikipedia:
Each of the embedded double-quote characters must be represented by a pair of double-quote characters.
1997,Ford,E350,"Super, ""luxurious"" truck"

Amazon Redshift error using COPY from CSV - line feed inside quotes

I'm using COPY to import MySQL data into my Redshift database. I've run into an issue where I have JSON data in a table and it fails to COPY, saying "Delimited value missing end quote".
So I start digging into this, and I experiment a little. I made a very basic table to test this out, called test, as so:
CREATE TABLE test (cola varchar(1000), colb varchar(1000))
I then use the COPY command to populate this table from a file called test.csv that I have in an S3 bucket. If the file looks like this, it works:
"{
""contactInfo"": [
""givenName"",
""familyName"",
""fullName"",
""middleNames"",
""suffixes"",
""prefixes"",
""chats"",
""websites""
]}", "a"
If it looks like this, it fails:
"a", "{
""contactInfo"": [
""givenName"",
""familyName"",
""fullName"",
""middleNames"",
""suffixes"",
""prefixes"",
""chats"",
""websites""
]}"
So, if my JSON data is in the first column, COPY ignores the line feed inside the QUOTE. If it is in the second column or later, it sees the line feed as the end of the line of data.
For the record, I am not setting QUOTE AS, I am letting it default to ", which is why I double up the " chars in the file.
Anyone have any idea why this is happening, and how I can fix it? I can't move the data to the first column all the time, I don't always know where it is, and there may be more than one column of JSON data.
Edit:
For the record, I have tried this with a simple linefeed inside a string, no JSON data, and am running into the same problem.

Unable to import 3.4GB csv into redshift because values contains free-text with commas

And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.

Flat File to retain commas from SQL data

I’m importing a SQL view to SSIS using the Flat File Connection Manager. One of my columns in SQL has comma(s) in it. (123 Main St, Boston, MA) . When I import the data to SSIS, the commas within the column are being treated as delimiters, and my column is being broken into several columns. I have done a lot of research online, and have followed some workarounds which aren't working for me.
In SQL Server, I added double quotes around the values that have comma(s) in it.
' "'+CAST(a.Address as varchar(100))+'" '
So, 123 Main St, Boston, MA now reads “123 Main St, Boston, MA”
Then in my SSIS Flat File Connection Manager,
In the General tab:
Text Qualifier is set to “
Header Row Delimiter is set to {CR}-{LF}
In the columns tab:
Row delimiter is set to {LF}
Column delimiter is set to Comma {,}
And in the advanced Tab, all of my columns have the Text Qualified set to True.
After all of this, my column with commas in it, is still being separated into multiple columns. Am I missing a step? How can I get the SSIS package to treat my address column as one column and not break it out to several columns?
EDIT: Just to add more specifics. I am pulling from a SQL view that has double quotes around any field that has commas in it. I am then emailing that file and opening it in MS Excel. When I open it the file it read as follows:
123 Main St Boston MA" " (In three cells)
And I need it to read as
123 Main St, Boston, MA (in one cell)
Have a look of this - Commas within CSV Data
If there is a comma in a column then that column should be surrounded
by a single quote or double quote. Then if inside that column there is
a single or double quote it should have an escape charter before it,
usually a \
Example format of CSV
ID - address - name
1, "Some Address, Some Street, 10452", 'David O\'Brian'
Change every comma values with another unique delimiter which values haven't any of the characters inside,like : vertical bar ( | )
Change column delimiter to this new delimiter , and set text qualifier with double quote ( " )
You can automate the replace process using a Script Task before Dataflow Task for replacing delimiters. You can use replace script form here.
Also have a look of these resources.
Fixing comma problem in CSV file in SSIS
How to handle extra comma inside double quotes while processing a CSV file in SSIS
I ended up recreating the package, using the same parameters that are listed in my question. I also replaced this
' "'+CAST(a.Address as varchar(100))+'" '
with this in my SQL view
a.Address
And it now runs as desired. Not sure what was going on there. Thanks to everyone for their comments and suggestions.

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.