I have several CSV files to combine in one table (files have the same structure), but the files structure is f**ed enough to be problematic.
The first row is ordinary, just headers split by a comma:
Account,Description,Entity,Risk,...
but then the rows with actual data are starting and ending with doublequote ", columns are separated by a comma, but people (full name) has two double-quotes at beginning and end. I understand that it's an escape character to keep the name in one column, but one would be enough.
"1625110,To be Invoiced,587,Normal,""Doe, John"",..."
So what I need to do and don't know how is to remove " from the beginning and end of every row with data and replace "" with " in every line with data.
I need to do it in Power Query because there will be more of similar CSV files over time and I don't want to clean them manually.
Any ideas?
I was trying with simple:
= Table.AddColumn(#"Removed Other Columns", "Custom", each Csv.Document(
[Content],
[
Delimiter = ",",
QuoteStyle = QuoteStyle.Csv
]
))
Try loading to a single column first, replace values to remove extra quotes, and then split by ",".
Here's what that looks like for loading a single file:
let
Source = Csv.Document(File.Contents("filepath\file.csv"),[Delimiter="#(tab)"]),
ReplaceQuotes = Table.ReplaceValue(Source,"""""","""",Replacer.ReplaceText,{"Column1"}),
SplitIntoColumns = Table.SplitColumn(ReplaceQuotes, "Column1", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv)),
#"Promoted Headers" = Table.PromoteHeaders(SplitIntoColumns, [PromoteAllScalars=true])
in
#"Promoted Headers"
I used the tab delimiter to keep it from splitting in the first step.
Related
I have a directory full of CSVs. A script I use loads each CSV via a Loop and corrects commonly known errors in several columns prior to being imported into an SQL database. The corrections I want to apply are stored in a JSON file so that a user can freely add/remove any corrections on-the-fly without altering the main script.
My script works fine for 1 value correction, per column, per CSV. However I have noticed that 2 or more columns per CSV now contain additional errors, as well as more than one correction per column is now required.
Here is relevant code:
with open('lookup.json') as f:
translation_table = json.load(f)
for filename in gl.glob("(Compacted)_*.csv"):
df = pd.read_csv(filename, dtype=object)
#... Some other enrichment...
# Extract the file "key" with a regular expression (regex)
filekey = re.match(r"^\(Compacted\)_([A-Z0-9-]+_[0-9A-z]+)_[0-9]{8}_[0-9]{6}.csv$", filename).group(1)
# Use the translation tables to apply any error fixes
if filekey in translation_table["error_lookup"]:
tablename = translation_table["error_lookup"][filekey]
df[tablename[0]] = df[tablename[0]].replace({tablename[1]: tablename[2]})
else:
pass
And here is the lookup.json file:
}
"error_lookup": {
"T7000_08": ["MODCT", "C00", -5555],
"T7000_17": ["MODCT", "C00", -5555],
"T7000_20": ["CLLM5", "--", -5555],
"T700_13": ["CODE", "100T", -5555]
}
For example if a column (in a CSV that includes the key "T7000_20") has a new erroneous value of ";;" in column CLLM5, how can I ensure that values that contain "--" and ";;" are replaced with "-5555"? How do I account for another column in the same CSV too?
Can you change the JSON file? The example below would edit Column A (old1 → new 1 and old2 → new2) and would make similar changes to Column B:
{'error_lookup': {'T7000_20': {'colA': ['old1', 'new1', 'old2', 'new2'],
'colB': ['old3', 'new3', 'old4', 'new4']}}}
The JSON parsing gets more complex, in order to handle current use case and new requirements.
536381,22411,JUMBO SHOPPER VINTAGE RED PAISLEY,10,12/1/2010 9:41,1.95,15311,United Kingdom
"536381,82567,""AIRLINE LOUNGE,METAL SIGN"",2,12/1/2010 9:41,2.1,15311,United Kingdom"
536381,21672,WHITE SPOT RED CERAMIC DRAWER KNOB,6,12/1/2010 9:41,1.25,15311,United Kingdom
These lines are example of rows in a csv file.
I'm trying to read it in Databricks, using:
df = spark.read.csv ('file.csv', sep=',', inferSchema = 'true', quote = '"')
but, the line in the middle and other similar are not getting into the right column because of the comma within the string. How can I workaround it?
Set the quote to:
'""'
df = spark.read.csv('file.csv', sep=',', inferSchema = 'true', quote = '""')
It looks like your data has double quotes - so when it's being read it sees the double quotes as being the start and end of the string.
Edit: I'm also assuming the problem comes in with this part:
""AIRLINE LOUNGE,METAL SIGN""
this is not only related to Excel; I have the same issue when retrieving data from a source into Azure Synapse. the comma within one column causes the process to enclose entire column data with double quotes and including double quotes get doubled as shown above, second line (see Retrieve CSV format over https)
I'm currently playing around with phpMyAdmin and I have encountered a problem. When importing my CSV into phpMyAdmin it's rounding the numbers. I have set the column to be a float and the column in Excel to be a Number (Also tried text/General) to no avail. Has anyone else encountered this issue and found a viable work-around?
A second question, is it possible for me to upload the CSV file so that it matches the column names in phpMyAdmin to Excel column names and enters the data in the correct column?
Your file should be look like this(decimal fields are of general type):
xlssheet
Save as CSV. File will be probably saved with ; separated
This is for new table:
Open phpMyAdmin, choose your database, click to import and select file to upload
Change format to CSV if there is not selected
Change in format specific options - columns separated with: ;
Be sure that checkbox (The first line of the file contains the table column names (if this is unchecked, the first line will become part of the data)) is SELECTED
Click Go
New table will be created with the structure according to the forst line in CSV.
This is for existing table:
Open phpMyAdmin, choose your database, CHOOSE YOUR TABLE which match the structure of imported file, click to import and select file to upload
Change format to CSV if there is not selected
Change in format specific options - columns separated with: ;
Change skip number of queries to 1 (this will skip the first line with column names)
Click Go
Selected table wich has the same structure as CSV will be updated and rows in CSV inserted.
// connecting dB
$mysqli = new mysqli('localhost','root','','testdB');
// opening csv
$fp = fopen('data.csv','r');
// creating a blank string to store values of fields of first row, to be used in query
$col_ins = '';
// creating a blank string to store values of fields after first row, to be used in query
$data_ins = '';
// read first line and get the name of fields
$data = fgetcsv($fp);
for($field=0;$field< count($data);$field++){
$col_ins = "'" . $col[$field] . "' , " . $col_ins;
}
// reading next lines and insert into dB
while($data=fgetcsv($fp)){
for($field=0;$field<count($data);$field++){
$data_ins = "'" . $data[$field] . "' , " . $data_ins;
}
$query = "INSERT INTO `table_name` (".$col_ins.") VALUES(".$data_ins.")";
$mysqli->query($query);
}
echo 'Imported...';
I've had the same issue.
Solved changing the separator between the integer part and the decimal part from comma to point.
i.e.
365,40 to 365.40
That worked for me.
...I really thought this would be a well-traveled path.
I want to create the DDL statement in Hive (or SQL for that matter) by inspecting the first record in a CSV file that exposes (as is often the case) the column names.
I've seen a variety of near answers to this issue, but not to many that can be automated or replicated at scale.
I created the following code to handle the task, but I fear that it has some issues:
#!/usr/bin/python
import sys
import csv
# get file name (and hence table name) from command line
# exit with usage if no suitable argument
if len(sys.argv) < 2:
sys.exit('Usage: ' + sys.argv[0] + ': input CSV filename')
ifile = sys.argv[1]
# emit the standard invocation
print 'CREATE EXTERNAL TABLE ' + ifile + ' ('
with open(ifile + '.csv') as inputfile:
reader = csv.DictReader(inputfile)
for row in reader:
k = row.keys()
sprung = len(k)
latch = 0
for item in k:
latch += 1
dtype = '` STRING' if latch == sprung else '` STRING,'
print '`' + item.strip() + dtype
break
print ')\n'
print "ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
print "LOCATION 'replacethisstringwith HDFS or S3 location'"
The first is that it simply datatypes everything as a STRING. (I suppose that coming from CSV, that's a forgivable sin. And of course one could doctor the resulting output to set the datatypes more accurately.)
The second is that it does not sanitize the potential column names for characters not allowed in Hive table column names. (I easily broke it immediately by reading in a data set where the column names routinely had an apostrophe as data. This caused a mess.)
The third is that the data location is tokenized. I suppose with just a little more coding time, it could be passed on the command line as an argument.
My question is -- why would we need to do this? What easy approach to doing this am I missing?
(BTW: no bonus points for referencing the CSV Serde - I think that's only available in Hive 14. A lot of us are not that far along yet with our production systems.)
Regarding the first issue (all columns are typed as strings), this is actually the current behavior even if the table were being processed by something like the CSVSerde or RegexSerDe. Depending on whether the particulars of your use case can tolerate the additional runtime latency, one possible approach is to define a view based upon your external table that dynamically recasts the columns at query time, and direct queries against the view instead of the external table. Something like:
CREATE VIEW VIEW my_view (
CAST(col1 AS INT) AS col1,
CAST(col2 AS STRING) AS col2,
CAST(col3 AS INT) as col3,
...
...
) AS SELECT * FROM my_external_table;
For the second issue (sanitizing column names), I'm inferring your Hive installation is 0.12 or earlier (0.13 supports any unicode character in a column name). If you import the re regex module, you can perform that scrubbing in your Python with something like the following:
for item in k:
...
print '`' + re.sub(r'\W', '', item.strip()) + dtype
That should get rid of any non-alphernumeric/underscore characters, which was the pre-0.13 expectation for Hive column names. By the way, I don't think you need the surrounding backticks anymore if you sanitize the column name this way.
As for the third issue (external table location), I think specifying the location as a command line parameter is a reasonable approach. One alternative may be to add another "metarow" to your data file that specifies the location somehow, but that would be a pain if you are already sitting on a ton of data files - personally I prefer the command line approach.
The Kite SDK has functionality to infer a CSV schema with the names from the header record and the types from the first few data records, and then create a Hive table from that schema. You can also use it to import CSV data into that table.
I was trying to load a .csv file into my database. It was a comma delimited file and for one of the columns there is a comma(,) in between the data just like Texas,Houston can some one help me how to get rid of the comma in between. the package which i have created recognizing the value after the comma as a new column but it should not be like that. Can any of the guys help me in this. I was getting error in the Flat file source itself. I thought of using Derived column but the package is failing at the source point itself.
Well some "comma" delimited files have ,"something or other", when there is a string and only use ,numeric_value, when its a number type. If your file is like this then you can preprocess your file changing ," for some (other) rare character, and similarly ", then replace the , if it occurs between the two rare characters. Or you can count the comma in any line and if its greater than the number pf delimited columns, manually frocess the exceptions