Strip large pandas dataframe of whitespace - mysql

I have what I am hoping is an easy thing ;)
I load a large dataset into a pandas dataframe from an excel file.
I then load that excel file into a mySQL table which has basically the same structure.
I have found that many of the cells in excel, have lots of trailing whitespace, which translates to whitespace in the columns in the table.
I am hoping there is a simple way to remove this whitespace before/while it is written to the database without massive loops. (i know there is the 'strip' method, but i think I'd have to loop through all the fields / rows.
Is there some simple switch that can be called during the read_excel or the to_sql, which would remove whitespace automatically? (likely daydreaming..)
I can also clean it up after it is loaded into the db - but i'm looking for any options before that.
Thanks!

Related

Combine multiple excel CSV documents into one, slightly different formatted sheet

I have several CSV files which can be combined into one using the code below:
#echo off
copy *.csv Union.csv
I would like to combine the CSV files into one, but have the columns slightly different. As a template I use uses a slightly different column format. I want to combine the original CSVs into one format file, use for import into Unleashed.
Does anyone know how to do this?
The usual approach for this problem is to write a transformer that normalizes the structures. You can easily accomplish this task using ... well, nearly any proper (note: not bash/zsh/powershell/bat) programming language.
The typical approach would be to read each file, translate the headers (and data if necessary) to a unified format, and write the data out to a normalized CSV file.
This is the bread and butter of any number of programming 101 classes.

In which case should one use header in the csv files to be imported to neo4j?

Is there some cases in which having headers in the csv files has significant advantage over not having them?
I am not sure about this but it seems that having using header is advantageous for huge data set: https://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets
Why is this the case?
I think you may be a little confused. There are multiple ways to import CSV files into neo4j.
The [neo4j-import(https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/) tool, which you specifically ask about, requires headers, since they enable the tool to do its job.
On the other hand, the LOAD CSV Cypher clause supports but does not require headers. With LOAD CSV, I know of use cases in which NOT using headers is better. For example, in the non-header case each row of data would be provided in a string collection -- which can be very convenient if you wanted to iterate through all the columns, or store a collection of contiguous columns. Also, if you do not have a fixed number of columns, having headers may not even make sense.

With SSIS, the flat file connection manager messes up with csv files having "," in values in the range of thousands

I had data in a CSV file, which appears in a notepad something like this:
In SSIS, when I try to load this file in a Delimited format, the data which appears in the preview gets messed up due to the commas which occur in the numeric values, eg. in thousands and millions. The data looks something like this:
Is there any way in which this problem can be taken care of in the connection manager itself ?
Thanks!
Use Text Qualifier as shown here:
This will take care of the columns that have quotes inside. Sometimes it gets really bad with CSV data, and I've had to resort to script components doing some cleanup, but that's really rare.

Apending CSV files with columns in different orders

I need to regularly merge data from multiple CSV files into a single spreadsheet by appending the rows from each source file. Only OpenOffice/LibreOffice is able to read the UTF-8 CSV file, which has quote-delimited fields containing newline characters.
Now, each CSV file column headings, but the order of the columns vary from file to file. Some files also have missing columns, and some have extra columns.
I have my master list of column names, and the order in which I would like them all to go. What is the best way to tackle this? LibreOffice gets the CSV parsing right (Excel certainly does not). Ultimately the files will all go into a single merged spreadsheet. Every row from each source file must be kept intact, apart from the column ordering.
The steps also need to be handed over to a non-technical third party eventually, so I am looking for an approach that will not offer too many non-expert technical hurdles.
Okay, I'm approaching this problem a different way. I have instead gone back to the source application (WooCommerce) to fix the export, so the spreadsheets list all the same columns and all in the same order, on every export. This does have other consequences that I need to follow up, such as managing patches and trying to get the changes accepted by the source project. But it does avoid having to append the CSV files with mis-matched columns, which seems to be a common issue that no-one has any real solutions for (yes, I have searched, a lot).

SSIS CSV File load to table

I have a problem loading the .CSV file as the connection manager editor settings are out of my knowledge.
When i load the .CSV file up to 18 rows i have no problem it is loading in to the table.
However, from the 19th column the data is not partioning correctly.
row delimeter is {CR}{LF}
column delimeter is Comma {,}
How can i partition the data correctly?
any help?
Here are some ideas I have with no details.
What happens when you try to import the same .CSV file into Excel? Anything interesting around row 19?
Does there appear to be anything different about row 19?
If you delete row 19, what happens?
See, I bet you've thought of these things as well, and probably more, since you have the details. If you want anything more than superficial bad guesses, you'll have to provide a little detail.
I've found the CSV Import to be a bit limited with regards to bad data. If you're having trouble with the 19th column, I would suggest figuring out why that column is failing. You can try and tell the import task's error conditions to Ignore Errors with data truncation, etc...but that may not fix the issue.
I have often switched complicated or error-prone CSV imports to simply use a SSIS Script Task, then just write my own code to parse out the CSV and handle bad data.
If it's not partitioning correctly, it might be something as trivial as one of your field values on row 19 containing a comma, thus throwing out the import by making that row seem to have more columns. If this is the case, I hope you can get a revised version of the CSV file - this time with a text qualifier set. If possible, use something like | rather than " as the qualifier so that it's less likely to appear in the field values.
Put the file in a text editor such as notepad++ or textpad and change the view to show control characters. You will probably find your culprit there.
Nothing unusuale. when i paste in excel as one column and converting text to column has no problem. but i can see in the SSIS preview the field value where the problem has started has two square boxs and data of the next row.
if any one want to see the file let me know i will e-mail you the file.