I have several CSV files which can be combined into one using the code below:
#echo off
copy *.csv Union.csv
I would like to combine the CSV files into one, but have the columns slightly different. As a template I use uses a slightly different column format. I want to combine the original CSVs into one format file, use for import into Unleashed.
Does anyone know how to do this?
The usual approach for this problem is to write a transformer that normalizes the structures. You can easily accomplish this task using ... well, nearly any proper (note: not bash/zsh/powershell/bat) programming language.
The typical approach would be to read each file, translate the headers (and data if necessary) to a unified format, and write the data out to a normalized CSV file.
This is the bread and butter of any number of programming 101 classes.
Related
I have what I am hoping is an easy thing ;)
I load a large dataset into a pandas dataframe from an excel file.
I then load that excel file into a mySQL table which has basically the same structure.
I have found that many of the cells in excel, have lots of trailing whitespace, which translates to whitespace in the columns in the table.
I am hoping there is a simple way to remove this whitespace before/while it is written to the database without massive loops. (i know there is the 'strip' method, but i think I'd have to loop through all the fields / rows.
Is there some simple switch that can be called during the read_excel or the to_sql, which would remove whitespace automatically? (likely daydreaming..)
I can also clean it up after it is loaded into the db - but i'm looking for any options before that.
Thanks!
Is there some cases in which having headers in the csv files has significant advantage over not having them?
I am not sure about this but it seems that having using header is advantageous for huge data set: https://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets
Why is this the case?
I think you may be a little confused. There are multiple ways to import CSV files into neo4j.
The [neo4j-import(https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/) tool, which you specifically ask about, requires headers, since they enable the tool to do its job.
On the other hand, the LOAD CSV Cypher clause supports but does not require headers. With LOAD CSV, I know of use cases in which NOT using headers is better. For example, in the non-header case each row of data would be provided in a string collection -- which can be very convenient if you wanted to iterate through all the columns, or store a collection of contiguous columns. Also, if you do not have a fixed number of columns, having headers may not even make sense.
I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!
The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.
I need to regularly merge data from multiple CSV files into a single spreadsheet by appending the rows from each source file. Only OpenOffice/LibreOffice is able to read the UTF-8 CSV file, which has quote-delimited fields containing newline characters.
Now, each CSV file column headings, but the order of the columns vary from file to file. Some files also have missing columns, and some have extra columns.
I have my master list of column names, and the order in which I would like them all to go. What is the best way to tackle this? LibreOffice gets the CSV parsing right (Excel certainly does not). Ultimately the files will all go into a single merged spreadsheet. Every row from each source file must be kept intact, apart from the column ordering.
The steps also need to be handed over to a non-technical third party eventually, so I am looking for an approach that will not offer too many non-expert technical hurdles.
Okay, I'm approaching this problem a different way. I have instead gone back to the source application (WooCommerce) to fix the export, so the spreadsheets list all the same columns and all in the same order, on every export. This does have other consequences that I need to follow up, such as managing patches and trying to get the changes accepted by the source project. But it does avoid having to append the CSV files with mis-matched columns, which seems to be a common issue that no-one has any real solutions for (yes, I have searched, a lot).
My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.