Why my PyAthena generate a csv and a csv meta data file in s3 location while reading a GLUE table? - csv

I started to pull GLUE table via using pyathena since last week. However, one annoying thing I noticed that is if I wrote my code as shown below, sometimes it works and returns a pandas dataframe but other times, this piece of codes will create a csv and a csv metadata in the folder where physical data (parquet) are stored in S3 and registered in GLUE.
I know that if you use pandas cursor, it may end up with these two files but I just wonder if I can access data without these two files since every time these two files generated in S3, my read in process failed.
Thank you!
import os
access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
connect1 = connect(s3_staging_dir='s3://xxxxxxxxxxxxx')
df = pd.read_sql("select * from abc.table_name", connect1)
df.head()

go to Athena
click settings -> workgroup name -> edit workgroup
Update "Query result location"
click "Override client-side settings"
Note: If you have not setup any other workgroups for your Athena environment, you should only find one workgroup named "Primary".
This should resolve your problem. For more information you can read:
https://docs.aws.amazon.com/athena/latest/ug/querying.html

Related

Found more columns than expected column count in Azure data factory while reading CSV stored in ADLS

I am exporting F&O D365 data to ADLS in CSV format. Now, I am trying to read the CSV stored in ADLS and copy into Azure Synapse dedicated SQL pool table using Azure data factory. However, I can create the pipeline and it's working for few tables without any issue. But it's failing for one table (salesline) because of mismatch in number of column.
Below is the CSV format sample, there is no column name(header) in CSV because it's exported from F&O system and column name stored in salesline.CDM.json file.
5653064010,,,"2022-06-03T20:07:38.7122447Z",5653064010,"B775-92"
5653064011,,,"2022-06-03T20:07:38.7122447Z",5653064011,"Small Parcel"
5653064012,,,"2022-06-03T20:07:38.7122447Z",5653064012,"somedata"
5653064013,,,"2022-06-03T20:07:38.7122447Z",5653064013,"someotherdata",,,,test1, test2
5653064014,,,"2022-06-03T20:07:38.7122447Z",5653064014,"parcel"
5653064016,,,"2022-06-03T20:07:38.7122447Z",5653064016,"B775-92",,,,,,test3
I have created ADF pipeline using copy data activity to copy the data from ADLS(CSV) to Synapse SQL table however I am getting below error.
Operation on target Copy_hs1 failed: ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'SALESLINE_00001.csv' with row number 4: found more columns than expected column count 6.,Source=Microsoft.DataTransfer.Common,'
Column mapping looks like below- Because CSV first row has 6 column so it's appearing 6 only while importing schema.
I have repro’d with your sample data and got the same error while copying the file using the copy data activity.
Alternatively, I have tried to copy the file using data flow and was able to load the data without any errors.
Source file:
Data flow:
Source dataset: only the first 6 columns are read as the first row contains only 6 columns in the file.
Source transformation: connect source dataset in source transformation.
Source preview:
Sink transformation: Connect sink to synapse dataset.
Settings:
Mappings:
Sink output:
After running the data flow, data is loaded to the sink synapse table.
Change my csv to xlsx help me to solve this problem in Copy Activity ADF.
1.From Copy data settings set "Fault Tolerance" = "Skip Incompatible rows"
skip incompatible rows
2.From Dataset connection settings set Escape character to Double quotes
Escape character

How to convert dbase III files to mysql?

is ist possible to convert .DBF files to any other format?
Does anybody knows a script, that can be used to convert .DBF files to an mysql query.
It would be also fine, to convert the DBF files to CSV files.
I always got problems with the codec of the DBF files.
Konstantin
https://www.dbase.com/Knowledgebase/faq/import_export_data.asp
Q: How do I export data from a dBASE table to a text file?
A: Exporting data from dBASE to a text file is handled through the COPY TO command.
Like the APPEND FROM command, there are a number of ways to use this command. Here we are only interested in it's most basic use. Once you understand how to use this command, you can go to your on-line help for further details on what can be accomplished with the COPY TO command.
In order to export data you must first be using the table from which the data will be exported. As before, you will be employing the USE command in the command window.
USE <tablename>
For example:
USE Mytest.dbf
Once the table is in use, all you need to do is type the following command in the command window:
COPY TO <filename> TYPE DELIMITED
For example:
COPY TO Myexport.txt TYPE DELIMITED
This would result in a file being created in the current directory called Myexport.txt which would be in the DELIMITED or *.CSV format.
If we had wanted to export the data in the *.SDF format, we would have typed:
COPY TO Myexport.txt TYPE SDF
This would result in a file being created in the current directory called Myexport.txt which would be in the System Delimted or *.SDF format.
Those are the basics on how to import and export text data into a dBASE table. For further information consult the on-line help for the APPEND FROM and COPY TO commands.
I converted old (circa 1997) DBF files to CSV using Python and the dbfread module.
After installation of Python, from the Python interpreter (<WIN> + 'Python') install the dbfread module:
>>> pip install dbfread
The module has many method to read DBF files and excellent documentation.
Then a Python script does the job, or typing directly into the interpreter:
# Read the DBF file
table = DBF('C:/my_dbf_file.dbf', encoding='1252')
outFileName = 'C:/my_export.csv'
with open(outFileName, 'w', newline='', encoding='1252' ) as file:
writer = csv.writer(file)
writer.writerow(table.field_names)
for record in table:
writer.writerow(list(record.values()))
Note that each record in the database is read and save one at a time and that the first line of the CSV file are the column's names.
Encoding could be problematic, a list of encoding to try is here: The dbread.DBF() method tries to guess the encoding but is not perfect. This is why in the code I specify the parameters encoding in both DBF() and csv.open().

Move Data from Blob to SQL Database

I am currently in the process of moving my sensor reading data from my Azure Blob storage into a SQL database. I have multiple .csv files and in those files I have various columns that holds the date ( in the format: 25/4/2017), time, sensor_location and sensor_readings.
My question; If I want to store the data according to their respective columns using Logic App, what step should I take? and how do I push the second file data into the row after the first file data? Thanks
You will need to either write a script, (any high level language which has support or extensions for mysql will do, python, php, nodejs, etc) to import your data or you can use a mysql client like sequelpro https://www.sequelpro.com/ which imports csv files.
Here is a link as to how to insert data into mysql with php:
http://php.net/manual/en/pdo.prepared-statements.php
You can read the csv file with:
$contents = file_get_contents('filename.csv');
$lines = explode("\n", $contents);
foreach($lines as $line) { ...
// insert all rows to mysql here

Import an edge list in csv format. and plot a graph for the complete edgelist

I have a data set in the form of a edgelist as :
Click to see sample data
I want to load this csv into neo4j and create a one-to-one relationship between A and P, A and Q and so on, and represent it graphically.
How can i do that using neo4j?
I am only able to import the file and not able to create any relationship
There is very little information to go on here. If you are using LOAD CSV to import then post what you have tried. If all you have is an edge list, you need to create the nodes that do not exist already in the database. So use a MERGE command on the node identifier to use it if it exists or create it if it doesn't.

CSV LOAD and updating existing nodes / creating new ones

I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]