I'm working on an application that imports data from a CSV file. I am told that the data in the CSV file comes from SAP, which I am totally unfamiliar with.
My client indicates that there is an issue. One column of data in the CSV file contains postal addresses. Sometimes, the system doesn't see a valid address. Here is a slightly fictionalized example:
1234 MAIN ST A&#C HOUSTON
As you can see, there is a street number, a street name, and a city, all in capital letters. There is no state or zip code specified. In the CSV file, all addresses are assumed to be in the same state.
Normally, where there is text between the street name and city, it is an apartment number or letter. In the above example, we get errors when we try to use the address with other services, such as Google geolocation. One suggested fix is to simply strip out there special characters, but I believe that there must be a better way.
I want to know what this A&#C means. It looks like some sort of escape sequence, but it isn't in a format I'm familiar with. Please tell me what these strange character sequence means.
I'm not totally sure, but I doubt there's a "canonical" escape sequence that looks like this. In the ABAP environment, # is used to replace non-printable characters. It might be that the data was improperly sanitized when importing into the SAP system in the first place, and when writing to the output file, some non-printable character was replaced by #. Another explanation might be that one of the field contained a non-ASCII unicode character (like, ) and the export program failed to convert that to the selected target codepage. It's hard to tell without examining the actual source dataset. Of course, it might also be some programming error or a weird custom field separator...
Related
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON --autodetect ERIC_KOLOTYLUK_BQ_POC_DATASET.Test2 small_data_clean.jsonl
seems to work well if the JSON is very clean, but very fragile with abstruse diagnostics when the JSON is not very clean. Well, the feature is still experimental, so no point in complaining. For example, if JSON property names contain - characters, while these are valid JSON, they are not valid BigQuery Column Names.
My questions is, are there some existing tools/utilities for ingesting generic JSON into BigQuery that works better than --autodetect?
Presumably Google will improve --autodetect over time, but for now I am looking for any advice/experience people may have. I have already written some code to replace - with _ in property names, so I was wondering if other people have similarly creates tools/utilities...
As you've mentioned for a sample scenario, you have a JSON property that has "-" for a column name, however per this Specifying a Schema Documentation,
A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore.
Any characters non-compliant with the above will result to error on creation of columns during the definition of schema.
On the other hand, when using --autodetect, BigQuery's official documentation on Auto-Detection already has a disclaimer saying,
When BigQuery detects schemas, it might, on rare occasions, change a
field name to make it compatible with BigQuery SQL syntax.
Since there is no other tools yet available to auto-correct/format JSON data to fit the BigQuery requirements for schema definition, the best approach for this kind of scenario is to write a code to replace unwanted characters on JSON data's column names, in which you already did.
I've been working a bit with some files from Minecraft Dungeons, which were extracted using QuickBMS and made available here: https://minecraft.fandom.com/wiki/Minecraft_Wiki:Minecraft_Dungeons_game_files
In the "data" folder, there are a bunch of json-files, which I believe contain a list of textures associated with any given stage of the game. There is, however, a problem. When opened, it reads like any json-file, it has a bunch of names and values, but some of the values are not human-readable, they instead show up as a string of seemingly unrelated characters. Here an example:
"walkable-plane" : "eNpjYSEOMIMAOp+ZmQmND1fEjF2AiQldAJsWDEPRXUKkowkDAM/qA6o=",
Now, given that these are exclusively characters, and not error signs or something of the sorts, I'm assuming this is an encoding issue. Of course, I don't know for sure, Or I wouldn't be asking this in the first place, but the file as it appears in the text is UTF-8, and it obviously doesn't produce a usable result. So, if anyone knows what exactly this is, and how I could extract information from it, I'd be really thankful.
I have a Python script which collects data and sends it to my MySQL table.
I noticed that the "Cost" sometimes is 0,95 which results in 0 in my table since my table use "0.95" instead of "0,95".
I assume the best solution is to convert the , to . in my Python script by using:
variable.replace(",", ".")
However, couldn't one solution be to change format in my MySQL table? So that I store numbers in this format:
1100
0,95
0,1
150000
My Django Model
cost = models.DecimalField(max_digits=10, decimal_places=4, default=None)
Any feedback on how to best solve this issue?
Thanks
Your first instinct is correct: convert the "unusual" (comma-decimal) input into the standard format that MySQL used by default (dot-decimal) at the first point where you receive it.
there's lots of ways to write numbers
Be careful, though that you don't get stung by people using commas as thousands separators like "3,203,907.23", or the European form "3.203.907,23", the Swiss "3'203'907,23' or even this form, which is widely used in India: "32,03,907.71" (yes, I did mean to type only two digits there!)
To make your life easier, the rule for currencies is relatively simple:
where a dot or comma is followed by only two digits at the end of the string, that character is acting as the decimal separator.
Once you know which is the decimal separator, you can safely remove all other non-digits from the string, change the decimal separator you found to . then use any standard library string-to-number conversion.
Storage format isn't presentation format
Yes, you can tell MySQL to use comma as its decimal separator, but doing that will break so much of your code - including the parts of the framework that read from the database and expect dot-decimal numbers - that you'll regret doing it that way very quickly...
There's a general principle at work here: you should do your data storage and processing using a format that is easy to process, interchangeable with other systems, and understood by other software developers.
Consider what happens if you need to allow a different framework to access your MySQL database to generate reports... whoever develops that software (and it may be you) will be glad that the numbers are all stored the way numbers are "always" stored in databases.
Convert on the way in, re-convert on the way out
Where you need to accept input in a different format, convert that input into your standardised format as early as possible.
When you need to use an output format, do the conversion to that format as late as possible.
The idea is to keep as much of your system "unexceptional" as possible. A programmer who has to remember what numeric format will in force at the time when a given method is called is not a happy programmer.
P.S.
The option you're talking about in MySQL is an example of this pattern: it doesn't change how numeric data is stored. All that changes is how you pass numbers to MySQL and how it presents them back to you.
I'm working on a database import/export process in VB.NET which writes data from a MySQL (5.5) database to a plain text file. The application reads the data to a DataTable, then goes through the rows/columns to actually write the data to the OutputFile (System.IO.StreamWriter object). The encoding on the tables in this database is Latin1. There is a MediumBlob field in one of the tables I've been using for testing which contains image files stored as a byte array.
In my attempts to validate the output from my application, I've exported the data directly from the database using the MySQL Workbench, then compared that with the results I get when I write the same data from my application. In the direct export from MySQL Workbench, I see some of these bytes are exported with the backslash. When I read the data through my application, however, this escape character does not appear. Viewed through Notepad++, it clearly shows some distinct differences between the two output results (see screenshot).
Obviously, while apparently very similar, the two are not completely identical. My application is not including the backslashes for escaped characters, and some characters such as NULL are coming out differently altogether. My code for writing this field to the file is:
OutputFile.Write("'" & System.Text.Encoding.GetEncoding(28591).GetString(CType(COPYRow(ColumnIndex), Byte())) & "'")
There doesn't appear to be an overload for the GetString method that allows me to specify an escape character, so I'm wondering if there's another way that, using this method, I can ensure the characters are correctly encoded, including escape characters.
I'm "assuming" that this method should also work in general when I start working with my PostgreSQL database, but with possibly a different encoding. I'm trying to build things as "generic" as possible, but I'll have to worry about specifying encodings at run-time instead of hard-coding them later.
EDIT
I just ran across another SO question, which might point me in the right direction: Convert a Unicode string to an escaped ASCII string. Obviously, it might take a bit more work to get it right, but this looks like the closest thing to what I'm trying to accomplish.
I'm following the demo code from article of phpsqlgeocode.html
In the db, I inserted some Chinese addresses, which are utf-8 encoded. I
found after urlencode the Chinese address, the output of the address
would be wrong. Like this one:
http://maps.google.com.tw/maps/geo?output=csv&key=ABQIAAAAfG3KxFZXjEslq8VNxMBpKRR08snBovzCxLQZ9DWwpnzxH-ROPxSAS9Q36m-6OOy0qlwTL6Ht9qp87w&q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
Then it outputs 200,5,59.3266963,18.2733433 (I can't query this through PHP, but through the browser instead).
This address is actually located in Taichung, Taiwan, but it turns out to be
in Sweden, Europe. But when I paste the Chinese address(such as 台中市西屯區智惠
街131巷56號58號60號) in the url, the result turns out to be fine!
How do I make sure it sends out the original Chinese
address? How do I avoid urlencode()? I found that removing urlencode() doesn't change anything.
(I've change the MAPS_HOST from maps.google.com to
maps.google.com.tw.)
(I'm sure my key is right, and other English address geocoding are
fine.)
q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
decodes to:
?????????132?
so something has corrupted the string already before URL-encoding. This could happen if you try to convert the Chinese characters into an encoding that doesn't support Chinese characters, such as Latin-1.
You need to ensure that you're using UTF-8 consistently through your application. In particular you will need to ensure the tables in the database are stored using a UTF-8 character set; in MySQL terms, a UTF-8 collation. The default collation for MySQL otherwise is Latin-1. You'll also want to ensure your connection to the database uses UTF-8 by calling 1mysql_set_charset('utf-8')`.
(I am guessing from your question that you're using PHP.)