Handle bad addresses in location columns - socrata

I am updating a very large dataset from a legacy system. We want to map the data, but some of the row's addresses aren't mappable. Either they are simply badly entered, or are PO Boxes.
I could live with some rows not getting mapped, but bad addresses cause the entire row to be rejected.
Has anyone had to deal with this situation? What is the best way to handle it? If we want the rows with bad addresses to be included in the dataset, do we need to forgo geo-locating?
Fixing the addresses is out of the question.

Related

Changing database table based on one column

I am new to database design, but not to computers and terminology. I need some help in my database design. I am collecting data from a Global Navigation Satellite System (GNSS) receiver and each packet is differing in sizes depending on which constellation it sees (GPS, GALILEO, GLONASS, etc). There are some common fields across them all, but the way I currently have it set up is all the possible fields are columns, and any field that is unrelated to the new packet is just NULL. This is very inefficient, I just don't know how to go about designing a better way. Thoughts? The main point is as it is now, every time I do a query, I either specify all the fields that are relevant to that specific packet type, or I get a bunch of useless data.
I was thinking of one option, where I have all the common fields in one table, and another table for the unique fields for each packet type. Then have a column that tells what type of packet it is, so when I do a SELECT query I can do a JOIN and only get the data that is relevant to said packet.
Thousands of rows of a hundred columns, many of which are NULL? Sounds fine.
It might help if you provided a CREATE TABLE; there could be tips on datatypes, etc.
It is usually a good idea to "cleanse" data that comes in from external sources. Especially if, say, two consellations provide the same value in a different format. In this example, have one column and convert one (or both) from the given format to the table's datatype. Sometimes that is 'automatic'. For example, for a FLOAT or DECIMAL(6,3) column, "123.456" and "1.23456e2" look different, but go into FLOAT the same. (OK, there could be a rounding difference.) You may choose to use DOUBLE.
NULLs don't cost much. Perhaps the main concern is your programming effort.
Your title "changing database based on column" -- that is messy; don't do it.

Is it possible to achieve non-linear storage of data in a MySQL table?

I've recently built a simple survey with a competition element. To enter the competition, the user was required to enter their email address, under the promise of anonymity.
In an effort to anonymize the data, the emails were stored in a separate table, with no foreign keys linking to survey data.
However, with a little knowledge, you can see it's quite easy to merely line up the two result sets and correlate the owner of the survey data. Everybody also opted in, which makes this task even easier. MySQL, even without timestamps and auto-increment columns maintains the inserted order.
So it got me to wondering, is there a clever way of preventing this? Some method of randomising the emails table on insert?
Obviously I know this could probably be done with a App-side callback, but I was looking for something more elegant on the MySQL side.

Managing Historical Data Dependencies

3 Tables: Device, SoftwareRevision, Message. All data entered is handed by PHP scripts on an Apache server.
A device can have one software revision. A software revision can have many devices. A device can have many messages. A message can have one device.
Something like above.
The issue is, the SoftwareRevision changes how the message is used in the front end application. This means that when the software is updated on the device, we need older messages to retain the information that they were received from a different software revision.
The TL;DR here is that the fully normalized way I see of doing this becomes a real pain. I've got about 5 of these situations in my current project and 3 of them are nested inside of each other.
I see three ways of doing this:
The first is the above fully normalized way. In order to find out how to use the message on the front end application, one must find the latest entry into Device_SoftwareRevision_Records that is before the datetime of the given message. This gets really fiddly when you have a more complex database and application. Just to get the current SoftwareRevision_ID for a device you have to use a MAX GROUP BY type statement (I've ended up having to use views to simplify).
The second is to directly link the Message to the SoftwareVersion. This means you don't have to go through the whole MAX GROUP BY WHERE blah blah. The SoftwareVersion_ID is retrieved by a PHP script and then the message is entered. Of course, this is denormalized so now there is potential for duplicate data.
Aaaand heres our fully denormalized version. The Software_Revision_Records table is purely for bookkeeping purposes. Easy to use for the front-end application but a pain to update at the back-end. The back-end updating can actually be streamlined with triggers for entering into the Software_Revision_Records table so the only thing that can really go wrong is the message gets the wrong software revision when it is entered.
Is there a better way of doing this that I have missed? Is it such a sin to denormalize the database in this situation? Will my decision here cause the business to erupt into flames (probably not)?
If the messages are tied to the software revision for that particular device, then it might make more sense to reflect that relationship in the data model. i.e. have a foreign key from Messages to Device_SoftwareRevision_Records rather than from Messages to Device. You still have the relationship from Messages to Device indirectly, it's normalised, and there's no messing around with dates trying to figure out which messages were created while a given software revision was in place.
In cases where you do need dates, it might also be worth considering having both a start and stop date, and filling in any null dates with something like 9999-12-31 (to indicate that a record has not yet been ended). You can easily find the latest record without needing to do a max. It will also make it a lot easier to query the table if you do need to compare it to other dates - you can just do a between on a single record. In this example, you'd just look for this:
where Message.TimeStamp between Device_SoftwareRevision_Records.StartDate and Device_SoftwareRevision_Records.EndDate
That said, I would still - if at all possible - change the model to relate Messages to the correct table rather than rely on dates. Being able to do simple joins will be quicker, more convenient, more obvious if anyone new needs to learn the structure, and is likely to perform better.

Cross Stream Data changes - EDW

I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.