I have multiple tables containing historical data, so there is not a 1 to 1 relation between id.
我有多个包含历史数据的表,因此id之间没有1对1的关系。
I have to join on id and the time stamp indicating when the data has been active, TO_TIMESTMP can be null if the data is still active or if it has never been set for old data.
我必须加入id和时间戳,指示数据何时处于活动状态,如果数据仍处于活动状态,或者从未为旧数据设置,则TO_TIMESTMP可以为空。
My main table after some grouping outputs something like this:
一些分组之后我的主表输出如下:
TABLE_A
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00
1 2 3 11/24/2012 12:00:00 null
2 1 2 11/21/2012 01:00:00 null
then i have another table that i use to link further
然后我有另一个表,我用来进一步链接
TABLE_B
AID CID FROM_TIMESTMP TO_TIMESTMP HIST_ID
1 3 11/01/2012 00:00:00 null 1
1 3 11/21/2012 00:00:00 12/04/2012 11:59:00 2
1 3 11/24/2012 12:00:00 null 3
2 4 11/21/2012 00:59:59 null 4
and my 3rd table looks something like this:
我的第3个表看起来像这样:
TABLE_C
CID VALUE FROM_TIMESTMP TO_TIMESTMP HIST_ID
3 A 11/01/2012 00:00:00 null 1
3 B 11/21/2012 00:00:00 11/24/2012 11:59:00 2
3 C 11/24/2012 12:00:00 null 3
4 D 11/21/2012 01:00:01 null 4
My expected output if I want to combine table A with Value of from Table C through Table B is:
如果我想将表A与表C到表B的值组合,我的预期输出是:
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP VALUE
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00 B
1 2 3 11/24/2012 12:00:00 null C
2 1 2 11/21/2012 01:00:00 null D
There is indexes on everything except AMOUNT in Table A and VALUE in Table C and I use the following SQL to pull out the data.
除了表A中的AMOUNT和表C中的VALUE之外的所有内容都有索引,我使用以下SQL来提取数据。
SELECT a.AID, a.USER_ID, a.AMOUNT, a.FROM_TIMESTMP, a.TO_TIMESTMP, c.VALUE from
(SELECT AID, USER_ID, SUM(AMOUNT), FROM_TIMESTMP, TO_TIMESTMP from TABLE_A GROUP BY AID, USER_ID, FROM_TIMESTMP, TO_TIMESTMP) a
inner join TABLE_B b on b.HIST_ID in (select max(HIST_ID) from TABLE_B
where AID = a.AID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null))
inner join TABLE_C c on c.HIST_ID in (select max(HIST_ID) from TABLE_C
where CID = b.CID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null));
Due to some inconsistencies on when data is saved I have added a 30 sec grace period when comparing starting time stamps in case they where created around the same time, is there a way to improve the way I do this?
由于在保存数据时存在一些不一致的原因,我在比较起始时间戳时增加了30秒的宽限期,以防它们在同一时间创建的地方,有没有办法改进我这样做的方式?
I select the one with MAX(HIST_ID) so cases like AID=1 and USER_ID=2 in TABLE_A only get the newest row that matches id/timestamp from other tables.
我选择MAX(HIST_ID)的那个,所以TABLE_A中的AID = 1和USER_ID = 2的情况只得到与其他表的id / timestamp匹配的最新行。
In my real data I Inner join 4 tables like this(instead of just 2) and it works good on my local test data (pulling just over 42000 lines in 11 sec when asking for all data).
在我的真实数据中,我加入了4个这样的表(而不仅仅是2个),它在我的本地测试数据上运行良好(在询问所有数据时,在11秒内拉动超过42000行)。
But when I try and run it on test environment where the data amount is closer to production it runs to slow even when I limit the amount of lines I query in the first table to about 6000 lines by setting FROM_TIMESTMP has to be between 2 dates.
但是当我尝试在数据量更接近生产的测试环境中运行它时,即使我通过设置FROM_TIMESTMP将我在第一个表中查询的行数限制为大约6000行,也必须在2个日期之间运行。
Is there a way to improve the performance of my joining of tables by doing it another way?
有没有办法通过另一种方式改善我加入表的性能?
2
one simple change to avoid the max() repeated sub queries is:
一个简单的改变,以避免max()重复的子查询是:
select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,a.value
from (select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,c.value,
row_number() over (partition by a.aid,a.user_id order by b.hist_id desc, c.hist_id desc) rn
from (select aid,user_id,sum(amount) amount,from_timestmp,to_timestmp
from table_a
group by aid,user_id,from_timestmp,to_timestmp) a
inner join table_b b
on b.aid = a.aid
and b.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( b.to_timestmp >= a.from_timestmp or b.to_timestmp is null)
inner join table_c c
on c.cid = b.cid
and c.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( c.to_timestmp >= a.from_timestmp or c.to_timestmp is null)) a
where rn = 1
order by a.aid, a.user_id;
1
There could be many reasons why your query runs faster on one environment and slower on another. Most probably it's because the optimizer has defined two distinct plans and one runs faster. Probably because the statistics are slightly different.
可能有很多原因可能导致您的查询在一个环境中运行得更快,而在另一个环最有可能是因为优化器定义了两个不同的计划,一个运行得更快。可能是因为统计数据略有不同。
You can certainly optimize your query to use your indexes but I think your main problem lies with the data and/or data model. And with bad data you'll run into these kind of problems again and again.
您当然可以优化查询以使用索引,但我认为您的主要问题在于数据和/或数据模型。对于糟糕的数据,您会一次又一次地遇到这些问题。
It's pretty common to archive data into the same table, it can be useful to represent transient data that needs to be queried historically. However, having archived data should not make you forget essential rules about database design.
将数据存档到同一个表中非常常见,它可以用于表示需要历史查询的瞬态数据。但是,拥有存档数据不应该让您忘记有关数据库设计的基本规则。
In your case it seems you have three related tables: they would be linked in your entity-relationship model. However, somewhere along the designing process, they lost this link so now you can't reliably identify which row is relied to which one.
在您的情况下,您似乎有三个相关的表:它们将在您的实体关系模型中链接。但是,在设计过程的某个地方,他们丢失了这个链接,所以现在你无法可靠地识别哪一行依赖于哪一行。
I suggest the following:
我建议如下:
If two tables are related in your ER model, add a foreign key. This will ensure that you can always join them if you need to. Foreign keys only add a small cost in DML operations (and only INSERT, DELETE and update to the primary key (?!)). If your data is inserted once and queried many times, the performance impact is negligible.
如果ER模型中有两个表相关,请添加外键。这将确保您可以随时加入他们。外键仅在DML操作中添加少量成本(并且仅对主键(?!)进行INSERT,DELETE和更新)。如果您的数据插入一次并多次查询,则性能影响可以忽略不计。
In your case if (AID
, FROM_TIMESTAMP
) is your primary key in TABLE_A
, then have the same columns in TABLE_B
reference TABLE_A
's primary key columns. You may need FROM_TIMESTAMP_A
and FROM_TIMESTAMP_C
if A
and C
(which seem unrelated) have distinct updating scheme.
在您的情况下,如果(AID,FROM_TIMESTAMP)是TABLE_A中的主键,则在TABLE_B中具有相同的列,引用TABLE_A的主键列。如果A和C(看似无关)具有不同的更新方案,则可能需要FROM_TIMESTAMP_A和FROM_TIMESTAMP_C。
If you don't follow this logic, you will have to build your queries differently. If A, B and C are each historically archived yet not fully referenced, you will only be able to answer questions with a single point-in-time reference, questions such as "What was the status of the DB at time TS":
如果您不遵循此逻辑,则必须以不同方式构建查询。如果A,B和C各自历史存档但尚未完全引用,您将只能使用单个时间点参考来回答问题,例如“在TS时刻DB的状态是什么”等问题:
SELECT *
FROM A
JOIN B on A.aid = B.aid
JOIN C on C.cid = B.cid
WHERE a.timestamp_from <= :TS
AND nvl(a.timestamp_to, DATE '9999-12-31') > :TS
AND b.timestamp_from <= :TS
AND nvl(b.timestamp_to, DATE '9999-12-31') > :TS
AND c.timestamp_from <= :TS
AND nvl(c.timestamp_to, DATE '9999-12-31') > :TS
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2012/12/07/354fbe6008c29c663124f80527c1d5f5.html。