如何删除SSIS中的DataFlow任务中的列?

[英]How can I delete the columns in DataFlow Task in SSIS?


I use SQL Server 2016 and I have a very busy DataFlow task. In my DataFlow task, I use Multicast component for some reason. After creating a new Flow in my DataFlow, I need to delete some of the columns in the new flow because they are useless.

我使用SQL Server 2016,我有一个非常繁忙的DataFlow任务。在我的DataFlow任务中,由于某些原因,我使用了多播组件。在我的DataFlow中创建一个新流之后,我需要删除新流中的一些列,因为它们是无用的。

enter image description here

Just for more information, I need to do that because I have more than 200 columns in my flow and I need less than 10 of those columns.

为了获得更多信息,我需要这样做,因为我的流中有超过200列,而我需要的列少于10列。

How can I delete the columns in DataFlow Task in SSIS?

如何删除SSIS中的DataFlow任务中的列?

3 个解决方案

#1


1  

You can add an extra component of some sort. However, this will never reduce complexity or improve performance. Just thinking about it, logically, you are adding an additional interface that needs to be maintained. Performance-wise, anything that will eliminate columns means copying one set of rows from one buffer to a whole other buffer. This is called an asynchronous transformation, and it is better described here and here. You can imagine that copying rows is less efficient than updating them in place.

您可以添加某种额外的组件。然而,这永远不会降低复杂性或提高性能。从逻辑上考虑,您正在添加一个需要维护的额外接口。在性能方面,任何消除列的方法都意味着将一组行从一个缓冲区复制到另一个缓冲区。这被称为异步转换,在这里和这里都有更好的描述。可以想象,复制行比在适当的地方更新行效率要低。

Here are some recommendations for reducing complexity, which will, in turn, improve performance:

以下是一些减少复杂性的建议,反过来将提高性能:

  • Reduce the columns at the source. If you are selecting columns that are not subsequently used in any way, then remove them from the query or uncheck them from the source component. Removing columns in this way removes them from the buffer, which will occupy less memory.
  • 减少源中的列。如果您选择的列后面没有以任何方式使用,那么从查询中删除它们,或者从源组件中取消选中它们。以这种方式删除列将从缓冲区中删除它们,这会占用更少的内存。
  • Reduce the number of components in the dataflow. Very long dataflows are easy to create, a pain to test and even harder to maintain. Dataflows are expecting a unit of work, i.e. a data stream from here to there with a few things in the middle. This is where dataflows shine, in fact, they protect themselves from complexity with memory limitations and a max number of threads. It is better to divide the work into separate dataflows or stored procs. You could stage the data into a table and read it twice, rather than use a multicast, for example.
  • 减少数据流中的组件数量。很长时间的数据流很容易创建,很难测试,更难维护。数据流期望的是一个工作单元,即从这里到那里的数据流,中间有一些东西。这就是数据流的亮点,事实上,它们保护自己不受内存限制和线程数量的影响。最好将工作划分为单独的数据流或存储的过程。例如,您可以将数据放置到一个表中并读取两次,而不是使用多播。
  • Use the database. SSIS is as much an orchestration tool as it is a data-moving tool. I have often found that using simple dataflows to stage the data, followed by calls to stored procedures to process the data, always out-performs an all-in-one dataflow.
  • 使用数据库。SSIS既是一个数据移动工具,也是一个编排工具。我经常发现,使用简单的dataflow对数据进行分段,然后调用存储过程来处理数据,总是优于一体式的数据流。
  • Increase the number of times you write the data. This is completely counter intuitive, but if you process data in smaller sets of operations, it is faster running and easier to test. Given a clean slate, I will often design an ETL to write data from the source to a staging table, perform a cleansing step from the stage table to another, optionally, add a conforming step to combine data from different sources to yet another table and, finally, a last step to load a target table. Note that each source is pushed to its own target table and later combined, leveraging the database. The first and last steps are set up to run fast and avoid locking or blocking at either end.
  • 增加写数据的次数。这完全与直觉相反,但如果您在更小的操作集中处理数据,那么它的运行速度更快,也更容易测试。给定一个干净的石板,我往往会设计ETL写数据从源到一个临时表,从舞台表执行一个清洗步骤,可选地,添加一个符合一步把来自不同数据源的数据到另一个表,最后,加载一个目标表的最后一步。注意,每个源都被推到自己的目标表中,然后结合使用数据库。第一步和最后一步的设置是为了快速运行,避免在任何一端锁定或阻塞。
  • Bulk Load. The prior step really does well, when you insure that bulk loading is happening. This can be a tricky thing, but generally you can get there by using "fast load" in the OLEDB destination and by never using the oledb command. Removing indexes and re-adding them is faster than loading in place (with few exceptions).
  • 批量加载。当你确保大容量装载发生时,前一步真的做得很好。这可能是一件棘手的事情,但通常您可以通过在OLEDB目标中使用“快速加载”,而不使用OLEDB命令来实现这一点。删除索引并重新添加它们比加载到位要快(很少有例外)。

These guidelines will get you headed in the general direction, but do post more questions for tuning specific performance problems.

这些指导方针将使您朝着大致的方向前进,但是在调优具体的性能问题时,要做更多的问题。

#2


1  

I believe that you can pass just one data flow path to a UNION ALL task to remove columns from that single data flow.

我相信,您可以只传递一个数据流路径到一个UNION ALL任务,以从单个数据流中删除列。

Take the single data flow path that you would like to remove columns from and pass it to a Union All task. Then open up the Union All task right click on the column(s) you would like to remove from that path and select delete.

取您希望从其中删除列的单个数据流路径,并将其传递给Union All任务。然后打开Union All task右键单击要从该路径中删除的列并选择delete。

Usually I think the source of the data should be altered to not send the unwanted columns out, but your case is special. With one path out of the multicast needing all of the columns from the source, while one path does not.

通常,我认为应该修改数据的来源,以不将不需要的列发送出去,但您的情况是特殊的。多播中的一条路径需要来自源的所有列,而另一条路径不需要。

#3


1  

First of all, i don't think that what you are asking will give a better performance because the data is loaded from source then multiplied when using Multicast Then The component that will reduce the column number...

首先,我不认为你所问的会带来更好的性能,因为数据是从源加载的,然后在使用多播时再乘以组件,这会减少列数……

You can do this multiple way:

你可以用多种方法:

  1. If you can create another DataFlow Task with a Reduced columns source (ex: OLEDB command with specific columns) it is better

    如果您可以使用减少的列源创建另一个DataFlow任务(例如:OLEDB命令与特定的列),那就更好了。

  2. You can add Script component with an Asynchronous Output (like shown in the image below) and add the specifid columns to the output, map them using a Vb.net or C# script, something like this:

    您可以添加带有异步输出的脚本组件(如下图所示),并将specifid列添加到输出中,使用Vb.net或c#脚本对它们进行映射,如下所示:

    Output0Buffer.AddRow()
    Output0Budder.OutColumn = Row.inColumn
    

enter image description here

  1. Add a UNION ALL component and select the columns you need
  2. 添加一个UNION ALL组件并选择所需的列

Side Note: It is good to test each scenario performance and choose the better

附注:最好测试每个场景的性能并选择更好的


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2017/03/08/d322718bc54d28a267c8d624f7bb1f15.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告