非本地操作的数据处理器和块之间的耦合

Palabos用户文档的第十六章

（#）前言

原始文档

Dynamics classes and data processors are the nuts and bolts of a Palabos program. Conceptually speaking, you can view dynamics classes as an implementation of a raw “lattice Boltzmann paradigm”, whereas data processors are rather a realization of a more general data-parallel multi-block paradigm. Dynamics classes are (relatively) easy, because lattice Boltzmann is easy, as long as there are no boundaries, refined grids, parallel programs, or any other advanced structural ingredients. Data processors can be a little bit harder to understand, because they have the difficult task to solve “everything else” which is not taken charge of by dynamics classes. They implement all non-local ingredients of a model, they execute operations on scalar-fields and tensor-fields, and they create couplings between all types of blocks. And, just like dynamics objects, they handle reduction operations, and they must be efficient and inherently parallelizable.
The importance of data processors is particularly obvious when you are working with scalar-fields and tensor-fields. Unlike block-lattices, these data fields do not have intelligent cells with a local reference to a dynamics object; data processors are therefore the only efficient way of performing collective operations on them. As an example, consider a field of type TensorField3D<T,3> that represents a velocity field (each element is a velocity vector of type Array<T,3>). Such a field is for example obtained from a LB simulation by calling the function computeVelocity. It might be interesting to compute space-derivatives of the velocity through finite differences.
The only morally right way of doing this is through a data processor, because it is the only approach which is parallelizable, scalable, efficient, and forward-compatible with future versions of Palabos. This is exactly what Palabos does when you call one of the functions like computeVorticity() or computeStrainRate() (see the appendix [tensor-field -> tensor-field]). Note, once more, that you could also evaluate the finite difference scheme by writing a simple loop over the space indices of the tensor-field. This would produce the correct result in serial and in parallel, and it would even be pretty efficient in serial, but it is not parallelizable (in parallel, efficiency is lost instead of gained).
All in all, it should have become clear that while data processors are powerful objects, they also need to address a certain amount of complexity and are therefore conceptually less simple than other components of Palabos. Consequently, data-processors are most certainly the part of Palabos’ user’s interface which is toughest to understand, and this section is filled with ad-hoc rule which you just need to learn at some point. To alleviate this thing a bit, the section starts with two relatively simple topics which teach you how to solve many tasks in Palabos through the available helper functions and avoid explicitly writing data processors in many cases. The remainder of this section contains many links to existing implementations in Palabos, and you are strongly encouraged to actually have a look at these examples to assimilate the theoretical concepts.

Using helper functions to avoid explicitly writing data processors

Data processors can be (arbitrarily) split into three categories, according to the use which is made out of them. The first category is about setting up a simulation, assigning dynamics objects to a sub domain of the lattice, initializing the populations, and so on. These methods are explained in the sections Initial values of density and velocity and Defining boundary conditions, and the functions are listed in the appendix Mutable (in-place) operations for simulation setup and other purposes. The second category embraces data processors which are added to a lattice to implement a physical model. Many models are predefined, as presented in section Implemented fluid models. Finally, the last category is for the evaluation of the data, and for the execution of short post-processing task, as in the above example of the computation of a velocity gradient. Examples are found in the section Data evaluation, and the available functions are listed in the appendix Non-mutable operations for data analysis and other purposes.

Convenience wrappers for local operations

Imagine that you have to perform a local initialization task on a block-lattice, i.e. a task for which you don’t need to access the value of surrounding neighbors, and for which the available Palabos functions are insufficient. As a lightweight alternative to writing a classical data-processing functional, you can implement a class which inherits from OneCellFunctionalXD and implements the virtual method (here in 2D)
virtual void execute(Cell<T,Descriptor>& cell) const;
In the body of this method, simply perform the desired action on the argument cell. If the action depends on the space position, you can instead inherit from OneCellIndexedFunctionalXD and implement the method (again illustrated for the 2D case)
virtual void execute(plint iX, plint iY, Cell<T,Descriptor>& cell) const;
An instance of this one-cell functional is then applied to the lattice through a function call like
// Case of a plain one-cell functional.
applyIndexed(lattice, domain, new MyOneCellFunctional<T,Descriptor>);
// Case of an indexed one-cell functional.
applyIndexed(lattice, domain, new MyOneCellIndexedFunctional<T,Descriptor>);
This method is used in the example program located in examples/showCases/multiComponent2d. Here, a customized initialization process is required to get access to the external scalars of a lattice for a thermal simulation.
These one-cell functionals are less general than usual data processing functionals, because they cannot be used for nonlocal operations. Furthermore, they tend to be numerically somewhat less efficient, because Palabos needs to perform a virtual function call to the method execute of the one-cell functional on each cell of the domain. However, this loss of efficiency is usually completely negligible during the initialization stage, where it is important to have a code which scales on a parallel machine, but not to optimize the code for a gain of a micro-second. In this case you should prefer a code which, as proposed by the one-cell functionals, is shorter and easier to read.

文档翻译

动力学类和数据处理器是Palabos程序的基本要素。从概念上讲，您可以将动力学类视为原始“格子Boltzmann范例”的实现，而数据处理器则是更通用的数据并行多块范例的实现。动力学类相对容易，是因为格子Boltzmann方法很容易，只要不是边界、精细化网格，并行程序或任何其他高级结构成分。数据处理器可能有点难以理解，因为它们具有解决“其他所有问题”的艰巨任务，而这并不是动力学类负责的。它们实现模型的所有非局部成分，对标量字段和张量字段执行运算，并在所有类型的块之间创建耦合。而且，就像动力学对象一样，它们处理归约运算，并且它们必须高效且固有地可并行化。
使用标量字段和张量字段时，数据处理器的重要性特别明显。与块格不同，这些数据字段没有智能元胞，这些智能元胞没有本地化的引用动态对象。因此，数据处理器是对其执行集体操作的唯一有效方法。例如，考虑一个TensorField3D <T，3>类型的字段，该字段表示一个速度场（每个元素都是Array <T，3>类型的速度向量）。例如，通过调用函数computeVelocity从LB仿真获得这样的速度字段。通过有限差分计算速度的空间导数可能很有趣。做到这一点的唯一理论上正确的方法是通过数据处理器，因为这是唯一可与未来版本的Palabos并行化，扩展，高效且向前兼容的方法。当您调用诸如computeVorticity()或computeStrainRate()之类的函数之一时，Palabos正是这么做的(请参见附录[tensor-field-> tensor-field])。再次注意，您还可以通过在张量字段的空间索引上编写一个简单的循环来评估有限差分方案。这将在串行和并行中产生正确的结果，甚至在串行中甚至会非常有效，但是它不是可并行化的（并行结果是效率下降而不是提高）。
总而言之，应该清楚的是，尽管数据处理器是功能强大的对象，但它们也需要解决一定数量的复杂性，因此从概念上讲，它不如Palabos的其他组件那么简单。因此，数据处理器无疑是Palabos用户界面中最难理解的部分，这一部分充满了临时性的规则，您只需要在某些时候学习即可。为了减轻这种负担，本节从两个相对简单的主题开始，这些主题教您如何通过可用的辅助函数解决Palabos中的许多任务，并且在许多情况下避免显式编写数据处理器。本节的其余部分包含许多与Palabos中现有实现的链接，强烈建议您实际看一下这些示例以吸收理论概念。

使用helper函数避免显式地编写数据处理器

根据数据处理器的用途，可以（任意）将其分为三类。第一类是关于建立仿真，将动力学对象分配给晶格的子域，初始化总体等等。在密度和速度的初始化(Initial values of density and velocity，第八章)以及定义边界条件(Defining boundary conditions，第九章)的小节中介绍了这些方法，附录中的用于模拟设置和其他用途的可变(就地)操作中(Mutable (in-place) operations for simulation setup and other purposes，附录)列出了这些功能。第二类包括添加到网格中以实现物理模型的数据处理器。预定义了许多模型，如已实现的流体模型(Implemented fluid models，第七章)部分中所述。最后一个类别是用于数据评估，以及用于执行简短的后处理任务，如上述速度梯度计算示例中所示。示例在数据评估(Data evaluation，第12章)一节中，附录中列出了用于数据分析和其他目的非可变操作(Non-mutable operations for data analysis and other purposes，附录)的相关功能。

方便本地操作的包装器

想象一下，您必须在块状格子上执行本地初始化任务，而该任务不需要访问周围邻居的值，并且可用的Palabos功能不足。作为编写经典数据处理功能的轻型替代方案，您可以实现一个类，该类继承自OneCellFunctionalXD并实现虚函数execute（此处为2D）

virtual void execute(Cell<T,Descriptor>& cell) const;

在此方法的主体中，只需对参数单元格执行所需的操作。如果动作取决于空间位置，则可以从OneCellIndexedFunctionalXD继承并实现该方法（再次针对2D情况进行说明）

virtual void execute(plint iX, plint iY, Cell<T,Descriptor>& cell) const;

然后，通过像这样的函数调用将该单单元功能的实例应用于晶格

// Case of a plain one-cell functional.
applyIndexed(lattice, domain, new MyOneCellFunctional<T,Descriptor>);
// Case of an indexed one-cell functional.
applyIndexed(lattice, domain, new MyOneCellIndexedFunctional<T,Descriptor>);

在examples / showCases / multiComponent2d中的示例程序中使用此方法。在这里，需要定制的初始化过程才能访问用于热模拟的晶格的外部标量。
这些单单元功能不如通常的数据处理功能通用，因为它们不能用于非本地操作。此外，它们在数值上往往效率较低，因为Palabos需要对域的每个单元上执行单单元功能的方法执行虚拟函数调用。但是，这种效率损失通常在初始化阶段可以完全忽略不计，在初始化阶段，重要的是要在并行计算机上缩放代码，而不是为了获得微秒的增益而优化代码。在这种情况下，您应该更喜欢单单元功能所建议的代码，该代码更短且更易于阅读。

解释说明

以上为使用数据处理器的非常好用的方式。
通过直接写操作函数，并由内置类调用函数指针的方式对对象执行操作，通过现成的已有数据处理器进行操作，可以避免直接写数据处理器，边界条件中的很多都操作都是用这种方式实现的。
通过编写继承自OneCellFunctionalXD类的子类并实现虚函数execute()，即可通过applyIndexed()调用对本地元胞进行各种操作，可进行的操作相较于通过调用函数指针的方式，有了跟多的灵活性，很多对于结果收集的相关函数都是这么写出来的。

（##）写数据处理器

原始文档

A common way to execute an operation on a matrix is to write some sort of loop over all elements of the matrix, or at least over a given sub-domain, and to execute the operation on each cell. If the memory of the matrix is subdivided into smaller components, as it is the case for Palabos’ multi-blocks, and these components are distributed over the nodes of a parallel machine, then your loop needs also to be subdivided into corresponding smaller loops. The purpose of the data processors in Palabos is to perform this subdivision automatically for you. It then provides the coordinates of the subdivided domains, and requires from you to execute the operation on these domains, instead of the original full domain.
As a developer of a data processor, you’re almost always in touch with so-called data-processing functionals which provide a simplified interface by performing a part of the repetitive tasks behind the scenes. There exist many different types of data-processing functionals, as listed in the next section. For the sake of illustration, we consider now the case of a data-processor which acts on a single 2D block-lattice or multi-block-lattice, and which does not perform any data reduction (it returns no value). Let’s say that the aim of the operation is to exchange the value of f[1] and f[5] on a given amount on lattice cells. This could (and should, for the sake of code clarity) be done with the simple one-cell-functional introduced in Section Convenience wrappers for local operations, but we want to do the real thing now, and get to the core of data functionals.
The data-processing functional for this job must inherit from BoxProcessingFunctional2D_L (the L indicates that the data processor acts on a single lattice), and implement, among other methods described below, the virtual method process:
template<typename T, template<typename U> class Descriptor>
class Invert_1_5_Functional2D : public BoxProcessingFunctional2D_L<T,Descriptor> {public:
// ... implement other important methods here.void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice){for (plint iX=domain.x0; iX<=domain.x1; ++iX) {for (plint iY=domain.y0; iY<=domain.y1; ++iY) {Cell<T,Descriptor>& cell = lattice.get(iX,iY);// Use the function swap from the C++ STL to swap the two values.std::swap(cell[1], cell[5]);}}}
};
The first argument of the function process corresponds to domain computed by Palabos after sub-dividing the original domain to fit to the small components of the original block. The second argument is corresponds to the lattice on which the operation is to be performed. This argument is always an atomic-block (i.e. it is always a block-lattice and never a multi-block-lattice), because at the point of the function call to process, Palabos has already subdivided the original block and is accessing its internal, atomic sub-blocks. If you compare this to the procedure of writing a one-cell-functional as shown in Section Convenience wrappers for local operations, you will see that the additional work you need to do in the present case is to write yourself the loop over the space indices of the domain. Having to write out these loops by hand all the time is tiring, especially when you write many data processors, and it is errorprone. But it is the cost to pay for optimal efficiency, and in the field of computational physics, efficiency counts just a bit more than in other domains of software engineering and must be valued, unfortunately, against elegance from time to time.
Another advantage against one-cell-functionals is the possibility to implement non-local operations. In the following example, the value of f[1] is swapped with f[5] on the right neighboring cell:
void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
{for (plint iX=domain.x0; iX<=domain.x1; ++iX) {for (plint iY=domain.y0; iY<=domain.y1; ++iY) {Cell<T,Descriptor>& cell = lattice.get(iX,iY);Cell<T,Descriptor>& partner = lattice.get(iX+1,iY);std::swap(cell[1], partner[5]);}}
}
You can do this without risking to access cells outside the range of the lattice if you respect two rules:

On nearest-neighbor lattices (D2Q9, D3Q19, etc.), you can be non-local by one cell but no more (you may write lattice.get(iX+1,iY) but not lattice.get(iX+2,iY)). On a lattice with extended neighborhood you can also extend the distance at which you access neighboring cells in a data processor. The amount of allowed non-locality is determined by the constant Descriptor::vicinity.

Non-local operations are not allowed in data processors which act on the communication envelope (Section The methods you need to override explains what this means).

To conclude this sub-section, let’s summarize the things you are allowed to do in a data processors, and the things you are not allowed to. You are allowed to access the cells in the provided range (plus the nearest neighbors, or a few more neighbors according to the lattice topology), to read them and to modify them. The operation which you perform can be space dependent, but this space dependency must be generic and cannot depend on the specific coordinates of the argument domain provided to the function process. This is an extremely important point which we shall line out as the Rule 0 of data processors:

Rule 0 of data processors:
A data processor must always be written in such a way that executing the data processor on a given domain has the same effect as splitting the domain into two sub-domains, and then executing the data processor consecutively on each of these sub-domains.

In practice, this means that you are not allowed to make any logical decision based on the parameters x0, x1, y0, or y1 of the argument domain, or directly based on the indices iX or iY. Instead, these local indices must first be converted to global ones, independent of the sub-division of the data processor, as explained in Section Absolute and relative position.

Categories of data-processing functionals

Depending on the type of blocks on which a data processor is applied, there exist different types of processing functionals, as listed below:
Class BoxProcessingFunctionalXD<T> void processGenericBlocks(BoxXD domain, std::vector<AtomicBlockXD<T>*> atomicBlocks);
This class is practically never used. It is the fall-back option when everything else fails. It handles an arbitrary number of blocks of arbitrary type, which were casted to the generic type AtomicBlockXD. Before use, you need to cast them back to their real type.
Class LatticeBoxProcessingFunctionalXD<T,Descriptor> void process(BoxXD domain, std::vector<BlockLatticeXD<T,Descriptor>*> lattices);
Use this class to process an arbitrary number of block-lattices, and potentially create a coupling between them. This data-processing functional is for example used to define a coupling between an arbitrary number of lattice for the Shan/Chen multi-component model defined in the files src/multiPhysics/shanChenProcessorsXD.h and .hh. This type of data-processing functional is not very frequently used either, as the two-block versions listed below are more appropriate in most cases.
Class ScalarFieldBoxProcessingFunctionalXD<T> void process(BoxXD domain, std::vector<ScalarFieldXD<T>*> fields);
Same as above, applied to scalar-fields.
Class TensorFieldBoxProcessingFunctionalXD<T,nDim> void process(BoxXD domain, std::vector<TensorFieldXD<T,nDim>*> fields);
Same as above, applied to tensor-fields.
Class BoxProcessingFunctionalXD_L<T,Descriptor> void process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice);
Data processor acting on a single lattice.
Class BoxProcessingFunctionalXD_S<T> void process(BoxXD domain, ScalarFieldXD<T>& field);
Data processor acting on a single scalar-field.
Class BoxProcessingFunctionalXD_T<T,nDim> void process(BoxXD domain, TensorFieldXD<T,nDim>& field);
Data processor acting on a single tensor-field.
Class BoxProcessingFunctionalXD_LL<T,Descriptor1,Descriptor2> void process(BoxXD domain, BlockLatticeXD<T,Descriptor1>& lattice1, BlockLatticeXD<T, Descriptor2>& lattice2);
Data processor for processing and/or coupling two lattices with potentially different descriptors. Similarly, there is an SS version for two scalar-fields, and a TT version for two tensor-fields with potentially different dimensionality nDim.
Class BoxProcessingFunctionalXD_LS<T,Descriptor> void process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice, ScalarFieldXD<T>& field);
Data processor for processing and/or coupling a lattice and a scalar-field. Similarly, there is an LT and an ST version for the lattice-tensor and the scalar-tensor case.
For each of these processing functionals, there exists a “reductive” version (e.g. ReductiveBoxProcessingFuncionalXD_L) for the case that the data processor performs a reduction operation and returns a value.

The methods you need to override

Additionally to the method process, a data-processing functional must override three methods. The use of these three methods is now illustrated for the example of class Invert_1_5_Functional2D introduced at the beginning of this section:
BlockDomain::DomainT appliesTo() const
{return BlockDomain::bulk;
}
void getModificationPattern(std::vector<bool>& isWritten) const
{isWritten[0] = true;
}
Invert_1_5_Functional2D<T,Descriptor>* clone() const
{return new Invert_1_5_Functional2D<T,Descriptor>(*this);
}
To start with, you need to provide the method clone() which is paradigmatic in Palabos (see Section Programming with Palabos). Next, you need to tell Palabos which of the blocks treated by the data processor are being modified. In the present case, there is only one block. In the general case, the size of the vector isWritten is equal to the number of involved blocks, and you must assign a flag true/false to each of them. Among others, this information is exploited by Palabos to decide whether an inter-process communication for the block is needed after execution of the data processor.
The third method, method appliesTo is used to decide whether the data processor acts only on the actual domain of the simulation (BlockDomain::bulk) or if it also includes the communication envelopes (BlockDomain::bulkAndEnvelope). Let’s remember that the atomic-blocks which are the components of a multi-block are extended by a single cell layer (or a multiple cell-layer for extended lattices) to incorporate communication between blocks. This envelope overlaps with the bulk of another atomic-block, and the information is duplicated between the corresponding bulk and envelope cells. It is this envelope which makes it possible to implement a nonlocal data processor without incurring the danger of accessing out-of-range data. Normally, it is sufficient to execute a data processor on the bulk of the atomic blocks, and it is better to do so, in order to avoid the restrictions listed below when using the envelopes. This is sufficient, because a communication is automatically initiated between the envelopes and the bulk of neighboring blocks to update the values in the envelope if needed. Including the envelope is only needed if (1) you assign a new dynamics object to some or all of the cells (as it is done when you call the function defineDynamics), or (2) if you modify the internal state of the dynamics object (as it is done when you call the function defineVelocity to assign a new velocity value on Dirichlet boundary nodes). In these cases, including the envelope is necessary, because the nature and the content of dynamics objects is not transferred during the communication step between atomic-blocks. The only information which is transferred is the cell data (the particle populations and the external scalars).
If you decide to include the envelope into the application area of the data processor, you must however respect the two following rules. Otherwise, undefined behavior shall arise.

The data processor must be entirely local, because there are no additional envelopes available to cope with non-local data access.

The data processor can have write access to at most one of the involved blocks (the vector isWritten returned from the method getModificationPattern() can have the value true at most at one place).

Absolute and relative position

The coordinates iX and iY used in the space loop of a data processor are pretty useless for anything else than the execution of the loop, because they represent local variables of an atomic-block, which is itself situated at a random position inside the overall multi-block. To make decision depending on a space position, the local coordinates must therefore first be converted to global ones:
// Access the position of the atomic-block inside the multi-block.
Dot2D relativePosition = lattice.getLocation();
// Convert local coordinates to global ones.
plint globalX = iX + relativePosition.x;
plint globaly = iY + relativePosition.x;
An example is provided in the directory examples/showCases/boussinesqThermal2d/. This conversion is a bit awkward, and this is again a good reason to use the one-cell functionals presented in Section Convenience wrappers for local operations, which do the job automatically for you.
Similarly, if you execute a data processor on more than just one block, the relative coordinates are not necessarily the same in all involved blocks. If you measure things in global coordinates, then the argument domain of the method process always overlaps with all of the involved blocks. This is something which is guaranteed by the algorithm implemented in Palabos. However, all multi-blocks on which the data processor is applied are not necessarily working with the same internal data distribution, and have potentially a different interpretation of local coordinates. The argument domain of the method process is always provided as local coordinates of the first atomic-block. To get at the coordinates of the other blocks, a corresponding conversion must be applied:
Dot2D offset_0_1 = computeRelativeDisplacement(lattice0, lattice1);
Dot2D offset_0_2 = computeRelativeDisplacement(lattice0, lattice2);
plint iX1 = iX + offset_0_1.x;
plint iY1 = iY + offset_0_1.y;
plint iX2 = iX + offset_0_2.x;
plint iY2 = iY + offset_0_2.y;
Again, this process is illustrated in the example in examples/showCases/boussinesqThermal2d/. This displacement needs to be computed if any of the following conditions is verified (if you are unsure, it is best to compute the displacement by default):

The multi-blocks on which the data processor is applied don’t have the same data distribution, because they were constructed differently.

The multi-blocks on which the data processor is applied don’t have the same data distribution, because they don’t have the same size. This is the case for all functions like computeVelocity, which computes the velocity on a sub-domain of the lattice. It uses a data-processor which acts on the original lattice (which is big) and the velocity field (which can be smaller because it has the size of the sub-domain).

The data processor includes the envelope. In this case, a relative displacement stems from the fact that bulk nodes are coupled with envelope nodes from a different atomic-block. This is one more reason why it is generally better not to include the envelope it the application domain of a data processor.

Executing, integrating, and wrapping up data-processing functionals

There are basically two ways of using a data processor. In the first case, the processor is executed just once, on one or more blocks, through a call to the function executeDataProcessor. In the second case, the processor is added to a block through a call to the function addInternalProcessor, and then adopts the role of an internal data processor. An internal data processor is part of the block and can be executed as many times as wished by calling the method executeInternalProcessors of this block. This approach is typically chosen when the data processing step is part of the algorithm of the fluid solver. As examples, consider the non-local parts of boundary conditions, the coupling between components in a multi-component fluid, or the coupling between the fluid and the temperature field in a thermal code with Boussinesq approximation. In a block-lattice, internal processors have a special role, because the method executeInternalProcessors is automatically invoked at the end of the method collideAndStream() and of the method stream(). This behavior is based on the assumption that collideAndStream() represents a full lattice Boltzmann iteration cycle, and stream(), if used, stands at the end of such a cycle. The internal processors are therefore considered to be part of a lattice Boltzmann iteration and are executed at the very end, after the collision and the streaming step.
For convenience, the function call to executeDataProcessor and to addInternalProcessor was redefined for each type of data-processing functional introduced in Section Categories of data-processing functionals, and the new functions are called applyProcessingFunctional and integrateProcessingFunctional respectively. To execute for example a data-processing functional of type BoxProcessingFunctional2D_LS on the whole domain of a given lattice and scalar field (they can be either of type multi-block or atomic-block), the function call to use has the form
applyProcessingFunctional (new MyFunctional<T,Descriptor>, lattice.getBoundingBox(),lattice, scalarField );
All predefined data-processing functionals in Palabos are additionally wrapped in a convenience function, in order to simplify the syntax. For example, one of the three versions of the function computeVelocityNorm for 2D fields is defined in the file src/multiBlock/multiDataAnalysis2D.hh as follows:
template<typename T, template<typename U> class Descriptor>
void computeVelocity( MultiBlockLattice2D<T,Descriptor>& lattice,MultiTensorField2D<T,Descriptor<T>::d>& velocity,Box2D domain )
{applyProcessingFunctional (new BoxVelocityFunctional2D<T,Descriptor>, domain, lattice, velocity );
}
Execution order of internal data processors

There are different ways to control the order in which internal data processors are executed in the function call executeInternalProcessors(). First of all, each data processor is attributed to a processor level, and these processor levels are traversed in increasing order, starting with level 0. By default, all internal processors are attributed to level 0, but you have the possibility to put them into any other level, specified as the last, optional parameter of the function addInternalProcessor or integrateProcessingFunctional. Inside a processor level, the data processors are executed in the order in which they were added to the block. Additionally to imposing an order of execution, the attribution of data processors to a given level has an influence on the communication pattern inside multi-blocks. As a matter of fact, communication is not immediately performed after the execution of a data processor with write access, but only when switching from one level to the next. In this way, all MPI communication required for by the data processors within one level is bundled and executed more efficiently. To clarify the situation, let us write down the details of one iteration cycle of a block-lattice which has data processors at level 0 and at level 1 and automatically executes them at the end of the function call collideAndStream:

Execute the local collision, followed by a streaming step.

Execute the data processors at level 0. No communication has been made so far. Therefore, the data processors at this level have only a restricted ability to perform non-local operations, because the cell data in the communication envelopes is erroneous.

Execute a communication between the atomic-blocks of the block-lattice to update the envelopes. If any other, external blocks (lattice, scalar-field or tensor-field) were modified by any of the data processors at level 0, update the envelopes in these blocks as well.

Execute the data processors at level 1.

If the block-lattice or any other, external blocks were modified by any of the data processors at level1, update the envelopes correspondingly.

Although this behavior may seem a bit complicated, it leads to an intuitive behavior of the program and offers a general way to control the execution of data processors. It should be specially emphasized that if a data processor B depends on data produced previously by another data processor A, you must make sure that a proper causality relation between A and B is implemented. In all cases, B must be executed after A. Additionally, if B is non-local (and therefore accesses data on the envelopes) and A is a bulk-only data-processor, it is required that a communication step is executed between the execution of A and B. Therefore, A and B must be defined on different processor levels.
If you execute data processors manually, you can choose to execute only the processors of a given level, by indicating the level as an optional parameter of the method executeInternalProcessors(plint level). It should also be mentioned that a processor level can have a negative value. The advantage of a negative processor level is that it is not executed automatically through the default function call executeInternalProcessors(). It can only be executed manually through the call executeInternalProcessors(plint level). It makes sense to exploit this behavior for data processors which are executed often, but not at every iteration step. Calling applyProcessingFunctional each time would be somewhat less efficient, because an overhead is incurred by the decomposition of the data processor over internal atomic-blocks.

文档翻译

在矩阵上执行操作的一种常用方法是在矩阵的所有元素上或至少在给定的子域上编写某种循环，并在每个单元上执行操作。如果将矩阵的内存细分为较小的组件（例如Palabos的多块），并且这些组件分布在并行计算机的节点上，那么您的循环也需要细分为相应的较小的循环。 Palabos中数据处理器的目的是自动为您执行此细分。然后，它提供细分的域的坐标，并要求您在这些域上执行操作，而不是在原始完整域上执行操作。
作为数据处理器的开发人员，您几乎总是与所谓的数据处理功能保持联系，这些功能通过在后台执行部分重复性任务来提供简化的界面。下一节列出了许多不同类型的数据处理功能。为了说明起见，我们现在考虑一种数据处理器的情况，该处理器作用于单个2D块格或多块格，并且不执行任何数据缩减（不返回任何值）。假设操作的目的是在晶格单元上交换给定数量的f [1]和f [5]的值。这可以（并且为了代码清晰起见）可以通过本节中便于本地操作的包装器(Convenience wrappers for local operations)中介绍的用于本地操作的简单的单单元功能来完成，但是我们现在想做一件真正的事情，并进入数据功能的核心。
此作业的数据处理功能必须继承自BoxProcessingFunctional2D_L（L表示数据处理器作用于单个晶格），并在下面描述的其他方法中实现虚拟方法过程：

template<typename T, template<typename U> class Descriptor>
class Invert_1_5_Functional2D : public BoxProcessingFunctional2D_L<T,Descriptor> {public:
// ... implement other important methods here.void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice){for (plint iX=domain.x0; iX<=domain.x1; ++iX) {for (plint iY=domain.y0; iY<=domain.y1; ++iY) {Cell<T,Descriptor>& cell = lattice.get(iX,iY);// Use the function swap from the C++ STL to swap the two values.std::swap(cell[1], cell[5]);}}}
};

函数过程的第一个形参对应于Palabos在细分原始域以计算出适合原始块的较小计算域。第二个形参对应于要在其上执行操作的晶格。该参数始终是一个原子块（即，它始终是一个块格，而不是一个多块格），因为在函数调用过程中，Palabos已经细分了原始块并访问其内部，即原子子块。如果将这与编写方便本地操作的包装器(Convenience wrappers for local operations)一节中所示的编写单单元功能的过程进行比较，您会发现在当前情况下需要做的其他工作是编写自己的空间索引循环域。必须一直手工写出这些循环很累，尤其是当您编写许多数据处理器时，更容易出错。但这是为获得最佳效率而付出的代价，在计算物理领域，效率只比软件工程的其他领域多一点，而且不幸的是，有时必须对它进行评估，以免受到干扰。
对比单单元功能的另一个优势是可以实现非本地操作。在下面的示例中，在右侧相邻单元上将f [1]的值与f [5]交换：

void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
{for (plint iX=domain.x0; iX<=domain.x1; ++iX) {for (plint iY=domain.y0; iY<=domain.y1; ++iY) {Cell<T,Descriptor>& cell = lattice.get(iX,iY);Cell<T,Descriptor>& partner = lattice.get(iX+1,iY);std::swap(cell[1], partner[5]);}}
}

如果遵守两个规则，则可以不必冒险访问格范围之外的单元，即可操作：

在最邻近的晶格（D2Q9，D3Q19等）上，您只能在一个单元格内执行非局部操作，但不能在位于其他位置执行操作（您可以编写lattice.get(iX + 1,iY)，但不能写lattice.get(iX + 2,iY) )。在具有扩展邻域的网格上，您还可以扩展访问数据处理器中相邻单元的距离。允许访问的非局部性的数量由常量Descriptor::vicinity确定。
作用在通信信封上的数据处理器中不允许进行非本地操作（需要重写的方法(The methods you need to override)章节解释了这的含义）。

总结本小节，让我们总结一下您可以在数据处理器中执行的操作以及不允许执行的操作。允许您访问所提供范围内的单元（根据晶格拓扑结构，加上最近的邻居，或者再加上一些邻居），以读取它们并进行修改。您执行的操作可能与空间有关，但是此空间与环境之间的依赖性必须是通用的，并且不能取决于提供给函数process的参数domain的特定坐标。这是非常重要的一点，我们将其列为数据处理器的规则0：

数据处理器的规则0：
必须始终以这样的方式编写数据处理器：在给定域上执行数据处理器与将域拆分为两个子域后执行数据处理器效果相同，并且在这些子域中的每个子域上连续执行数据处理器具有相同的效果。

实际上，这意味着不允许您基于参数域的参数x0，x1，y0或y1或直接基于索引iX或iY做出任何逻辑决策。相反，必须首先将这些局部索引转换为全局索引，而与数据处理器的细分无关，如绝对和相对位置(Absolute and relative position)部分中所述。

数据处理器功能的类别

根据应用数据处理器的块的类型，存在不同类型的处理功能，如下所示：

Class BoxProcessingFunctionalXD<T> void processGenericBlocks(BoxXD domain, std::vector<AtomicBlockXD<T>*> atomicBlocks);

该类几乎从未使用过。当其他所有方法均失败时，它是后备选项。它处理任意数量的任意类型的块，这些块被转换为通用类型AtomicBlockXD。在使用之前，您需要将它们转换回其真实类型。

Class LatticeBoxProcessingFunctionalXD<T,Descriptor> void process(BoxXD domain, std::vector<BlockLatticeXD<T,Descriptor>*> lattices);

使用此类可处理任意数量的块晶格，并可以在它们之间创建耦合。例如，此数据处理功能用于为文件src/multiPhysics/ shanChenProcessorsXD.h和.hh中定义的Shan/Chen多组件模型定义任意数量的晶格之间的耦合。这种类型的数据处理功能也不是很常用，因为下面列出的两块式版本在大多数情况下更合适。

Class ScalarFieldBoxProcessingFunctionalXD<T> void process(BoxXD domain, std::vector<ScalarFieldXD<T>*> fields);

与上面相同，应用于标量场。

Class TensorFieldBoxProcessingFunctionalXD<T,nDim> void process(BoxXD domain, std::vector<TensorFieldXD<T,nDim>*> fields);

与上面相同，应用于张量场。

Class BoxProcessingFunctionalXD_L<T,Descriptor> void process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice);

作用于单个晶格的数据处理器。

Class BoxProcessingFunctionalXD_S<T> void process(BoxXD domain, ScalarFieldXD<T>& field);

作用于单个标量场的数据处理器。

Class BoxProcessingFunctionalXD_T<T,nDim> void process(BoxXD domain, TensorFieldXD<T,nDim>& field);

作用于单个张量场的数据处理器。

Class BoxProcessingFunctionalXD_LL<T,Descriptor1,Descriptor2> void process(BoxXD domain, BlockLatticeXD<T,Descriptor1>& lattice1, BlockLatticeXD<T, Descriptor2>& lattice2);

数据处理器，用于处理和/或耦合具有潜在不同描述符的两个晶格。类似地，对于两个标量场，有一个SS版本，对于两个张量场，有一个可能具有不同维度nDim的TT版本。

Class BoxProcessingFunctionalXD_LS<T,Descriptor> void process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice, ScalarFieldXD<T>& field);

用于处理和/或耦合晶格和标量场的数据处理器。类似地，晶格张量和标量张量的情况也有LT和ST版本。

对于这些处理功能中的每一个，在数据处理器执行归约运算并返回一个值的情况下，都存在一个“简化”版本（例如ReductionBoxProcessingFuncionalXD_L）。

您需要覆盖的方法

除了方法process之外，数据处理功能还必须覆盖三种方法。现在，在本节开头介绍的Invert_1_5_Functional2D类示例中说明了这三种方法的使用：

BlockDomain::DomainT appliesTo() const
{return BlockDomain::bulk;
}
void getModificationPattern(std::vector<bool>& isWritten) const
{isWritten[0] = true;
}
Invert_1_5_Functional2D<T,Descriptor>* clone() const
{return new Invert_1_5_Functional2D<T,Descriptor>(*this);
}

首先，您需要提供在Palabos中范例化的clone()方法（请参阅使用Palabos编程一节，第五章）。接下来，您需要告诉Palabos，数据处理器处理的哪些块正在被修改。在当前情况下，只有一个块。在一般情况下，向量isWritten的大小等于所涉及块的数量，并且必须为每个块分配一个true / false标志。其中，Palabos利用此信息来确定在执行数据处理器之后是否需要对该块进行进程间通信。
第三种方法appliesTo用于确定数据处理器是仅作用于模拟的实际域（BlockDomain :: bulk），还是还包含通信信封（BlockDomain :: bulkAndEnvelope）。让我们记住，作为一个多块的组成部分的原子块被单个单元层（或用于扩展晶格的多个单元层）扩展，以合并块之间的通信。该信封与另一个原子块的主体重叠，并且该信息在相应的主体和信封单元之间重复。正是这种信封使得实现非本地数据处理器成为可能，而不会产生访问超出范围数据的危险。通常，在大部分原子块上执行数据处理器就足够了，最好这样做，以避免使用信封时出现下面列出的限制。这是足够的，因为如果需要，将自动在信封和大部分相邻块之间启动通信以更新信封中的值。仅在以下情况下才需要包含信封：（1）将新的动态对象分配给部分或所有单元（如调用函数defineDynamics时所做的那样），或者（2）修改动态对象的内部状态（当您调用函数defineVelocity在Dirichlet边界节点上分配新的速度值时，就完成了此操作）。在这些情况下，包括信封是必要的，因为动力学对象的性质和内容在原子块之间的通信步骤中不会传递。传递的唯一信息是元胞数据（粒子总数和外部标量）。
但是，如果决定将信封包括在数据处理器的应用程序区域中，则必须遵守以下两个规则。否则，将会出现不确定的行为。

数据处理器必须完全是本地的，因为没有可用的其他信封来应对非本地数据访问。
数据处理器最多可以对所涉及的块之一进行写访问（从方法getModificationPattern()返回的向量isWritten最多只能在一个位置具有true值）。

绝对和相对位置

数据处理器的空间循环中使用的坐标iX和iY对于执行循环以外的任何其他操作都毫无用处，因为它们表示原子块的局部变量，而该原子块本身位于整个多块中的随机位置。为了根据空间位置进行决策，因此必须首先将局部坐标转换为全局坐标：

// Access the position of the atomic-block inside the multi-block.
Dot2D relativePosition = lattice.getLocation();
// Convert local coordinates to global ones.
plint globalX = iX + relativePosition.x;
plint globaly = iY + relativePosition.x;

目录examples/showCases/boussinesqThermal2d/中提供了一个示例。这种转换是有点尴尬的，这也是将方便本地操作的包装器（Convenience wrappers for local operations）部分中介绍的单单元功能用于本地操作的一个很好的理由，该功能会自动为您完成相应工作，避免转换。
同样，如果您在多个块上执行一个数据处理器，则相对坐标不一定在所有涉及的块中都相同。如果以全局坐标衡量事物，则方法过程的参数域始终与所有涉及的块重叠。 Palabos中实现的算法可以保证这一点。但是，在其上应用了数据处理器的所有多块不一定都具有相同的内部数据分布，并且可能具有不同的局部坐标解释。方法过程的参数域始终作为第一个原子块的局部坐标提供。要获得其他块的坐标，必须应用相应的转换：

Dot2D offset_0_1 = computeRelativeDisplacement(lattice0, lattice1);
Dot2D offset_0_2 = computeRelativeDisplacement(lattice0, lattice2);
plint iX1 = iX + offset_0_1.x;
plint iY1 = iY + offset_0_1.y;
plint iX2 = iX + offset_0_2.x;
plint iY2 = iY + offset_0_2.y;

同样，在examples / showCases / boussinesqThermal2d /中的示例中说明了此过程。如果满足以下任一条件，则需要计算此位移（如果不确定，最好默认计算位移）：
1.应用数据处理器的多块数据分布不同，因为它们的结构不同。
2.应用数据处理器的多个块的数据分布不同，因为它们的大小不同。对于诸如computeVelocity之类的所有函数都是这种情况，该函数可计算晶格的子域上的速度。它使用一个作用于原始晶格（较大）和速度场（其可能较小，因为它具有子域的大小）的数据处理器。
3.数据处理器包括信封。在这种情况下，相对位移源于本体节点与来自不同原子块的信封节点耦合的事实。这是为什么最好不要将信封包含在数据处理器的应用程序域中的另一个原因。

执行，集成和包装数据处理功能

基本上有两种使用数据处理器的方式。在第一种情况下，通过调用函数executeDataProcessor在一个或多个块上仅执行一次处理器。在第二种情况下，通过调用函数addInternalProcessor将处理器添加到块中，变成内部数据处理器的角色。内部数据处理器是该块的一部分，可以通过调用该块的executeInternalProcessors方法来执行任意多次。当数据处理步骤是流体求解器算法的一部分时，通常选择此方法。作为示例，请考虑边界条件的非局部部分，多组分流体中各组分之间的耦合或热代码中具有Boussinesq逼近的流体与温度场之间的耦合。在块状晶格中，内部处理器具有特殊作用，因为在方法collideAndStream()和方法stream()的末尾会自动调用方法executeInternalProcessors。此行为基于以下假设：collideAndStream()代表完整的格子Boltzmann迭代周期，并且stream()（如果使用）位于该周期的结尾。因此，内部处理器被认为是晶格Boltzmann迭代的一部分，并在冲突和流传输步骤之后在最后执行。
为了方便起见，针对在数据处理功能的类别(Categories of data-processing functionals)中介绍的每种类型的数据处理功能，重新定义了对executeDataProcessor和addInternalProcessor的函数调用，新函数分别称为applyProcessingFunctional和IntegratedProcessingFunctional。为了在给定的晶格和标量字段的整个域上执行BoxProcessingFunctional2D_LS类型的数据处理功能（它们可以是multi-block或atomic-block类型），使用的函数调用的形式为

applyProcessingFunctional (new MyFunctional<T,Descriptor>, lattice.getBoundingBox(),lattice, scalarField );

Palabos中所有预定义的数据处理功能都另外包装在便利函数中，以简化语法。例如，在文件src/multiBlock/multiDataAnalysis2D.hh中定义了用于2D字段的computeVelocityNorm函数的三个版本之一，如下所示：

template<typename T, template<typename U> class Descriptor>
void computeVelocity( MultiBlockLattice2D<T,Descriptor>& lattice,MultiTensorField2D<T,Descriptor<T>::d>& velocity,Box2D domain )
{applyProcessingFunctional (new BoxVelocityFunctional2D<T,Descriptor>, domain, lattice, velocity );
}

内部数据处理器的执行顺序

有多种方法可以控制函数调用executeInternalProcessors()中内部数据处理器的执行顺序。首先，每个数据处理器都属于一个处理器级别，并且从级别0开始以递增的顺序遍历这些处理器级别。默认情况下，所有内部处理器都属于级别0，但是您可以将它们放入任何其他级别，指定为函数addInternalProcessor或IntegratedProcessingFunctional的最后一个可选参数。在处理器级别内，数据处理器按照添加到块中的顺序执行。除了强加执行顺序之外，将数据处理器的属性分配给给定级别还会影响多块内部的通信模式。实际上，在执行具有写访问权限的数据处理器之后，不会立即执行通信，而仅在从一个级别切换到下一个级别时才进行通信。这样，一层中的数据处理器所需的所有MPI通信都将被捆绑并更有效地执行。为了澄清这种情况，让我们写下一个块格的一个迭代周期的细节，该块格的数据处理器处于0级和1级，并在函数调用collideAndStream的结尾自动执行它们：

执行局部碰撞，然后执行流传输步骤。
在级别0执行数据处理器。到目前为止，尚未进行任何通信。因此，由于通信信封中的单元数据是错误的，因此该级别的数据处理器仅具有有限的执行非本地操作的能力。
在块晶格的原子块之间执行通信以更新信封。如果级别0的任何数据处理器修改了任何其他外部块（晶格，标量场或张量场），则也将更新这些块中的信封。
在级别1执行数据处理器。
如果在第1级的任何数据处理器修改了块晶格或任何其他外部块，请相应地更新信封。

尽管此行为可能看起来有些复杂，但它引导了程序的直观行为，并提供了控制数据处理器执行的一般方法。应该特别强调的是，如果数据处理器B依赖于另一个数据处理器A先前生成的数据，则必须确保在A和B之间实现适当的因果关系。在所有情况下，必须在A之后执行B。此外，如果B是非本地的（因此访问信封上的数据）并且A是仅批量处理数据的处理器，则要求在A之间执行通信步骤。因此，必须在不同的处理器级别上定义A和B。
如果手动执行数据处理器，则可以通过将级别指示为executeInternalProcessors(plint level)方法的可选参数来选择仅执行给定级别的处理器。还应该提到，处理器级别可以具有负值。负处理器级别的优点是它不会通过默认函数调用executeInternalProcessors()自动执行。它只能通过调用executeInternalProcessors(plint level)手动执行。对于经常执行但并非在每个迭代步骤都执行的数据处理器，可以利用这种行为。每次调用applyProcessingFunctional的效率都会有所降低，因为内部原子块上数据处理器的分解会导致开销。

解释说明

本部分，不推荐任何对palabos理解较少的人观看，如果只是写来利用，不理解内部的代码架构，基本上写一个错一个，还是推荐用上一部分单个#的前言部分内的相关方法来做程序编写。这部分的基础知识，需要你有设计模式中的，组合模式，策略模式，命令模式等基础，还要了解一些MPI中的多块分块与数据通信相关的知识。如果会写了这部分相关的程序，任何lbm的模型应该都可以编写使用，但时间花费上是相当巨大的，不管怎样还是不推荐初学者或着急发文章的各位尝试这个。
其实在原有代码案例中，很多的部分的数据处理器也是按照上面手册里的写作方式写出来的，理解一些，有助于更好地调用原有案例的相关数据处理器操作，并不需要自己写。
此外可以直接在碰撞与迁移部分对所需要执行的数据处理器进行一次性操作，使用applyProcessingFunctional相关的函数即可实现，一般都被我用来做一些骚操作，比如，强行加一个外部进口输入一些流体之类的。。。
其实各个模型所写的整个模块内，都有一些已经写好的数据处理器类，并没有使用在案例中，如果想用的话，还是需要自己钻研你想使用的那个模型的源文件，找到那些东西，然后使用规范的方法调用他们，可以做很多的东西。

完

Palabos用户手册翻译及学习（四）非本地操作的数据处理器和块之间的耦合相关推荐

Palabos User Guide中文解读 | 第十六章 | 非局部操作的数据处理器和Block之间耦合
作者的话:本人在学习palabos时,发现国内中文资料甚少,恰好网上可以直接搜到palabos user guide这种英文资料,加之时间充裕,便打算开始翻译,翻了一节后发现这可能算侵权,就比较伤脑筋 ...
Tensorflow学习四---高阶操作
Tensorflow学习四-高阶操作 Merge and split 1.tf.concat 拼接 a = tf.ones([4,32,8]) b = tf.ones([2,32,8]) print( ...
MVC3学习四 EF删除操作
由于EF的框架是4.1的,所以现在如果想更新部分字段的话,只能从数据库中查出一次数据(不用查的方法还没找到,需要继续研究),不能像5.1的版本可以不用查. 更新的Action需要用到[HttpGet] ...
dj鲜生-让应用的模型类生效，搬家到云服务器-非本地操作
代码弄到云生成迁移文件报错解决办法,安装pillow pip3 install pillow 再次生成迁移文件所有的表都成功生成了迁移生成表迁移前的表
HBase学习(四) HBase API操作
目录准备工作创建maven项目添加依赖 API操作创建HBase连接创建HBase表删除表向表中插入数据查看数据过滤器操作全部代码注意事项准备工作创建maven项目添加依赖 ...
《Spatially and Temporally Efficient Non-local Attention Netw......》翻译文献--学习网络
Abstract 基于视频的人再识别(reid)旨在匹配行人在非重叠相机上的视频序列.如何将视频的时空信息嵌入到特征表示中,是一项很有挑战性的实际工作.现有的方法大多是通过汇聚图像特征和设计神经网络的 ...
数据分析第四讲 numpy学习+numpy读取本地数据和索引
文章目录数据分析第四讲 numpy学习+numpy读取本地数据一.numpy数组 1.numpy介绍 2.numpy基础 3.numpy常见的数据类型 4.数组的形状 5.数组的计算 6.数组的广 ...
赛灵思的block memory generator用户手册pg058翻译和学习（AXI4 Interface Block Memory Generator Feature Summary）
(1) 读赛灵思IP手册,block memory generator Product Guide,即内存memory系列(如RAM ROM等)的手册.本期介绍AXI4 Interface Block ...
Android JNI学习(四)——JNI的常用方法的API
前三篇主要讲解了jni基础相关的理论知识,今天主要讲解一下JNI的常用方法的API,掌握了基本的理论知识和常用的API接下来才能更好的实战. jni的常用API大纲再看API前,我建议大家主要结合官 ...

Palabos用户手册翻译及学习（四）非本地操作的数据处理器和块之间的耦合

非本地操作的数据处理器和块之间的耦合

（#）前言

原始文档

Using helper functions to avoid explicitly writing data processors

Convenience wrappers for local operations

文档翻译

使用helper函数避免显式地编写数据处理器

方便本地操作的包装器

解释说明

（##）写数据处理器

原始文档

Categories of data-processing functionals

The methods you need to override

Absolute and relative position

Executing, integrating, and wrapping up data-processing functionals

Execution order of internal data processors

文档翻译

数据处理器功能的类别

您需要覆盖的方法

绝对和相对位置

执行，集成和包装数据处理功能

内部数据处理器的执行顺序

解释说明

完

Palabos用户手册翻译及学习（四）非本地操作的数据处理器和块之间的耦合相关推荐

最新文章

热门文章