% % $Id: prof.tex,v 1.4 1994/03/05 00:25:02 otto Exp $ % % $Log: prof.tex,v $ % Revision 1.4 1994/03/05 00:25:02 otto % *** empty log message *** % % Revision 1.6 1994/02/28 17:00:12 jim % Final version ? Added new function MPI_PCONTROL and description of % logic for this. Updated example to use MPI_Wtime. % % Revision 1.1 1994/02/28 16:56:15 jim % Initial revision % % Revision 1.5 1994/02/02 10:21:37 jim % Added extra requirement for layered implementations. % Added section on multiple counting and linker oddities. % Corrected examples to reflect current bindings. % % Revision 1.4 1993/05/05 15:34:04 jim % Changed the prefix to PMPI_ and removed other options. % Added discussion on multiple levels of interception. % % Revision 1.3 1993/03/18 14:42:51 jim % Changed order so that requirements come before discussion. % Some other tidying up. % % Revision 1.2 1993/02/04 14:59:12 jim % Minor changes, added to do section. % % Revision 1.1 1993/02/02 10:38:37 jim % Initial revision % % \chapter{Profiling Interface} \label{sec:prof} \label{chap:prof} %\footnotetext[1]{Version as of March, 1994} \section{Requirements} To meet the \MPI/ profiling interface, an implementation of the \MPI/ functions {\em must} \begin{enumerate} \item provide a mechanism through which all of the \MPI/ defined functions may be accessed with a name shift. Thus all of the \MPI/ functions (which normally start with the prefix ``{\tt MPI\_}'') should also be accessible with the prefix ``{\tt PMPI\_}''. \item ensure that those \MPI/ functions which are not replaced may still be linked into an executable image without causing name clashes. \item document the implementation of different language bindings of the \MPI/ interface if they are layered on top of each other, so that the profiler developer knows whether she must implement the profile interface for each binding, or can economise by implementing it only for the lowest level routines. \item where the implementation of different language bindings is is done through a layered approach (e.g. the Fortran binding is a set of ``wrapper'' functions which call the C implementation), ensure that these wrapper functions are separable from the rest of the library. This is necessary to allow a separate profiling library to be correctly implemented, since (at least with Unix linker semantics) the profiling library must contain these wrapper functions if it is to perform as expected. This requirement allows the person who builds the profiling library to extract these functions from the original \MPI/ library and add them into the profiling library without bringing along any other unnecessary code. \item provide a no-op routine \func{MPI\_PCONTROL} in the \MPI/ library. \end{enumerate} \section{Discussion} The objective of the \MPI/ profiling interface is to ensure that it is relatively easy for authors of profiling (and other similar) tools to interface their codes to \MPI/ implementations on different machines. Since \MPI/ is a machine independent standard with many different implementations, it is unreasonable to expect that the authors of profiling tools for \MPI/ will have access to the source code which implements \MPI/ on any particular machine. It is therefore necessary to provide a mechanism by which the implementors of such tools can collect whatever performance information they wish {\em without} access to the underlying implementation. We believe that having such an interface is important if \MPI/ is to be attractive to end users, since the availability of many different tools will be a significant factor in attracting users to the \MPI/ standard. The profiling interface is just that, an interface. It says {\em nothing} about the way in which it is used. There is therefore no attempt to lay down what information is collected through the interface, or how the collected information is saved, filtered, or displayed. While the initial impetus for the development of this interface arose from the desire to permit the implementation of profiling tools, it is clear that an interface like that specified may also prove useful for other purposes, such as ``internetworking'' multiple \MPI/ implementations. Since all that is defined is an interface, there is no objection to its being used wherever it is useful. As the issues being addressed here are intimately tied up with the way in which executable images are built, which may differ greatly on different machines, the examples given below should be treated solely as one way of implementing the objective of the \MPI/ profiling interface. The actual requirements made of an implementation are those detailed in the Requirements section above, the whole of the rest of this chapter is only present as justification and discussion of the logic for those requirements. The examples below show one way in which an implementation could be constructed to meet the requirements on a Unix system (there are doubtless others which would be equally valid). \section{Logic of the design} Provided that an \MPI/ implementation meets the requirements above, it is possible for the implementor of the profiling system to intercept all of the \MPI/ calls which are made by the user program. She can then collect whatever information she requires before calling the underlying \MPI/ implementation (through its name shifted entry points) to achieve the desired effects. \subsection{Miscellaneous control of profiling} There is a clear requirement for the user code to be able to control the profiler dynamically at run time. This is normally used for (at least) the purposes of \begin{itemize} \item Enabling and disabling profiling depending on the state of the calculation. \item Flushing trace buffers at non-critical points in the calculation \item Adding user events to a trace file. \end{itemize} These requirements are met by use of the \func{MPI\_PCONTROL}. \begin{funcdef}{MPI\_PCONTROL(level, \ldots)} \funcarg{\IN}{level}{Profiling level} \end{funcdef} \mpibind{MPI\_Pcontrol(const~int~level, \ldots)} \mpifbind{MPI\_PCONTROL(level)\fargs INTEGER LEVEL, \ldots} \MPI/ libraries themselves make no use of this routine, and simply return immediately to the user code. However the presence of calls to this routine allows a profiling package to be explicitly called by the user. Since \MPI/ has no control of the implementation of the profiling code, we are unable to specify precisely the semantics which will be provided by calls to \func{MPI\_PCONTROL}. This vagueness extends to the number of arguments to the function, and their datatypes. However to provide some level of portability of user codes to different profiling libraries, we request the following meanings for certain values of level. \begin{itemize} \item{{\tt level==0}} Profiling is disabled. \item{{\tt level==1}} Profiling is enabled at a normal default level of detail. \item{{\tt level==2}} Profile buffers are flushed. (This may be a no-op in some profilers). \item{All other values of {\tt level}} have profile library defined effects and additional arguments. \end{itemize} We also request that the default state after \func{MPI\_INIT} has been called is for profiling to be enabled at the normal default level. (i.e. as if \func{MPI\_PCONTROL} had just been called with the argument 1). This allows users to link with a profiling library and obtain profile output without having to modify their source code at all. The provision of \func{MPI\_PCONTROL} as a no-op in the standard \MPI/ library allows them to modify their source code to obtain more detailed profiling information, but still be able to link exactly the same code against the standard \MPI/ library. \section{Examples} \subsection{Profiler implementation} Suppose that the profiler wishes to accumulate the total amount of data sent by the MPI\_SEND function, along with the total elapsed time spent in the function. This could trivially be achieved thus \snir \begin{verbatim} static int totalBytes; static double totalTime; int MPI_SEND(void * buffer, const int count, MPI_Datatype datatype, int dest, int tag, MPI_comm comm) { double tstart = MPI_Wtime(); /* Pass on all the arguments */ int extent; int result = PMPI_Send(buffer,count,datatype,dest,tag,comm); MPI_Type_size(datatype, &extent); /* Compute size */ totalBytes += count*extent; totalTime += MPI_Wtime() - tstart; /* and time */ return result; } \end{verbatim} \rins \subsection{MPI library implementation} On a Unix system, in which the \MPI/ library is implemented in C, then there are various possible options, of which two of the most obvious are presented here. Which is better depends on whether the linker and compiler support weak symbols. \subsubsection{Systems with weak symbols} If the compiler and linker support weak external symbols (e.g. Solaris 2.x, other system V.4 machines), then only a single library is required through the use of {\tt \#pragma weak} thus \begin{verbatim} #pragma weak MPI_Example = PMPI_Example int PMPI_Example(/* appropriate args */) { /* Useful content */ } \end{verbatim} The effect of this {\tt \#pragma} is to define the external symbol {\tt MPI\_Example} as a weak definition. This means that the linker will not complain if there is another definition of the symbol (for instance in the profiling library), however if no other definition exists, then the linker will use the weak definition. \subsubsection{Systems without weak symbols} In the absence of weak symbols then one possible solution would be to use the C macro pre-processor thus \begin{verbatim} #ifdef PROFILELIB # ifdef __STDC__ # define FUNCTION(name) P##name # else # define FUNCTION(name) P/**/name # endif #else # define FUNCTION(name) name #endif \end{verbatim} Each of the user visible functions in the library would then be declared thus \begin{verbatim} int FUNCTION(MPI_Example)(/* appropriate args */) { /* Useful content */ } \end{verbatim} The same source file can then be compiled to produce both versions of the library, depending on the state of the {\tt PROFILELIB} macro symbol. It is required that the standard \MPI/ library be built in such a way that the inclusion of \MPI/ functions can be achieved one at a time. This is a somewhat unpleasant requirement, since it may mean that each external function has to be compiled from a separate file. However this is necessary so that the author of the profiling library need only define those \MPI/ functions which she wishes to intercept, references to any others being fulfilled by the normal \MPI/ library. Therefore the link step can look something like this \begin{verbatim} % cc ... -lmyprof -lpmpi -lmpi \end{verbatim} Here {\tt libmyprof.a} contains the profiler functions which intercept some of the \MPI/ functions. {\tt libpmpi.a} contains the ``name shifted'' \MPI/ functions, and {\tt libmpi.a} contains the normal definitions of the \MPI/ functions. \subsection{Complications} \subsubsection{Multiple counting} Since parts of the \MPI/ library may themselves be implemented using more basic \MPI/ functions (e.g. a portable implementation of the collective operations implemented using point to point communications), there is potential for profiling functions to be called from within an \MPI/ function which was called from a profiling function. This could lead to ``double counting'' of the time spent in the inner routine. Since this effect could actually be useful under some circumstances (e.g. it might allow one to answer the question ``How much time is spent in the point to point routines when they're called from collective functions ?''), we have decided not to enforce any restrictions on the author of the \MPI/ library which would overcome this. Therefore the author of the profiling library should be aware of this problem, and guard against it herself. In a single threaded world this is easily achieved through use of a static variable in the profiling code which remembers if you are already inside a profiling routine. It becomes more complex in a multi-threaded environment (as does the meaning of the times recorded !) \subsubsection{Linker oddities} The Unix linker traditionally operates in one pass : the effect of this is that functions from libraries are only included in the image if they are needed at the time the library is scanned. When combined with weak symbols, or multiple definitions of the same function, this can cause odd (and unexpected) effects. Consider, for instance, an implementation of \MPI/ in which the Fortran binding is achieved by using wrapper functions on top of the C implementation. The author of the profile library then assumes that it is reasonable only to provide profile functions for the C binding, since Fortran will eventually call these, and the cost of the wrappers is assumed to be small. However, if the wrapper functions are not in the profiling library, then none of the profiled entry points will be undefined when the profiling library is called. Therefore none of the profiling code will be included in the image. When the standard \MPI/ library is scanned, the Fortran wrappers will be resolved, and will also pull in the base versions of the \MPI/ functions. The overall effect is that the code will link successfully, but will not be profiled. To overcome this we must ensure that the Fortran wrapper functions are included in the profiling version of the library. We ensure that this is possible by requiring that these be separable from the rest of the base \MPI/ library. This allows them to be {\tt ar}ed out of the base library and into the profiling one. \section{Multiple levels of interception} The scheme given here does not directly support the nesting of profiling functions, since it provides only a single alternative name for each \MPI/ function. Consideration was given to an implementation which would allow multiple levels of call interception, however we were unable to construct an implementation of this which did not have the following disadvantages \begin{itemize} \item assuming a particular implementation language. \item imposing a run time cost even when no profiling was taking place. \end{itemize} Since one of the objectives of \MPI/ is to permit efficient, low latency implementations, and it is not the business of a standard to require a particular implementation language, we decided to accept the scheme outlined above. Note, however, that it is possible to use the scheme above to implement a multi-level system, since the function called by the user may call many different profiling functions before calling the underlying \MPI/ function. Unfortunately such an implementation may require more cooperation between the different profiling libraries than is required for the single level implementation detailed above. .