Friday, July 01, 2005

The Problem of Data Polymorphism

Parametric polymorphism refers to the ability of functions to be called with different types of arguments in different contexts. The cartesian product algorithm, or CPA, efficiently separates these contexts from each other, by creating templates for each function, based on the cartesian product of the actual arguments provided at each call site. By connecting each context with certain templates during analysis, these templates do not pollute each other with bogus types. The idea behind CPA is that when different call sites provide similar argument types, these templates can be shared. This makes for a very efficient analysis, and also a very precise one, as it is not limited to a certain call-chain length, as earlier analyses such as NCFA.

However, CPA only considers 'simple' types, such as integers and floats. Because integers and floats always have the same 'behaviour', these types may be merged together during dataflow analysis (i.e. when one integer meets another integer). CPA templates can be created eagerly for these types, as everything is known about them at each call site where they pop up - all integers are the same, so a template made for one integer will be valid and precise for all integers that pass by. For polymorphic containers such as 'list' the story becomes more complicated. Data polymorphism refers to the ability of instance variables to contain different types. Different such types may require different templates to be made. For example, the return type of some function may depend on an instance variable of an argument to that function. Unfortunately, instance variable types cannot be fully known at their allocation site! Consider the following simple example:
def ident(x):
return x
a = []
a = []
b = []
ident(a).append(1)
ident(b).append(1.0)

Do we create one or two templates for the 'ident' function? As the list contents may not be known as these lists reach the function calls, we cannot eagerly create multiple templates. We cannot merge lists either during dataflow analysis, and then split them later, when we find out more about their contents, as this could leave the dataflow network in an inconsistent state.

A naive solution, called 1CFA, is to create separate 'class templates' for each allocation site, in effect giving each allocation site its own type. This slows down the process, and is still not precise, because allocation sites may be duplicated as part of templates, again giving rise to possibly differently used polymorphic containers. Splitting these duplicated allocation sites would quickly make for an intractable computation.

A better solution adapts itself to the actual data polymorphism used in the program (the way CPA adapts itself to the actual parametric polymorphism), so that as many templates as possible can be shared. One solution, already alluded to by Olin Shivers (of NCFA fame), is to iterate the analysis, starting from a very imprecise situation, and trying to improve type information after each iteration. John Plevyak found a practical method to implement this idea. Its first iteration is just 0CFA, which means that all similar types are merged, e.g. all lists. By looking at the dataflow graph in a backward fashion, starting at conflicting assignments to a single instance variable, it is possible to locate so-called 'imprecision points'. In the above example, if we start inside the two append templates (CPA creates one for integer and one for float cases), and we look backwards, we find that there flow types from different allocation sites into the formal argument of 'ident'. By splitting the list type in two, one for each incoming dataflow edge at this imprecision point, the respective imprecision can be 'removed', because CPA will now create different templates for 'ident' during the next iteration! By repeatedly splitting types, based on imprecision points, Plevyak's analysis 'adapts' itself to the data polymorphism in the program. Heuristics are probably needed for splitting decisions, because splitting too little means many iterations, and splitting too much means CPA might blow up.

Actually, when Plevyak developed the above method, CPA did not yet exist. His approach also encompassed means of handling parametric polymorphism precisely, but reportedly it was not very efficient. As the developer of CPA alluded to in his PhD thesis, it would be nice to try and combine Plevyak's data-iterative approach with CPA. Over the following weeks, I will try to implement such a combination, although I am a bit wary about potential problems I might encounter. As the first iteration will be just 0CFA, there will be almost no precision - how to select imprecision points that, when split, will actually help? What if assignment types are not singleton sets, or what if there are many different combinations of assignment types? What about inheritance? How do we maintain information about duplicated allocation sites between iterations? How do we terminate efficiently in the face of many polymorphic containers? I guess I will have to find out..

No comments: