# Research

# Current

I am a research software engineer at the Topos Institute. I work with a team developing applied category theory software in Julia. The primary application here is building tools that make scientific computing and scientific knowledge representation easier, more transparent, and more robust to updating one’s model of the world. For an overview, see my summary talks Scientific and Software Engineering Applications of CT and Combinatorial Representations of Scientific Knowledge.

Writing code is more painful then it has to be, whether you’re a computational scientist or a software engineer. There are lots of high-level things we would like to do with code (e.g. combine functionalities of different codes, change one component or assumption without breaking everything else) that are not possible to do in practice. Part of the reason for this is that the syntax of a general purpose programming language is too powerful - one can do anything in it, which makes it hard to reason about! The abstractions we create to make our lives easier when programming (datatypes, classes, interfaces, scripts) are informed by decades of engineering know-how, yet also pretty ad hoc / not understood mathematically at a deep level, leading to these abstractions being very fragile.

Our general value proposition is that category theory can provide a foundation for computing in virtue of allowing us to work in syntaxes which are *not* arbitrary code. These may be less expressive, but CT provides a formalism for giving them a computational *semantics* and providing us with tools to perform high-level operations in the simple syntax with guarantees that the corresponding desired things happen in the computational semantics. We also understand these simpler syntaxes well (e.g. directed graphs) and can relate them to other syntaxes (e.g. Petri nets), giving us a principled way of extending / modifying our abstractions that was previously unavailable.

Some recent projects in this vein:

Model Exploration: a language for composing primitive “model spaces” (for scientific models of a very wide generality) into larger ones and an computational implementation, as described in our ACT 2022 paper (2022). [code]

Rewriting: an implementation of a general theory of patterns and replacements, allowing one to declare very general types of knowledge in a pictorial, transparent form, rather than code. (2021)

Agent based models: an extension of the rewriting above to combine sequences of rewrite rules into a scientific model, again with the virtues of being more transparent and amenable to high-level refactoring than general code. See this blog post or (2023).

Symmetries: As described in this blog post, this project concerns extending McKay’s graph algorithm (implemented as nauty) to C-sets, so that isomorphic instances can be immediately seen to be equal. [code]

Knowledge representation: see blog post on how a database can be equipped with equipped with equations (stated in a graphical form) and how these equations are enforced.

## Knowledge representation and data integration

### Declarative specification of databases

Scientists modeling complicated phenomena don’t use explicit (formally-specified) models. This is pragmatic, given current available open-source tools, but informal reasoning ulimately leads to serious challenges in communicating scientific results clearly and sharing data. A relational database backend is important for a scalable modeling tool, but a SQL-less interface is also crucial: the complexity of managing database implementation details quickly becomes unmanagable and unextensible as the model complexity increases. I helped develop a Python EDSL to help scientists generate relational databases from a natural declaration of scientific facts and to naturally query and publicly communicate their knowledge base. Our strategy is published here (2021) and here (2021), and it is being commercialized by the startup Modelyst by Michael Statt and Brian Rohr.

### Heterogenous data integration, using Category Theory

There is little standardization in how data is to be represented and stored in many scientific fields. However, the varying schemas of different researchers contain significant overlap in information, and for data-driven fields it is especially beneficial to be able to freely switch from one frame of reference to another.

Furthermore, when we update our view of the world, it’s important to be able to migrate our old data, algorithms, and analysis tools into the new framework. When these tools are expressed in the language of C-sets, this migration can be automated in a verifiable way, described in Categorical data integration for computational science (2019) with computational chemistry as a case study. As stressed in the paper, these migration tools are of importance to scientists who wish to communicate and share data with lower risk of data misinterpretation. This was implemented using Categorical Query Language, a tool developed by the startup Conexus.

## Development of functionals for Density Functional Theory

The simulation of chemical reactions using first-principles techniques requires a theoretical framework that is able to describe a wide range of electronic interactions. Under the direction of Johannes Voss and with Yasheng Maimaiti and Kai Trepke, I developed MCML (2021), a new meta exchange-correlation functional, with a semi-empirical approach, fitting the functional form against higher level of theory and experimental benchmark data. By using Bayesian statistics, we enabled uncertainty estimation of the computed reaction energies. This complements the earlier research I did which applied DFT to discover catalysts for sustainable energy applications ((2020), (2019), (2019), and (2018)) as well as my earlier experimental chemistry research ((2017) and (2017)).

## Formal methods

I briefly rotated with the Barrett group at Stanford to learn about formal methods and made small contributions. (2021) and (2021)

I also interned at Google where we applied the Lean theorem prover. My final presentation was recorded here.

# Talks

Title | Event | Links |
---|---|---|

Practical abstract algebra for chemical engineers | UC Berkeley 2023 | slides |

A graphical language for rewriting-based programs and agent-based models | ACT 2023 | slides |

On extending mathematical attitudes to natural languages | Topos 2023 | slides |

Scientific and software engineering examples of applied category theory | Topos 2023 | video slides |

Compositional Exploration of Scientific Models | Glasgow, ACT 2022 | video slides notes |

Rewriting individual-based models for epidemiology | Glasgow, ACT 2022 | video notes |

Computational Category Theoretic Graph Rewriting | Nantes, ICGT 2022 | slides |

AlgebraicRewriting.jl - Declarative Data Transformation | JuliaCon 2022 | video slides |

Extending McKay’s Canonical Isomorph Algorithm to C-Sets | SIAM DM 2022 | slides |

Combinatorial Representations of Scientific Knowledge | Topos 2022 | video slides |

Leveraging Data to Improve the Accuracy of Chemistry Simulations | Thesis 2021 | video |

Formal Verification of Android Build Code | Google 2020 | video |

## Posters

Title | Event |
---|---|

Applied Category Theory for Scientists | Denmark, Catalysis and Modeling Symposium 2022 |

# CV

## References

*The Journal of Physical Chemistry C*122 (12): 6713–20.

*Computational Materials Science*164: 127–32. https://arxiv.org/pdf/1903.10579.pdf.

*Journal of Computational Chemistry*42 (28): 2004–13.

*CoRR*abs/2111.03784. https://arxiv.org/abs/2111.03784.

*arXiv Preprint arXiv:2304.14950*.

*Applied Physics Letters*110 (15): 153902.

*Applied Catalysis B: Environmental*218: 643–49.

*Nature Communications*10 (1): 1–10.

*The Journal of Physical Chemistry C*123 (10): 5999–6009.

*The Journal of Physical Chemistry C*124 (45): 24765–75.

*International Conference on Computer Aided Verification*, 461–74. Springer.

*International Conference on Theory and Applications of Satisfiability Testing*, 377–86. Springer.