Math-to-Industry Boot Camp III

Advisory: Application deadline is February 28, 2018

Organizers:

Benjamin Brubaker, University of Minnesota, Twin Cities
Fadil Santosa, University of Minnesota, Twin Cities
Daniel Spirn, University of Minnesota, Twin Cities

The Math-to-Industry Boot Camp is an intense six-week session designed to provide graduate students with training and experience that is valuable for employment outside of academia. The program is targeted at Ph.D. students in pure and applied mathematics. The boot camp consists of courses in the basics of programming, data analysis, and mathematical modeling. Students work in teams on projects and are provided with training in resume and interview preparation as well as teamwork.

There are two group projects during the session: a small-scale project designed to introduce the concept of solving open-ended problems and working in teams, and a "capstone project" that is posed by industrial scientists.

Weekly seminars by industrial scientists provide the students with opportunities to learn about a variety of possible future careers.

Eligibility

Applicants must be current graduate students in a Ph.D. program at a U.S. institution during the period of the boot camp.

Logistics

The program will take place at the IMA on the campus of the University of Minnesota. Students will be housed in a residence hall on campus and will receive a per diem and a travel budget, as well as an $800 stipend.

Applications

To apply, please supply the following materials through the link at the top of the page:

Statement of reason for participation, career goals, and relevant experience
Unofficial transcript, evidence of good standing, and have full-time status
Letter of support from advisor, director of graduate studies, or department chair

Selection criteria will be based on background and statement of interest, as well as geographic and institutional diversity. Women and minorities are especially encouraged to apply. Selected participants will be contacted in April.

Participants

Name	Department	Affiliation
Muhammad Afridi		3M
Nicholas Asendorf		3M
Christopher Bemis		Whitebox Advisors
Nitsan Ben-Gal	Software, Electronics and Mechanical Systems Laboratory	3M
Jesse Berwald		D-Wave Systems
Ariel Bowman	Department of Mathematics	University of Texas at Arlington
Chris Browne	Center for Applied Mathematics	Cornell University
Benjamin Brubaker	School of Mathematics	University of Minnesota, Twin Cities
Kate Brubaker	Department of Mathematics	Purdue University
Irfan Bulu	Department of Math and Modeling	Schlumberger-Doll Research
Shawn Burkett	Mathematics	University of Colorado
Olivia Cannon	Department of Mathematics	University of Minnesota, Twin Cities
Jared Catenacci	Diagnostic Research and Material Studies	National Security Technologies, LLC
Chirasree Chatterjee	Department of Mathematics and Statistics	Saint Louis University
Hua Chen	Department of Mathematical Sciences	University of Delaware
Aaron Cohen	Department of Mathematics	Indiana University
Paula Dassbach		Medtronic
Mingchang Ding	Department of Mathematical Sciences	University of Delaware
Jasmine Foo	School of Mathematics	University of Minnesota, Twin Cities
Zhen Gao	Department of Mathematics	Vanderbilt University
Maria Gommel	Department of Mathematics	The University of Iowa
Hayley Guy	School of Mathematics	North Carolina State University
Qie He	Department of Industrial and Systems Engineering	University of Minnesota, Twin Cities
Thomas Hoft	Department of Mathematics	University of St. Thomas
Ruihao Huang	Department of Mathematical Sciences	Michigan Technological University
Jeffrey Humpherys		UnitedHealth Group
Laura Iosip	Department of Mathematics	University of Maryland
Melanie Jensen	Department of Mathematics	Tulane University
Alicia Johnson		Macalester College
Ekaterina Kryuchkova	Center for Applied Mathematics	Cornell University
Kevin Leder	Department of Industrial System and Engineering	University of Minnesota, Twin Cities
Philku Lee	Department of Mathematics and Statistics	Mississippi State University
SangJoon Lee	Department of Mathematics	University of Connecticut
Hengguang Li	Department of Mathematics	Wayne State University
Aaron Luttman	Diagnostic Research and Material Studies	National Security Technologies, LLC
Christopher Miller	School of Mathematics	University of California, Berkeley
Cristian Minoccheri	Department of Mathematics	State University of New York, Stony Brook (SUNY)
Sarah Miracle	Department of Computer and Information Sciences	University of St. Thomas
Shannon Negaard-Paper		University of Minnesota, Twin Cities
Elpiniki Nikolopoulou	Department of Applied Mathematics and Statistics	Arizona State University
Michelle Pinharry	School of Mathematics	University of Minnesota, Twin Cities
Iurii Posukhovskyi	Department of Mathematics	University of Kansas
Mrinal Raghupathi	USAA Asset Management Company	USAA Asset Management Company
Michael Ramsey	Department of Applied Mathematics	University of Colorado
Eric Roberts	Department of Applied Mathematics	University of California, Merced
Tanushree Roy	School of Mathematics	University of Central Florida
Keith Rush	Department of Strategy and Analytics	Milwaukee Brewers
Fadil Santosa	School of Mathematics	University of Minnesota, Twin Cities
Chang Shu	Department of Applied Mathematics	University of California, Davis
Dallas Smith	School of Mathematics	Brigham Young University
Alberto Speranzon	Aerospace	Honeywell
Daniel Spirn	University of Minnesota	University of Minnesota, Twin Cities
Binh Tang	Department of Statistical Science	Cornell University
Elizabeth Wicks	School of Mathematics	University of Washington
Shiqiang Xia		University of Minnesota, Twin Cities
Di Ye		Zhennovate
Yufei Yu	Department of Mathematics	University of Kansas
Sheng Zhang	Department of Mathematics	Purdue University

Projects and teams

Team 1: Mathematical Models for Adaptive Multi-modal Sensing

Mentor Aaron Luttman, National Security Technologies, LLC
Mentor Jared Catenacci, National Security Technologies, LLC
Ariel Bowman, University of Texas at Arlington
Shawn Burkett, University of Colorado
Hayley Guy, North Carolina State University
Laura Iosip, University of Maryland
Yufei Yu, University of Kansas
Sheng Zhang, Purdue University

Scientific experiments are a natural source of data – which usually means diagnostic systems fielded to collect information within the experiments themselves – but there has been a recent trend towards collecting data around big science experiments to understand if we can detect and characterize the behaviors associated with the experiments. The question is whether it is possible to determine what experiments are being conducted by analyzing human patterns, so-call “patterns of life,” around and in the experimental facilities. In order to measure patterns of life, we analyze many different types of data, from power grid load profiles to internet activity to sound and pressure signals from cars.

There are two primary challenges that must be addressed:

Mathematical Models for Adaptive Sensing – When should a sensor system turn on its sensors and transmit its data, given that these two activities take a lot of power?

Physics-based Multi-modal Feature Selection and Detection – How can one incorporate physics models for sensing into machine learning approaches to data analysis?

Real multi-sensor data will be provided for testing and validation.

Team 2: Quantum Computation and QUBO Slicing

Mentor Jesse Berwald, D-Wave Systems
Olivia Cannon, University of Minnesota, Twin Cities
Tanushree Roy, University of Central Florida
Chang Shu, University of California, Davis
Dallas Smith, Brigham Young University
Elizabeth Wicks, University of Washington

Background

Quantum annealing computers have begun to enter the business and academic worlds. Over the past five years they have been used for a wide variety of (prototypical) applications, with evidence of differentiated performance in some cases.

A first step in utilizing these computers is to reformulate the problem in an energy minimization framework. This is typically cast as a Hamiltonian, or alternatively as a quadratic unconstrained binary optimization (QUBO), which can be represented as a matrix. These formulations are translated to the physical qubits on the quantum processing unit (QPU) through a process termed “embedding”. Embedding a given problem onto the QPU is handled through a number of different heuristics and is an active area of research in itself, one of which is described below.

Problem statement

In this project we will investigate one proposed solution to the embedding problem:

The goal is to make the most efficient use of the qubit hardware by developing a parameterized transformation from the space spanned by physical qubits, “qubit space”, to the space spanned by problem variables, the “problem search space”. Our goal will be to define a linear transformation from qubit space to problem search space that allows for a more efficient use of available hardware.

Since the problem space is (in general) much larger than the qubit space, a fixed parameterization will succeed in mapping the qubit space into an proper subspace of the problem space. We term these subspaces “slices”. This reduced problem can then be solved with an optimal use of the available hardware. Using different parameterizations, we can define a series of linear transformations onto orthogonal subspaces of the problem space.

There are many parameterizations to choose from, each of which raises a number of research questions. We will prioritize our investigation roughly as follows:

Given a QUBO matrix defining the problem search space, is there an algorithm that produces the most efficient set of transformations (parameterizations) from qubit space to problem space?
Is there a greedy algorithm that is best in practice — i.e. choose a slice that maximizes the use of the chip, and then choose successively smaller slices to query the entire search space.
What is the role of sparsity in the choice of transformations?
The QPU itself has a unique architecture. How does this architecture affect the choice of transformations?

References

Traffic flow optimization using a quantum annealer: https://arxiv.org/pdf/1708.01625.pdf
A NASA Perspective on Quantum Computing: Opportunities and Challenges: https://arxiv.org/pdf/1704.04836.pdf

Team 3: Time Series Analysis of Gas Mixture Data

Mentor Nicholas Asendorf, 3M
Kate Brubaker, Purdue University
Ruihao Huang, Michigan Technological University
Philku Lee, Mississippi State University
Elpiniki Nikolopoulou, Arizona State University
Michelle Pinharry, University of Minnesota, Twin Cities

Motivation

Sensor networks are ubiquitous in today’s Internet of Things, capable of collecting high frequency data in a cost efficient way. This results in mountains of time-series data that hopefully contain signals of interest buried in noise. As the number of deployed sensors grows, so does the dimensionality of the observed data, further increasing the complexity of the problem. 3M is interested in such large scale time series analyses because many of our datasets can be framed in this way: manufacturing, sales, and chemical experiments to name a few.

Dataset

This publicly available dataset contains time series sensor readings from chemical sensors over the duration of 12 hours. The input to these sensors are known concentrations of various gases. The dataset contains timestamped measurements from 16 gas sensors and the input concentrations of the gases. This is a labeled time series dataset. There are two different gas mixture measurement files, one for Ethylene and CO, and one for Ethylene and Methane. At 3M, we may have similar types of experimental data (perhaps using different sensors) where we would like to determine the interactions between materials or understand fundamental properties of materials. Being able to intelligently and efficiently mine these rich datasets for insights about material characteristics is critical.

The Challenge

Some interesting problems to consider:

Develop an algorithm to estimate the concentration of each gas given sensor measurements. You might approach this problem using classical machine learning, splitting data into training, validation, and testing, while treating time series measurements as independent points.
Develop algorithms to estimate the concentrations of each gas using time series based methods like windowing, tsfresh, or RNNs. In this approach, we don’t want to treat each measurement as independent. How do these algorithms compare to classical machine learning techniques?
Can you use the fact that we have 4 replicates of each sensor at each time point to improve your algorithms? Can you use any clever data fusion techniques or outlier detection strategies?
What can you tell about the importance or accuracy of the 4 types of sensors used?
What happens when we purposely introduce missing data? Can we use the replicates of each sensor to overcome this? How robust are your algorithms to missing data?
Since each dataset has measurements for Ethylene, can we use both datasets to develop a more robust estimation scheme for that gas?

Team 4: Structured Variational Auto Encoders

Mentor Irfan Bulu, Schlumberger-Doll Research
Hua Chen, University of Delaware
Aaron Cohen, Indiana University
Mingchang Ding, University of Delaware
Melanie Jensen, Tulane University
Christopher Miller, University of California, Berkeley
Michael Ramsey, University of Colorado

Generative models such as Variational Auto Encoders (VAE), Generative Adversarial Networks(GAN) have been very successful in unsupervised learning settings. In a VAE setting, we would like to learn a set of latent variables that explain our data. Although, this has been very successful as a generative model, the interpretation of latent variables is still a challenge. Ideally, what we would like to do is unsupervised learning through which we identify a number of classes (not specified yet). Once a set of classes has been identified, we can then label once instead of having to label the entire data set. Imagine you have a sample of handwritten digits without labels. If we can structure VAE in a way that it can identify 10 classes, we can then go label these classes as the relevant digits. This would be very helpful as most of our data is unlabeled or poorly labeled.

Concepts that may be helpful to know: neural network, generative models, graphical models, stochastic variational inference.

Team 5: Tailored Discovery in Stock Portfolios

Mentor Christopher Bemis, Whitebox Advisors
Chirasree Chatterjee, Saint Louis University
Zhen Gao, Vanderbilt University
Cristian Minoccheri, State University of New York, Stony Brook (SUNY)
Shannon Negaard-Paper, University of Minnesota, Twin Cities
Shiqiang Xia, University of Minnesota, Twin Cities

Modern portfolio theory has provided tools to identify systemic and idiosyncratic risks via models like Markowitz' Mean-Variance Optimization. In addition, a taxonomy of equities has emerged through feature identification, with one of the earliest and most impactful being Fama and French's three factor model.

In this project, we will leverage technical and fundamental data like return series and earnings information along with well understood equity features like exposure to so-called size, value, and market portfolios to develop tools for suggesting supplements (e.g., technology stocks when looking at Apple) and complements (e.g., energy stocks when looking at Delta Airlines) for individual equities and portfolios. These tools may be used in tailored discovery and research by analysts looking to either construct a portfolio based on a theme or to diversify. The work will ideally evolve from point estimates using simple norms in a predetermined feature space to applying machine learning techniques.

Data will be supplied from Quandl, and the preferred language for development will be Python.

Team 6: Sequence-to-sequence modeling for the business of baseball

Mentor Keith Rush, Milwaukee Brewers
Maria Gommel, The University of Iowa
Ekaterina Kryuchkova, Cornell University
SangJoon Lee, University of Connecticut
Iurii Posukhovskyi, University of Kansas
Eric Roberts, University of California, Merced

Each fan has a unique relationship to his or her favorite sports teams, and each has a different ideal every time they step into the stadium. When a team makes a big free-agent signing in February, the fan who follows he competition closely will be ecstatic--the fan who primarily enjoys the communal aspects will only see this effect in the buzz generated in his or her social circles. In order to cherish their fans to the utmost, teams must have a global view of their business and be able to structure data from all sources and across all levels of granularity, creating one universe into which all inputs and from which all outputs feed.

This project is fundamentally a first step in that direction. The problem we are focusing on is roughly the following: conditioned on a vector representing a fan's history with the Club and the attributes of a particular game, how well can we ingest information in time and map it forward one time step. For this purpose, we will test the standard recurrent and convolutional network architectures, as well as experimenting with variants and discussing the reasons for applying each and their limitations. Data will be provided from the Brewers and the development will take place in Python, utilizing cloud infrastructure for the computing power.