About Me

I obtained my PhD from University of Michigan CSE co-advised by Prof. Mike Cafarella and Prof. H. V. Jagadish (jag). I am particularly interested in building interactive data preparation systems (like data cleaning, data integration, etc.) to improve productivity of data scientists/analysts, programmers and non-expert data users. In Spring 2019, I interned in Microsoft Research DMX lab mented by Yeye He. In Summer 2017, I interned in Trifacta working on string data pattern normalization mentored by Sean Kandel, Mike Minar, and Prof. Joe Hellerstein. The work was integrated into Trifacta Cloud Wrangler launched in 08/2018. I received B.S. degrees in computer science and mathematics from Purdue University. Before transfering to Purdue, I studied EE for two years in Tianjin University in China. Please check my CV for more details.

NEW! Our Query Engine team at TikTok US is actively hiring software/data engineers at all levels. Feel free to DM me for more info.

Publications

Recent News

Projects

Foofah (video)

SIGMOD'17 · SIGMOD'17 Demo

Foofah performs data transformation/cleaning through programming by examples (PBE), which requires little domain knowledge from non-expert users. It efficiently discovers a sequence of parameterized data wrangling actions which guarantee to transform the raw data into the example form provided by the end user using a combinatorial search algorithm guided by the proposed distance metric customized for spreadsheets. The user interaction time is reduced by ~60% compared to the seminal Wrangler system. The system is open-sourced at https://github.com/umich-dbgroup/foofah/.

CLX (Click-label-and-transform; pronouced as "Clicks")

EDBT'19

Designed and implemented CLX, an interactive data cleaning system. CLX 1) automatically identifies regular-expression-like data patterns for a given set of string data with heterogeneous data patterns for non-expert users to understand, and 2) suggests pattern-based transformation programs to unify various data patterns. The work was integrated to Trifacta Cloud Wrangler as a main feature in Aug 2018 and available at https://cloud.trifacta.com/.

MithraCoverage (video)

ICDE'19 · SIGMOD'20 Demo

The system efficiently discovers under-represented/under-covered intersectional subgroups in a given dataset (e.g., a medical dataset may lack data records from a subgroup of "Hispanic women"), which may cause the problem of population bias. MithraCoverage also suggests a ranked list of subgroups in which the user could collect more data entries to remedy the above issue and ensure data fairness.

PRISM (video)

HILDA@SIGMOD'18 · CIDR'19

The system infers SQL queries using imprecise and/or incomplete user examples from the target table the user desires from a relational database. The query discovery uses a bottom-up search-based algorithm and a filter-based validation process driven by a Bayesian network which reduces the overall number of query executions on the source database by ~70%.

DuoQuest

SIGMOD'20 · CIDR'20 · CAST@VLDB 2019

Duoquest is a dual-specification query synthesis system, which consumes both a Natural Language Query and an optional PBE-like table sketch query that enables users to express varied levels of knowledge about the desired SQL and schema.