Section: 20 Boundless Analytics - Pre-discovery Tool

In this section we demonstrate application of Boundless Analytics - the tool developed by Tomasz Imielinski and his team at Rutgers (and supported by NSF subcontract of Center of Science of Information at Purdue University). Boundless Analytics calculates all significant bargraphs from the data set and allows to find data subsets (slices) which deviate the most from the whole data set in regard to frequency distribution of an attribute. Boundless performs otherwise very tedious task of looking at all combinations of attribute value pairs to identify the “significant ones” - saving enormous amount of work in preliminary exploration of data.

We have provided synthetic data set - describing customer transactions in the small chain of minimarkets in NJ. Data 101 students used Boundless Analytics to discover the most interesting subsets of this data set

20.1 Minimarket Data Set description

Zoom recording

20.2 Demo of Boundless Analytics

Zoom Recording

20.3 The Boundless Analytics web application

Boundless Analytics Interface: http://209.97.156.178:8082/

(it is a soft login abc/abc will do)

Objective: Nominate the most interesting subset of the Minimarket2022 data set

Seems open ended, no? what is the “most interesting”?

  • Chi-square value is a good measure. The higher it is, the more interesting the data set.

  • By swiping through possible plots (using Next), one can identify good candidates for the “interesting data subsets”)

  • These are plots where red and blue bars differ the most.

  • Then run chi-square test over the candidates and nominate the plot with the highest chi-square value.

  • Therefore this task can be seen as chi-square hunt for the highest chi-square value (use the snippet 20.1 code after plugging in definition of a slice and the anchor attribute)

20.4 Snippet 1: Chi square hunt

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFNheSwgdGhlIEJvdW5kbGVzcyBhbmFseXRpY3MgcHJvdmlkZXMgdXMgd2l0aCB0aGUgc2xpY2U6ICBCZWVyID09J0xhZ2VyJyAmICBEYXkgPT0nV2Vla2VuZCcgYW5kIFNuYWNrcyA9J0NyYWNrZXJzJyBhbmQgYW5jaG9yIGF0dHJpYnV0ZSBpcyBMb2NhdGlvbi4gIFlvdSBjYW4gY2FsY3VsYXRlIENoaXNxIGZvciB0aGlzIHNsaWNlIGFuZCB0aGUgTG9jYXRpb24gYXR0cmlidXRlIHRvIHRlc3QgaWYgZGlzdHJpYnV0aW9uIG9mIGxvY2F0aW9ucyBpcyBhZmZlY3RlZCBpZiB3ZSBsaW1pdCBvdXJzZWx2ZXMgb25seSB0byB0cmFuc2FjdGlvbnMgc2VsbGluZyBMYWdlciBhbmQgQ3JhY2tlcnMgb24gV2Vla2VuZHM/ICBcblxuIyBUaGUgbW9zdCBpbnRlcmVzdGluZyBzbGljZS1hbmNob3IgYXR0cmlidXRlIGNvbWJpbmF0aW9ucyBhcmUgdGhlIG9uZXMgd2l0aCB0aGUgbGFyZ2VzdCBjaGlzcSB0ZXN0IGFuZCBsb3dlc3QgcC12YWx1ZS4gTmV2ZXJ0aGVsZXNzIGRvIG5vdCBmb3JnZXQgYWJvdXQgbXVsdGlwbGUgaHlwb3RoZXNpcyBjb3JyZWN0aW9uIC0gc2luY2Ugd2UgY2FuIG9uIGNoaS1zcXVhcmUgaHVudCBoZXJlIVxuXG5NaW5pbWFya2V0PC1yZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L0hvbWV3b3JrTWFya2V0MjAyMi5jc3ZcIilcblxuTWluaW1hcmtldCRJTjwtJ091dF9TbGljZSdcbk1pbmltYXJrZXRbTWluaW1hcmtldCRCZWVyPT0nTGFnZXInICYgTWluaW1hcmtldCREYXk9PSdXZWVrZW5kJyAmICBNaW5pbWFya2V0JFNuYWNrcyA9PSdDcmFja2VycycsIF0kSU48LSdJbl9TbGljZSdcbmQ8LXRhYmxlKE1pbmltYXJrZXQkTG9jYXRpb24sIE1pbmltYXJrZXQkSU4pXG5jaGlzcS50ZXN0KGQpIn0=


ATTACHED - the data set (same as on the Boundless Analytics interface) HomeworkMarket2022-2.csv

RESULTS:

Here are two out of 250+ submissions. The one with the highest chi-square of 600.15 is the slice showing weekend buyers of lager in New brunswick but disproportionately more snacks (in particular Crackers). This was identified by nearly 20 students.


Here is another find by Eva Zhang showing disproportionately frequent sales of Coca Cola on Weekdays in Princeton for transactions which purchased Popcorn. The chi-square value of this find is 205.31, with df=3.