How do you Classify Items in a Transactions Receipt for Analysis — A Machine Learning Approach to a FinTech Problem

Suraj Venkat
5 min readJan 12, 2020

Problem

Often in building FinTech applications, we come across a very interesting problem — how do you classify data from a transactions receipt?

Whether you are building a cash flow analyzer or a tool to help traders calculate taxes owed based on transaction records — It is important to classify the data first so you can perform analytics and provide insights/answers to users.

We will take the example of Ace Analytics, a product we developed in which small business owners can get analytical predictions about their cash flow based on a list of transaction receipts that they upload to the application; to explain our approach.

Broadly following are the major categories into which we have to divide the data:

1.Income

2.Expenses: further classified into categories like Grocery, Tech etc.

3.Financial Transactions — which are further divided into Investment, Transfers and Financial Fees.

Categories across the classes — Periodic, One time

Solution

The description string from the transaction receipt is not very clear about the transaction, it may be helpful to use multiple approaches to increase the accuracy of the classification. The following is the high-level overview of the approach:-

1.Pre-Process the description string from the statement

2.Classify using multiple approaches

a.Search in Database ( Description string Vs Classifications)

b.Search Web

c.Search/Correlate using contextual data within the statement itself (learn cluster of spending/income patterns)

d.Search/Correlate using external contextual data (Like weather, inflation etc.)

3.Rank and collate the search results from step-2.

Phase I: Preprocessing

In this phase we streamline the description string by preprocessing so it can be used for classification. During this phase we split the string (into splits) based on syntax like white spaces, underscores, etc. We also find and discard irrelevant parts like Transaction Ids, and other useless information for instance, vendor codes that don’t help us in classifying the items. We use spelling correction to fill in missing letters and this is used in conjunction with a Database so that correction techniques are applied. After several such corrections and estimates, we build a combination of the splits — corrected description strings.

Phase II: Classifying using Multiple Approaches

Use web search to classify:

We can use ‘search APIs’ to search the preprocessed description strings.

Input for Search API:

a.Search the original description string

b.Search the preprocessed description string

c.Search individual parts of the search string

d.Search various combination of various parts of the description

e.Analyze the search results

f.Search for relevant category words in our category list

g.Compile a synonym lists for comparison from the search

h.Rank the search results

i.Read/Analyze the knowledge graph results. The knowledge graph results can be taken at face value. However, actual search results must be further analyzed.

Search DB to classify:

Search the preprocessed description string in the DB for the best matching strings. Since the pre-processing results in multiple processed description string, we will have to search all the strings and their combination and collate results.

Matching individual sections of the description string with the DB.

Matching combination of parts with the DB: Do searches for multiple combinations and ascertain the importance of parts. if searches of several combinations return the same result, then mark the common part in the combinations as important criteria for search and store in the DB

Handling partial matches:

a.Single partial match

b.Multiple partial match

c.Rank and correlate results from above steps

Use Correlation/Context to classify:

Classification and correlation of expenses with other historical spending data (of the same and other similar users) will increase the accuracy of classification. The availability of data is questionable and the correlation strategy needs to be sorted out.

Available data:

a.data available in the statement itself

b.Timing of the spend

c.Season

d.part of the week

e.part of the month

f.part of the day

g.Part of the year

h.Sporadic

Related spends:

a.What are nearest spend transactions

b.time: spend transactions immediately preceding and spend transactions immediately after current transaction within multiple time windows

c.Amount: spend transactions with similar amounts

d.What are the nearest income transactions - incomes immediately preceding and incomes immediately after the current transaction

e.Periodicity

f.Periodicity of the spending: Daily, monthly, quarterly, yearly, once

g.Periodicity of Income: Daily, monthly, quarterly, yearly, once

h.Cluster Transactions

e.Spending

f.Spending that happen in a cluster

g.Spending that are isolated

h.Real-estate rentals

Misclassification mitigation strategies:

It is necessary to identify misclassification data and to set up a feedback mechanism to incorporate the feedback into all relevant DBs to improve.

How to identify misclassifications?

Is training dataset available?

Incorporate misclassification feedback in various DBs including search and contextual DBs

General Maintenance- data that require regular updates

Build vendor DB based on inference: Once classification is done, the description string points to a classification. The description string will contain both the relevant and irrelevant parts. It may be helpful to store both the relevant and irrelevant parts in the DB to be processed later in case they repeat themselves.

Periodic update of vendor DB based on sources like yellow pages APIs

Update of DBs based on partial match queries

Periodic update of banking/financial transaction description

Update of user profiles used in contextual search classification

Update of general population databases used in contextual search classification

Possible additional data mining opportunities

The above analysis can also get us some more data if it is accurate enough. Some of the additional data that can be mined are:

a.Spending profile

b.Purchase patterns- when he/she is purchasing

c.Online vs Offline spending

d.Profiling locations of spending

e.Profile of preferred vendors

f.Profile of essential vs non-essential spending.

g.Projected spending patterns

h.Projected income patterns

i.Suggest alternate vendors (popular, cheaper)

j.Suggest savings/investment plans(as discussed in the first meeting)

k.Generate data insight for vendors

l.Spending cycle prediction for vendor inventory management

m.Spending cycle prediction that is location specific

n.Vendor specific competitive data

o.Spending trend/pattern for a specific vendor across a time period that can be used to map various business operations. This can be used to correlate the effectiveness of certain business measures taken by vendors

p.Online/offline split of business.

--

--

Suraj Venkat

Futurist. I love studying and writing about how state of the art technologies can solve business problems.