Data Management Challenges in Analytics: The Case of Portfolio Management

Data Management Challenges in Analytics: The Case of Portfolio Management

8/13/2014 | By Dessislava A. Pachamanova and Frank J. Fabozzi

Estimated reading time: 5.5 minutes (6 minutes including video)

Key Takeaways

  1. A well-executed approach to analytics-based portfolio management has numerous benefits, such as rapid information filtering and identification of statistical arbitrage opportunities, but it is not always easy or straightforward.
  2. Four distinctive challenges with using data for portfolio management decision making include 1.) merging and aligning data from different sources, 2.) survival bias (using only data of those companies that still exist), 3.) look-ahead bias (using “not-yet-available” data for predictive modeling, and 4.) data snooping (looking for specific data to verify a hypothesis).
  3. Analytics-inclined managers in a wide range of industries can benefit by identifying the analogous challenges in their particular fields.
Journal of Portfolio Management

This article is adapted from the authors’ research published in the Journal of Portfolio Management.
Download the full article: Recent Trends in Equity Portfolio Construction Analytics.

In the context of investment management, the term analytics refers to all of the ways in which investments are modeled, tracked, and reported. Investment analytics can include:

  • Market analytics: Analysis of real-time market data, prices, financials, earnings estimates, and market research reports
  • Financial screening: Selection from the universe of investment candidates those of interest, based on pre-specified financial and nonfinancial criteria
  • Quantitative modeling: Asset allocation, portfolio construction within an asset class, and trading models
  • Financial analytics: Performance evaluation, return attribution analysis, risk measurement, and asset allocation/asset liability analysis

Qualitative and quantitative asset managers rely on quantitative investment models to different extents. However, there is little doubt that some fundamental level of portfolio analytics is critical for identifying investment opportunities, keeping portfolios aligned with investment objectives and within specified risk guidelines, and monitoring portfolio risk and performance.

Analytics-based portfolio management lets asset managers filter information quickly, take advantage of statistical arbitrage opportunities and deal with inefficiencies such as transaction costs incurred during trading and tax consequences of investment decisions. In short: the upside of a well-executed approach to analytics can be significant.

Creating those benefits, however, is not always easy or straightforward. Thus, in this article, we review four distinctive challenges with using data for decision making. We discuss each within the context of portfolio management, though analytically inclined managers in other industries are likely to recognize similar issues in their fields.

Data Alignment

Data are often stored in multiple databases, and must be merged to run the analysis. For example, price and return data used in predictive models may be collected from the University of Chicago’s Center for Research in Security Prices (CRSP) database, fundamental data may be from Standard and Poor’s Capital IQ Compustat database, macroeconomic data may be from government sites or Bloomberg, analyst data may be from the Institutional Broker Estimates System (IBES) database, and social issues data may be from KLD Research and Analytics Inc.

Merging data from different databases can be challenging. For example, sometimes different databases use different identifiers for the same company, making it difficult to align records. Even when databases use common identifiers, such as CUSIPs or ticker symbols, the latter may change over time and are sometimes reused when a company is no longer in the database, so it may be difficult to link correctly all companies across databases.

A more nuanced problem with data alignment is that alternative ways to calculate model inputs based on data records of different variables may not lead to consistent estimates because of data discrepancies. Fabozzi, Focardi, and Kolm1 list multiple reasons for such data inconsistencies:

  • First, there may be a problem with rounding and minor inaccuracies.
  • Second, there can be errors in the records.
  • Third, different data items are sometimes combined. For example, sometimes depreciation and amortization expenses are not a separate line item on an income statement; instead, they are included in cost of goods sold.
  • Fourth, data items may be inconsistently reported across different companies, sectors, or industries. This happens also when the financial data provider incorrectly maps financial measures from company reports to the specific database items.

Considering these possible issues, two mathematically equivalent approaches for calculating a financial ratio such as EBITDA/EV (the earnings before interest, taxes, depreciation, and amortization divided by enterprise value) may not deliver the same empirical results. Thus, two asset managers using the two different approaches for calculating the ratio may rank stocks differently and decide on very different portfolio allocations. Fabozzi, Focardi, and Kolm2 illustrate that the percentage of companies in the Russell 1000 index with different ranking according to the EBITDA/EV factor was as high as 30% from 1989 to 2008.

Survival Bias

If companies are removed from the database when they no longer exist, and only data on surviving companies are used for analysis, survival bias occurs. (Companies may stop existing for different reasons, such as a bankruptcy or a merger.)

This problem can be understood by considering the following example. Suppose that one looks at the 3,000 largest (highest market capitalization) companies in 2005, and then again at the 3,000 largest companies in 2014, and finds that the value of these companies went up by 40%. One might be tempted to conclude that percentage return on a portfolio including these companies would be 40%. However, if some of the companies that existed in 2005 are no longer there (or on the list of the 3,000 largest companies in 2014), one cannot actually construct a portfolio and realize the 40% return.

To avoid survival bias and obtain a more accurate picture of the factors that determined performance, one would need to track the original 3,000 companies from 2005, regardless of whether they were on the list in 2014.

Look-ahead Bias

Look-ahead bias occurs when predictive models use data that would not be available at the time at which the prediction occurs. For example, suppose that end-of-year earnings are used as a factor in a predictive model for returns in January. Those earnings are not reported until several days or weeks after the end of the year. Thus, they cannot be used to forecast January earnings, because they would not be available at the beginning of January.

The look-ahead bias can be made worse by backfilling and restatements. Backfilling is when previously missing data is entered into the database upon receipt, which may happen much later than the time that is necessary to use the information for prediction purposes. Restatements can happen when, for example, a company revises its initial earnings release. Many database companies (for example, Capital IQ) overwrite the number that was originally recorded. However, if the information were used in predictive models, only the original number would have been available. Using the updated information for building predictive models introduces bias, because it does not correctly reflect the information available at the time when the investment decision needs to be made.

Data Snooping

With sufficient time and enough data, one can find any data pattern one sets out to find. It is therefore important to make sure that any analytical model a manager uses is grounded in sound economic theory and common sense, rather than just empirical evidence. When using historical data, one also should use caution when determining the time period over which the data are collected. For example, it may not be wise to use data over 30 years in a regression to estimate the beta of a stock if one (with reason) believes that the beta changes more frequently than every 30 years.

Understanding and managing the data used for building predictive models is critical to the success of a quantitative portfolio management strategy. Analytics-inclined managers in a wide range of industries will benefit by identifying the analogous challenges in their particular field. No doubt, from a practical standpoint, it’s best to manage the four challenges we describe above proactively—rather than reactively.


  1. Fabozzi, F., S. Focardi, and P. Kolm. Quantitative Equity Investing. Hoboken, NJ: John Wiley & Sons, 2010.
  2. Ibid.