Skip to main content
Premium Trial:

Request an Annual Quote

For Data Cleanliness

Big data offers opportunities to uncover hidden or subtle connections, but it also takes some 'data wrangling,' the New York Times reports.

"It's an absolute myth that you can send an algorithm over raw data and have insights pop up," Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of the startup Trifacta, says.

That data first must be cleaned and organized to make the most out of it. For instance, the Times notes that it may be obvious to people that "drowsiness," "somnolence" and "sleepiness" mean pretty much the same thing, but an algorithm must be told that. Other times there are data format conflicts that need to be resolved. It's a painstaking process that takes an estimated 50 percent to 80 percent of data scientists' time.

A number of new startups are trying to address the process, the Times notes. Paxata, for example, is working on automating finding, cleaning, and reconciling data so it can be analyzed.

"We really need better tools so we can spend less time on data wrangling and get to the sexy stuff," Michael Cavaretta, a data scientist at Ford Motor, tells the Times.