“Big data refers to any domain that is generating large datasets that have become difficult to analyze using conventional means.”
—Yisong Yue, professor of computing and mathematical sciences
Big data isn't just big. It is vast, mushrooming, and messy. Too complex for humans or traditional software to analyze, big data is characterized by "three Vs," as identified by IT industry strategist Douglas Laney in 2001: the huge volume of information, the velocity with which it continually amasses, and the variety of formats involved. The development of faster processors, improved computer memory, cloud architectures, and the web of internet connections and online devices have fueled the growth of data.
Computer scientists recognized in the 1990s that some form of machine learning would be needed to analyze and use the huge amounts of data we collect. Today, machine-learning algorithms train themselves, often with human feedback, until they can recognize certain types of information in datasets, isolate signal from noise, detect patterns, and provide insights. The bigger the dataset, the more effective the algorithms become, so big data necessitatesand powers machine learning.
Where is big data?
Big data is especially associated with information kept on shared computer storage systems (the so-called cloud). Data streams in from computers and other internet-connected devices and sensors. It may also have originated from magnetic or optical media and drives, or perhaps it was digitized from paper and film.
Where does it come from and where does it go?
Every recorded activity generates data: online browsing and purchase records, the text in free email and social media accounts, electricity and gas meter readouts, light from millions of stars recorded by survey telescopes, geolocations of birds banded with tracking devices, metrics on driving and braking speeds for usage-based insurance, phone records, medical records, logs of the geological formations encountered when oil wells are drilled. The list is effectively endless.
Organizations regularly sell or share data and use it for marketing and other purposes unrelated to the reasons it was originally generated. This simultaneously raises concerns over privacy and other ethical considerations and creates new opportunities and pathways to discovery.