Data and Its Analysis
The proliferation of computing has created an enormous amount of data. There is data about everything from sensors that track whales in the ocean to data about visitors to web sites. Below is a picture of whale tracking.
Computers are used in an iterative and interactive way when processing digital information to gain insight and knowledge. Iterative means that computers can go through all data in large data sets to filter and clean it. Combining data sources, clustering data and data classification are part of the process of using computers to process information. Interaction means that people can gain insight and knowledge from translating and transforming digitally represented information. Patterns can emerge when data is transformed using computational tools.
Computing allows people to share data to collaborate, such as by shared Internet access to large databases, or by using a shared Google Sheet spreadsheet. Collaboration is an important part of solving data-driven problems. Collaboration facilitates solving computational problems through multiple perspectives, experiences, and skill sets. Communication between participants working on data-driven problems gives rise to enhanced insights and knowledge. Collaboration in developing hypotheses and questions and in testing hypotheses and answering questions about data helps gain insight and knowledge. Collaborating face-to-face and using online collaborative tools can facilitate processing information to gain insight and knowledge. Investigating large data sets collaboratively can lead to insight and knowledge not obtained working alone.
Visualization tools and software can communicate information about data. Tables in a document, diagrams generated from a spreadsheet, and textual displays in a presentation can be used in communicating insight and knowledge gained from data. Summaries of data analyzed computationally, as opposed to showing all of the vast data, can be effective in communicating insight and knowledge gained from digitally represented information. Transforming information can be effective in communicating knowledge gained from data. Interactivity with data, such as showing a colleague the effects of changing one cell in a spreadsheet and its impact on related cells, is an aspect of communicating with computing. Below is a picture of formulas which are impacted if one cell is changed.
Metadata is data about data. Metadata can be descriptive data about an image, word processed, or other complex objects. Metadata can increase the effective use of data or data sets by providing additional information about various aspects of that data.
Computing facilitates exploration and the discovery of connections in information. The use of large data sets, such as the logs of all visitors to a web site, provides opportunities and challenges for extracting information and knowledge. Below is a visualization of data collected by Google Analytics about visitors to a web site. Large data sets, like all of the Google searches done in a two day period, provide opportunities for identifying trends, making connections in data, and solving problems. Computing tools facilitate the discovery of connections in information within large data sets. Search tools such as the Google Search Engine, are essential for efficiently finding information. Information filtering systems, which take large data sets and eliminate data that is not of interest, are important tools for finding information and recognizing patterns in the information. Software tools, including spreadsheets and databases, help to efficiently organize and find trends in information.
Large data sets include data such as transactions, measurements, text, sound, images, and video. The storing, processing, and curating of large data sets is challenging simply because of the amount of data it is now possible to obtain. For instance, NASA obtains incredibly vast amounts of data from its satellites, but much of that data is redundant among satellites and/or not of use; NASA's information filtering systems seek to eliminate redundant and useless data to help manage the size and complexity.
Structuring large data sets for analysis can be challenging. Maintaining privacy and cyber security of large data sets containing personal information can be challenging. Scalability of systems is an important consideration when data sets are large - techniques that worked on a smaller data set may not work when the size of the data set increases. Analytical techniques to store, manage, transmit, and process data sets change as the size of data sets scale. The size or scale of a system that stores data affects how that data set is used. The effective use of large data sets requires computational solutions.
Trade-offs and Concerns
Digital data representations involve trade-offs related to storage, security, and privacy concerns. Security and privacy concerns, as described in the chapter on Cyber Security, arise with data containing personal or otherwise sensitive information and engender trade-offs in storing and transmitting it. For instance, storing and transmitting encrypted data is more secure, but makes the data slower to access. There are other trade-offs such as using lossy and lossless compression techniques, as described in the chapter on Data In The Computer, for storing and transmitting data. Lossless data compression reduces the number of bits stored or transmitted, but allows complete reconstruction of the original data. Lossy data compression can significantly reduce the number of bits stored or transmitted at the cost of being able to reconstruct only an approximation of the original data. Data is stored in many formats depending on its characteristics, such as size and intended use. The choice of storage media affects both the methods of and costs of manipulating the data it contains. Reading data, which multiple users can do concurrently, and updating data, which typically only one user at a time can do, have different storage requirements.
A database is a system for storing and taking care of data (any kind of information).
A database engine can sort, change or serve the information on the database. The information itself can be stored in many different ways - before digital computers, card files, printed books and other methods were used. Now most data is kept on computer files.
A database system is a computer program for managing electronic databases. A very simple example of a database system would be an electronic address book.
The data in a database is organized in some way. Before there were computers, employee data was often kept in file cabinets. There was usually one card for each employee. On the card, information such as the date of birth or the name of the employee could be found. A database also has such "cards". To the user, the card will look the same as it did in old times, only this time it will be on the screen. To the computer, the information on the card can be stored in different ways. Each of these ways is known as a database model. The most commonly used database model is called relational database model; it uses relations and sets to store the data. Normal users talking about the database model will not talk about relations, they will talk about database tables.
Uses for database systems include:
- Storing data
- Storing special information used to manage the data. This information is called metadata and it is not shown to all the people looking at the data.
- Solving cases where many users want to access (and possibly change) the same entries of data.
- Managing access rights (who is allowed to see the data, who can change it).
Parts of this page are based on information from: Wikipedia: The Free Encyclopedia