Data Science?-?Data?Types

Data Types are an important concept in order to perform Exploratory Data Analysis (EDA) properly, which is one of the most underrated parts of Machine Learning. One must have good understanding on variety of data, to apply statistical measurements to the data accurately, therefore to conclude certain assumptions correctly.

No alt text provided for this image

Table of Contents:

  • Variety of Data
  • General Data Types
  • Likert Scale

Variety of Data

In the digital world, we see exponential growth in data collection. Visualization and analysis of this data can give us insights that can be used for business benefits.

Followings are the different types of data that get collected from various sources:

  • Temporal Data: Data with a time component attached to it. Example: opening and closing values of stocks in a year. Here we need a plot that can represent the sequence in data and how the patterns change over time.
  • Geospatial Data: Data that has a physical location as an attribute. Example: location of volcanoes around the world. Here we need a plot that can represent the data on the geographical map.
  • Topical Data: Data concerned with topics. Example: feedback from customers. Here we need a plot that can represent the relationships in the given data.
  • Network Data: Data is in the form of nodes and links between nodes. Example: social networking data. Here we need a plot that can represent the relationship between nodes.
  • Tree Data: Data which is basically network data but contains some hierarchy in it. Example: organizational structure. Here we need a plot that can represent tree structure.
No alt text provided for this image

General Data Types

No alt text provided for this image

Qualitative/Categorical data

The data that deals with characteristics and descriptions. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning. The qualitative data can be further categorized as:

  • Binary: E.g.: True/False, Yes/No, 1/0
No alt text provided for this image
  • Nominal: Pure labels without inherent order (no label is intrinsically greater or less than any other) E.g.: Different colors, Blood Groups, Nationality.
No alt text provided for this image
  • Ordinal: The labels with an intrinsic order or ranking. Comparison operations can be made between values, but the magnitude of differences are not be well-defined (i.e.) the difference between Elementary and High School is different than the difference between High School and College. This is the main limitation of ordinal data. E.g.: Height (short, medium, tall)
No alt text provided for this image

Quantitative/ Numerical data

The data that is numerical in nature and can be measured.

  • Interval data: numeric values that represent ordered units where absolute differences are meaningful (addition and subtraction operations can be made).
No alt text provided for this image

The problem with interval values data is that they don’t have a “true zero”. i.e. that there is no such thing as no temperature. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied.

  • Ratio data: numeric values that represent ordered units where relative differences are meaningful (multiplication and division operations can be made)
No alt text provided for this image

Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

No alt text provided for this image

All quantitative-type variables also come in one of two varieties: discrete and continuous.

Discrete: This data can’t be measured but it can be counted (whole numbers). E.g.: # of floors in a building

Continuous: This data can’t be counted but it can be measured (take any values within a range). E.g.: Weight, mileage of a car.

Difference between discrete and continuous

Distinguishing between continuous and discrete can be a little tricky — a rule of thumb is if there are few levels, and values can’t be subdivided into further units, then it’s discrete. Otherwise, it’s continuous.

If you have a scale that can only take natural number values between 1 and 5, that’s discrete. A quantity that can be measured to two digits, e.g. 2.72, is best characterized as continuous, since we might hypothetically be able to measure to even more digits, e.g. 2.718.

A tricky case like test scores measured between 0 and 100 can only be divided down to single integers, making it initially seem discrete. But since there are so many values, such a feature is usually considered as continuous.

Likert Scale

One form of data you might encounter is response data to a Likert scale like the ones below.

The below Likert scale, which happens to be graphical, has five points, allowing for neutrality

This Likert scale, which happens to be graphical, has five points, allowing for neutrality

The below Likert scale has six points, not allowing for neutrality

No alt text provided for this image

What level of measurement should you consider for this kind of data?

Technically, responses on these kinds of questions should be considered ordinal in nature. There is a clear order in response values, but it may not be the case that the differences between consecutive levels are consistent in size. The criteria to move between Strongly Disagree and Disagree might be different from the criteria between Agree and Strongly Agree. However, Likert data is often treated as interval to simplify analyses. If you have data like this, make sure you use exploratory data visualizations to make a good judgment on how your data should be treated later on in the analysis process.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了