Locating Datasets: Exploring the Information Landscape
Data is everywhere, but finding a reputable, accessible source of information to answer the particular questions you’re looking for can be much harder than it seems.
As a data analyst whether you are working on a new data project for your portfolio or you’re starting your first project as a paid junior data analyst, your first task will be to locate a suitable dataset. I’ll briefly go over the types of datasets out there, factors to consider when choosing them and also how to find them.
Let’s learn what a dataset is.
What is a dataset?
A dataset is simply a collection of information. Oftentimes, this information is arranged in a specific format, but it may not be structured in a way that is immediately useful to you, and it will need a bit of work on your part to make it usable. Common types of datasets include:
Tabular datasets – Arranged in rows and columns like spreadsheets
Time-series data – Data that is ordered chronologically.
Numerical Data – This consists of quantitative information that can be measured and expressed numerically, such as sales figures, temperatures, or ages. Other data sets may include collections of images, text documents, or audio or video recordings.
Factors to Consider When Choosing a Dataset
- Relevance: – It’s crucial for the datasets to align with the topic or goals of the project being carried out.
Size and Format – Large projects may require sizable datasets and compatibility with the analysis tools being used (e.g., CSV, JSON) is very important. - Data Quality: – Another factor to consider when choosing a data set is how much cleaning and wrangling is necessary to get the data into a usable format. Choosing a more curated data set may save you time. However, it is often unavoidable to use messier data which requires significant effort to ensure fields are in the same format, missing values are addressed, and duplicate data is deleted.
- Reliability: – One of the crucial steps when choosing a data source is assessing its quality and reliability. You’ll want to verify that your data source is reputable. Reliability factors on how the data was collected, which populations are represented by the data, and whether there were any biases in the collection process.
Categories of Datasets and Where to Find Them
Searching for reliable datasets to work with can be a time-consuming task. There are many free data sets available, although many others are paid or even proprietary.
Here are some of the datasets available for free on the internet:
Type of Dataset | Where to Find Them | Application |
Open Government Data Sets | Data.gov (U.S.) , UK Government Data, European Union Open Data Portal. | This data set can be used for environmental and health research. |
Academic and Research Data Sets | Harvard Dataverse, UCI Machine Learning Repository, IEEE DataPort | Useful for research-based projects, machine learning, and academic studies |
Social Media and Web Scraping | Twitter API, Reddit API, GitHub repositories, datacamp | Useful when conducting research on interactions with certain posts/contents. |
Health and Medical Dataset | National Institutes of Health (NIH) , World Health Organization (WHO) , Kaggle | Ideal for healthcare studies, and medical research projects. |
E-commerce and Business Datasets | Amazon Web Services Public Data Sets, Google Dataset Search ,Datafrik | Useful for business analysis, customer behavior studies, and financial market research. |
Environmental and Geographic Datasets | NASA, NOAA, Global Biodiversity Information Facility | Relevant for climate science, ecology, geospatial analysis, and environmental studies. |
Not sure what to do after you’ve acquired the dataset? Here’s a guide on how to clean data with python.
Conclusion
Choosing the right data set is foundational to the success of any data-driven project. Datasets collect information in one place, making it possible to identify trends and make predictions. Finding data sets to analyze may seem daunting at first. But knowing a few places to start looking can make all the difference. Check out Datafrik’s selection of interesting datasets.