The very first step of data analysis is understanding how the data we get is organized or in other words checking the schema of our data.
The following are 3 basic things you can do to get this understanding in using Python.
1. Get the list of files
Let’s say we have stored our data in 5 files in a folder called input in the same folder where our code resides.
path = './input/'
The following function get_files() will create a list of all filenames.
def get_files(path): files = listdir(path) return [ file for file in files if file.count(file)]
Now, all we need to do is call the get_files() to get the list of all files in the folder.
2. Assess the size of each file
Now that we understand the number of files in our folder, we can take a brief look at the size of each file in the folder so that we can check if we have enough memory in our system to process these files.
This is very useful since using this information, we can take a call on how much data should be loaded based on our processing capacity or if we should increase our capacity to load the entire dataset.
2.1. Find the size of the files
Given we have the list of filenames in the filenames variable from the previous piece of code, the file_size() function checks the files given in path.
def file_size(filenames): for i in filenames: file = path+i size = os.path.getsize(file) if size > 0 and size <= 10000: kb = (size/1000) print(i,': ',round(kb,3),'kB') if size > 10000 and size < 1000000000: mb = (size/1000000) print (i,': ',round(mb,3),'MB') if size >= 1000000000: gb = (size/1000000000) print(i,': ',round(gb,3),'GB')
Now, call the file_size() function to get the size of each file in the folder.
2.2. Find the row numbers in each file
Given we have the list of filenames in the filenames variable from get_files(), the file_size() function checks the number of rows of files given in the path variable.
def row_count(filenames): for i in filenames: file = path+i with open(file) as f: for j, l in enumerate(f): rc = j print('File: ',i,',q Row Number: ',j)
Now like we did before, just call the row_count() function to get the row number of each file in the folder.
Once you have an idea of the amount of data you are dealing with, if you are working on Linux, go to terminal and type in free to check the physical memory of your system.
If it’s not enough, then you can either increase the memory or use a subset of rows to train your model.
3. Understanding the schema
We can check the content of each file and identify the Primary Key of each file and the Foreign Keys that help us see how files are connected to each other.
To do this, we can load the first few rows of each file to separate data frames and check the columns of each file.
Let’s say we have a file called train.csv which has the training data and store.csv which has the store-level data.
We can find the columns of each by calling the info() method.
# Load 1000 rows of data from each file train = pd.read_csv(path+'train.csv', nrows= 1000, low_memory=True) store = pd.read_csv(path+'store.csv', nrows= 1000, low_memory=True)
Calling the info() function on the dataframes will show the columns and the data types. Using the names we can identify how these tables are related to each other.
For example, both train.csv and store.csv will both be connected by the column store_id.
We can join these two tables using the join() function and create a new dataframe training_data.
training_data = train.join(store, on='store_id', rsuffix='_')
If there are more files in the folder, they can be joined to the training_data dataframe using the join() function in a similar manner.
Join the She Drives Data community on SHEROES – The Women Only Social Network and connect with us directly.
If you are a lady techie, e-mail us at firstname.lastname@example.org to inspire us with your story or learnings on your area of interest and give us an opportunity to feature you on our community and on our social pages.
We would love to hear from you ❤️️