Skip to content

Instantly share code, notes, and snippets.

@Cdaprod
Forked from isaacarnault/OUTPUT.md
Created February 17, 2024 18:42
Show Gist options
  • Select an option

  • Save Cdaprod/3ccc41e0bfaadb21b50a0774dd3dac10 to your computer and use it in GitHub Desktop.

Select an option

Save Cdaprod/3ccc41e0bfaadb21b50a0774dd3dac10 to your computer and use it in GitHub Desktop.
Data collection using Python

Data collection using Python and R - Using one dataframe

Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.

Scripting in Python and R

The following gist offers a focus on Data Collection, one of the stages* of the Data Science methodology.

Versioning

I used no vesioning system for this gist. My repository's status is flagged as active because it has reached a stable, usable state. Original gist related to this repository is pending as concept.

Author

Licence

All public gists https://gist.github.com/aiPhD
Copyright 2018, Isaac Arnault
MIT License, http://www.opensource.org/licenses/mit-license.php

Sources

Exercise

  • Perform data collection in Python and R using Jupyter
  • Use the following dataframe from Spatialkey.com.
  • How many observations and variable do the dataframe contain. Base your assessment on your scripting outputs. — (*) Ten stages are crucial regarding Data Science methodology, among which Data collection. See figures.md.
Vertices of Data Science methodology

isaac-arnault-data-science-methodology.png

See answer

There are 10 variables and 1461 observations in the dataframe.

Data collection using P

See notebook

isaac-arnault-data-collection-P.png

Data collection using R

See notebook

isaac-arnault-data-collection-R.png

# 1. Checking Python version
!python -V
# 2. Import pandas to read the dataframe
import pandas as pd
pd.set_option('display.max_columns', None)
MyData = pd.read_csv("http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv")
#3 Show the first rows of the dataframe
MyData.head()
#4 Get the dimensions of the dataframe
MyData.shape
#Full code
!python -V
import pandas as pd
pd.set_option('display.max_columns', None)
MyData = pd.read_csv("http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv")
#3 Show the first rows of the dataframe
MyData.head()
MyData.shape
# 1. Checking R version
R.Version()$version.string
# 2. Download the dataframe from a remote server
download.file("http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv",
destfile="/resources/data/SalesJan2009.csv", quiet = TRUE)
#3 Read the dataframe, this will print out the first 5 observations
MyData <- read.csv("/resources/data/SalesJan2009.csv")
head(MyData, 5)
#4 Get the dimensions of the dataframe: number of variables (columns), number of observations (rows)
ncol(MyData)
nrow(MyData)
#Full code
R.Version()$version.string
download.file("http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv",
destfile="/resources/data/SalesJan2009.csv", quiet = TRUE)
MyData <- read.csv("/resources/data/SalesJan2009.csv")
head(MyData, 5)
ncol(MyData)
nrow(MyData)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment