Project 1: SQL, Tableau, Different Usage Patterns, Bike Share Service in Chicago (Google Capstone Project)


Introduction


In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders Based on that, the stakeholder believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, she believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

The stake holder’s goal is clear: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

Among the three questions, I have been assigned to answer the first question:

 

How do annual members and casual riders use Cyclistic bikes differently?


1. Dataset : the data containing the previous 12 months of Cyclistic trip data from 2020-04-01 to 2021-04-30

                Link : Bike Share Dataset from 2020-04-01 to 2021 04-30 


2. Files :


202004-divvy-tripdata.csv, 202005-divvy-tripdata.csv,  202006-divvy-tripdata.csv,  202007-divvy-tripdata.csv,  202008-divvy-tripdata.csv,  202009-divvy-tripdata.csv,  202010-divvy-tripdata.csv,  202011-divvy-tripdata.csv,  202012-divvy-tripdata.csv,  202101-divvy-tripdata.csv,  202102-divvy-tripdata.csv,  202103-divvy-tripdata.csv,  202104-divvy-tripdata.csv

    3. Size : 656 MB


    4. Column :


ride_id (TEXT) : an id for each usage of Cyclistic bike share service

rideable_type (TEXT) : Type of bike rented : ‘classic_bike’, ‘docked_bike’, ‘electric_bike’

started_at (TIMESTAMP) : Starting service timestamp recorded as users undock or unlock   (electric bike only) a bike

ended_at (TIMESTAMP) : End service timestamp recorded as users dock  or lock (electric bike only) a bike

start_station_name (TEXT) : Name of the station where users started using bike share service by undocking or unlocking(electric bike only)

start_station_id (INTEGER) : Unique id for each start station

end_station_name (TEXT) : Name of the station where users ended using the service by docking or locking(electric bike only)

end_station_id (INTEGER) : Unique id for each end station

start_lat (FLOAT) : Latitude of service start point

start_lng (FLOAT) : Longitude of service start point

end_lat (FLOAT) : Latitude of service end point

end_lng (FLOAT) : Latitude of service end point

member_casual (TEXT) : User type : ‘member’ , ‘casual’

 

5. Data Credibility and Source :


The data has been made available by Motivate International Inc. under this license

Data-privacy issues prohibit me from using riders’ personally identifiable information.

This means that I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.


    6. Tools used in this analysis:


1.       SQL (POSTGRESQL) and Tableau (for visualization)

2.       Why : The size of dataset is over 650MB which is too large to be analyzed in spreadsheets.

            The dataset contains geo-tracked GPS coordinates that can be visualized in tableau.

 


PROCESING AND CLEANING:

 

1Importing the dataset to local POSTGRESQL database:

 

        1) Importing 13 datasets to local POSTGRESQL database:


CREATE TABLE bike_20_04 (
	ride_id TEXT,
	rideable_type TEXT,
	started_at TIMESTAMP,
	ended_at TIMESTAMP,
	start_station_name TEXT,
	start_station_id TEXT,
	end_station_name TEXT,
	end_station_id TEXT,
	start_lat NUMERIC,
	start_lng NUMERIC,
	end_lat NUMERIC,
	end_lng NUMERIC,
	member_casual TEXT
)

COPY bike_20_04
FROM 'C:\data\bike\202004-divvy-tripdata.csv' 
DELIMITER ',' 
CSV HEADER;

-- repeated the same process for rest of 12 csv fiies


2) Uniting tables, making backup files and import united table: 


--UNITE the 13 datasets in the database
Copy (
SELECT *
FROM bike_20_04 
UNION ALL
SELECT *
FROM bike_20_05
UNION ALL
SELECT *
FROM bike_20_06
UNION ALL
SELECT *
FROM bike_20_07
UNION ALL
SELECT *
FROM bike_20_08
UNION ALL
SELECT *
FROM bike_20_09
UNION ALL
SELECT *
FROM bike_20_10
UNION ALL
SELECT *
FROM bike_20_11
UNION ALL
SELECT *
FROM bike_20_12
UNION ALL
SELECT *
FROM bike_21_01
UNION ALL
SELECT *
FROM bike_21_02
UNION ALL
SELECT *
FROM bike_21_03
UNION ALL
SELECT *
FROM bike_21_04
	)
	To 'C:\data\bike2021.csv' 
DELIMITER ','
 CSV HEADER;

-- Create a new table in local POSTGRES database for the united csv file:

CREATE TABLE bike2021 (
	ride_id TEXT,
	rideable_type TEXT,
	started_at TIMESTAMP,
	ended_at TIMESTAMP,
	start_station_name TEXT,
	start_station_id TEXT,
	end_station_name TEXT,
	end_station_id TEXT,
	start_lat NUMERIC,
	start_lng NUMERIC,
	end_lat NUMERIC,
	end_lng NUMERIC,
	member_casual TEXT
					  );

-- Importing the united csv to the table in local POSTGRESQL database.  

COPY bike2021
FROM 'c:\data\bike2021.csv'
DELIMITER ','
CSV HEADER;


    2. Checking total number of rows and rows with null values:


SELECT COUNT(*) AS total,
	   COUNT(*) FILTER (WHERE ride_id IS NULL) AS ride_id,
	   COUNT(*) FILTER (WHERE rideable_type IS NULL) AS rideable_type,
	   COUNT(*) FILTER (WHERE started_at IS NULL) AS started_at,
	   COUNT(*) FILTER (WHERE ended_at IS NULL) AS ended_at,
	   COUNT(*) FILTER (WHERE start_station_name IS NULL) AS start_station_name,
	   COUNT(*) FILTER (WHERE start_station_id IS NULL) AS start_station_id,
	   COUNT(*) FILTER (WHERE end_station_name IS NULL) AS end_station_name,
	   COUNT(*) FILTER (WHERE end_station_id IS NULL) AS end_station_id,
	   COUNT(*) FILTER (WHERE start_lat IS NULL) AS start_lat,
	   COUNT(*) FILTER (WHERE start_lng IS NULL) AS start_lng,
	   COUNT(*) FILTER (WHERE end_lat IS NULL) AS end_lat,
	   COUNT(*) FILTER (WHERE end_lng IS NULL) AS end_lng,
	   COUNT(*) FILTER (WHERE member_casual IS NULL) AS member_casual
FROM bike2021; 

The dataset has total 3826978 rows, and null values in 6 columns, start_station_name (148231) start_station_id (148857), end_station_name (171416), end_station_id (171877), end_lat (5005), end_lng (5005).

Before deleting rows with null values, I will see if there is inconsistency, wrong values, white spaces and duplicates first so that I can figure out if can fill in null values referring to other rows or columns


3. Whitespace, duplicates, consistency, wrong value check, column by column:


        1) Whitespace check:


--white space check on all columns
SELECT *
FROM bike2021
WHERE ride_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR rideable_type SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR started_at::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR ended_at::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR start_station_name SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR start_station_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR end_station_name SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR end_station_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR start_lat::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR start_lng::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR end_lat::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR end_lng::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' 
	  OR member_casual SIMILAR TO '[ \t\v\b\r\n\u00a0]';

Whitespace does not exist.


        2) column by column check - ride_id


--checking whether there is any duplicate
SELECT ride_id, COUNT(*)
FROM bike2021
GROUP BY ride_id
HAVING COUNT(*) > 1

--entire rows with ride_id duplicates 
SELECT *
FROM bike2021
INNER JOIN (
            SELECT ride_id, COUNT(*)
            FROM bike2021
            GROUP BY ride_id
            HAVING COUNT(*) > 1
             ) AS ll
USING (ride_id)
ORDER BY ride_id

--counting rows with non-sense timestamp values (start date > end date)

SELECT COUNT(*)
FROM bike2021
WHERE started_at > ended_at;

--delete
DELETE
FROM bike2021
WHERE started_at > ended_at;

The column, ride_id had 209 rows with duplicates. They contained non-sense start timestamp and end timestamp value, end timestamp being on earlier date than start one. I deleted 10557 rows with those timestamp values.


        3) rideable_type:


SELECT DISTINCT rideable_type
FROM bike2021;

There is no problem in this column.


        4) started_at and end_at:


--rows with same start and end time but different start and end stations, and different latitudes and longitudes 
SELECT *
FROM bike2021
WHERE started_at = ended_at AND start_station_name != end_station_name;

--deleting the rows with start and end time which make no sense
DELETE
FROM bike2021
WHERE started_at = ended_at AND start_station_name != end_station_name;

I have deleted 176 rows with same start and end time but different start and end stations and different latitudes and longitudes, which make no sense.


        5) member_casual:


SELECT DISTINCT member_casual
FROM bike2021;

There is no problem in this column.


        6) start_station_name:



SELECT DISTINCT start_station_name
FROM bike2021;

/* Finding out whether the ‘(*)’, ‘(Temp)’, ‘(Halsted)’ 
marked names refer to a different location 
than the actual one with non-marked station names */
 
SELECT start_station_id, b.start_station_name, flt.start_station_name,
	   b.start_lat, b.start_lng, flt.start_lat, flt.start_lng
FROM bike2021 AS b
INNER JOIN (
	     SELECT start_station_id, start_station_name, 
	             start_lat, start_lng,
	             COUNT(*) OVER (PARTITION BY start_station_id) AS freq
	     FROM bike2021
		  --filtering rows with inconsistent station names only
             WHERE start_station_name SIMILAR TO '%[\(\)|\*\|]%'
	     GROUP BY start_station_id, start_station_name, start_lat, start_lng
	     ) AS flt
USING (start_station_id)
WHERE b.start_station_name NOT SIMILAR TO '%[\(\)|\*\|]%' 
GROUP BY start_station_id, b.start_station_name, flt.start_station_name,
	   	 b.start_lat, b.start_lng, flt.start_lat, flt.start_lng
/* filtering joined rows with inconsistent names which have more than 100 meter 
   line distance difference from the actual GPS coordinates of the station. */
HAVING (flt.start_lat > MODE() WITHIN GROUP (ORDER BY b.start_lat) + 0.001
	   OR flt.start_lat < MODE() WITHIN GROUP (ORDER BY b.start_lat) - 0.001)
	   OR (flt.start_lng > MODE() WITHIN GROUP (ORDER BY b.start_lng) + 0.001
	   OR flt.start_lng < MODE() WITHIN GROUP (ORDER BY b.start_lng) - 0.001);

It turned out that same station names with ‘(*)’, ‘(Temp)’, ‘(Halsted)’ do have around 100 meter distance difference from the actual station although they share the same id with the other consistent station names. Since it's due to a distance difference, I will leave them as they are.


        7) end_station_name:



SELECT DISTINCT end_station_name
FROM bike2021;

/* Finding out whether the ‘(*)’, ‘(Temp)’, ‘(Halsted)’ 
marked names refer to a different location than 
the actual one with non-marked station names */
 
SELECT end_station_id, b.end_station_name, flt.end_station_name,
	   b.end_lat, b.end_lng, flt.end_lat, flt.end_lng
FROM bike2021 AS b
INNER JOIN	(
	         SELECT end_station_id, end_station_name, 
		         end_lat, end_lng,
	   	         COUNT(*) OVER (PARTITION BY end_station_id) AS freq
		  FROM bike2021
		 --filtering rows with inconsistent station names only
		  WHERE end_station_name SIMILAR TO '%[\(\)|\*\|]%'
		  GROUP BY end_station_id, end_station_name, end_lat, end_lng
		  ) AS flt
USING (end_station_id)
WHERE b.end_station_name NOT SIMILAR TO '%[\(\)|\*\|]%' 
GROUP BY end_station_id, b.end_station_name, flt.end_station_name,
	   	 b.end_lat, b.end_lng, flt.end_lat, flt.end_lng
/* filtering joined rows with inconsistent names which have more than 100 meter 
   line distance difference from the actual GPS coordinates of the station. */
HAVING (flt.end_lat > MODE() WITHIN GROUP (ORDER BY b.end_lat) + 0.001
	   OR flt.end_lat < MODE() WITHIN GROUP (ORDER BY b.end_lat) - 0.001)
	   OR (flt.end_lng > MODE() WITHIN GROUP (ORDER BY b.end_lng) + 0.001
	   OR flt.end_lng < MODE() WITHIN GROUP (ORDER BY b.end_lng) - 0.001);

Similar to start_station_name, there are several ‘(*)’, ‘(Temp)’, ‘(Halsted)’ marked names referring to a different location. Again, I will leave them as they are.


        8) start_id, and end_station_id:


-Checking the inconsistent values in the start_station_id column:

SELECT DISTINCT bike2021.start_station_id, start_station_name, l.start_station_id
FROM bike2021
LEFT JOIN (
           SELECT start_station_id, start_station_name
           FROM bike2021
             --rows not with 3 digit number
           WHERE start_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
           GROUP BY start_station_id, start_station_name
            ) AS l
USING (start_station_name)
WHERE bike2021.start_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}';

--Checking the inconsistent values in the end_station_id column:
SELECT DISTINCT bike2021.end_station_id, end_station_name, l.end_station_id
FROM bike2021
LEFT JOIN (
           SELECT end_station_id, end_station_name
           FROM bike2021
             --rows not with 3 digit number
           WHERE end_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
           GROUP BY end_station_id, end_station_name
            ) AS l
USING (end_station_name)
WHERE bike2021.end_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}';


There is an inconsistency in station id values. The same station has 5 digit id or one with letters besides a correct 3 digit id. Confirming that those inconsistent ids don’t refer to a different location, I will replace them with a correct 3 digit one:



--For start_station_id
UPDATE bike2021
SET start_station_id = refer.start_station_id
FROM (
	SELECT DISTINCT start_station_name, start_station_id
	FROM bike2021
	WHERE start_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
	) AS refer
WHERE bike2021.start_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
AND bike2021.start_station_name = refer.start_station_name;

--For end_station_id
UPDATE bike2021
SET end_station_id = refer.end_station_id
FROM (
	SELECT DISTINCT end_station_name, end_station_id
	FROM bike2021
	WHERE end_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
	) AS refer
WHERE bike2021.end_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'
AND bike2021.end_station_name = refer.end_station_name;



I also have to check if ids are correctly assigned to station names:



--Checking 3 digit ids wrongly assigned to station names 
SELECT *
FROM (
SELECT start_station_name, 
	   start_station_id, 
	   COUNT(start_station_id) OVER (PARTITION BY start_station_name) AS count_ids,
	   AVG(start_lat) AS slat, 
	   AVG(start_lng) AS slng,
	   --To check if inconsistent ids on one same station refer to different locations (0.001 latitude difference = around 111 meters)
	   ABS(AVG(start_lat) - LAG(AVG(start_lat)) OVER (PARTITION BY start_station_name)) > 0.001,
	   --To check if inconsistent ids on one same station refer to different locations (0.001 longitude difference = around 111 meters)
	   ABS(ABS(AVG(start_lng)) - ABS(LAG(AVG(start_lng)) OVER (PARTITION BY start_station_name))) > 0.001
FROM bike2021
GROUP BY start_station_name, start_station_id
ORDER BY start_station_name
	) AS grouped
WHERE count_ids > 1;



--For end stations
SELECT *
FROM (
SELECT end_station_name, 
	   end_station_id, 
	   COUNT(end_station_id) OVER (PARTITION BY end_station_name) AS count_ids,
	   AVG(end_lat) AS elat, 
	   AVG(end_lng) AS elng,
	   --To check if inconsistent ids on one same station refer to different locations (0.001 latitude difference = 111 meters)
	   ABS(AVG(end_lat) - LAG(AVG(end_lat)) OVER (PARTITION BY end_station_name)) > 0.001,
	   --To check if inconsistent ids on one same station refer to different locations (0.001 longitude difference = 111 meters)
	   ABS(ABS(AVG(end_lng)) - ABS(LAG(AVG(end_lng)) OVER (PARTITION BY end_station_name))) > 0.001
FROM bike2021
GROUP BY end_station_name, end_station_id
ORDER BY end_station_name
	) AS grouped
WHERE count_ids > 1;


It turned out that one station name is assigned to a wrong id resulting in one id being used for two different stations. I will split them into two different ids:



UPDATE bike2021
SET start_station_id = '651'
WHERE start_station_name = 'Michigan Ave & 71st St';

UPDATE bike2021
SET start_station_id = '800'
WHERE start_station_name = 'Halsted St & 104th St';

UPDATE bike2021
SET end_station_id = '651'
WHERE end_station_name = 'Michigan Ave & 71st St';

UPDATE bike2021
SET end_station_id = '800'
WHERE end_station_name = 'Halsted St & 104th St';


        9) start_lat, start_lng, end_lat, end_lng:

Note: Since values in the columns start_lat, start_lng, end_lat, end_lng are GPS coordinates of each ride, there is a slight difference among values within same stations, millimeter in extreme case. Also, electric bikes record start or end of service when they are not only undocked or ducked but also unlocked or locked. This generates many different latitude and longitude values.


SELECT bike2021.start_station_name, *
FROM bike2021
LEFT JOIN (
         SELECT start_station_name,
	          MODE() WITHIN GROUP (ORDER BY start_lat) AS mod_slat,
	          MODE() WITHIN GROUP (ORDER BY start_lng) AS mod_slng
         FROM bike2021
         GROUP BY start_station_name
	   ) AS with_modes
USING(start_station_name)
-- filtering rows with more than around +-150 meters in line distance away from a mode station location.
WHERE (start_lat > mod_slat + 0.0015 
	  OR start_lat < mod_slat - 0.0015) 
	  AND (start_lng > mod_slng + 0.0015 
	  OR start_lng < mod_slng - 0.0015);

--For end_lat and end_lng
SELECT bike2021.end_station_name, *
FROM bike2021
LEFT JOIN (
         SELECT end_station_name,
	          MODE() WITHIN GROUP (ORDER BY end_lat) AS mod_elat,
	          MODE() WITHIN GROUP (ORDER BY end_lng) AS mod_elng
         FROM bike2021
         GROUP BY end_station_name
	   ) AS with_modes
USING(end_station_name)
-- filtering rows with more than around +-150 meters in line distance away from a mode station location.
WHERE (end_lat > mod_elat + 0.0015 
	  OR end_lat < mod_elat - 0.0015) 
	  AND (end_lng > mod_elng + 0.0015 
	  OR end_lng < mod_elng - 0.0015);


There are 23 and 3586 values in the start and end coordinates columns respectively, which indicate completely different locations than the actual station. 

Those are all recorded by electric bikes. It is perhaps due to errors caused by electric bike that records start and end of service not only when undocked and docked but also when unlocked and locked.

I will delete them:


--for start lat and lng
DELETE
FROM bike2021
USING (
	SELECT bike2021.start_station_name, *
	FROM bike2021
	LEFT JOIN (
	           SELECT start_station_name,
		           MODE() WITHIN GROUP (ORDER BY start_lat) AS mod_slat,
		           MODE() WITHIN GROUP (ORDER BY start_lng) AS mod_slng
	           FROM bike2021
	           GROUP BY start_station_name
		 ) AS with_modes
	USING(start_station_name)
	-- filtering rows with more than around +-150 meters in line distance away from a mode station location.
	WHERE (start_lat > mod_slat + 0.0015 
		  OR start_lat < mod_slat - 0.0015) 
		  AND (start_lng > mod_slng + 0.0015 
		  OR start_lng < mod_slng - 0.0015)
	             ) AS refer
WHERE bike2021.start_lat = refer.start_lat 
	  AND bike2021.start_lng = refer.start_lng;

--for end lat and lng
DELETE
FROM bike2021
USING (
	SELECT bike2021.end_station_name, *
	FROM bike2021
	LEFT JOIN (
	           SELECT end_station_name,
		           MODE() WITHIN GROUP (ORDER BY end_lat) AS mod_elat,
		           MODE() WITHIN GROUP (ORDER BY end_lng) AS mod_elng
	           FROM bike2021
	           GROUP BY end_station_name
		    ) AS with_modes
	USING(end_station_name)
	-- filtering rows with more than around +-150 meters in line distance away from a mode station location.
	WHERE (end_lat > mod_elat + 0.0015 
		  OR end_lat < mod_elat - 0.0015) 
		  AND (end_lng > mod_elng + 0.0015 
		  OR end_lng < mod_elng - 0.0015)
	  ) AS refer
WHERE bike2021.end_lat = refer.end_lat 
	  AND bike2021.end_lng = refer.end_lng;

There are latitude and longitude values with only 1 or 2 decimal digits, which means those coordinates can indicate wrong start station names. 

I need to check them all:
 
--start station lat, lon values with 1 or 2 digits only (non-null rows)
SELECT start_station_name, start_lat, start_lng, rideable_type, 
	   COUNT(DISTINCT start_lat) AS start_lat, 
	   COUNT(DISTINCT start_lng) AS start_lng
FROM bike2021
--filter values with decimal digits less than 3 
WHERE (start_lat::TEXT NOT LIKE '%.___%'  
	  AND start_lng::TEXT NOT LIKE '%.___%')
	  AND start_station_name IS NOT NULL
GROUP BY start_station_name, start_lat, start_lng, rideable_type
ORDER BY start_station_name;

--end station lat, lon values with 1 or 2 digits only (non-null rows)
SELECT end_station_name, end_lat, end_lng, rideable_type, 
	   COUNT(DISTINCT end_lat) AS end_lat, 
	   COUNT(DISTINCT end_lng) AS end_lng
FROM bike2021
--filter values with decimal digits less than 3 
WHERE (end_lat::TEXT NOT LIKE '%.___%'  
	  AND end_lng::TEXT NOT LIKE '%.___%')
	  AND end_station_name IS NOT NULL
GROUP BY end_station_name, end_lat, end_lng, rideable_type
ORDER BY end_station_name;

There are 17 latitude(lat)-longitude(lon) values which have only 1 or 2 decimal digits. On google map, I compared all of 17 coordinate locations to the location which their station names indicate. 

It turned out that only the value with ‘S Michigan Ave & E 118th St’ as a start station, and ‘Woodlawn & 103rd - Olive Harvey Vaccination Site’ as an end station indicate the same location. 

I will delete all except those ones:
 
--for lat,lon start stations
DELETE
FROM bike2021
WHERE (start_lat::TEXT NOT LIKE '%.___%'  
	  AND start_lng::TEXT NOT LIKE '%.___%')
	  AND start_station_name IS NOT NULL
	  AND start_station_name != 'S Michigan Ave & E 118th St';

--for lat,lon end stations
DELETE
FROM bike2021
WHERE (end_lat::TEXT NOT LIKE '%.___%'  
	  AND end_lng::TEXT NOT LIKE '%.___%')
	  AND end_station_name IS NOT NULL
	  AND end_station_name != 'Woodlawn & 103rd - Olive Harvey Vaccination Site';

1235 rows with 1 or 2 digit lat-lon and 848 rows with 1 or 2 digit lat-lon have been deleted.



4. Looking into null values:


        1) nulls related to electric bikes:


--counting rows with nulls when filtered by ‘electric bike’ 
SELECT COUNT(*)
FROM bike2021
WHERE rideable_type = 'electric_bike'
	  AND start_station_name IS NULL

--check if the start station null values can be replaced by matching values in other rows
SELECT DISTINCT start_station_name, 
	   start_station_id, 
	   l.start_lat, l.start_lng,
	   COUNT(*) OVER () AS total_freq
FROM bike2021
INNER JOIN (
            SELECT DISTINCT start_lat, start_lng
            FROM bike2021
            WHERE rideable_type = 'electric_bike'
	              AND start_station_name IS NULL
             ) AS l
USING (start_lat, start_lng)
WHERE bike2021.start_station_name IS NOT NULL
	  OR bike2021.start_station_id IS NOT NULL
ORDER BY start_station_name;

--end station name
SELECT DISTINCT l.end_lat, l.end_lng,
	   end_station_name, end_station_id
FROM bike2021
INNER JOIN (
            SELECT DISTINCT end_lat, end_lng
            FROM bike2021
            WHERE rideable_type = 'electric_bike'
	              AND end_station_name IS NULL
	     ) AS l
USING (end_lat, end_lng)
WHERE (bike2021.end_station_name IS NOT NULL
	  OR bike2021.end_station_id IS NOT NULL)
ORDER BY end_lat, end_lng;

Total 147636 rows have nulls in either start_station_name or start_station_id when filtered by ‘electric_bike’. 

There are rows with latitude and longitude but have no station name. 

I will replace those nulls in the start and end station name with the values from other rows having the same latitudes and longitudes:

 
--for start station name
UPDATE bike2021
SET start_station_name = rf_station
FROM (
	  SELECT start_lat, start_lng, MAX(start_station_name) AS rf_station
	  FROM bike2021
	  WHERE rideable_type = 'electric_bike'
			AND start_station_name IS NOT NULL
	  GROUP BY start_lat, start_lng
	 ) AS rf
WHERE (rf.start_lat = bike2021.start_lat 
	  AND rf.start_lng = bike2021.start_lng )
	  AND bike2021.rideable_type = 'electric_bike'
	  AND bike2021.start_station_name IS NULL;

--for end station name
UPDATE bike2021
SET end_station_name = rf_station
FROM (
	  SELECT end_lat, end_lng, MAX(end_station_name) AS rf_station
	  FROM bike2021
	  WHERE rideable_type = 'electric_bike'
			AND end_station_name IS NOT NULL
	  GROUP BY end_lat, end_lng
	 ) AS rf
WHERE (rf.end_lat = bike2021.end_lat 
	  AND rf.end_lng = bike2021.end_lng )
	  AND bike2021.rideable_type = 'electric_bike'
	  AND bike2021.end_station_name IS NULL;


        2) Checking if start and end station nulls can be replaced based on latitude and longitude:


--start station names
SELECT *
FROM bike2021
INNER JOIN (
            SELECT start_lat, start_lng, MAX(start_station_name)
	      FROM bike2021
	      WHERE start_station_name IS NOT NULL
	      GROUP BY start_lat, start_lng
             ) AS l
USING (start_lat, start_lng)
WHERE start_station_name IS NULL;

--end station names
SELECT *
FROM bike2021
INNER JOIN (
	      SELECT end_lat, end_lng, MAX(end_station_name)
	      FROM bike2021
	      WHERE end_station_name IS NOT NULL
	      GROUP BY end_lat, end_lng
             ) AS l
USING (end_lat, end_lng)
WHERE end_station_name IS NULL;

There is none to be replaced.
 
 
        3) Checking if nulls in start and end station ids are replaceable based on latitude and longitude:


--start station ids
SELECT *
FROM bike2021
INNER JOIN (
	     SELECT start_lat, start_lng, MAX(start_station_id)
	     FROM bike2021
	     WHERE start_station_id IS NOT NULL
	     GROUP BY start_lat, start_lng, start_station_id
) AS l
USING (start_lat, start_lng)
WHERE bike2021.start_station_id IS NULL;

--end station ids
SELECT *
FROM bike2021
INNER JOIN (
	     SELECT end_lat, end_lng, MAX(end_station_id) 
	     FROM bike2021
	     WHERE end_station_id IS NOT NULL
	     GROUP BY end_lat, end_lng
) AS l
USING (end_lat, end_lng)
WHERE bike2021.end_station_id IS NULL;

There is one case in end station that can be replaced:
 
UPDATE bike2021
SET end_station_id = '20133'
WHERE end_station_name = 'Woodlawn & 103rd - Olive Harvey Vaccination Site'
	  AND end_station_id IS NULL


        4) Checking if nulls in station names can be replaced based on id, or vice versa:


SELECT *
FROM bike2021
WHERE (end_station_name IS NULL
	          AND end_station_id IS NOT NULL)
	  OR (start_station_name IS NULL
		  AND start_station_id IS NOT NULL)
	  OR (end_station_name IS NOT NULL
		  AND end_station_id IS NULL)
	  OR (start_station_name IS NOT NULL
		  AND start_station_id IS NULL);

There is none to be replaced.


 
5. Deleting nulls


Final check on nulls:

SELECT *
FROM bike2021
WHERE (start_station_name IS NULL
	  OR start_station_id IS NULL
	  OR end_station_name IS NULL
	  OR end_station_id IS NULL)
	  AND end_lat IS NOT NULL
	  AND end_lng IS NOT NULL
	  AND start_lat::TEXT LIKE '%.___%'
	  AND start_lng::TEXT LIKE '%.___%'
	  AND end_lat::TEXT LIKE '%.___%'
	  AND end_lng::TEXT LIKE '%.___%';


All nulls in the dataset can be divided into two groups:

1. Not replaceable group: rows with all of end_lat, end_lng, end_station_name, end_station_name are nulls. 

2. Nulls with only 1 or 2 decimal digits in latitude and longitude recorded by electric bikes: replacing nulls with station name and id values may be possible but cause problems in analysis.

As for the group 2 type, I can cluster those 1 or 2 decimal digit latitude and longitude values with closest station center points, and fill in the missing station name and id.

However, maximum 0.009 degree difference in latitude and longitude causes approximately 1 kilometer distance in line difference, which is not acceptable for my analysis.
 
Also, I cannot be sure if the 1 or 2 decimal digit values are not errors because the majority of records from electric bikes have the correct form of GPS coordinates with at least 4 decimal digits.
 

Therefore, I will delete all nulls: 

DELETE
FROM bike2021
WHERE start_station_name IS NULL
	  OR start_station_id IS NULL
	  OR end_station_name IS NULL
	  OR end_station_id IS NULL
	  OR end_lat IS NULL
	  OR end_lng IS NULL;  

SELECT COUNT(*) AS total,
	   COUNT(*) FILTER (WHERE ride_id IS NULL) AS ride_id,
	   COUNT(*) FILTER (WHERE rideable_type IS NULL) AS rideable_type,
	   COUNT(*) FILTER (WHERE started_at IS NULL) AS started_at,
	   COUNT(*) FILTER (WHERE ended_at IS NULL) AS ended_at,
	   COUNT(*) FILTER (WHERE start_station_name IS NULL) AS start_station_name,
	   COUNT(*) FILTER (WHERE start_station_id IS NULL) AS start_station_id,
	   COUNT(*) FILTER (WHERE end_station_name IS NULL) AS end_station_name,
	   COUNT(*) FILTER (WHERE end_station_id IS NULL) AS end_station_id,
	   COUNT(*) FILTER (WHERE start_lat IS NULL) AS start_lat,
	   COUNT(*) FILTER (WHERE start_lng IS NULL) AS start_lng,
	   COUNT(*) FILTER (WHERE end_lat IS NULL) AS end_lat,
	   COUNT(*) FILTER (WHERE end_lng IS NULL) AS end_lng,
	   COUNT(*) FILTER (WHERE member_casual IS NULL) AS member_casual
FROM bike2021;

Total 232291 rows with null values are removed. The cleaned dataset has total 3578263 rows.
 
 
Creating a copy of cleaned dataset in database for analysis and a backup:
 
CREATE TABLE bike AS
TABLE bike2021;




ANALYZING


1. Organizing data


        1) Adding a new column, ride_duration, calculated from the column started_at and ended_at:


ALTER TABLE bike
ADD COLUMN ride_duration INTERVAL;

UPDATE bike
SET ride_duration = ended_at::TIMESTAMP - started_at::TIMESTAMP;


        2) Cleaning the dataset with the new column:


The ride duration 0 and the one over 24 hours are likely errored values because there are only single ride pass (usually 30 to 45 minute ride) and full day pass (24 hours with the rented bike docked every 30 to 45 minutes). 

I will delete those 1240 rows:
 
DELETE
FROM bike
WHERE ride_duration <= '00:00:00'
	  OR ride_duration > '24:00:00'


        3) Adding new columns with ride duration in minute, ride distance, ride speed (ride distance / ride duration in minute).
 
 
To calculate ride distance from start station latitudes(lat) and longitudes(lon) to end station lat and lon, I will use the ST_DistanceSphere function from POSTGIS extension, which is believed to be with less than 7 meter error:

 
ALTER TABLE bike
ADD COLUMN distance_m NUMERIC, --ride distance in meter
ADD COLUMN ridetime_min NUMERIC, --ride duration in minute
ADD COLUMN speed_mm NUMERIC; --ride distance in meter / ride duration in minute

UPDATE bike
SET ridetime_min = (EXTRACT(EPOCH FROM ride_duration)/60)::NUMERIC, 
	distance_m = ST_DistanceSphere(
					ST_MakePoint(start_lng,start_lat),
					ST_MakePoint(end_lng,end_lat))::NUMERIC,
	
	speed_mm = (ST_DistanceSphere(
					ST_MakePoint(start_lng,start_lat),
					ST_MakePoint(end_lng,end_lat)) / 
                                        ridetime_min)::NUMERIC;


        4) Cleaning dataset with the new columns:


SELECT MAX(distance_m), MIN(distance_m), 
	   MAX(speed_mm), MIN(speed_mm)
FROM bike;

Distance 0 and speed 0 indicate either test centers or round trips.

The top ride speed of bike-share bikes is considered to be around 20 mph, which is 536 meter / minute. 

Electric bikes also have 15 mph cap system, which is around 24 km/h and 402 meter / minute. 

Thus, speed over 600 meter / minute, 22 mph and 36 km/h, is likely to be errored values.

I will delete 1602 rows with over 600 m/m :
 
DELETE
FROM bike
WHERE speed_mm > 600   


Values from the company’s bike test centers are not needed for my analysis. I will delete the 3403 rows with bike test centers:
 
DELETE
FROM bike
WHERE start_station_id = 'Hubbard Bike-checking (LBS-WH-TEST)'
      OR start_station_name IN ('HUBBARD ST BIKE CHECKING (LBS-WH-TEST)'
								, 'WATSON TESTING - DIVVY');

DELETE
FROM bike
WHERE end_station_name IN ('HUBBARD ST BIKE CHECKING (LBS-WH-TEST)'
, 'WATSON TESTING - DIVVY') OR end_station_id = 'Hubbard Bike-checking (LBS-WH-TEST)';

--checking the cleaned dataset
SELECT COUNT(*) AS row_n 
FROM bike;



2. General pattern in number of usages


        1) Number of total usages, usages of each bike type:


SELECT COUNT(*) AS usages, 
	   SUM(COUNT(*)) OVER () AS total,
	   ROUND(COUNT(*) * 100 / 
               SUM(COUNT(*)) OVER ()) AS ratio
FROM bike
GROUP BY member_casual;

--usages of a bike type, grouped by user type 
SELECT member_casual, rideable_type, 
	   COUNT(*) AS usages,
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER 
                                  (PARTITION BY member_casual), 2) AS ratio 
FROM bike
GROUP BY member_casual, rideable_type
ORDER BY member_casual, rideable_type;





Throughout a year (2020-04 to 2021-04), the total usages of Cyclistic bike share service were 3,570,848. the member usages make up about 59% of the total which is 18% more than casual usages. Both user types preferred riding non-electric bikes.



        2) Monthly usages by user type:


SELECT member_casual, 
	   TO_CHAR(started_at, 'yyyy-mm') as year_month,
	   COUNT(*),
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER 
			               (PARTITION BY member_casual), 2) 
					AS percentage_type
FROM bike
GROUP BY member_casual, year_month
ORDER BY member_casual, year_month;


Both casual and member usages increase vastly as it reaches to the outdoor season and start decreasing when it gets cold. The pattern is almost identical in both user types.


        3) Usages by day of week:


-- 0 = Sunday, 6 = Saturday
SELECT EXTRACT(DOW from started_at) AS day_of_week, 
	   member_casual, 
	   COUNT(*) AS usages,
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER 
			 	       (PARTITION BY member_casual),2) 
						      AS percentage_type,
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2)
	   				              AS percentage_total 
FROM bike
GROUP BY day_of_week, member_casual
ORDER BY member_casual, day_of_week



While members’ usages do not vary much throughout a week, casual usages increase vastly on weekends, nearly twice more than a weekday, accounting for approximately 42% of total casual usages. This means that around 42% of total casual usages are unlikely for work-related purposes such as commuting to work or work errands while about 28% of total members are so.


        4) Hourly usages:


SELECT member_casual,
       CASE WHEN EXTRACT(DOW FROM started_at) IN (0,6)
	     THEN 'weekends'
	     ELSE 'weekdays' END AS day_of_week,
       EXTRACT(HOUR FROM started_at::TIME) AS hour, 
       COUNT(*) AS freq
FROM bike
GROUP BY member_casual, day_of_week, hour
ORDER BY member_casual, day_of_week, hour;

--counting values only in commuting hour in the morning, 05:00 -08:59 weekdays
SELECT member_casual,
	   COUNT(CASE WHEN (EXTRACT(HOUR FROM started_at::TIME) 
			         BETWEEN 5 AND 8) 
			     AND (EXTRACT(DOW FROM started_at) NOT IN (0,6)) 
		       THEN 1 END) AS commute_usages, 
	   COUNT(CASE WHEN (EXTRACT(HOUR FROM started_at::TIME) 
			         BETWEEN 5 AND 8) 
			     AND (EXTRACT(DOW FROM started_at) NOT IN (0,6)) 
		       THEN 1 END) * 100 / COUNT(*) AS percentage
FROM bike
GROUP BY member_casual;




The number of usages in commuting hour also shows that only a small number of casual customers are likely to use the bike share service for commuting to work, making up 3% of the total casual usages. In contrast, 11% of the total member usages shows a clear commuting pattern, starting to increase at 5:00 and decrease vastly after 8:59 AM.


        5) One way and round trip usages:


SELECT member_casual,
	   CASE WHEN start_station_name = end_station_name 
	   		THEN 'round'
	   		ELSE 'one way'
			END AS trip_type,
	   COUNT(*) AS usages,
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER 
			 		(PARTITION BY member_casual),2) 
					           AS percentage_type,
	   ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2)
	   				           AS percentage_total
FROM bike
GROUP BY member_casual, trip_type;


There is a gap between round trip usages of casual users and that of members. The percentage of round trip casual usages is about 12% higher than that of member usages.

This is likely to mean that around 18% of total casual usages is not for errands or commuting while only about 6% of total member usages is not so. Since users are not supposed to park a bike at any places until they end the service, round trips are likely for leisure usages.


        6) Number of usages not related to work (round trips + weekend usages):


SELECT member_casual,
	   --the number of weekend usages
	   (COUNT(*) FILTER (WHERE EXTRACT(DOW FROM started_at) IN (0, 6)) 
			  --the number of round trip only during weekdays
			+ COUNT(*) FILTER (WHERE start_station_name = end_station_name
	  					  AND EXTRACT(DOW FROM started_at) 
						 	 NOT IN (0, 6))) 
			                               * 100 / COUNT(*) 
                                                AS no_work_usages
FROM bike
GROUP BY member_casual;


It turned out that at least more than a half of the total casual usages, 52%, is highly unlikely for work-related purposes such as commuting to work and work-related errands.

This is 20% more than the percentage of member usages that are possibly unrelated to work. The numbers show a clue that casual usages are more leisure centered than member ones are.



3. Usages in different locations


        1) Top 10 start stations with most usages of casual and member users:


SELECT *
FROM (
	 SELECT member_casual,
	         start_station_name,
	   	 COUNT(*) AS usages,
	   	 ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio ,
	   	 ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking
	 FROM bike
	 WHERE member_casual = 'casual'
	 GROUP BY member_casual, start_station_name
	 LIMIT 10
       ) AS ca
LEFT JOIN (
	    SELECT member_casual,
	   	    start_station_name,
	            COUNT(*) AS usages,
	   	    ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio,
	   	    ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking
	    FROM bike
	    WHERE member_casual = 'member'
	    GROUP BY member_casual, start_station_name
	    LIMIT 10
	    ) AS me
USING(ranking);


A clear difference between casual and member users is found in the top 10 mostly used start stations. 

Except the Clark St & Elm St station, all top 10 start stations of casual usages are located in or right beside famous city attractions, parks, museums, theaters and beaches. 

In contrast, except 3 stations such as Wells St & Concord Ln, Theater on the Lake, Clark St & Armitage Ave, most top 10 start stations of member usages are in retail store, hospital, commercial building and restaurant areas.

The locational usage difference indicates that casual usages are likely to be more leisure-oriented than member ones are. Also, it shows that errand and work-related usages are likely to be made more by member than casual customers.


        2) Top 10 end stations with most usages of casual and member users:

  
SELECT *
FROM (
	 SELECT member_casual,
	         end_station_name,
	   	 COUNT(*) AS usages,
	   	 ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio ,
	   	 ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking
	 FROM bike
	 WHERE member_casual = 'casual'
	 GROUP BY member_casual, end_station_name
	 LIMIT 10
      ) AS ca
LEFT JOIN (
		  SELECT member_casual,
	   		  end_station_name,
	      		  COUNT(*) AS usages,
	   		  ROUND(COUNT(*) * 100 / 
                                SUM(COUNT(*)) OVER (), 2) AS ratio,
	   		  ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking
		  FROM bike
		  WHERE member_casual = 'member'
		  GROUP BY member_casual, end_station_name
		  LIMIT 10
	    ) AS me
USING(ranking);

All top 10 end stations of both casual and member users are identical with their top 10 start stations except one in members' which also has many retail stores are around.


        3) Usages in all start and end stations, Tableau map chart:



SELECT member_casual,
        start_station_name,
	COUNT(*) AS usages,		
        AVG(start_lat) AS slat,
	AVG(start_lng) AS slng
FROM bike
GROUP BY member_casual, start_station_name;

--for end stations
SELECT member_casual,
	end_station_name,
	COUNT(*) AS usages,		
        AVG(end_lat) AS elat,
        AVG(end_lng) AS elng
FROM bike
GROUP BY member_casual, end_station_name;


(Hoover on a geo-point to see the name and number / Click on a geo-point and click on 'Link to Google Map' to see the location in google map / Scroll up and down for zoom in and out)

While casual usages are concentrated in the stations near the shore where most tourist attractions and grand parks are located, member usages are scattered all over Chicago city, from the shore to commercial and residential areas in the west.

Those famous leisure locations near the shore are not likely for someone who does running errands. Rather, they are where people explore sites, do outdoor activities and enjoy spending time.  

If looked closely by zooming in, even many stations far from the shore and showing a significant number of casual usages are also near local parks.

All these support that casual customers use the Cyclistic bike share service for leisure more than members do.


        4) Usages in four highly leisure centered locations exclusively:


WITH latlng AS (  -- to merge the scattered geo points within a same station into one geo point 
SELECT o.start_station_name,
	   o.end_station_name,
	   o.started_at,
	   o.ended_at,
	   o.member_casual,
	   o.ride_duration,
	   o.ridetime_min,
	   o.distance_m,
	   o.speed_mm,
	   l.start_lat,
	   l.start_lng,
	   d.end_lat,
	   d.end_lng
FROM bike AS o
LEFT JOIN (
	    SELECT start_station_name,
	            AVG(start_lat) AS start_lat,
	            AVG(start_lng) AS start_lng
	    FROM bike
	    GROUP BY start_station_name
	    ) AS l
ON o.start_station_name = l.start_station_name
LEFT JOIN (
	    SELECT end_station_name,
	            AVG(end_lat) AS end_lat,
	            AVG(end_lng) AS end_lng
	    FROM bike
	    GROUP BY end_station_name
	    ) AS d
ON o.end_station_name = d.end_station_name
)

SELECT member_casual,
	   start_station_name,
	   end_station_name,
	   COUNT(*) AS usages,		
	   AVG(start_lat) AS slat,
	   AVG(start_lng) AS slng,
	   AVG(end_lat) AS elat,
	   AVG(end_lng) AS elng
FROM latlng
GROUP BY member_casual, start_station_name, end_station_name
--to filter four major leisure park areas in the shore exclusively
HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach
	  OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park
	  OR (AVG(start_lng) >= -87.62573  AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium
	  OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park
	  OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park
	  OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St');

--ratio of each one’s usages in 4 locations to each total
SELECT member_casual, ratio_each_total
FROM (
SELECT member_casual,
	   SUM(COUNT(*)) OVER (PARTITION BY member_casual) AS usages		
FROM bike
GROUP BY member_casual, start_station_name
HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach
	  OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park
	  OR (AVG(start_lng) >= -87.62573  AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium
	  OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park
	  OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park
	  OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St')
) AS l
LEFT JOIN LATERAL (
                  SELECT member_casual,
                             ROUND(usages * 100 / COUNT(*), 2) AS ratio_each_total
                  FROM bike
                  GROUP BY member_casual
                    ) AS o
USING (member_casual)
GROUP BY member_casual, ratio_each_total;

(Click geo-point on the left to see the path on the right / Select member or casual on the right-top container / Hoover on a geo-point to see the name and number / Click on a geo-point and click on 'Link to Google Map' to see the location in google map / Scroll up and down for zoom in and out)


Casual usages in the four leisure locations exclusively make up 33.09% of the total casual usages. Also, looking into the riding path made from start stations to end stations, most destinations in casual usages are either the same stations (round trip) or within the leisure locations.

In contrast, although member usages account for 22.19% of the total, their destinations are more scattered far outside leisure locations. Moreover, they do not show a particularly high number of usages in end stations in the leisure locations. This probably suggests that, unlike casual usages, many usages of member started from there are not likely for leisure but to go to destinations outside the leisure locations.


        5) ALL usages from outside of the four leisure locations to inside of the locations:


SELECT member_casual, 
	   ratio_each_total
FROM (
	SELECT bike.member_casual,
		   bike.start_station_name,
		   bike.end_station_name,
		   SUM(COUNT(*)) OVER (PARTITION BY bike.member_casual) AS ratio
	FROM bike
	LEFT JOIN (  --filtering out start stations in the four major leisure locations
		    SELECT member_casual,
			    start_station_name
		    FROM bike
		    GROUP BY member_casual, start_station_name
		    HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach
			    OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park
			    OR (AVG(start_lng) >= -87.62573  AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium
			    OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park
			    OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park
			    OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St')
	            ) AS l
	USING (start_station_name)
	WHERE l.start_station_name IS NULL
	GROUP BY bike.member_casual, bike.start_station_name, bike.end_station_name
	--filtering rows with end stations in the four major leisure locations	
	HAVING (AVG(end_lng) >= -87.6523 AND AVG(end_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach
		  OR (AVG(end_lng) >= -87.6392 AND AVG(end_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park
		  OR (AVG(end_lng) >= -87.62573  AND AVG(end_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium
		  OR (AVG(end_lng) >= -87.61351 AND AVG(end_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park
		  OR (AVG(end_lng) >= -87.5869 AND AVG(end_lat) BETWEEN 41.7736 AND 41.8024) -- jason par
		  OR end_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St')
	  ) AS ll
LEFT JOIN LATERAL (
		    SELECT member_casual,
			    ROUND(ratio * 100 / COUNT(*), 2) AS ratio_each_total
		    FROM bike
		    GROUP BY member_casual
		    ) AS o
USING (member_casual)
GROUP BY member_casual, ratio_each_total;  


Approximately 13.13% of the total casual usages that started from outside the four leisure locations ends at stations inside the leisure locations while about 12.16% of the total member usages does so.

If combined with the start station usages in those locations, casual usages in the four major leisure areas make up around 46% of the total casual usages whereas members usages do about 34% of the total member usages.


        6) Summing highly likely leisure usages (round trips + leisure location usages):


SELECT member_casual, 
	   ratio_each_total
FROM (
	SELECT bike.member_casual,
		   bike.start_station_name,
		   bike.end_station_name,
		   SUM(COUNT(*)) OVER (PARTITION BY bike.member_casual) AS ratio
	FROM bike
	LEFT JOIN (  --filtering out start stations in the four major leisure locations
		    SELECT member_casual,
			    start_station_name
		    FROM bike
		    GROUP BY member_casual, start_station_name
		    HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach
			    OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park
		            OR (AVG(start_lng) >= -87.62573  AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium
			    OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park
			    OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park
			    OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St')
		     ) AS l
	USING (start_station_name)
	WHERE l.start_station_name IS NULL AND start_station_name = end_station_name -- round trip
	GROUP BY bike.member_casual, bike.start_station_name, bike.end_station_name
       ) AS ll
LEFT JOIN LATERAL (
		    SELECT member_casual,
			    ROUND(ratio * 100 / COUNT(*), 2) AS ratio_each_total
		    FROM bike
		    GROUP BY member_casual
		    ) AS o
USING (member_casual)
GROUP BY member_casual, ratio_each_total;


The percentage of round trip casual usages in outside of the four leisure locations is 10.23%. 

If combined with the percentage of casual usages made in start and end stations in the four leisure locations, it can be said that around 56% of the total were likely for leisure. With usages only in the four leisure locations and round trips considered, it is more than a half of the total usages.

As for members, about 38% of the total usages are likely to be for leisure.



4. Summary of patterns found in usage number


I’ve found that around 52% of the total casual usages is likely for non-work-related purposes whereas 32% of the total member usages is likely so.

In addition, considering only the usages in four leisure locations near the shore and round trips, approximately 56% of the total casual usages can be said to be for leisure while around 38% of the total member usages is likely so.

A number of casual usages that are not made in the four leisure locations are also found near local parks.

In sum, it is likely that the usages of casual customers are more leisure oriented than that of member. Also, leisure is likely to be the main attraction for casual customers to use the bike share service whereas members are likely to use the service for various purposes including commuting to work, errands and leisure.



5. General pattern in ride duration


        1) Average and median ride duration:


SELECT member_casual,
	   AVG(ride_duration) AS avg_ride_duration,
	   PERCENTILE_CONT(0.5) WITHIN GROUP 
	   		 (ORDER BY ride_duration) AS med_ride_duration,
	   ROUND(AVG(distance_m), 2) AS avg_distance_m,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY distance_m)::NUMERIC, 2) AS med_distance,
	   ROUND(AVG(speed_mm), 2) AS avg_speed_mm,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm
FROM bike
GROUP BY member_casual;


The median ride duration of casual usages is almost twice longer than that of member. Interestingly, at the same time, the median distance is shorter. As a result, the median speed is much slower than member usages.

 
        2) Average and median ride duration in one-way trip and round trip:


--one-way trip
SELECT member_casual,
	   AVG(ride_duration) AS avg_ride_duration,
	   PERCENTILE_CONT(0.5) WITHIN GROUP 
	   		 (ORDER BY ride_duration) AS med_ride_duration,
	   ROUND(AVG(distance_m), 2) AS avg_distance_m,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY distance_m)::NUMERIC, 2) AS med_distance,
	   ROUND(AVG(speed_mm), 2) AS avg_speed_mm,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm
FROM bike
WHERE speed_mm > 0 
	  --filtering out scattered coordinates within same station as well.
	  AND start_station_name != end_station_name
GROUP BY member_casual;

--round trip
SELECT member_casual,
	   AVG(ride_duration) AS avg_ride_duration,
	   PERCENTILE_CONT(0.5) WITHIN GROUP 
	   		 (ORDER BY ride_duration) AS med_ride_duration,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY distance_m)::NUMERIC, 2) AS med_distance,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
			 (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm
FROM bike
WHERE speed_mm >= 0 
	  --including scattered coordinates with in same station.
	  AND start_station_name = end_station_name
GROUP BY member_casual

<one-way trip>

<round trip>


In both one-way and round trips, the ride duration of casual usages is much longer than that of member.

The degree of gap is different in one-way and round trips. In round trips, the ride duration gap is extremely big, showing about 3 times longer than that of member usages. On the other hand, in one-way trip, the median ride duration of casual usages is longer by around 1.6 times.

It seems that the long ride tendency of casual usages is an interesting pattern. When looking at the ride speed gap between casual and member usages, it can be noticed that distance is not the reason why there is the tendency. casual usages show approximately 43 meter per minutes slower speed than member ones. This indicates that casual users ride longer than members even for a same journey distance.



6. Ride duration and speed by different distance, round trip excluded   


SELECT CASE WHEN distance_m < 500 THEN '<500m'
	      WHEN distance_m BETWEEN 500 AND 1000 THEN '500-1000m'
	      WHEN distance_m BETWEEN 1000 AND 1500 THEN '1000-1500m'
	      WHEN distance_m BETWEEN 1500 AND 2000 THEN '1500-2000m'
	      WHEN distance_m BETWEEN 2000 AND 2500 THEN '2000-2500m'
	      WHEN distance_m BETWEEN 2500 AND 3000 THEN '2500-3000m'
	      WHEN distance_m BETWEEN 3000 AND 3500 THEN '3000-3500m'
	      WHEN distance_m BETWEEN 3500 AND 4000 THEN '3500-4000m'
	      WHEN distance_m BETWEEN 4000 AND 4500 THEN '4000-4500m'
	      WHEN distance_m BETWEEN 4500 AND 5000 THEN '4500-5000m'
	      WHEN distance_m BETWEEN 5000 AND 5500 THEN '5000-5500m'
	      WHEN distance_m BETWEEN 5500 AND 6000 THEN '5500-6000m'
	      WHEN distance_m BETWEEN 6000 AND 6500 THEN '6000-6500m'
	      WHEN distance_m BETWEEN 6500 AND 7000 THEN '6500-7000m'
	      WHEN distance_m BETWEEN 7000 AND 7500 THEN '7000-7500m'
	      WHEN distance_m BETWEEN 7500 AND 8000 THEN '7500-8000m'
	      WHEN distance_m >= 8000 THEN '>8000m' END AS distance,
        COUNT(*) AS usages,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ridetime_min) AS duration,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed,
	member_casual,
	PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) -
	LAG(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)) OVER () AS diff
			
FROM bike
WHERE start_station_name != end_station_name
	  AND speed_mm != 0
GROUP BY distance, member_casual
ORDER BY distance, member_casual;



The ride duration of casual usages is constantly longer than that of members no matter how far the distance is. 

Usages of both types show almost same trend. Most usages are made in the distance less than 3.5 Kilometers. 

It is not likely that distance makes casual usages in longer duration. 



7. Hourly riding duration and speed during weekdays and weekends




SELECT member_casual,
	   CASE WHEN EXTRACT(DOW FROM started_at) IN (0, 6)
	        THEN 'weekends'
		ELSE 'weekdays' END AS day,
	   EXTRACT(HOUR FROM started_at) AS hour,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ridetime_min) AS duration,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed,
	   COUNT(*) AS usages
FROM bike
GROUP BY member_casual, day, hour;



The median ride speed of causal users becomes faster and almost converges to that of members at from 5 to 9 during the weekdays, which is likely commuting hours.

In addition, the median speed of both users increases slightly at 5 PM to 6 PM, which is usually quitting time and dinner time for people. 

The pattern coincides with how people manage time generally. When people have things to be done at their destination, they tend to go faster. Based on these, it can be said that usages with the high median riding speed is more likely for errands or commuting.  

According to the chart, 4% of the total casual usages is likely to be for commuting to work in the morning whereas 12% of the total member usages is likely so.
 
As for the long ride tendency of casual users, time is also not likely the reason since the ride duration of casual usages is constantly longer than members regardless of hourly changes.



8. Ride duration difference in locations


To analyze ride duration difference of casual and member in all locations, I will look into speed (meter / minute) instead of ride duration. It is because ride duration varies by different distances. In order to see how much longer or shorter users ride a bike regardless of different distances, speed (distance / time) should be considered.



        1) Top 10 start stations with biggest speed gap and significant usage number:   


SELECT bike.member_casual,
	   start_station_name,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY speed_mm)::NUMERIC, 2) AS c_speed,
	   PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY ride_duration) AS c_duration,
	   COUNT(*) AS c_usages,
	   COUNT(*) FILTER (WHERE start_station_name = end_station_name
					    AND speed_mm = 0) AS roundtrip,
	   l.member_casual,
	   ROUND(l.speed::NUMERIC, 2) AS speed,
	   l.duration,
	   l.usages
FROM bike
LEFT JOIN (
           SELECT member_casual,
	   	   start_station_name,
	   	   PERCENTILE_CONT(0.5) WITHIN GROUP 
				(ORDER BY speed_mm) AS speed,
	   	   PERCENTILE_CONT(0.5) WITHIN GROUP 
				(ORDER BY ride_duration) duration,
		   COUNT(*) AS usages
	   FROM bike
	   WHERE member_casual = 'member'
	   GROUP BY member_casual, start_station_name
	    ) AS l
USING (start_station_name)
WHERE bike.member_casual = 'casual'
GROUP BY bike.start_station_name, bike.member_casual,
		l.member_casual, l.speed, l.duration, l.usages
HAVING COUNT(*) > 10000
--order by the greatest speed difference between casual and member 
ORDER BY l.speed - PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY speed_mm) DESC
LIMIT 10;


All the top 10 greatest speed gap start stations are located in the tourist attractions near the shore. The extreme case is Buckingham Fountain, casual usages showing only 4 meter per minute speed.
 
It is hard to imagine a person riding a bike at a speed that goes only 4 meter per minute constantly. Rather, it is more likely that casual users stop more often, explore sites in further distance or take a half round trip. 

The ride speed of members in those locations is also relatively slower than their usual speed. This may indicate that location affects users' speed. 

However, comparing to casual users, the ride speed of members is 4 to 5 times faster. Since those locations are highly likely for leisure, there is a possibility that how members and casual users enjoy their time in leisure places is different. Members may ride a bike to exercise or just to get to a destination in those places whereas casual users do it for enjoying riding a bike and sightseeing at the same time. 
 
This brings up a new problem. Just like the faster speed of member usages have shown so far, local people do not usually spend much time looking around tourist attractions. That is, it is also possible that most low speed casual usages in the tourist attractions, which make up significant percentage of the total, are made not by locals but by tourists.
 
 
        2) The top 10 least gap start stations with a significant number of usages:


SELECT bike.member_casual,
	   start_station_name,
	   ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY speed_mm)::NUMERIC, 2) AS c_speed,
	   PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY ride_duration) AS c_duration,
	   COUNT(*) AS c_usages,
	   COUNT(*) FILTER (WHERE start_station_name = end_station_name
						AND speed_mm = 0) AS roundtrip,
	   l.member_casual,
	   ROUND(l.speed::NUMERIC, 2) AS speed,
	   l.duration,
	   l.usages
FROM bike
LEFT JOIN (
           SELECT member_casual,
	   	   start_station_name,
	   	   PERCENTILE_CONT(0.5) WITHIN GROUP 
				    (ORDER BY speed_mm) AS speed,
	   	   PERCENTILE_CONT(0.5) WITHIN GROUP 
				    (ORDER BY ride_duration) duration,
		   COUNT(*) AS usages
	   FROM bike
          WHERE member_casual = 'member'
          GROUP BY member_casual, start_station_name
			) AS l
USING (start_station_name)
WHERE bike.member_casual = 'casual'
GROUP BY bike.start_station_name, bike.member_casual,
		l.member_casual, l.speed, l.duration, l.usages
HAVING COUNT(*) > 4000
--order by the least speed difference between casual and member 
ORDER BY l.speed - PERCENTILE_CONT(0.5) WITHIN GROUP 
	   			(ORDER BY speed_mm)
LIMIT 10;


Most top 10 end stations with the least speed gap stations are located either in commercial or commercial-residential areas where markets, banks and shops are around. Those are more likely running errand and work oriented locations.
 
The ride speed of casual usages in those locations gets much faster than the tourist attraction locations, showing about 100 to 130 meter / minute speed difference. In contrast, that of members in the same locations also increases as well, but the difference from their speed in leisure locations is much smaller than casual usages, having about 30 to 40 meter / minutes gap.

It is interesting to see that casual users ride a bike slower than members even in places where running errands are likely done.


        3) Destination usages from start stations with greatest and least speed gap (round trip excluded):


  
--from start stations with greatest gap 
SELECT member_casual,
	   end_station_name,
	   COUNT(*) AS usages,
	   MODE() WITHIN GROUP (ORDER BY speed_mm) AS mode_speed,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS med_speed,
	   AVG(end_lat) AS end_lat,
	   AVG(end_lng) AS end_lng
FROM bike
WHERE start_station_name IN ('Michigan Ave & Washington St',
							'Michigan Ave & Lake St',
							'Millennium Park',
							'Columbus Dr & Randolph St',
							'Michigan Ave & 8th St',
							'Indiana Ave & Roosevelt Rd',
							'Lake Shore Dr & Monroe St',
							'Buckingham Fountain',
							'Michigan Ave & Oak St',
							'Streeter Dr & Grand Ave')
	 AND speed_mm != 0
	 AND start_station_name != end_station_name
GROUP BY end_station_name, member_casual
ORDER BY usages DESC;

--from start stations with least gap
SELECT member_casual,
	   end_station_name,
	   COUNT(*) AS usages,
	   MODE() WITHIN GROUP (ORDER BY speed_mm) AS mode_speed,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS med_speed,
	   AVG(end_lat) AS end_lat,
	   AVG(end_lng) AS end_lng
FROM bike
WHERE start_station_name IN ('Clark St & Drummond Pl',
							'Halsted St & Wrightwood Ave',
							'Clark St & Schiller St',
							'Wells St & Concord Ln',
							'Bissell St & Armitage Ave',
							'Sedgwick St & North Ave',
							'Morgan St & Lake St',
							'Damen Ave & Pierce Ave',
							'Clark St & Elm St',
							'Wells St & Evergreen Ave')
	 AND speed_mm != 0
	 AND start_station_name != end_station_name
GROUP BY end_station_name, member_casual
ORDER BY usages DESC;


In the casual usage case, most journeys started from the greatest speed gap start stations end at end stations in the tourist attractions and parks in the shore. On the other hand, the end points of member usages are more scattered over the west of the shore where running errands are more likely done. 

The journeys of casual users starting from the least speed gap start stations mostly end in the northwest and west of the tourist attractions that are commercial and residential areas.

It seems that location types and ride speed are related to each other. When locations are more likely for running errand or work, both casual and member users ride faster, and when it is more likely for leisure, they ride slower.

However, there is rather a mysterious problem. As seen on the chart, the ride speed of both casual and member increases and decreases together simultaneously in almost all locations. 

That is, the ride duration of casual usages is constantly slower than member usages no matter what type of locations they are at and what purposes are to be attained by using the bike share service.    

Thus, the long ride tendency is not likely due to more usage numbers in leisure locations because the speed of casual usages is slower in running errand locations as well.


        4) Correlation between the median speed of casual and member usages in each stations:


  
WITH aa AS (
SELECT bike.member_casual,
	   start_station_name,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY bike.speed_mm) AS speed_casual,
	   COUNT(*) AS usages_casual,
	   l.member_casual AS member,
	   l.speed AS speed_member,
	   l.usages AS usages_member
FROM bike
LEFT JOIN (
	    SELECT member_casual,
	   	    start_station_name,
	            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY bike.speed_mm) AS speed,
		    COUNT(*) AS usages
	    FROM bike
	    WHERE member_casual = 'member'
            GROUP BY member_casual, start_station_name
	    ) AS l
USING (start_station_name)
WHERE bike.member_casual = 'casual'
GROUP BY bike.member_casual, bike.start_station_name, l.member_casual,
	  	 l.speed, l.usages
ORDER BY COUNT(bike.*) DESC 
            )

SELECT CORR(speed_casual, speed_member),
	   SUM(usages_casual) AS cas_total,
	   SUM(usages_member) AS mem_total,
	   COUNT(*) AS num_station
FROM aa
WHERE usages_casual > 400 AND usages_member > 400;

--for end stations
WITH aa AS (
SELECT b.member_casual,
	   end_station_name,
	   COUNT(*) AS usages,
	   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed,
	   m_usages,
	   m_speed
FROM bike AS b
LEFT JOIN (
	    SELECT member_casual,
		    end_station_name,
		    COUNT(*) AS m_usages,
		    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS m_speed
	    FROM bike
	    WHERE member_casual = 'member'
	    GROUP BY  end_station_name, member_casual
	    ) AS l
USING (end_station_name)
WHERE b.member_casual = 'casual'
GROUP BY  b.end_station_name, b.member_casual, m_usages, m_speed
)

SELECT CORR(speed, m_speed), SUM(usages) AS cas_total, 
	   SUM(m_usages) AS mem_total, COUNT(*) AS num_station
FROM aa
WHERE usages > 400 AND m_usages > 400;

<start stations>

<end stations>


The calculation is done by including around total 3,490,509 usages grouped by around  446 stations with at least more than 400 usages over the past year.

It turned out that the ride speed of casual and members in both each start station and each end station have a strong positive correlation. That is, location is likely to affect the ride speed of both users to be faster or slower.

This also means that the long ride tendency of casual usages is not derived from locational reasons. It is that no matter how fast or low the ride speed of both casual and member usages is in a particular location, the speed of casual usages is constantly slower than that of members. As shown in the least and greatest speed gap tables, regardless of which location users ride bike at, casual users are slower in a significant degree.

The long ride tendency is the pattern almost all casual users share. It may be related to problems only casual users have, users' age or a unique psychology regarding using the service with a single ride pass. These are all key aspects to tackle with when making a successful casual customer targeted marketing strategy that finally converts them to annual members.

The pattern that casual usages are more leisure oriented alone does not imply much for making the strategy. Without knowing why there is the long ride tendency, there too many possible variables that eventually generate completely different marketing strategies.

For example, the ride speed of casual usages in leisure areas is extremely slow. one may think it is because they are mostly tourists, and it results in focusing more on casual customers in running errand areas. On the other hand, the other believes that it is because they want most out of a single ride pass by not docking the rented bike until the nearest minute of end-service time. In this case, it will be more effective if the fact that members can dock and undock anytime without being charged gets highlighted.     


        5) Ride speed and usages in all stations (start and end):  


SELECT member_casual,
	   start_station_name,
	   end_station_name,
	   speed_mm AS speed,
	   distance_m AS distance,
	   ridetime_min AS duration,
	   l.start_lat AS slat,
	   l.start_lng AS slng,
	   ll.end_lat AS elat,
	   ll.end_lng AS elng
FROM bike
--merging the scattered geo points in same start station into one
LEFT JOIN (
  		   SELECT start_station_name,
			   AVG(start_lat) AS start_lat,
			   AVG(start_lng) AS start_lng
		   FROM bike
		   GROUP BY start_station_name
		  ) AS l
USING (start_station_name)
--merging the scattered geo points in same end station into one
LEFT JOIN (
		   SELECT end_station_name,
		           AVG(end_lat) AS end_lat,
			   AVG(end_lng) AS end_lng
		   FROM bike
		   GROUP BY end_station_name
		  ) AS ll
USING (end_station_name);

<Casual Usages>

<Member Usages>
(I included a captured version of map chart because its file size is 53 MB. To see the original map chart with paths of individual origin and destination stations, click the follwoing: Map Chart with Individual Path of Origin and Destination)

Similar to what has been seen in the usage number analysis, the ride speed of casual usages varies by locations in a huge degree whereas that of members do not.

Coinciding with the finding in the top least speed gap stations, the locations in the northwest of the shore show the highest speed of users. 

Although location affects users' speed, why casual usages show extremely low speed in particular locations cannot be known unless the reason why the long ride tendency of casual exists is answered. It's because answering that problem, in the end, gets down to the question why members ride a bike much shorter time than casual users in those same locations, which is the same question arisen in all locations as well.

Additional data should be given for a further analysis.



9. Conclusion 


The given task was to find out how casual customers and members use the Cyclistic bike share service differently by looking into the dataset having total 1,461,192 and 2,109,656 casual and member usages from 2020-04 to 2021-04 respectively.
 
I first looked into how the number of usages is different. Second, the ride duration, distance and speed of casual and member usages were compared.

Major findings are as follows:


1) Casual usages are likely to be more leisure-oriented than commuting to work and errands while members are likely to use the bike share service for various purposes including commuting, errands and leisure.

2) Casual usages are relatively concentrated on weekends and in famous leisure locations in the shore whereas member usages are evenly made throughout a week and in most locations.     

3) Casual customers have a strong tendency to ride a bike for longer duration and at slower speed than members. The reason could not be found due to the limitation of current dataset.

4) There is a big speed difference within casual usages while the ride speed of member usages does not show much variance overall.


Based on the notions found in trip types, hourly usages and locations, I could see that  around 56% of casual usages are highly likely for leisure purposes while approximately 38% of member usages are likely so.
 
In addition, looking into the ride duration and speed of both casual and member usages, I could find a huge gap between casual and member usages. 

The ride duration of casual usages is constantly longer than that of member regardless of distance, time and location.

Possible reasons may be as follows:

1) Members can dock and undock their bike anytime they want without being charged whereas casual customers cannot unless they pay again. Due to this reason, they take a half round trip or stop more often having their bikes beside.

2) Many casual users are not local people and that’s why they spend much time exploring tourist attractions.

3) Casual users tend to illegally park their bike on the street and do errands.

4) The company bikes are difficult to ride at first and require some time to get adjusted.

5) They don’t know which ways are short cuts because they don’t use the service in a regular basis.

6) Subscribing a daily pass on mobile phone takes a lot of time.

7) Stations get full often and casual users spend much time finding a docking spot because they don’t know what to do. 

Tackling with these possible scenarios are important for a successful casual customer targeted marketing strategy if any of them is true.

However, the long ride tendency problem could not be answered because of the limitation of current dataset. After various attempts, I concluded that additional data is needed in order to answer the question.



Recommendations




1) I recommend a further research to be done on the long ride tendency of casual usages before designing a marketing strategy. Since knowing the reason may produce completely different approaches in targeting customers and designing strategies, it has to be done.

2) I recommend additional data to be collected. It includes customer information that shows repeated usages by a same casual customer, bike stop records that shows how many times a bike is stopped during a usage, complaint records, age information, the number of docking spots and in-and-out in each station and customer satisfaction survey data.

3) The current dataset indicates the leisure-oriented tendency of casual usages and concentrated usages on weekends and near the tourist attractions. If what implication the long ride tendency has upon those patterns is discerned clearly, there will be more accurate and detailed marketing strategies that target the right customers and convert them to annual members eventually.

Comments

Popular posts from this blog

Nasdaq IXIC and Bitcoin Price Movement Patterns EDA, Cross Correlation