Project 1: SQL, Tableau, Different Usage Patterns, Bike Share Service in Chicago (Google Capstone Project)
Introduction
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building
general awareness and appealing to broad consumer segments. One approach that helped make
these things possible was the flexibility of its pricing plans: single-ride
passes, full-day passes, and annual memberships. Customers who purchase
single-ride or full-day passes are referred to as casual riders. Customers who
purchase annual memberships are Cyclistic members.
Cyclistic’s
finance analysts have concluded that annual members are much more profitable
than casual riders Based on that, the stakeholder believes that maximizing the
number of annual members will be key to future growth. Rather than creating a
marketing campaign that targets all-new customers, she believes there is a very
good chance to convert casual riders into members. She notes that casual riders
are already aware of the Cyclistic program and have chosen Cyclistic for their
mobility needs.
The
stake holder’s goal is clear: Design marketing strategies aimed at converting
casual riders into annual members. In order to do that, however, the marketing
analyst team needs to better understand how annual members and casual riders
differ, why casual riders would buy a membership, and how digital media could
affect their marketing tactics.
Among
the three questions, I have been assigned to answer the first question:
How do annual members and casual riders use Cyclistic bikes differently?
1. Dataset : the data containing the previous 12 months of Cyclistic trip data from 2020-04-01 to 2021-04-30
Link : Bike Share Dataset from 2020-04-01 to 2021 04-30
2. Files :
202004-divvy-tripdata.csv, 202005-divvy-tripdata.csv, 202006-divvy-tripdata.csv, 202007-divvy-tripdata.csv, 202008-divvy-tripdata.csv, 202009-divvy-tripdata.csv, 202010-divvy-tripdata.csv, 202011-divvy-tripdata.csv, 202012-divvy-tripdata.csv, 202101-divvy-tripdata.csv, 202102-divvy-tripdata.csv, 202103-divvy-tripdata.csv, 202104-divvy-tripdata.csv
3. Size : 656 MB
4. Column :
ride_id (TEXT) : an id for each usage of Cyclistic bike share service
rideable_type (TEXT) : Type of bike rented :
‘classic_bike’, ‘docked_bike’, ‘electric_bike’
started_at (TIMESTAMP) : Starting service
timestamp recorded as users undock or unlock
(electric bike only) a bike
ended_at (TIMESTAMP) : End service
timestamp recorded as users dock or lock
(electric bike only) a bike
start_station_name (TEXT) : Name of the
station where users started using bike share service by undocking or
unlocking(electric bike only)
start_station_id (INTEGER) : Unique id for each
start station
end_station_name (TEXT) : Name of the
station where users ended using the service by docking or locking(electric bike
only)
end_station_id (INTEGER) : Unique id for
each end station
start_lat (FLOAT) : Latitude of service
start point
start_lng (FLOAT) : Longitude of service
start point
end_lat (FLOAT) : Latitude of service end
point
end_lng (FLOAT) : Latitude of service end
point
member_casual (TEXT) : User type : ‘member’ ,
‘casual’
5. Data Credibility and Source :
The data has been
made available by Motivate International Inc. under this license.
Data-privacy issues prohibit me from using riders’ personally identifiable
information.
This means that I won’t be able to connect pass purchases to credit card
numbers to determine if casual riders live in the Cyclistic service area or if
they have purchased multiple single passes.
6. Tools used in this analysis:
1.
SQL (POSTGRESQL) and Tableau (for visualization)
2.
Why : The size of dataset is over 650MB which is
too large to be analyzed in spreadsheets.
The dataset contains geo-tracked GPS
coordinates that can be visualized in tableau.
PROCESING AND CLEANING:
1. Importing the dataset to local POSTGRESQL database:
1) Importing 13 datasets to local POSTGRESQL database:
CREATE TABLE bike_20_04 ( ride_id TEXT, rideable_type TEXT, started_at TIMESTAMP, ended_at TIMESTAMP, start_station_name TEXT, start_station_id TEXT, end_station_name TEXT, end_station_id TEXT, start_lat NUMERIC, start_lng NUMERIC, end_lat NUMERIC, end_lng NUMERIC, member_casual TEXT ) COPY bike_20_04 FROM 'C:\data\bike\202004-divvy-tripdata.csv' DELIMITER ',' CSV HEADER; -- repeated the same process for rest of 12 csv fiies
2) Uniting tables, making backup files and import united table:
--UNITE the 13 datasets in the database Copy ( SELECT * FROM bike_20_04 UNION ALL SELECT * FROM bike_20_05 UNION ALL SELECT * FROM bike_20_06 UNION ALL SELECT * FROM bike_20_07 UNION ALL SELECT * FROM bike_20_08 UNION ALL SELECT * FROM bike_20_09 UNION ALL SELECT * FROM bike_20_10 UNION ALL SELECT * FROM bike_20_11 UNION ALL SELECT * FROM bike_20_12 UNION ALL SELECT * FROM bike_21_01 UNION ALL SELECT * FROM bike_21_02 UNION ALL SELECT * FROM bike_21_03 UNION ALL SELECT * FROM bike_21_04 ) To 'C:\data\bike2021.csv' DELIMITER ',' CSV HEADER; -- Create a new table in local POSTGRES database for the united csv file: CREATE TABLE bike2021 ( ride_id TEXT, rideable_type TEXT, started_at TIMESTAMP, ended_at TIMESTAMP, start_station_name TEXT, start_station_id TEXT, end_station_name TEXT, end_station_id TEXT, start_lat NUMERIC, start_lng NUMERIC, end_lat NUMERIC, end_lng NUMERIC, member_casual TEXT ); -- Importing the united csv to the table in local POSTGRESQL database. COPY bike2021 FROM 'c:\data\bike2021.csv' DELIMITER ',' CSV HEADER;
2. Checking total number of rows and rows with null values:
SELECT COUNT(*) AS total, COUNT(*) FILTER (WHERE ride_id IS NULL) AS ride_id, COUNT(*) FILTER (WHERE rideable_type IS NULL) AS rideable_type, COUNT(*) FILTER (WHERE started_at IS NULL) AS started_at, COUNT(*) FILTER (WHERE ended_at IS NULL) AS ended_at, COUNT(*) FILTER (WHERE start_station_name IS NULL) AS start_station_name, COUNT(*) FILTER (WHERE start_station_id IS NULL) AS start_station_id, COUNT(*) FILTER (WHERE end_station_name IS NULL) AS end_station_name, COUNT(*) FILTER (WHERE end_station_id IS NULL) AS end_station_id, COUNT(*) FILTER (WHERE start_lat IS NULL) AS start_lat, COUNT(*) FILTER (WHERE start_lng IS NULL) AS start_lng, COUNT(*) FILTER (WHERE end_lat IS NULL) AS end_lat, COUNT(*) FILTER (WHERE end_lng IS NULL) AS end_lng, COUNT(*) FILTER (WHERE member_casual IS NULL) AS member_casual FROM bike2021;
The dataset has total 3826978 rows, and null values in 6
columns, start_station_name (148231) start_station_id (148857), end_station_name
(171416), end_station_id (171877), end_lat (5005), end_lng (5005).
Before deleting rows with null values, I will see if there is inconsistency, wrong values, white spaces and duplicates first so that I can figure out if can fill in null values referring to other rows or columns
3. Whitespace, duplicates, consistency, wrong value check, column by column:
1) Whitespace check:
--white space check on all columns SELECT * FROM bike2021 WHERE ride_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR rideable_type SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR started_at::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR ended_at::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR start_station_name SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR start_station_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR end_station_name SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR end_station_id SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR start_lat::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR start_lng::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR end_lat::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR end_lng::TEXT SIMILAR TO '[ \t\v\b\r\n\u00a0]' OR member_casual SIMILAR TO '[ \t\v\b\r\n\u00a0]';
Whitespace does not exist.
2) column by column check - ride_id
--checking whether there is any duplicate SELECT ride_id, COUNT(*) FROM bike2021 GROUP BY ride_id HAVING COUNT(*) > 1 --entire rows with ride_id duplicates SELECT * FROM bike2021 INNER JOIN ( SELECT ride_id, COUNT(*) FROM bike2021 GROUP BY ride_id HAVING COUNT(*) > 1 ) AS ll USING (ride_id) ORDER BY ride_id --counting rows with non-sense timestamp values (start date > end date) SELECT COUNT(*) FROM bike2021 WHERE started_at > ended_at; --delete DELETE FROM bike2021 WHERE started_at > ended_at;
The column, ride_id had 209 rows with duplicates. They contained non-sense start timestamp and end timestamp value, end timestamp being on earlier date than start one. I deleted 10557 rows with those timestamp values.
3) rideable_type:
SELECT DISTINCT rideable_type FROM bike2021;
--rows with same start and end time but different start and end stations, and different latitudes and longitudes SELECT * FROM bike2021 WHERE started_at = ended_at AND start_station_name != end_station_name; --deleting the rows with start and end time which make no sense DELETE FROM bike2021 WHERE started_at = ended_at AND start_station_name != end_station_name;
I have deleted 176 rows with same start and end time but different start and end stations and different latitudes and longitudes, which make no sense.
5) member_casual:
SELECT DISTINCT member_casual FROM bike2021;
SELECT DISTINCT start_station_name FROM bike2021; /* Finding out whether the ‘(*)’, ‘(Temp)’, ‘(Halsted)’ marked names refer to a different location than the actual one with non-marked station names */ SELECT start_station_id, b.start_station_name, flt.start_station_name, b.start_lat, b.start_lng, flt.start_lat, flt.start_lng FROM bike2021 AS b INNER JOIN ( SELECT start_station_id, start_station_name, start_lat, start_lng, COUNT(*) OVER (PARTITION BY start_station_id) AS freq FROM bike2021 --filtering rows with inconsistent station names only WHERE start_station_name SIMILAR TO '%[\(\)|\*\|]%' GROUP BY start_station_id, start_station_name, start_lat, start_lng ) AS flt USING (start_station_id) WHERE b.start_station_name NOT SIMILAR TO '%[\(\)|\*\|]%' GROUP BY start_station_id, b.start_station_name, flt.start_station_name, b.start_lat, b.start_lng, flt.start_lat, flt.start_lng /* filtering joined rows with inconsistent names which have more than 100 meter line distance difference from the actual GPS coordinates of the station. */ HAVING (flt.start_lat > MODE() WITHIN GROUP (ORDER BY b.start_lat) + 0.001 OR flt.start_lat < MODE() WITHIN GROUP (ORDER BY b.start_lat) - 0.001) OR (flt.start_lng > MODE() WITHIN GROUP (ORDER BY b.start_lng) + 0.001 OR flt.start_lng < MODE() WITHIN GROUP (ORDER BY b.start_lng) - 0.001);
It turned out that same station names with ‘(*)’, ‘(Temp)’, ‘(Halsted)’ do have around 100 meter distance difference from the actual station although they share the same id with the other consistent station names. Since it's due to a distance difference, I will leave them as they are.
7) end_station_name:
SELECT DISTINCT end_station_name FROM bike2021; /* Finding out whether the ‘(*)’, ‘(Temp)’, ‘(Halsted)’ marked names refer to a different location than the actual one with non-marked station names */ SELECT end_station_id, b.end_station_name, flt.end_station_name, b.end_lat, b.end_lng, flt.end_lat, flt.end_lng FROM bike2021 AS b INNER JOIN ( SELECT end_station_id, end_station_name, end_lat, end_lng, COUNT(*) OVER (PARTITION BY end_station_id) AS freq FROM bike2021 --filtering rows with inconsistent station names only WHERE end_station_name SIMILAR TO '%[\(\)|\*\|]%' GROUP BY end_station_id, end_station_name, end_lat, end_lng ) AS flt USING (end_station_id) WHERE b.end_station_name NOT SIMILAR TO '%[\(\)|\*\|]%' GROUP BY end_station_id, b.end_station_name, flt.end_station_name, b.end_lat, b.end_lng, flt.end_lat, flt.end_lng /* filtering joined rows with inconsistent names which have more than 100 meter line distance difference from the actual GPS coordinates of the station. */ HAVING (flt.end_lat > MODE() WITHIN GROUP (ORDER BY b.end_lat) + 0.001 OR flt.end_lat < MODE() WITHIN GROUP (ORDER BY b.end_lat) - 0.001) OR (flt.end_lng > MODE() WITHIN GROUP (ORDER BY b.end_lng) + 0.001 OR flt.end_lng < MODE() WITHIN GROUP (ORDER BY b.end_lng) - 0.001);
Similar to start_station_name, there are several ‘(*)’,
‘(Temp)’, ‘(Halsted)’ marked names referring to a different location. Again, I
will leave them as they are.
-Checking the inconsistent values in the start_station_id column: SELECT DISTINCT bike2021.start_station_id, start_station_name, l.start_station_id FROM bike2021 LEFT JOIN ( SELECT start_station_id, start_station_name FROM bike2021 --rows not with 3 digit number WHERE start_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' GROUP BY start_station_id, start_station_name ) AS l USING (start_station_name) WHERE bike2021.start_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}'; --Checking the inconsistent values in the end_station_id column: SELECT DISTINCT bike2021.end_station_id, end_station_name, l.end_station_id FROM bike2021 LEFT JOIN ( SELECT end_station_id, end_station_name FROM bike2021 --rows not with 3 digit number WHERE end_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' GROUP BY end_station_id, end_station_name ) AS l USING (end_station_name) WHERE bike2021.end_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}';
There is an inconsistency in station id values. The same station has 5 digit id or one with letters besides a correct 3 digit id. Confirming that those inconsistent ids don’t refer to a different location, I will replace them with a correct 3 digit one:
--For start_station_id UPDATE bike2021 SET start_station_id = refer.start_station_id FROM ( SELECT DISTINCT start_station_name, start_station_id FROM bike2021 WHERE start_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' ) AS refer WHERE bike2021.start_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' AND bike2021.start_station_name = refer.start_station_name; --For end_station_id UPDATE bike2021 SET end_station_id = refer.end_station_id FROM ( SELECT DISTINCT end_station_name, end_station_id FROM bike2021 WHERE end_station_id SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' ) AS refer WHERE bike2021.end_station_id NOT SIMILAR TO '[0-9]{3}|[0-9]{2}|[0-9]{1}' AND bike2021.end_station_name = refer.end_station_name;
--Checking 3 digit ids wrongly assigned to station names SELECT * FROM ( SELECT start_station_name, start_station_id, COUNT(start_station_id) OVER (PARTITION BY start_station_name) AS count_ids, AVG(start_lat) AS slat, AVG(start_lng) AS slng, --To check if inconsistent ids on one same station refer to different locations (0.001 latitude difference = around 111 meters) ABS(AVG(start_lat) - LAG(AVG(start_lat)) OVER (PARTITION BY start_station_name)) > 0.001, --To check if inconsistent ids on one same station refer to different locations (0.001 longitude difference = around 111 meters) ABS(ABS(AVG(start_lng)) - ABS(LAG(AVG(start_lng)) OVER (PARTITION BY start_station_name))) > 0.001 FROM bike2021 GROUP BY start_station_name, start_station_id ORDER BY start_station_name ) AS grouped WHERE count_ids > 1; --For end stations SELECT * FROM ( SELECT end_station_name, end_station_id, COUNT(end_station_id) OVER (PARTITION BY end_station_name) AS count_ids, AVG(end_lat) AS elat, AVG(end_lng) AS elng, --To check if inconsistent ids on one same station refer to different locations (0.001 latitude difference = 111 meters) ABS(AVG(end_lat) - LAG(AVG(end_lat)) OVER (PARTITION BY end_station_name)) > 0.001, --To check if inconsistent ids on one same station refer to different locations (0.001 longitude difference = 111 meters) ABS(ABS(AVG(end_lng)) - ABS(LAG(AVG(end_lng)) OVER (PARTITION BY end_station_name))) > 0.001 FROM bike2021 GROUP BY end_station_name, end_station_id ORDER BY end_station_name ) AS grouped WHERE count_ids > 1;
It turned out that one station name is assigned to a wrong
id resulting in one id being used for two different stations. I will split them into
two different ids:
UPDATE bike2021 SET start_station_id = '651' WHERE start_station_name = 'Michigan Ave & 71st St'; UPDATE bike2021 SET start_station_id = '800' WHERE start_station_name = 'Halsted St & 104th St'; UPDATE bike2021 SET end_station_id = '651' WHERE end_station_name = 'Michigan Ave & 71st St'; UPDATE bike2021 SET end_station_id = '800' WHERE end_station_name = 'Halsted St & 104th St';
Note: Since values in the columns start_lat, start_lng,
end_lat, end_lng are GPS coordinates of each ride, there is a slight
difference among values within same stations, millimeter in extreme case. Also, electric bikes record start or end of service when they
are not only undocked or ducked but also unlocked or locked. This generates
many different latitude and longitude values.
SELECT bike2021.start_station_name, * FROM bike2021 LEFT JOIN ( SELECT start_station_name, MODE() WITHIN GROUP (ORDER BY start_lat) AS mod_slat, MODE() WITHIN GROUP (ORDER BY start_lng) AS mod_slng FROM bike2021 GROUP BY start_station_name ) AS with_modes USING(start_station_name) -- filtering rows with more than around +-150 meters in line distance away from a mode station location. WHERE (start_lat > mod_slat + 0.0015 OR start_lat < mod_slat - 0.0015) AND (start_lng > mod_slng + 0.0015 OR start_lng < mod_slng - 0.0015); --For end_lat and end_lng SELECT bike2021.end_station_name, * FROM bike2021 LEFT JOIN ( SELECT end_station_name, MODE() WITHIN GROUP (ORDER BY end_lat) AS mod_elat, MODE() WITHIN GROUP (ORDER BY end_lng) AS mod_elng FROM bike2021 GROUP BY end_station_name ) AS with_modes USING(end_station_name) -- filtering rows with more than around +-150 meters in line distance away from a mode station location. WHERE (end_lat > mod_elat + 0.0015 OR end_lat < mod_elat - 0.0015) AND (end_lng > mod_elng + 0.0015 OR end_lng < mod_elng - 0.0015);
--for start lat and lng DELETE FROM bike2021 USING ( SELECT bike2021.start_station_name, * FROM bike2021 LEFT JOIN ( SELECT start_station_name, MODE() WITHIN GROUP (ORDER BY start_lat) AS mod_slat, MODE() WITHIN GROUP (ORDER BY start_lng) AS mod_slng FROM bike2021 GROUP BY start_station_name ) AS with_modes USING(start_station_name) -- filtering rows with more than around +-150 meters in line distance away from a mode station location. WHERE (start_lat > mod_slat + 0.0015 OR start_lat < mod_slat - 0.0015) AND (start_lng > mod_slng + 0.0015 OR start_lng < mod_slng - 0.0015) ) AS refer WHERE bike2021.start_lat = refer.start_lat AND bike2021.start_lng = refer.start_lng; --for end lat and lng DELETE FROM bike2021 USING ( SELECT bike2021.end_station_name, * FROM bike2021 LEFT JOIN ( SELECT end_station_name, MODE() WITHIN GROUP (ORDER BY end_lat) AS mod_elat, MODE() WITHIN GROUP (ORDER BY end_lng) AS mod_elng FROM bike2021 GROUP BY end_station_name ) AS with_modes USING(end_station_name) -- filtering rows with more than around +-150 meters in line distance away from a mode station location. WHERE (end_lat > mod_elat + 0.0015 OR end_lat < mod_elat - 0.0015) AND (end_lng > mod_elng + 0.0015 OR end_lng < mod_elng - 0.0015) ) AS refer WHERE bike2021.end_lat = refer.end_lat AND bike2021.end_lng = refer.end_lng;
There are latitude and longitude values with only 1 or 2 decimal digits, which means those coordinates can indicate wrong start station names.
--start station lat, lon values with 1 or 2 digits only (non-null rows) SELECT start_station_name, start_lat, start_lng, rideable_type, COUNT(DISTINCT start_lat) AS start_lat, COUNT(DISTINCT start_lng) AS start_lng FROM bike2021 --filter values with decimal digits less than 3 WHERE (start_lat::TEXT NOT LIKE '%.___%' AND start_lng::TEXT NOT LIKE '%.___%') AND start_station_name IS NOT NULL GROUP BY start_station_name, start_lat, start_lng, rideable_type ORDER BY start_station_name; --end station lat, lon values with 1 or 2 digits only (non-null rows) SELECT end_station_name, end_lat, end_lng, rideable_type, COUNT(DISTINCT end_lat) AS end_lat, COUNT(DISTINCT end_lng) AS end_lng FROM bike2021 --filter values with decimal digits less than 3 WHERE (end_lat::TEXT NOT LIKE '%.___%' AND end_lng::TEXT NOT LIKE '%.___%') AND end_station_name IS NOT NULL GROUP BY end_station_name, end_lat, end_lng, rideable_type ORDER BY end_station_name;
There are 17 latitude(lat)-longitude(lon) values which have only 1 or 2 decimal digits. On google map, I compared all of 17 coordinate locations to the location which their station names indicate.
--for lat,lon start stations DELETE FROM bike2021 WHERE (start_lat::TEXT NOT LIKE '%.___%' AND start_lng::TEXT NOT LIKE '%.___%') AND start_station_name IS NOT NULL AND start_station_name != 'S Michigan Ave & E 118th St'; --for lat,lon end stations DELETE FROM bike2021 WHERE (end_lat::TEXT NOT LIKE '%.___%' AND end_lng::TEXT NOT LIKE '%.___%') AND end_station_name IS NOT NULL AND end_station_name != 'Woodlawn & 103rd - Olive Harvey Vaccination Site';
1235 rows with 1 or 2 digit lat-lon and 848 rows with 1 or 2 digit lat-lon have been deleted.
--counting rows with nulls when filtered by ‘electric bike’ SELECT COUNT(*) FROM bike2021 WHERE rideable_type = 'electric_bike' AND start_station_name IS NULL --check if the start station null values can be replaced by matching values in other rows SELECT DISTINCT start_station_name, start_station_id, l.start_lat, l.start_lng, COUNT(*) OVER () AS total_freq FROM bike2021 INNER JOIN ( SELECT DISTINCT start_lat, start_lng FROM bike2021 WHERE rideable_type = 'electric_bike' AND start_station_name IS NULL ) AS l USING (start_lat, start_lng) WHERE bike2021.start_station_name IS NOT NULL OR bike2021.start_station_id IS NOT NULL ORDER BY start_station_name; --end station name SELECT DISTINCT l.end_lat, l.end_lng, end_station_name, end_station_id FROM bike2021 INNER JOIN ( SELECT DISTINCT end_lat, end_lng FROM bike2021 WHERE rideable_type = 'electric_bike' AND end_station_name IS NULL ) AS l USING (end_lat, end_lng) WHERE (bike2021.end_station_name IS NOT NULL OR bike2021.end_station_id IS NOT NULL) ORDER BY end_lat, end_lng;
Total 147636 rows have nulls in either start_station_name or start_station_id when filtered by ‘electric_bike’.
--for start station name UPDATE bike2021 SET start_station_name = rf_station FROM ( SELECT start_lat, start_lng, MAX(start_station_name) AS rf_station FROM bike2021 WHERE rideable_type = 'electric_bike' AND start_station_name IS NOT NULL GROUP BY start_lat, start_lng ) AS rf WHERE (rf.start_lat = bike2021.start_lat AND rf.start_lng = bike2021.start_lng ) AND bike2021.rideable_type = 'electric_bike' AND bike2021.start_station_name IS NULL; --for end station name UPDATE bike2021 SET end_station_name = rf_station FROM ( SELECT end_lat, end_lng, MAX(end_station_name) AS rf_station FROM bike2021 WHERE rideable_type = 'electric_bike' AND end_station_name IS NOT NULL GROUP BY end_lat, end_lng ) AS rf WHERE (rf.end_lat = bike2021.end_lat AND rf.end_lng = bike2021.end_lng ) AND bike2021.rideable_type = 'electric_bike' AND bike2021.end_station_name IS NULL;
--start station names SELECT * FROM bike2021 INNER JOIN ( SELECT start_lat, start_lng, MAX(start_station_name) FROM bike2021 WHERE start_station_name IS NOT NULL GROUP BY start_lat, start_lng ) AS l USING (start_lat, start_lng) WHERE start_station_name IS NULL; --end station names SELECT * FROM bike2021 INNER JOIN ( SELECT end_lat, end_lng, MAX(end_station_name) FROM bike2021 WHERE end_station_name IS NOT NULL GROUP BY end_lat, end_lng ) AS l USING (end_lat, end_lng) WHERE end_station_name IS NULL;
There is none to be replaced.
3) Checking if nulls in start and end station ids are replaceable based on latitude and longitude:
--start station ids SELECT * FROM bike2021 INNER JOIN ( SELECT start_lat, start_lng, MAX(start_station_id) FROM bike2021 WHERE start_station_id IS NOT NULL GROUP BY start_lat, start_lng, start_station_id ) AS l USING (start_lat, start_lng) WHERE bike2021.start_station_id IS NULL; --end station ids SELECT * FROM bike2021 INNER JOIN ( SELECT end_lat, end_lng, MAX(end_station_id) FROM bike2021 WHERE end_station_id IS NOT NULL GROUP BY end_lat, end_lng ) AS l USING (end_lat, end_lng) WHERE bike2021.end_station_id IS NULL;
There is one case in end station that can be replaced:
UPDATE bike2021 SET end_station_id = '20133' WHERE end_station_name = 'Woodlawn & 103rd - Olive Harvey Vaccination Site' AND end_station_id IS NULL
4) Checking if nulls in station names can be replaced based on id, or vice versa:
SELECT * FROM bike2021 WHERE (end_station_name IS NULL AND end_station_id IS NOT NULL) OR (start_station_name IS NULL AND start_station_id IS NOT NULL) OR (end_station_name IS NOT NULL AND end_station_id IS NULL) OR (start_station_name IS NOT NULL AND start_station_id IS NULL);
There is none to be replaced.
5. Deleting nulls
Final check on nulls:
SELECT * FROM bike2021 WHERE (start_station_name IS NULL OR start_station_id IS NULL OR end_station_name IS NULL OR end_station_id IS NULL) AND end_lat IS NOT NULL AND end_lng IS NOT NULL AND start_lat::TEXT LIKE '%.___%' AND start_lng::TEXT LIKE '%.___%' AND end_lat::TEXT LIKE '%.___%' AND end_lng::TEXT LIKE '%.___%';
Also, I cannot be sure if the 1 or 2 decimal digit values are not errors because the majority of records from electric bikes have the correct form of GPS coordinates with at least 4 decimal digits.
DELETE FROM bike2021 WHERE start_station_name IS NULL OR start_station_id IS NULL OR end_station_name IS NULL OR end_station_id IS NULL OR end_lat IS NULL OR end_lng IS NULL; SELECT COUNT(*) AS total, COUNT(*) FILTER (WHERE ride_id IS NULL) AS ride_id, COUNT(*) FILTER (WHERE rideable_type IS NULL) AS rideable_type, COUNT(*) FILTER (WHERE started_at IS NULL) AS started_at, COUNT(*) FILTER (WHERE ended_at IS NULL) AS ended_at, COUNT(*) FILTER (WHERE start_station_name IS NULL) AS start_station_name, COUNT(*) FILTER (WHERE start_station_id IS NULL) AS start_station_id, COUNT(*) FILTER (WHERE end_station_name IS NULL) AS end_station_name, COUNT(*) FILTER (WHERE end_station_id IS NULL) AS end_station_id, COUNT(*) FILTER (WHERE start_lat IS NULL) AS start_lat, COUNT(*) FILTER (WHERE start_lng IS NULL) AS start_lng, COUNT(*) FILTER (WHERE end_lat IS NULL) AS end_lat, COUNT(*) FILTER (WHERE end_lng IS NULL) AS end_lng, COUNT(*) FILTER (WHERE member_casual IS NULL) AS member_casual FROM bike2021;
Total 232291 rows with null values are removed. The cleaned dataset has total 3578263 rows.
Creating a copy of cleaned dataset in database for analysis and a backup:
CREATE TABLE bike AS TABLE bike2021;
ANALYZING
ALTER TABLE bike ADD COLUMN ride_duration INTERVAL; UPDATE bike SET ride_duration = ended_at::TIMESTAMP - started_at::TIMESTAMP;
2) Cleaning the dataset with the new column:
The ride duration 0 and the one over 24 hours are likely errored values because there are only single ride pass (usually 30 to 45 minute ride) and full day pass (24 hours with the rented bike docked every 30 to 45 minutes).
DELETE FROM bike WHERE ride_duration <= '00:00:00' OR ride_duration > '24:00:00'
3) Adding new columns with ride duration in minute, ride distance, ride speed (ride distance / ride duration in minute).
To calculate ride distance from start station latitudes(lat) and longitudes(lon) to end station lat and lon, I will use the ST_DistanceSphere function from POSTGIS extension, which is believed to be with less than 7 meter error:
ALTER TABLE bike ADD COLUMN distance_m NUMERIC, --ride distance in meter ADD COLUMN ridetime_min NUMERIC, --ride duration in minute ADD COLUMN speed_mm NUMERIC; --ride distance in meter / ride duration in minute UPDATE bike SET ridetime_min = (EXTRACT(EPOCH FROM ride_duration)/60)::NUMERIC, distance_m = ST_DistanceSphere( ST_MakePoint(start_lng,start_lat), ST_MakePoint(end_lng,end_lat))::NUMERIC, speed_mm = (ST_DistanceSphere( ST_MakePoint(start_lng,start_lat), ST_MakePoint(end_lng,end_lat)) / ridetime_min)::NUMERIC;
SELECT MAX(distance_m), MIN(distance_m), MAX(speed_mm), MIN(speed_mm) FROM bike;
Distance 0 and speed 0 indicate either test centers or round trips.
DELETE FROM bike WHERE speed_mm > 600
Values from the company’s bike test centers are not needed for my analysis. I will delete the 3403 rows with bike test centers:
DELETE FROM bike WHERE start_station_id = 'Hubbard Bike-checking (LBS-WH-TEST)' OR start_station_name IN ('HUBBARD ST BIKE CHECKING (LBS-WH-TEST)' , 'WATSON TESTING - DIVVY'); DELETE FROM bike WHERE end_station_name IN ('HUBBARD ST BIKE CHECKING (LBS-WH-TEST)' , 'WATSON TESTING - DIVVY') OR end_station_id = 'Hubbard Bike-checking (LBS-WH-TEST)'; --checking the cleaned dataset SELECT COUNT(*) AS row_n FROM bike;
1) Number of total usages, usages of each bike type:
SELECT COUNT(*) AS usages, SUM(COUNT(*)) OVER () AS total, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER ()) AS ratio FROM bike GROUP BY member_casual; --usages of a bike type, grouped by user type SELECT member_casual, rideable_type, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (PARTITION BY member_casual), 2) AS ratio FROM bike GROUP BY member_casual, rideable_type ORDER BY member_casual, rideable_type;
Throughout a year (2020-04 to 2021-04), the total usages
of Cyclistic bike share service were 3,570,848. the member usages make up about 59%
of the total which is 18% more than casual usages. Both user types preferred riding non-electric
bikes.
SELECT member_casual, TO_CHAR(started_at, 'yyyy-mm') as year_month, COUNT(*), ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (PARTITION BY member_casual), 2) AS percentage_type FROM bike GROUP BY member_casual, year_month ORDER BY member_casual, year_month;
-- 0 = Sunday, 6 = Saturday SELECT EXTRACT(DOW from started_at) AS day_of_week, member_casual, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (PARTITION BY member_casual),2) AS percentage_type, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS percentage_total FROM bike GROUP BY day_of_week, member_casual ORDER BY member_casual, day_of_week
SELECT member_casual, CASE WHEN EXTRACT(DOW FROM started_at) IN (0,6) THEN 'weekends' ELSE 'weekdays' END AS day_of_week, EXTRACT(HOUR FROM started_at::TIME) AS hour, COUNT(*) AS freq FROM bike GROUP BY member_casual, day_of_week, hour ORDER BY member_casual, day_of_week, hour; --counting values only in commuting hour in the morning, 05:00 -08:59 weekdays SELECT member_casual, COUNT(CASE WHEN (EXTRACT(HOUR FROM started_at::TIME) BETWEEN 5 AND 8) AND (EXTRACT(DOW FROM started_at) NOT IN (0,6)) THEN 1 END) AS commute_usages, COUNT(CASE WHEN (EXTRACT(HOUR FROM started_at::TIME) BETWEEN 5 AND 8) AND (EXTRACT(DOW FROM started_at) NOT IN (0,6)) THEN 1 END) * 100 / COUNT(*) AS percentage FROM bike GROUP BY member_casual;
SELECT member_casual, CASE WHEN start_station_name = end_station_name THEN 'round' ELSE 'one way' END AS trip_type, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (PARTITION BY member_casual),2) AS percentage_type, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS percentage_total FROM bike GROUP BY member_casual, trip_type;
There is a gap between round trip usages of casual users and that of members. The percentage of round trip casual usages is about 12% higher than that of member usages.
This is likely to mean that around 18% of total casual usages is not for errands or commuting while only about 6% of total member
usages is not so. Since users are not supposed to park a bike at any places until they end the service, round trips are likely for leisure usages.
6) Number of usages not related to work (round trips + weekend usages):
SELECT member_casual, --the number of weekend usages (COUNT(*) FILTER (WHERE EXTRACT(DOW FROM started_at) IN (0, 6)) --the number of round trip only during weekdays + COUNT(*) FILTER (WHERE start_station_name = end_station_name AND EXTRACT(DOW FROM started_at) NOT IN (0, 6))) * 100 / COUNT(*) AS no_work_usages FROM bike GROUP BY member_casual;
SELECT * FROM ( SELECT member_casual, start_station_name, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio , ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking FROM bike WHERE member_casual = 'casual' GROUP BY member_casual, start_station_name LIMIT 10 ) AS ca LEFT JOIN ( SELECT member_casual, start_station_name, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio, ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking FROM bike WHERE member_casual = 'member' GROUP BY member_casual, start_station_name LIMIT 10 ) AS me USING(ranking);
SELECT * FROM ( SELECT member_casual, end_station_name, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio , ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking FROM bike WHERE member_casual = 'casual' GROUP BY member_casual, end_station_name LIMIT 10 ) AS ca LEFT JOIN ( SELECT member_casual, end_station_name, COUNT(*) AS usages, ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 2) AS ratio, ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ranking FROM bike WHERE member_casual = 'member' GROUP BY member_casual, end_station_name LIMIT 10 ) AS me USING(ranking);
All top 10 end stations of both casual and member users are identical with their top 10 start stations except one in members' which also has many retail stores are around.
3) Usages in all start and end stations, Tableau map chart:
SELECT member_casual, start_station_name, COUNT(*) AS usages, AVG(start_lat) AS slat, AVG(start_lng) AS slng FROM bike GROUP BY member_casual, start_station_name; --for end stations SELECT member_casual, end_station_name, COUNT(*) AS usages, AVG(end_lat) AS elat, AVG(end_lng) AS elng FROM bike GROUP BY member_casual, end_station_name;
(Hoover on a geo-point to see the name and number / Click on a geo-point and click on 'Link to Google Map' to see the location in google map / Scroll up and down for zoom in and out)
Those famous leisure locations near the shore are not likely for someone who does running errands. Rather, they are where people explore sites, do outdoor activities and enjoy spending time.
If looked closely by zooming in, even many stations far from the shore and showing a significant number of casual usages are also near local parks.
All these support that casual customers use the Cyclistic bike share service for leisure more than members do.
4) Usages in four highly leisure centered locations exclusively:
WITH latlng AS ( -- to merge the scattered geo points within a same station into one geo point SELECT o.start_station_name, o.end_station_name, o.started_at, o.ended_at, o.member_casual, o.ride_duration, o.ridetime_min, o.distance_m, o.speed_mm, l.start_lat, l.start_lng, d.end_lat, d.end_lng FROM bike AS o LEFT JOIN ( SELECT start_station_name, AVG(start_lat) AS start_lat, AVG(start_lng) AS start_lng FROM bike GROUP BY start_station_name ) AS l ON o.start_station_name = l.start_station_name LEFT JOIN ( SELECT end_station_name, AVG(end_lat) AS end_lat, AVG(end_lng) AS end_lng FROM bike GROUP BY end_station_name ) AS d ON o.end_station_name = d.end_station_name ) SELECT member_casual, start_station_name, end_station_name, COUNT(*) AS usages, AVG(start_lat) AS slat, AVG(start_lng) AS slng, AVG(end_lat) AS elat, AVG(end_lng) AS elng FROM latlng GROUP BY member_casual, start_station_name, end_station_name --to filter four major leisure park areas in the shore exclusively HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park OR (AVG(start_lng) >= -87.62573 AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St'); --ratio of each one’s usages in 4 locations to each total SELECT member_casual, ratio_each_total FROM ( SELECT member_casual, SUM(COUNT(*)) OVER (PARTITION BY member_casual) AS usages FROM bike GROUP BY member_casual, start_station_name HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park OR (AVG(start_lng) >= -87.62573 AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St') ) AS l LEFT JOIN LATERAL ( SELECT member_casual, ROUND(usages * 100 / COUNT(*), 2) AS ratio_each_total FROM bike GROUP BY member_casual ) AS o USING (member_casual) GROUP BY member_casual, ratio_each_total;
Casual usages in the four leisure locations exclusively make up 33.09% of the total casual usages. Also, looking into the riding path made from start stations to end stations, most destinations in casual usages are either the same stations (round trip) or within the leisure locations.
SELECT member_casual, ratio_each_total FROM ( SELECT bike.member_casual, bike.start_station_name, bike.end_station_name, SUM(COUNT(*)) OVER (PARTITION BY bike.member_casual) AS ratio FROM bike LEFT JOIN ( --filtering out start stations in the four major leisure locations SELECT member_casual, start_station_name FROM bike GROUP BY member_casual, start_station_name HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park OR (AVG(start_lng) >= -87.62573 AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St') ) AS l USING (start_station_name) WHERE l.start_station_name IS NULL GROUP BY bike.member_casual, bike.start_station_name, bike.end_station_name --filtering rows with end stations in the four major leisure locations HAVING (AVG(end_lng) >= -87.6523 AND AVG(end_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach OR (AVG(end_lng) >= -87.6392 AND AVG(end_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park OR (AVG(end_lng) >= -87.62573 AND AVG(end_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium OR (AVG(end_lng) >= -87.61351 AND AVG(end_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park OR (AVG(end_lng) >= -87.5869 AND AVG(end_lat) BETWEEN 41.7736 AND 41.8024) -- jason par OR end_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St') ) AS ll LEFT JOIN LATERAL ( SELECT member_casual, ROUND(ratio * 100 / COUNT(*), 2) AS ratio_each_total FROM bike GROUP BY member_casual ) AS o USING (member_casual) GROUP BY member_casual, ratio_each_total;
Approximately 13.13% of the total casual usages that started from outside the four leisure locations ends at stations inside the leisure locations while about 12.16% of the total member usages does so.
SELECT member_casual, ratio_each_total FROM ( SELECT bike.member_casual, bike.start_station_name, bike.end_station_name, SUM(COUNT(*)) OVER (PARTITION BY bike.member_casual) AS ratio FROM bike LEFT JOIN ( --filtering out start stations in the four major leisure locations SELECT member_casual, start_station_name FROM bike GROUP BY member_casual, start_station_name HAVING (AVG(start_lng) >= -87.6523 AND AVG(start_lat) BETWEEN 41.9455 AND 41.9897) --montrose beach OR (AVG(start_lng) >= -87.6392 AND AVG(start_lat) BETWEEN 41.9069 AND 41.9408) -- lincon park OR (AVG(start_lng) >= -87.62573 AND AVG(start_lat) BETWEEN 41.8542 AND 41.8926) -- habour millenium OR (AVG(start_lng) >= -87.61351 AND AVG(start_lat) BETWEEN 41.8162 AND 41.8386) -- burnham park OR (AVG(start_lng) >= -87.5869 AND AVG(start_lat) BETWEEN 41.7736 AND 41.8024) -- jason park OR start_station_name IN ('Miles van der Rohe Way & Chicago Ave', 'Michigan Ave & Oak St') ) AS l USING (start_station_name) WHERE l.start_station_name IS NULL AND start_station_name = end_station_name -- round trip GROUP BY bike.member_casual, bike.start_station_name, bike.end_station_name ) AS ll LEFT JOIN LATERAL ( SELECT member_casual, ROUND(ratio * 100 / COUNT(*), 2) AS ratio_each_total FROM bike GROUP BY member_casual ) AS o USING (member_casual) GROUP BY member_casual, ratio_each_total;
As for members, about 38% of the total usages are likely to be for leisure.
4. Summary of patterns found in usage number
SELECT member_casual, AVG(ride_duration) AS avg_ride_duration, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) AS med_ride_duration, ROUND(AVG(distance_m), 2) AS avg_distance_m, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY distance_m)::NUMERIC, 2) AS med_distance, ROUND(AVG(speed_mm), 2) AS avg_speed_mm, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm FROM bike GROUP BY member_casual;
2) Average and median ride duration in one-way trip and round trip:
--one-way trip SELECT member_casual, AVG(ride_duration) AS avg_ride_duration, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) AS med_ride_duration, ROUND(AVG(distance_m), 2) AS avg_distance_m, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY distance_m)::NUMERIC, 2) AS med_distance, ROUND(AVG(speed_mm), 2) AS avg_speed_mm, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm FROM bike WHERE speed_mm > 0 --filtering out scattered coordinates within same station as well. AND start_station_name != end_station_name GROUP BY member_casual; --round trip SELECT member_casual, AVG(ride_duration) AS avg_ride_duration, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) AS med_ride_duration, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY distance_m)::NUMERIC, 2) AS med_distance, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)::NUMERIC, 2) AS med_speed_mm FROM bike WHERE speed_mm >= 0 --including scattered coordinates with in same station. AND start_station_name = end_station_name GROUP BY member_casual
<one-way trip>
In both one-way and round trips, the ride duration of casual usages is much longer than that of member.
The degree of gap is different in one-way and round trips. In round trips, the ride duration gap is extremely big, showing about 3 times longer than that of member usages. On the other hand, in one-way trip, the median ride duration of casual usages is longer by around 1.6 times.SELECT CASE WHEN distance_m < 500 THEN '<500m' WHEN distance_m BETWEEN 500 AND 1000 THEN '500-1000m' WHEN distance_m BETWEEN 1000 AND 1500 THEN '1000-1500m' WHEN distance_m BETWEEN 1500 AND 2000 THEN '1500-2000m' WHEN distance_m BETWEEN 2000 AND 2500 THEN '2000-2500m' WHEN distance_m BETWEEN 2500 AND 3000 THEN '2500-3000m' WHEN distance_m BETWEEN 3000 AND 3500 THEN '3000-3500m' WHEN distance_m BETWEEN 3500 AND 4000 THEN '3500-4000m' WHEN distance_m BETWEEN 4000 AND 4500 THEN '4000-4500m' WHEN distance_m BETWEEN 4500 AND 5000 THEN '4500-5000m' WHEN distance_m BETWEEN 5000 AND 5500 THEN '5000-5500m' WHEN distance_m BETWEEN 5500 AND 6000 THEN '5500-6000m' WHEN distance_m BETWEEN 6000 AND 6500 THEN '6000-6500m' WHEN distance_m BETWEEN 6500 AND 7000 THEN '6500-7000m' WHEN distance_m BETWEEN 7000 AND 7500 THEN '7000-7500m' WHEN distance_m BETWEEN 7500 AND 8000 THEN '7500-8000m' WHEN distance_m >= 8000 THEN '>8000m' END AS distance, COUNT(*) AS usages, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ridetime_min) AS duration, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed, member_casual, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) - LAG(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)) OVER () AS diff FROM bike WHERE start_station_name != end_station_name AND speed_mm != 0 GROUP BY distance, member_casual ORDER BY distance, member_casual;
SELECT member_casual, CASE WHEN EXTRACT(DOW FROM started_at) IN (0, 6) THEN 'weekends' ELSE 'weekdays' END AS day, EXTRACT(HOUR FROM started_at) AS hour, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ridetime_min) AS duration, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed, COUNT(*) AS usages FROM bike GROUP BY member_casual, day, hour;
As for the long ride tendency of casual users, time is also not likely the reason since the ride duration of casual usages is constantly longer than members regardless of hourly changes.
SELECT bike.member_casual, start_station_name, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)::NUMERIC, 2) AS c_speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) AS c_duration, COUNT(*) AS c_usages, COUNT(*) FILTER (WHERE start_station_name = end_station_name AND speed_mm = 0) AS roundtrip, l.member_casual, ROUND(l.speed::NUMERIC, 2) AS speed, l.duration, l.usages FROM bike LEFT JOIN ( SELECT member_casual, start_station_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) duration, COUNT(*) AS usages FROM bike WHERE member_casual = 'member' GROUP BY member_casual, start_station_name ) AS l USING (start_station_name) WHERE bike.member_casual = 'casual' GROUP BY bike.start_station_name, bike.member_casual, l.member_casual, l.speed, l.duration, l.usages HAVING COUNT(*) > 10000 --order by the greatest speed difference between casual and member ORDER BY l.speed - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) DESC LIMIT 10;
It is hard to imagine a person riding a bike at a speed that goes only 4 meter per minute constantly. Rather, it is more likely that casual users stop more often, explore sites in further distance or take a half round trip.
This brings up a new problem. Just like the faster speed of member usages have shown so far, local people do not usually spend much time looking around tourist attractions. That is, it is also possible that most low speed casual usages in the tourist attractions, which make up significant percentage of the total, are made not by locals but by tourists.
2) The top 10 least gap start stations with a significant number of usages:
SELECT bike.member_casual, start_station_name, ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm)::NUMERIC, 2) AS c_speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) AS c_duration, COUNT(*) AS c_usages, COUNT(*) FILTER (WHERE start_station_name = end_station_name AND speed_mm = 0) AS roundtrip, l.member_casual, ROUND(l.speed::NUMERIC, 2) AS speed, l.duration, l.usages FROM bike LEFT JOIN ( SELECT member_casual, start_station_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ride_duration) duration, COUNT(*) AS usages FROM bike WHERE member_casual = 'member' GROUP BY member_casual, start_station_name ) AS l USING (start_station_name) WHERE bike.member_casual = 'casual' GROUP BY bike.start_station_name, bike.member_casual, l.member_casual, l.speed, l.duration, l.usages HAVING COUNT(*) > 4000 --order by the least speed difference between casual and member ORDER BY l.speed - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) LIMIT 10;
Most top 10 end stations with the least speed gap stations are located either in commercial or commercial-residential areas where markets, banks and shops are around. Those are more likely running errand and work oriented locations.
--from start stations with greatest gap SELECT member_casual, end_station_name, COUNT(*) AS usages, MODE() WITHIN GROUP (ORDER BY speed_mm) AS mode_speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS med_speed, AVG(end_lat) AS end_lat, AVG(end_lng) AS end_lng FROM bike WHERE start_station_name IN ('Michigan Ave & Washington St', 'Michigan Ave & Lake St', 'Millennium Park', 'Columbus Dr & Randolph St', 'Michigan Ave & 8th St', 'Indiana Ave & Roosevelt Rd', 'Lake Shore Dr & Monroe St', 'Buckingham Fountain', 'Michigan Ave & Oak St', 'Streeter Dr & Grand Ave') AND speed_mm != 0 AND start_station_name != end_station_name GROUP BY end_station_name, member_casual ORDER BY usages DESC; --from start stations with least gap SELECT member_casual, end_station_name, COUNT(*) AS usages, MODE() WITHIN GROUP (ORDER BY speed_mm) AS mode_speed, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS med_speed, AVG(end_lat) AS end_lat, AVG(end_lng) AS end_lng FROM bike WHERE start_station_name IN ('Clark St & Drummond Pl', 'Halsted St & Wrightwood Ave', 'Clark St & Schiller St', 'Wells St & Concord Ln', 'Bissell St & Armitage Ave', 'Sedgwick St & North Ave', 'Morgan St & Lake St', 'Damen Ave & Pierce Ave', 'Clark St & Elm St', 'Wells St & Evergreen Ave') AND speed_mm != 0 AND start_station_name != end_station_name GROUP BY end_station_name, member_casual ORDER BY usages DESC;
WITH aa AS ( SELECT bike.member_casual, start_station_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY bike.speed_mm) AS speed_casual, COUNT(*) AS usages_casual, l.member_casual AS member, l.speed AS speed_member, l.usages AS usages_member FROM bike LEFT JOIN ( SELECT member_casual, start_station_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY bike.speed_mm) AS speed, COUNT(*) AS usages FROM bike WHERE member_casual = 'member' GROUP BY member_casual, start_station_name ) AS l USING (start_station_name) WHERE bike.member_casual = 'casual' GROUP BY bike.member_casual, bike.start_station_name, l.member_casual, l.speed, l.usages ORDER BY COUNT(bike.*) DESC ) SELECT CORR(speed_casual, speed_member), SUM(usages_casual) AS cas_total, SUM(usages_member) AS mem_total, COUNT(*) AS num_station FROM aa WHERE usages_casual > 400 AND usages_member > 400; --for end stations WITH aa AS ( SELECT b.member_casual, end_station_name, COUNT(*) AS usages, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS speed, m_usages, m_speed FROM bike AS b LEFT JOIN ( SELECT member_casual, end_station_name, COUNT(*) AS m_usages, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY speed_mm) AS m_speed FROM bike WHERE member_casual = 'member' GROUP BY end_station_name, member_casual ) AS l USING (end_station_name) WHERE b.member_casual = 'casual' GROUP BY b.end_station_name, b.member_casual, m_usages, m_speed ) SELECT CORR(speed, m_speed), SUM(usages) AS cas_total, SUM(m_usages) AS mem_total, COUNT(*) AS num_station FROM aa WHERE usages > 400 AND m_usages > 400;
The calculation is done by including around total 3,490,509 usages grouped by around 446 stations with at least more than 400 usages over the past year.
It turned out that the ride speed of casual and members in both each start station and each end station have a strong positive correlation. That is, location is likely to affect the ride speed of both users to be faster or slower.
This also means that the long ride tendency of casual usages is not derived from locational reasons. It is that no matter how fast or low the ride speed of both casual and member usages is in a particular location, the speed of casual usages is constantly slower than that of members. As shown in the least and greatest speed gap tables, regardless of which location users ride bike at, casual users are slower in a significant degree.
The long ride tendency is the pattern almost all casual users share. It may be related to problems only casual users have, users' age or a unique psychology regarding using the service with a single ride pass. These are all key aspects to tackle with when making a successful casual customer targeted marketing strategy that finally converts them to annual members.
The pattern that casual usages are more leisure oriented alone does not imply much for making the strategy. Without knowing why there is the long ride tendency, there too many possible variables that eventually generate completely different marketing strategies.
For example, the ride speed of casual usages in leisure areas is extremely slow. one may think it is because they are mostly tourists, and it results in focusing more on casual customers in running errand areas. On the other hand, the other believes that it is because they want most out of a single ride pass by not docking the rented bike until the nearest minute of end-service time. In this case, it will be more effective if the fact that members can dock and undock anytime without being charged gets highlighted.
5) Ride speed and usages in all stations (start and end):
SELECT member_casual, start_station_name, end_station_name, speed_mm AS speed, distance_m AS distance, ridetime_min AS duration, l.start_lat AS slat, l.start_lng AS slng, ll.end_lat AS elat, ll.end_lng AS elng FROM bike --merging the scattered geo points in same start station into one LEFT JOIN ( SELECT start_station_name, AVG(start_lat) AS start_lat, AVG(start_lng) AS start_lng FROM bike GROUP BY start_station_name ) AS l USING (start_station_name) --merging the scattered geo points in same end station into one LEFT JOIN ( SELECT end_station_name, AVG(end_lat) AS end_lat, AVG(end_lng) AS end_lng FROM bike GROUP BY end_station_name ) AS ll USING (end_station_name);
Similar to what has been seen in the usage number analysis, the ride speed of casual usages varies by locations in a huge degree whereas that of members do not.
Coinciding with the finding in the top least speed gap stations, the locations in the northwest of the shore show the highest speed of users.
Although location affects users' speed, why casual usages show extremely low speed in particular locations cannot be known unless the reason why the long ride tendency of casual exists is answered. It's because answering that problem, in the end, gets down to the question why members ride a bike much shorter time than casual users in those same locations, which is the same question arisen in all locations as well.
Additional data should be given for a further analysis.
9. Conclusion
I first looked into how the number of usages is different. Second, the ride duration, distance and speed of casual and member usages were compared.
In addition, looking into the ride duration and speed of both casual and member usages, I could find a huge gap between casual and member usages.




















Comments
Post a Comment