Which Cycle Frontier Jobs Are Worth Doing? - Part Four.
Time To Deal With the Missing Values
game
python
data
science
exploration
cycle
frontier
Published
November 4, 2022
Introduction
Welcome back to the fourth post in the series. Last time, we were having issues with the data since there were missing values - including the Job that prompted the initial investigation. We’ll jump right into it picking up with the mismatch between naming in tables.
While working through the list of problems I found reviewing the data, there is clearly an error in the Regex built before. The symptom of this was the Autoloader which has a hanging number when the group is pulled out.
This didn’t end up being the only correction to the Regex; in fact, the Co-Tool Multitool was being broken apart since it has a - in the name. So, we had to update this to check and include the - on the split.
newRegex =r"((\d+\s[\w\-]+\s?[a-zA-Z]+))"
Now we’ll have our updated version of the breakLoot function we write before.
Now we can continue and move to fixing the name mismatches like intended. There was quite a few of these so I’ll only show the process for a few of them - and then the fix for all of them. We’ll pull the code from the previous post so we can start by using the 0 values and work backwards.
There are 31 rows with a Cost of 0 which we’ll need to investigate. The first one which I had noticed and corrected in the previous post was the CPU’s so we’ll address this one first. We’ll need to find the values in the tasks table and the lootKMarks table so we can figure out where the disconnect is.
The tasks table has a shortened version of its name so the easiest way to fix this would be to append CPU to the loot name in the tasks table.
tasks.loc[ tasks.loot =="Master Unit", 'loot'] ='Master Unit CPU'tasks.loc[ tasks.loot =="Master Unit CPU", 'loot']
6 Master Unit CPU
16 Master Unit CPU
47 Master Unit CPU
11 Master Unit CPU
Name: loot, dtype: object
Another was the Pure Focus Crystals which has the same problem of having the name truncated; this is a trend among all the affected materials with higher tiered materials. Those will all be included in the final correction and will skip showing the process; it’s literally just the same and tedious.
# Missing the Crystal parttasks.loc[tasks.loot.str.contains("Pure Focus")]
We’ve made some good progress so far but we’ve still got quite a few jobs to update. Let’s check the list of missing valuse once more and see if there is a new pattern here.
The biggest standout problem here now is all those data drive quests so lets resolve those next.
Data Drives:
We have another problem though since none of the tables we have actually contains the data we’re after. So, we’re back to the wiki to do some more scraping work. If we also look at the names of the data drives we’re going to have another problem soon - which you’ll see when we download the data.
# Missing the Data Drivetasks.loc[tasks.loot.str.contains("Data")]
name
count
loot
description
19
Data Drive II
1
Rare Data
The Data you brought us was helpful, but we ne...
20
Data Drive III
2
Rare Data
Good work last time, Prospector. The Data was ...
21
Data Drive IV
1
Epic Data
In order to be able to predict Storm Behaviour...
22
Data Drive V
1
Legendary Data
Yes! The more precise Data you brought us was ...
23
Data Drive VI
3
Legendary Data
We're finding more than just Storm data now......
32
Data Drop
2
Rare Data
Prospector. One of our Scientists is convinced...
Like the previous posts, scraping is a tedious process of matching keywords and pulling the right tables so that’s getting skipped; it’s just mostly trial and error.
# game taken down# urlDataDrives = 'https://thecyclefrontier.wiki/wiki/Utilities#Data_Drives-0'# urlDataDrives = 'https://thecyclefrontier.wiki/wiki/Data_Drive'urlDataDrives ='https://archive.ph/zG8Q2'siteDrive = pd.read_html(urlDataDrives, attrs={"class":"zebra"})[2]
We’ll use a modified version of the code we wrote before for parsing the loot table for this.
Since this is the second time that we’ve needed this - and we’re going to need this again - we should write a function to wrap this whole process.
# this is the function, where:## siteData: the table from the scraped site## columns: the columns you want to keep from the scraped data## adjust: the count from the bottom containing an index## step: how many rows between values we care about## offset: how many rewards are there?def extractSite(siteData, columns, adjust, step, offset):ifnotisinstance(columns, list):print("Columns argument must be a list.")returnNone siteSubset = siteData[columns].copy() siteSubset = siteSubset.assign( Loot = np.NaN )# Some extra error handling ifnotisinstance(adjust, int):print("adjust argument must be an int.")returnNoneifnotisinstance(step, int):print("step argument must be an int.")returnNoneifnotisinstance(offset, list):print("offset argument must be a list.")returnNone index =range( 0, len(siteSubset) - adjust, step) offset = np.array(offset)for i in index: aLoot = siteSubset.iloc[i, 1] indexes = i + offset siteSubset.iloc[indexes, len(siteSubset.columns)-1] = aLoot tmp = siteSubset.iloc[:, 1:len(siteSubset.columns)] tmp = tmp.fillna(method="ffill") siteSubset.iloc[:, 1:len(siteSubset.columns)-1] = tmp cutNA = siteSubset.Loot.isna() returnData = siteSubset[ ~cutNA ] returnData = returnData.rename(columns={'Image':'Unit', 'Name':'Reward'})return returnData
Now we’ll sanity check this to make sure it works.
Perfect! Now we just add this to our list of adjustments right? Sadly no. If you look at the names of the drives you’ll find that we’re not quite there. The names of the drives were renamed in Season 2 but the values in our tasks were not updated from their previous values. This is not too hard to update since the old drive names were just the Rarity + Data and we have that so we’ll just need a new column.
We’re going to move to getting the gun data included since we already actualy have it. This was part of another post which was done - not included in the series.
# game taken down# url = "https://thecyclefrontier.wiki/wiki/Weapons"url ='https://archive.ph/pM11n'siteGun = pd.read_html(url, attrs={"class":"zebra"})[0]gunData = siteGun[~siteGun.Type.isna()]indx = gunData['Proj. Speed'] =='Hitscan'gunData.loc[indx, 'Proj. Speed'] = np.NaNgunData = gunData.assign( Unit = gunData['Sell Value'].str.replace(' K-Marks', '').astype('float'), Reward ="K-Marks", Loot = gunData['Name'])# # This removes the legendary weapons# data = data.query('Faction != "Printing"')guns = gunDataguns[['Unit', 'Reward', 'Rarity', 'Loot']].head(15)
Unit
Reward
Rarity
Loot
0
17429.0
K-Marks
Epic
Advocate
3
524.0
K-Marks
Common
AR-55 Autorifle
6
12341.0
K-Marks
Epic
Asp Flechette Gun
9
371.0
K-Marks
Common
B9 Trenchgun
12
63080.0
K-Marks
Exotic
Basilisk
15
1918.0
K-Marks
Uncommon
Bulldog
18
2052.0
K-Marks
Common
C-32 Bolt Action
21
17429.0
K-Marks
Epic
Gorgon
24
16805.0
K-Marks
Exotic
Hammer
27
7143.0
K-Marks
Rare
ICA Guarantee
30
228.0
K-Marks
Common
K-28
33
325513.0
K-Marks
Legendary
KARMA-1
36
22781.0
K-Marks
Epic
KBR Longshot
39
94459.0
K-Marks
Exotic
Kinetic Arbiter
42
1918.0
K-Marks
Uncommon
KM-9 'Scrapper'
Now we have the gun data to be added to the loot table but if you were paying attention you may have noticed something is wrong with one of our values.
# Not sure what this is from.tasks.loc[tasks.loot.str.contains(' at')]
name
count
loot
description
35
Provide an Advocate
1
Advocate at
Our field agent requested better gear to take ...
The Advocate in this row is labeled as Advocate at which is another problem. In this instance, instead of trying to deal with the regex we’re just going to update that single value.
We’re going to try to fix the Backpacks, the Shields and the Helmets toegther since they’re all on the same page of the wiki. Again, this data was not part of the previous tables that we had so we’ll need to go get it.
# game taken down:# gearUrl = 'https://thecyclefrontier.wiki/wiki/Gear'gearUrl ='https://archive.ph/qeeTm'site = pd.read_html(gearUrl)
This was tricky to collect - which is only why I’m pointing out how it was done. While working out how to collect the data from the website, there was no good strategy to collect just the data that I wanted. What you will see is that I use hard coded values to pull out where each table is.
What I did was enumerate throught all the tables collected to find which results had outlier sizes. I then pulled those to make sure they were what I was after like this:
It is not pretty but web scraping rarely is. And, it works. Next we’ll start with the backup packs; with some tweaking of the values to the extractSite function which was defined earlier, this becomes really easy.
We have the same problem which we had for the Data Drives: the the old name Rare Backpack is in the tasks table but is not how they are named here. Again, we’re just going to steal and modify the solution we had before and apply it here.
This is much better. So, how many are missing now?
len(results.query("Cost == 0"))
4
Much better. But, there is actually a new problem here: masking missing values. If we look back at our data from before we’ll see that there was Helment and Shield information missing and now it’s gone! Since we’ve fixed some of the values for those jobs the Balance is no longer 0 and thefore we’ve lost track of it! Let’s step back and look at the merged result and look for mistakes. And, upon doing this we run into our first masked problem.
There are all these NaN values at the bottom which were attached and included due to the outer including all the rows. Can we simply change this over to a left?
This is more in line what I’d have expected. We’ll temporarily keep this - but mostly for discussion purposes. The problem here is that when you look through the wiki there are no sell values for the Shields and Helmets. We cannot get these from the wiki which means we cannot automate it. We could do this manually but then I’d have to constantly update this value when it changes - which I’m trying to avoid. This looks to be a case where we’re going to need to remove this job for now.
# Adding this to the pipeline:idx = tmp.name.str.contains("Loadout Drop")tmp = tmp.loc[-idx]tmp
name
count
loot
description
Loot
Unit
0
New Mining Tools
2
Hydraulic Piston
We are producing new Mining Tools for new Pros...
Hydraulic Piston
338.0
1
New Mining Tools
10
Hardened Metals
We are producing new Mining Tools for new Pros...
Hardened Metals
150.0
2
Explosive Excavation
4
Derelict Explosives
One of our mines collapsed with valuable equip...
Derelict Explosives
1709.0
3
Mining Bot
2
Zero Systems CPU
Our engineers have designed an autonomous mini...
Zero Systems CPU
506.0
4
Mining Bot
3
Ball Bearings
Our engineers have designed an autonomous mini...
Ball Bearings
338.0
...
...
...
...
...
...
...
136
NEW-Hard-Osiris-EliteCrusher-1
1
Alpha Crusher
DESCRIPTION MISSING
NaN
NaN
137
Flexible Sealant
8
Resin Gun
A number of our weather balloons took damage f...
Resin Gun
759.0
138
Indigenous Fruit
4
Indigenous Fruit
Ah, Prospector, have you come across any Indig...
Indigenous Fruit
759.0
139
Indigenous Fruit
4
Biological Sampler
Ah, Prospector, have you come across any Indig...
Biological Sampler
1139.0
140
Don't get crushed
3
Crusher Hide
Gear up, Prospector! We need Crusher Skins for...
Crusher Hide
11533.0
137 rows × 6 columns
Ok, so it looks like next will be solving the ammo listings.
tmp.loc[tmp.Loot.isna()]
name
count
loot
description
Loot
Unit
39
Excavation Gear
1
Heavy Mining Tool
For excavations, we need you to stash a Heavy ...
NaN
NaN
46
And two smoking Barrels
200
Shotgun Ammo
Prospector. Get down there and stash a PKR Mae...
NaN
NaN
79
Grenadier
1
Frag Grenade
Prospector. You have heard of Badum's Dead Dro...
NaN
NaN
83
Ammo Supplies
1000
Medium Ammo
Our Field Agents need more Ammo if they are to...
NaN
NaN
136
NEW-Hard-Osiris-EliteCrusher-1
1
Alpha Crusher
DESCRIPTION MISSING
NaN
NaN
Deal with Ammo
# game taken down:# ammoUrl = "https://thecyclefrontier.wiki/wiki/Ammo"ammoUrl ='https://archive.ph/Xacnz'ammo = pd.read_html(ammoUrl)[0]
We’ve already done this before so this is simply here to show it was done. And, you add it to the pipeline just the same.
Sadly, this tool is on its own page so we’ll need something custom again. There is a table here we can pull but it’s oriented incorrectly for our use.
# game taken down:# site = pd.read_html("https://thecyclefrontier.wiki/wiki/Heavy_Mining_Tool")site = pd.read_html("https://archive.ph/c6QCk")minerData = site[0]minerData
0
1
0
Description
Allows faster mining of materials.
1
Rarity
Common
2
Weight
30
3
Buy Value
600
4
Sell Value
180
5
Faction Points
2
Thankfully, a data frame has the Transpose function that a matrix does. A simple explanation of this function is that it swaps the rows to columns and columns to rows.
minerData[0].T
0 Description
1 Rarity
2 Weight
3 Buy Value
4 Sell Value
5 Faction Points
Name: 0, dtype: object
We’re going to extract the rows values to create a new dataframe object. Since the Heaving Mining Tool designation is missing from the data then we’ll need to add it ourself.