# install.packages("dplyr")
# install.packages("palaeoverse")
# install.packages("ggplot2")
# install.packages("rnaturalearth")
# install.packages("rnaturalearthdata")
# install.packages("deeptime")
# install.packages("rgplates")
# install.packages("fossilbrush")
library(dplyr)
library(palaeoverse)
library(ggplot2)
library(rnaturalearth)
library(rnaturalearthdata)
library(deeptime)
library(rgplates)
library(fossilbrush)
# Load data
fossils <- read.csv("cenozoic_crocs.csv")Data Processing I
Data Processing I: Data Exploration and Cleaning
Learning Objectives
In this session, the objectives are to (1) understand why data exploration and cleaning is key for data analyses and (2) develop the skills and knowledge needed to explore and clean data. We will cover:
- Exploratory data analyses
- Identifying and handling incomplete records
- Identifying and handling outliers
- Identifying and handling inconsistencies
- Identifying and handling duplicate records
Schedule
11:15–12:30
Data exploration
Why do we explore our data?
After acquiring the raw data to address your research question, a practical next step is to explore your data. Exploratory data analysis involves using graphical tools and basic statistical techniques to better understand the characteristics of your dataset, identify anomalies, and uncover patterns. This step is important for a variety of reasons:
- Reveal the structure and attributes of your dataset, such as variable types and distributions, numbers of observations, and spatial or temporal dependencies between observations.
- Highlight relationships between variables to guide future analyses and maximise statistical insights.
- Help you select appropriate statistical tools and verify their assumptions to avoid type I (false positive) and II (false negative) errors that might lead to incorrect conclusions.
- Flag systematic biases (e.g. taphonomic or sampling biases) that warrant careful consideration when interpreting your results.
- Reveal missing values, outliers, inconsistencies, duplication, and other unusual or erroneous values that require cleaning.
Together, exploratory data analysis is used to assess the quality and completeness of your dataset and gauge whether it can provide a meaningful and representative sample to address your research question. Without this step, you run the risk of applying inappropriate statistical techniques or making faulty inferences.
How do we explore our data?
Load packages and data
Before we start, we will load the R packages and data we need:
The first thing we want to do with our data is generate summary statistics and plots to help us understand the data and its various characteristics.
For example, we can look at the distribution of identification levels for our fossils.
# Count the frequency of taxonomic ranks
table(fossils$accepted_rank)
family genus species subfamily subgenus
15 624 850 3 1
subspecies superfamily unranked clade
2 30 718
# Calculate as percentages
(table(fossils$accepted_rank) / nrow(fossils)) * 100
family genus species subfamily subgenus
0.66874721 27.81988408 37.89567543 0.13374944 0.04458315
subspecies superfamily unranked clade
0.08916630 1.33749443 32.01069996
We can see that of our 2243 occurrences, 850 (~38%) are identified to species level. A further 624 (~28%) are identified to genus level. The remaining fossils are more coarsely identified, including 718 (~32%) which are identified to the mysterious level of “unranked clade”.
Next, let’s look at the distribution of fossils across localities. In the PBDB, fossils are placed within collections, each of which can roughly be considered a separate locality (they can also represent different sampling horizons at the same locality; more on this later). First, we can count the number of unique collection_no values to find out how many unique collections are in the dataset.
# What is the length of a vector of unique collection numbers?
length(unique(fossils$collection_no))[1] 1692
Our dataset contains 1692 unique collections.
We can also create a plot showing us the distribution of occurrences across these collections. First let’s tally up the number of occurrences in each collection.
# Count the number of times each collection number appears in the dataset
coll_no_freq <- as.data.frame(table(fossils$collection_no))
coll_no_freq Var1 Freq
1 3113 4
2 5241 1
3 11601 1
4 11803 1
5 12842 1
6 12847 1
7 12970 1
8 13063 1
9 13065 7
10 13079 2
11 13096 1
12 13127 1
13 13221 1
14 13265 1
15 13293 1
16 13318 2
17 13319 1
18 13322 6
19 13346 1
20 13456 1
21 13626 6
22 13686 1
23 13739 1
24 13747 1
25 13749 1
26 13758 1
27 13782 1
28 14658 3
29 14662 1
30 14670 2
31 14692 2
32 14713 1
33 14729 1
34 14730 1
35 14735 1
36 14736 2
37 14738 1
38 14739 2
39 14748 1
40 14762 1
41 14764 1
42 14766 2
43 14774 2
44 14790 2
45 14805 3
46 14830 2
47 14872 1
48 14873 2
49 14874 2
50 14883 1
51 14928 1
52 14956 1
53 14962 1
54 14963 1
55 14970 1
56 14975 1
57 14992 1
58 15000 2
59 15034 1
60 15049 1
61 15074 2
62 15091 1
63 15098 1
64 15108 1
65 15114 1
66 15119 1
67 15132 1
68 15147 1
69 15152 1
70 15155 1
71 15157 1
72 15159 1
73 15171 1
74 15173 2
75 15174 1
76 15176 1
77 15184 1
78 15190 1
79 15192 1
80 15208 2
81 15211 1
82 15217 2
83 15252 1
84 15454 1
85 15458 1
86 15583 1
87 15586 2
88 15587 1
89 15589 1
90 15590 1
91 15592 1
92 15593 1
93 15594 1
94 15595 3
95 15665 2
96 15668 1
97 15669 1
98 15682 1
99 15683 1
100 15687 1
101 15688 1
102 15692 1
103 15694 2
104 15695 1
105 15696 1
106 15697 1
107 15698 1
108 15699 1
109 15701 1
110 15702 1
111 15703 1
112 15704 1
113 15705 1
114 15706 1
115 15707 1
116 15718 1
117 15759 3
118 15760 1
119 15761 1
120 15771 1
121 15780 2
122 15793 1
123 15830 2
124 15832 1
125 15833 1
126 15837 4
127 15895 3
128 15914 1
129 15917 1
130 15997 1
131 16000 1
132 16134 1
133 16196 1
134 16202 1
135 16228 7
136 16235 1
137 16236 1
138 16244 1
139 16253 1
140 16261 1
141 16262 1
142 16264 2
143 16265 1
144 16266 1
145 16267 1
146 16268 2
147 16270 3
148 16272 3
149 16273 3
150 16274 1
151 16279 1
152 16280 1
153 16282 1
154 16296 2
155 16396 2
156 16405 1
157 16414 1
158 16438 1
159 16528 1
160 16542 3
161 16545 1
162 16549 1
163 16550 1
164 16586 1
165 16590 1
166 16607 1
167 16614 1
168 16623 2
169 16650 2
170 16651 1
171 16656 1
172 16662 1
173 16666 1
174 16669 2
175 16674 1
176 16675 1
177 16676 1
178 16687 1
179 16695 1
180 16712 1
181 16823 1
182 16842 1
183 16877 1
184 16885 1
185 16919 1
186 16938 1
187 17341 1
188 17476 1
189 17837 1
190 17865 2
191 17904 1
192 18004 1
193 18028 1
194 18031 1
195 18046 1
196 18284 1
197 18327 1
198 18357 1
199 18442 1
200 18443 1
201 18537 1
202 18539 2
203 18550 1
204 18554 2
205 18556 2
206 18560 2
207 18561 1
208 18564 1
209 18579 1
210 18586 2
211 18587 1
212 18597 1
213 18601 1
214 18610 1
215 18611 1
216 18613 1
217 18619 2
218 18623 1
219 18626 1
220 18627 1
221 18628 1
222 18630 1
223 18640 1
224 18645 1
225 18646 1
226 18663 2
227 18665 1
228 18759 1
229 19673 1
230 19750 1
231 19753 1
232 19764 1
233 20079 1
234 20302 1
235 20400 1
236 20402 1
237 20440 1
238 20491 1
239 20495 1
240 20530 1
241 20604 1
242 20613 1
243 20650 1
244 21282 1
245 21287 1
246 21294 1
247 21300 2
248 21332 2
249 21334 2
250 21335 1
251 21336 3
252 21337 1
253 21338 1
254 21339 1
255 21340 1
256 21341 2
257 21369 3
258 21377 3
259 21396 1
260 21414 3
261 21442 3
262 21446 3
263 21451 3
264 21453 3
265 21456 3
266 21466 3
267 21490 3
268 21495 1
269 21497 3
270 21499 3
271 21500 3
272 21513 3
273 21522 3
274 21524 3
275 21526 3
276 21531 3
277 21534 3
278 21538 3
279 21626 1
280 21627 1
281 21654 1
282 21688 2
283 21723 1
284 21724 1
285 21835 1
286 21836 1
287 21837 1
288 21839 1
289 21843 1
290 21848 1
291 21855 1
292 21962 1
293 21970 1
294 21973 1
295 21974 1
296 22041 1
297 22064 1
298 22069 1
299 22071 1
300 22077 1
301 22078 2
302 22081 1
303 22082 1
304 22089 1
305 22134 1
306 22138 1
307 22160 1
308 22174 1
309 22178 1
310 22196 1
311 22234 2
312 22235 3
313 22236 2
314 22237 2
315 22238 1
316 22239 2
317 22241 5
318 22242 3
319 22248 1
320 22265 1
321 22268 1
322 22269 1
323 22270 1
324 22271 1
325 22272 1
326 22273 1
327 22275 1
328 22276 1
329 22277 1
330 22278 1
331 22290 1
332 22300 1
333 22301 1
334 22302 1
335 22303 1
336 22307 1
337 22316 1
338 22329 1
339 22336 3
340 22338 2
341 22345 2
342 22346 1
343 22368 1
344 22376 2
345 22377 1
346 22423 1
347 22424 1
348 22444 1
349 22456 1
350 22486 1
351 22487 1
352 22491 1
353 22494 1
354 22495 3
355 22505 1
356 22506 2
357 22548 1
358 22552 1
359 22553 1
360 22554 1
361 22577 1
362 22578 1
363 22579 1
364 22593 1
365 22596 2
366 22614 1
367 22621 2
368 22626 2
369 22628 1
370 22664 1
371 22701 3
372 22924 1
373 23115 1
374 24181 3
375 24237 1
376 25654 1
377 26129 2
378 26550 1
379 26574 1
380 27077 1
381 27268 4
382 27269 4
383 27353 1
384 27508 1
385 28180 1
386 28194 1
387 28230 1
388 28292 1
389 28293 1
390 28368 1
391 28422 1
392 28747 2
393 31108 1
394 31173 1
395 31183 1
396 31316 1
397 31605 1
398 31606 1
399 31717 1
400 31735 1
401 31743 1
402 31748 1
403 31762 1
404 32051 2
405 32052 1
406 32053 2
407 32054 1
408 32055 2
409 32058 2
410 32060 1
411 32079 1
412 32085 2
413 32111 1
414 32116 2
415 32931 1
416 32948 1
417 34269 1
418 34615 2
419 34805 1
420 34845 1
421 34846 1
422 34847 1
423 35132 1
424 35275 1
425 35277 3
426 35279 1
427 35280 1
428 35281 2
429 35531 5
430 35986 1
431 36113 1
432 36114 1
433 36266 2
434 36628 1
435 36631 2
436 36633 1
437 36713 1
438 37239 1
439 37559 1
440 37703 2
441 38060 1
442 38076 2
443 38077 1
444 38172 1
445 38793 1
446 38814 1
447 38817 1
448 39014 1
449 39015 1
450 39209 3
451 39210 1
452 39643 2
453 39662 1
454 39665 1
455 39667 1
456 39892 1
457 40030 1
458 40358 1
459 40516 1
460 40758 1
461 40883 1
462 41073 2
463 41078 2
464 41308 1
465 41735 1
466 41810 1
467 41812 1
468 41813 1
469 41818 1
470 41992 1
471 41998 1
472 42067 1
473 42119 1
474 42130 1
475 43026 1
476 43030 1
477 43063 2
478 43400 1
479 44977 1
480 44999 1
481 45000 1
482 45002 3
483 45097 1
484 45235 1
485 45237 2
486 45239 1
487 45317 1
488 45326 1
489 45327 1
490 45452 1
491 45469 1
492 45485 3
493 45590 1
494 45591 1
495 45594 1
496 45597 1
497 45598 1
498 45640 1
499 45652 1
500 45777 1
501 45801 1
502 46084 1
503 46097 1
504 46197 1
505 46636 1
506 46637 1
507 46639 1
508 46641 1
509 46642 1
510 46644 1
511 46645 1
512 46646 1
513 46649 1
514 46650 1
515 46651 1
516 46652 1
517 46653 1
518 46654 1
519 46659 1
520 46661 1
521 46663 1
522 46672 1
523 46676 1
524 46677 1
525 46678 1
526 46679 1
527 46682 1
528 46683 1
529 46685 1
530 46686 1
531 46687 1
532 46688 1
533 46689 1
534 46690 1
535 46693 1
536 46694 1
537 46697 1
538 46698 1
539 46699 1
540 46700 2
541 46704 1
542 46705 1
543 46707 1
544 46708 1
545 46711 1
546 46720 1
547 46721 1
548 46726 1
549 46727 1
550 46728 1
551 46730 1
552 46731 1
553 46732 1
554 46733 1
555 46734 1
556 46735 1
557 46736 1
558 46738 1
559 46739 1
560 46740 1
561 46743 1
562 46748 1
563 46749 1
564 46753 1
565 46754 1
566 46757 1
567 46758 1
568 46759 1
569 46760 1
570 46761 1
571 46763 1
572 46764 1
573 46766 1
574 46767 1
575 46768 1
576 46769 1
577 46771 1
578 46772 1
579 46773 1
580 46776 1
581 46777 1
582 46778 1
583 46780 1
584 46782 1
585 46783 1
586 46784 1
587 46785 1
588 46786 1
589 46787 1
590 46788 1
591 46789 1
592 46790 1
593 46791 1
594 46792 1
595 46793 1
596 46794 1
597 46795 1
598 46797 1
599 46798 1
600 46800 1
601 46803 1
602 46804 1
603 46807 1
604 46808 1
605 46809 1
606 46810 1
607 46812 1
608 46813 1
609 46814 1
610 46816 1
611 46817 1
612 46819 1
613 46821 1
614 46824 1
615 46825 1
616 46833 1
617 46834 1
618 46835 1
619 46836 1
620 46837 1
621 46840 1
622 46841 1
623 46845 1
624 46846 1
625 46849 1
626 46852 1
627 46955 1
628 47021 1
629 47024 1
630 47026 1
631 47069 1
632 47074 1
633 47086 1
634 47091 1
635 47095 1
636 48029 1
637 48074 1
638 48077 2
639 48159 1
640 48173 1
641 48325 1
642 48638 1
643 48656 1
644 48657 1
645 48667 1
646 48676 1
647 48681 2
648 48696 1
649 48706 1
650 48712 1
651 49078 5
652 49083 1
653 49102 1
654 49403 1
655 49882 1
656 49942 1
657 51104 1
658 51106 1
659 51266 2
660 51412 2
661 51413 1
662 51586 1
663 53001 1
664 53970 3
665 55250 1
666 55396 2
667 55449 2
668 55530 1
669 55600 9
670 55602 15
671 56976 1
672 57007 2
673 57700 2
674 57782 1
675 57989 1
676 58089 1
677 58454 2
678 58455 1
679 59088 2
680 59129 1
681 59839 3
682 59904 1
683 59906 1
684 60443 1
685 63515 1
686 63519 1
687 63520 7
688 64376 1
689 64377 1
690 64378 1
691 64382 1
692 65037 1
693 65140 1
694 65143 2
695 65149 1
696 65161 1
697 65162 1
698 65163 1
699 65170 2
700 65181 1
701 65205 1
702 65367 1
703 65405 2
704 65798 1
705 65912 1
706 65943 1
707 65960 3
708 67172 1
709 67384 1
710 67385 4
711 67386 3
712 67634 2
713 67706 1
714 68032 1
715 68069 1
716 68174 4
717 68428 2
718 68437 4
719 68440 3
720 68795 1
721 69098 1
722 69099 1
723 69539 1
724 69540 1
725 69920 1
726 70118 1
727 70129 1
728 70257 1
729 70266 2
730 70338 1
731 70359 1
732 70360 1
733 70362 1
734 70363 1
735 70364 1
736 70808 2
737 70814 1
738 70827 1
739 70828 1
740 70833 1
741 71275 1
742 71302 1
743 71332 1
744 71651 1
745 71801 1
746 71817 1
747 71819 1
748 71942 2
749 72045 1
750 72184 1
751 73686 1
752 73965 1
753 74097 1
754 74098 1
755 74361 1
756 74470 1
757 74505 1
758 74555 1
759 74556 1
760 74643 2
761 74737 1
762 75078 1
763 75282 1
764 75427 1
765 75638 1
766 75945 1
767 75974 1
768 76029 1
769 76063 1
770 76064 1
771 76066 1
772 76067 1
773 76129 1
774 76844 2
775 76981 1
776 77596 1
777 77776 1
778 77777 2
779 77778 1
780 77779 3
781 77780 1
782 77784 1
783 78233 2
784 78249 1
785 78300 1
786 78518 1
787 79200 1
788 79323 1
789 79700 1
790 79701 1
791 79704 1
792 79757 1
793 79793 1
794 81455 1
795 81498 1
796 81604 1
797 83303 1
798 83304 1
799 83305 1
800 83629 1
801 83938 1
802 83984 1
803 84022 1
804 84045 2
805 84048 1
806 84364 1
807 84511 1
808 84548 1
809 84549 1
810 84606 1
811 84793 3
812 85142 1
813 87489 2
814 87925 1
815 87947 1
816 89015 1
817 89444 2
818 90354 2
819 90576 1
820 90577 2
821 91316 1
822 91467 1
823 91575 1
824 92070 4
825 92641 4
826 92702 1
827 92703 2
828 92732 1
829 92733 1
830 92751 1
831 92839 2
832 92904 1
833 93103 1
834 93265 1
835 93379 1
836 93521 2
837 93522 2
838 93525 1
839 93543 1
840 93546 1
841 93548 1
842 93567 1
843 93587 1
844 93623 1
845 93668 1
846 93670 1
847 93697 1
848 93752 1
849 94595 1
850 95826 1
851 95863 1
852 95914 4
853 96702 3
854 96800 1
855 96899 1
856 96952 1
857 97257 1
858 97550 1
859 98313 1
860 99403 1
861 99707 1
862 99708 1
863 99709 1
864 99711 1
865 99821 1
866 99855 1
867 99856 1
868 100330 1
869 105889 3
870 106003 1
871 106061 1
872 107992 5
873 110177 1
874 110965 1
875 112590 1
876 113118 1
877 113275 1
878 113430 1
879 113685 1
880 113769 1
881 113770 1
882 114013 1
883 114014 1
884 114104 1
885 114105 1
886 114149 1
887 114685 1
888 115113 1
889 115148 4
890 115152 2
891 118015 1
892 118016 1
893 118037 1
894 118038 1
895 118314 1
896 118406 2
897 118482 1
898 120036 4
899 120798 1
900 120885 3
901 120887 1
902 122033 2
903 122454 2
904 122852 1
905 123270 1
906 123296 1
907 123371 1
908 123372 1
909 123373 1
910 123374 2
911 123375 1
912 123376 1
913 123483 1
914 123511 1
915 123512 1
916 123513 1
917 123514 1
918 123515 1
919 123518 1
920 123800 1
921 123919 1
922 124191 1
923 124717 1
924 124905 1
925 124963 1
926 124964 1
927 125091 2
928 126289 1
929 126605 1
930 126866 1
931 131859 2
932 132580 2
933 132938 1
934 133440 2
935 133648 1
936 133692 2
937 133693 1
938 133780 1
939 134806 2
940 134945 2
941 134954 1
942 135038 1
943 135348 2
944 135373 1
945 135688 1
946 135706 6
947 135707 1
948 135911 1
949 135917 1
950 136148 1
951 136269 1
952 136334 1
953 136511 2
954 136521 1
955 136714 5
956 136717 8
957 136720 1
958 136776 6
959 136777 6
960 138404 2
961 138825 1
962 140051 1
963 140202 1
964 140377 1
965 141848 1
966 142030 1
967 142340 1
968 142889 1
969 142891 1
970 142893 1
971 143057 2
972 143060 1
973 143207 1
974 143208 1
975 143209 1
976 143319 1
977 143321 3
978 143324 1
979 143343 1
980 143358 1
981 143359 1
982 143377 1
983 143459 1
984 143461 3
985 143464 1
986 143470 3
987 143472 1
988 143473 1
989 143475 1
990 143476 1
991 143477 1
992 143478 1
993 143479 1
994 143480 1
995 143482 1
996 143483 1
997 143484 1
998 143487 1
999 143488 1
1000 143490 1
1001 143515 1
1002 143516 1
1003 143517 1
1004 143518 3
1005 143524 1
1006 143531 1
1007 143552 1
1008 143586 1
1009 143603 1
1010 143727 1
1011 143748 2
1012 143754 1
1013 143780 2
1014 143781 1
1015 143782 2
1016 143783 1
1017 143784 2
1018 143786 3
1019 143788 1
1020 143792 1
1021 143795 1
1022 143796 1
1023 143884 1
1024 143885 2
1025 144046 2
1026 144047 1
1027 144065 1
1028 144099 1
1029 144102 1
1030 144103 1
1031 144104 1
1032 144105 3
1033 144106 1
1034 144107 1
1035 144108 1
1036 144109 2
1037 144110 1
1038 144111 1
1039 144113 1
1040 144115 1
1041 144116 1
1042 144117 1
1043 144118 1
1044 144119 2
1045 144124 1
1046 144125 1
1047 144127 1
1048 144130 1
1049 144131 1
1050 144132 1
1051 144133 1
1052 144134 1
1053 144135 1
1054 144173 1
1055 144174 1
1056 144175 1
1057 144176 4
1058 144177 1
1059 144178 1
1060 144179 1
1061 144180 1
1062 144181 1
1063 144182 1
1064 144183 1
1065 144184 1
1066 144185 1
1067 144187 2
1068 144195 1
1069 144196 1
1070 144200 1
1071 144202 1
1072 144203 1
1073 144260 1
1074 144261 1
1075 144262 1
1076 144263 1
1077 144444 1
1078 144474 3
1079 144511 1
1080 144512 2
1081 144514 1
1082 144516 1
1083 144517 1
1084 144536 1
1085 144537 2
1086 144538 1
1087 144540 1
1088 144541 1
1089 144542 1
1090 144543 1
1091 144544 1
1092 144545 1
1093 144546 1
1094 144547 1
1095 144548 1
1096 144549 1
1097 144550 1
1098 144552 1
1099 144553 1
1100 144554 1
1101 144555 1
1102 144556 1
1103 144557 1
1104 144558 2
1105 144559 1
1106 144560 1
1107 144562 1
1108 144563 1
1109 144564 1
1110 144565 1
1111 144566 2
1112 144584 1
1113 144585 1
1114 144586 2
1115 144588 1
1116 144593 1
1117 144595 3
1118 144596 1
1119 144638 1
1120 144643 1
1121 144645 2
1122 144646 1
1123 144649 1
1124 144662 1
1125 144663 1
1126 144739 2
1127 144788 1
1128 144977 4
1129 145618 1
1130 145619 1
1131 145620 1
1132 145621 1
1133 145622 1
1134 145811 1
1135 146600 1
1136 146988 1
1137 147191 1
1138 147461 2
1139 147462 4
1140 147463 6
1141 147464 3
1142 147465 5
1143 147466 5
1144 147467 1
1145 148381 1
1146 148384 1
1147 148387 1
1148 148388 1
1149 148390 1
1150 148393 1
1151 149168 4
1152 149173 1
1153 151620 1
1154 151737 1
1155 152082 1
1156 152113 1
1157 152870 1
1158 153738 1
1159 153741 1
1160 153743 1
1161 153745 1
1162 153746 1
1163 153747 1
1164 153748 1
1165 153749 1
1166 153750 1
1167 153751 1
1168 153752 1
1169 153753 1
1170 153754 1
1171 153756 1
1172 153757 1
1173 153758 1
1174 153759 1
1175 153760 1
1176 153761 1
1177 153763 1
1178 153775 1
1179 153776 1
1180 153777 1
1181 153779 1
1182 153818 1
1183 153820 1
1184 153821 1
1185 153822 1
1186 153823 1
1187 153824 1
1188 154035 1
1189 154221 1
1190 154222 1
1191 154724 1
1192 154804 1
1193 154805 1
1194 154921 1
1195 154960 1
1196 155253 1
1197 155752 1
1198 156789 1
1199 158991 1
1200 160726 1
1201 161374 1
1202 161377 1
1203 162464 1
1204 162465 1
1205 163851 1
1206 163852 1
1207 164388 1
1208 164395 1
1209 165176 1
1210 166506 3
1211 166507 2
1212 166508 3
1213 166598 1
1214 166767 8
1215 166768 7
1216 166769 4
1217 166770 4
1218 166975 1
1219 167230 1
1220 167354 1
1221 167729 1
1222 167953 1
1223 169420 2
1224 169436 1
1225 170185 1
1226 172914 1
1227 172927 1
1228 174065 1
1229 174066 1
1230 174226 1
1231 174672 1
1232 174673 1
1233 174730 1
1234 175334 1
1235 175727 1
1236 175734 1
1237 176127 1
1238 176128 1
1239 176129 3
1240 176130 3
1241 176132 2
1242 176133 2
1243 176135 1
1244 176140 1
1245 176143 2
1246 176145 1
1247 176147 3
1248 176148 1
1249 176149 3
1250 176156 2
1251 176162 3
1252 176208 1
1253 176226 3
1254 176664 1
1255 176750 1
1256 176824 1
1257 176825 1
1258 176826 3
1259 176827 1
1260 176828 2
1261 176829 1
1262 176868 1
1263 176936 2
1264 176937 1
1265 177036 1
1266 177494 1
1267 179000 1
1268 179001 1
1269 179481 1
1270 179692 3
1271 179707 1
1272 179708 1
1273 179709 1
1274 179710 1
1275 179732 1
1276 179811 2
1277 179812 1
1278 179933 1
1279 180111 1
1280 180192 1
1281 180576 1
1282 180578 1
1283 180579 1
1284 180580 1
1285 180582 1
1286 180583 1
1287 180591 1
1288 180592 1
1289 180594 1
1290 180599 1
1291 180603 1
1292 180611 1
1293 180613 1
1294 180614 1
1295 180616 1
1296 180617 1
1297 180624 1
1298 180626 1
1299 180629 1
1300 180630 1
1301 180691 2
1302 180692 4
1303 180791 1
1304 181304 2
1305 181308 1
1306 181309 1
1307 181310 1
1308 181515 1
1309 181845 1
1310 182159 1
1311 182505 1
1312 182519 1
1313 182565 1
1314 182619 1
1315 182641 2
1316 182667 1
1317 182671 1
1318 182805 1
1319 182916 1
1320 182999 1
1321 183008 2
1322 183009 3
1323 183010 3
1324 183080 1
1325 183151 1
1326 183364 1
1327 183674 1
1328 183918 1
1329 183976 1
1330 183979 1
1331 184227 1
1332 184275 2
1333 184468 1
1334 185323 1
1335 185615 1
1336 185616 1
1337 185681 1
1338 186359 1
1339 186362 1
1340 186558 1
1341 186636 1
1342 186845 1
1343 186852 1
1344 186888 1
1345 186894 1
1346 187320 1
1347 187324 2
1348 187580 3
1349 187583 1
1350 187585 1
1351 187873 1
1352 187874 1
1353 188304 1
1354 188748 1
1355 188761 1
1356 188992 1
1357 189440 3
1358 189577 1
1359 189788 1
1360 189790 1
1361 190038 1
1362 190041 1
1363 190086 1
1364 190120 1
1365 190535 1
1366 190744 1
1367 190965 1
1368 191113 1
1369 191222 1
1370 191365 1
1371 191932 2
1372 192070 1
1373 193126 1
1374 193640 1
1375 193641 1
1376 193642 1
1377 193643 1
1378 193644 1
1379 193645 1
1380 193648 1
1381 193650 1
1382 193652 1
1383 193654 1
1384 193655 1
1385 193656 1
1386 193657 1
1387 193659 1
1388 193662 1
1389 193664 1
1390 193667 1
1391 193670 1
1392 193988 1
1393 193989 1
1394 193990 1
1395 193998 1
1396 193999 5
1397 194000 2
1398 194001 1
1399 194002 1
1400 194073 1
1401 194574 1
1402 194579 1
1403 194580 1
1404 194726 1
1405 194846 1
1406 195083 1
1407 195288 1
1408 195449 1
1409 195457 2
1410 195517 1
1411 195859 1
1412 196739 1
1413 196740 1
1414 196742 1
1415 198248 1
1416 198251 1
1417 198638 1
1418 198647 1
1419 198648 1
1420 198649 1
1421 198650 1
1422 198651 1
1423 198652 1
1424 198653 1
1425 198654 1
1426 198655 1
1427 198656 1
1428 198657 1
1429 198658 1
1430 198659 1
1431 198660 1
1432 198661 1
1433 198662 1
1434 198812 3
1435 198813 1
1436 198814 2
1437 198815 1
1438 198816 2
1439 198892 1
1440 198893 1
1441 199213 1
1442 199559 2
1443 199560 3
1444 199562 3
1445 200044 1
1446 200110 1
1447 200111 4
1448 201834 1
1449 201840 1
1450 201849 1
1451 201850 1
1452 201975 1
1453 201976 1
1454 202068 1
1455 202336 1
1456 202337 1
1457 203089 1
1458 203090 1
1459 203092 1
1460 203093 1
1461 203112 5
1462 203629 1
1463 204596 1
1464 204645 1
1465 204646 1
1466 205187 1
1467 205815 1
1468 205883 1
1469 205884 1
1470 205897 1
1471 205898 1
1472 205912 1
1473 205916 1
1474 205919 1
1475 205920 1
1476 205927 1
1477 205929 1
1478 205943 1
1479 206017 1
1480 206042 1
1481 206081 1
1482 206083 1
1483 206184 1
1484 206185 1
1485 206532 1
1486 206776 1
1487 207049 4
1488 207452 1
1489 207484 1
1490 207489 1
1491 207493 1
1492 207764 2
1493 207765 1
1494 208786 1
1495 208787 1
1496 209089 4
1497 209370 1
1498 209371 1
1499 209873 1
1500 209877 1
1501 209878 1
1502 209898 1
1503 209904 1
1504 209906 1
1505 209913 1
1506 209918 1
1507 209922 1
1508 209923 1
1509 209932 1
1510 209936 1
1511 209938 1
1512 209941 1
1513 209947 1
1514 209948 1
1515 209965 1
1516 209971 1
1517 209975 1
1518 209976 1
1519 209979 1
1520 209981 1
1521 209987 1
1522 209988 1
1523 209990 1
1524 209992 1
1525 210009 1
1526 210023 1
1527 210027 1
1528 210032 1
1529 210038 1
1530 210039 1
1531 210043 1
1532 210044 1
1533 210045 1
1534 210193 1
1535 210194 1
1536 210195 3
1537 210197 4
1538 210198 3
1539 210199 4
1540 210951 2
1541 210952 1
1542 210953 1
1543 210954 1
1544 210955 1
1545 210956 1
1546 210957 1
1547 211299 1
1548 211306 1
1549 211787 1
1550 211915 1
1551 212251 1
1552 212253 1
1553 213137 1
1554 213296 1
1555 215569 2
1556 217014 1
1557 217616 1
1558 217618 1
1559 217775 1
1560 217776 2
1561 217781 2
1562 217782 1
1563 217788 1
1564 217817 1
1565 217818 1
1566 217824 1
1567 217876 1
1568 217877 2
1569 219022 1
1570 219023 2
1571 219024 2
1572 219025 1
1573 219026 1
1574 219027 2
1575 219762 1
1576 219885 3
1577 219967 1
1578 220175 2
1579 220590 1
1580 220617 1
1581 221108 1
1582 221117 3
1583 221118 1
1584 221119 1
1585 221120 1
1586 222064 1
1587 224007 1
1588 224009 4
1589 224010 5
1590 224012 1
1591 224013 3
1592 224112 1
1593 224113 1
1594 224114 1
1595 224115 1
1596 224116 1
1597 224117 1
1598 224118 1
1599 224119 1
1600 224120 1
1601 224121 1
1602 224122 1
1603 224123 1
1604 224124 1
1605 224125 1
1606 224126 1
1607 224127 1
1608 224128 1
1609 224129 1
1610 224130 1
1611 224131 1
1612 224132 1
1613 224301 2
1614 224304 1
1615 224305 1
1616 224306 1
1617 224308 1
1618 224309 1
1619 224310 1
1620 224311 2
1621 224312 1
1622 224313 1
1623 224326 1
1624 224327 1
1625 224328 1
1626 224332 1
1627 224333 1
1628 224418 1
1629 224505 1
1630 224506 1
1631 224507 1
1632 224509 1
1633 224510 1
1634 224511 1
1635 224594 1
1636 224597 2
1637 224598 1
1638 224654 1
1639 224655 2
1640 224657 2
1641 225490 1
1642 225491 1
1643 225492 1
1644 225493 1
1645 225494 1
1646 225495 1
1647 225589 1
1648 225590 2
1649 225891 1
1650 225946 1
1651 225947 1
1652 225948 1
1653 226046 2
1654 226065 1
1655 226738 1
1656 227021 1
1657 227328 1
1658 227368 1
1659 227371 1
1660 227934 1
1661 227935 1
1662 228148 1
1663 230003 1
1664 230004 1
1665 230115 1
1666 230116 2
1667 230675 1
1668 230753 1
1669 230754 1
1670 230759 1
1671 231878 1
1672 232040 1
1673 232370 1
1674 232486 3
1675 234561 1
1676 234666 1
1677 235737 1
1678 235747 3
1679 236300 1
1680 236310 1
1681 238106 1
1682 238108 1
1683 238109 1
1684 238110 1
1685 239734 1
1686 239764 4
1687 239770 1
1688 239799 1
1689 239809 1
1690 239845 1
1691 239955 1
1692 241377 1
Next, we’ll use the ggplot2 package, the go-to for professional-looking data visualizations in R, to visualize the frequency of collections with various numbers of occurrences.
# Plot the distribution of number of occurrences per collection
ggplot(coll_no_freq, aes(x = Freq)) +
geom_bar() +
labs(x = "Number of occurrences",
y = "Frequency")
We can see that the collection containing the most occurrences has 15, while the vast majority only contain a single occurrence.
What about the countries in which these fossils were found? We can investigate this using the “cc”, or “country code” column.
# List unique country codes, and count them
unique(fossils$cc) [1] "US" "NC" "CN" "IN" "CA" "KE" "AU" NA "TD" "TZ" "CD" "ET" "UG" "MW" "DJ"
[16] "ZA" "PA" "FJ" "PE" "FR" "MA" "IT" "TN" "PK" "PG" "BE" "PT" "RU" "AR" "ES"
[31] "UK" "IL" "DE" "IQ" "SA" "LY" "VE" "KZ" "NP" "BR" "MG" "PR" "AT" "JM" "EG"
[46] "TH" "MX" "ID" "AQ" "CH" "CR" "SV" "TW" "NE" "TR" "CZ" "MM" "DK" "SE" "UA"
[61] "PL" "CO" "SK" "GT" "VU" "SC" "JP" "KY" "AE" "CU" "MT" "BS" "VN" "NZ" "OM"
[76] "GR" "ER" "PY" "EH" "DO" "RO" "SD" "ML" "BA" "SN" "MN" "BG" "HU" "LK"
length(unique(fossils$cc))[1] 89
Here we can see that Cenozoic crocodiles have been found in 89 different countries. Let’s sort those values alphabetically to help us find specific countries.
# List and sort unique country codes, and count them
sort(unique(fossils$cc)) [1] "AE" "AQ" "AR" "AT" "AU" "BA" "BE" "BG" "BR" "BS" "CA" "CD" "CH" "CN" "CO"
[16] "CR" "CU" "CZ" "DE" "DJ" "DK" "DO" "EG" "EH" "ER" "ES" "ET" "FJ" "FR" "GR"
[31] "GT" "HU" "ID" "IL" "IN" "IQ" "IT" "JM" "JP" "KE" "KY" "KZ" "LK" "LY" "MA"
[46] "MG" "ML" "MM" "MN" "MT" "MW" "MX" "NC" "NE" "NP" "NZ" "OM" "PA" "PE" "PG"
[61] "PK" "PL" "PR" "PT" "PY" "RO" "RU" "SA" "SC" "SD" "SE" "SK" "SN" "SV" "TD"
[76] "TH" "TN" "TR" "TW" "TZ" "UA" "UG" "UK" "US" "VE" "VN" "VU" "ZA"
length(sort(unique(fossils$cc)))[1] 88
Something weird has happened here: we can see that once the countries have been sorted, one of them has disappeared. Why? We will come back to this during our data cleaning.
Practical
Now it’s your turn! Explore the data yourself:
What is the geographic scale of our data? (hint: geogscale column)
What is the stratigraphic scale of our data? (hint: stratscale column)
What proportion of our occurrences are marine crocodiles? (hint: taxon_environment column)
Data cleaning
Incomplete data records
Datasets are rarely perfect. A common issue you may encounter when exploring your data is ambiguous, incomplete, or missing data entries. These incomplete or missing data records can occur due to various reasons. In some cases, the data truly do not exist or cannot be estimated due to issues relating to taphonomy, collection approaches, or biases in the fossil record. In other cases, discrepancies may arise because data were collected when definitions or contexts differed, such as shifts in geopolitical boundaries and country names over time. Additionally, data may be incomplete for some records, but can be inferred through other available data.
Why is it important?
Missing information can bias the results of palaeobiological studies. Occurrence data are inherently based on the existence of a particular fossil, but missing data associated with that fossil occurrence can also affect analyses that rely on that associated data. For instance, missing temporal or spatial data may prevent you from including occurrences in your temporal or geographic range analyses.
What should we do with incomplete data records?
Depending on your research goals, incomplete entries may either be removed through filtering or addressed through imputation techniques. Data imputation approaches can be used to replace missing data with values modelled on the observed data using various methods. These can range from simple approaches, like replacing missing values with the mean for continuous variables, to more advanced statistical or machine learning techniques. If you do decide to impute missing data, it is essential that this process and its effects on the dataset are clearly justified and documented so that future users of the dataset or analytical results are aware of these decisions. Although missing data can reduce the statistical power of analyses and bias the results, imputing missing values can introduce new biases, potentially also skewing results and interpretations of the examined data.
To decide how to handle missing data, start by identifying the gaps in your dataset, which are often represented by empty entries or ‘NA’. For imputing missing values, numerous methods and tools are available in your coding language of choice, such as missForest, mice, and kNN. Removing missing data can be straightforward when working with small datasets. For manual removal, tools such as spreadsheet software can be sufficient. In R, built-in functions such as complete.cases() and na.omit() quickly identify and remove missing values (caution: this will remove whole rows of data). The tidyr package also provides the drop_na() function for this purpose.
Identify and handle incomplete data records
By default, when we read data tables into R, it recognises empty cells and takes some course of action to manage them. When we use base R functions, such as read.csv(), empty cells are given an NA value (‘not available’) only when the column is considered to contain numerical data. When we use Tidyverse functions, such as readr::read_csv(), all empty cells are given NA values. This is important to bear in mind when we want to find those missing values: here, we have done the latter, so all empty cells are NA.
The extent of incompleteness of the different columns in our dataset is highly variable. For example, the number of NA values for the collection_no is 0.
# Count the number of collection number values for which `is.na()` is TRUE
sum(is.na(fossils$collection_no))[1] 0
This is because it is impossible to add an occurrence to the PBDB without putting it in a collection, which must in turn have an identification number.
However, what about genus?
# Count the number of genus IDs for which `is.na()` is TRUE
sum(is.na(fossils$genus))[1] 766
What other columns might we want to check?
# Latitude
sum(is.na(fossils$lat))[1] 0
# Palaeolatitude
sum(is.na(fossils$paleolat))[1] 234
# Geological formations
sum(is.na(fossils$formation))[1] 571
# Country code
sum(is.na(fossils$cc))[1] 5
OK, so we’ve identified some incomplete data records, what do we do now? We have three options:
- Filter (i.e. remove records)
- Impute (i.e. complete records with substituted values)
- Complete (i.e. complete records with ‘true’ values)
Filter
While all occurrences have present-day coordinates, some are missing palaeocoordinates. We could easily remove these occurrences from the dataset.
# Remove occurrences which are missing palaeocoordinates
fossils <- filter(fossils, !is.na(fossils$paleolat))
# Check whether this has worked
sum(is.na(fossils$paleolat))[1] 0
A further option applicable in some cases would be to fill in our missing data. We may be able to interpolate values from the rest of our data, or use additional data sources. For our palaeogeography example above, we could generate our own palaeocoordinates, for example using palaeoverse::palaeorotate().
Impute
Data imputation is the process of replacing missing values in a dataset with substituted values. How might we do this for our formation names?
- We could estimate potential formations by using geographic coordinates to extract formations from a geological map.
- We could evaluate whether any nearby collections of the same age have associated formation names.
However, while a useful technique, data imputation does carry a level of uncertainty and can also bias our analyses. In this example, it might be preferable to trace back to the original literature and try to resolve this issue more robustly if the source material allows.
Complete
For example, the formation data for collection 18539 are missing, so we could go back to the original desciptive literature to complete the data for this collection. In doing so, we’ve discovered that occurrences from collection 18539 are from the Bone Valley Formation. We can now programmatically update our data. We could also do this manually in spreadsheet software, but through coding, we can track and document all the changes we’ve made to the dataset with ease!
# Add formation name
fossils[which(fossils$collection_no == "18539"), "formation"][1] NA NA
fossils[which(fossils$collection_no == "18539"), "formation"] <- "Bone Valley Formation"
fossils[which(fossils$collection_no == "18539"), "formation"][1] "Bone Valley Formation" "Bone Valley Formation"
We identified several data records without country codes. We could quickly filter this data, it’s not that much data after all. But you’ve just remembered something! The country where the collection is located is a compulsory data entry field in the PBDB! What on Earth has gone wrong?
Any guesses on what the country code for NAmibia is?
R has interpreted Namibia’s country code as a ‘NA’ value.
This is an important illustration of why we should conduct further investigation when any apparent errors arise in the dataset, rather than immediately removing these data points.
Outlier data records
Why is it important?
Outliers are data points that significantly deviate from other values in a dataset. Similar to missing information, outliers can bias the results of palaeobiological studies and can occur due to various reasons, including errors in data collection, measurement, processing, or even just natural variations within the data. For instance, when considering the temporal range of a taxonomic group based on occurrence data, an outlier could represent an issue with data entry (e.g. wrong taxonomic name or age entered) or a hiatus in favourable preservation conditions.
What should we do with outliers?
Identifying and handling outliers is an important part of data preparation and cleaning, and they typically become apparent when conducting exploratory data analysis. For numerical data, a simple box plot can often be useful for identifying outliers where typically the ‘whiskers’ are quantified based on some range of values describing the data, and any points lying outside of this range are plotted as individual outliers. In general, when in doubt, visualise and summarise your data.
But what should we do with outliers once they have been identified? Depends.
- How extreme is the outlier?
- Do we suspect it is an error? Can it be corrected (e.g. going to the source material) or removed?
- Do we have a good reason for retaining the data record for our analyses?
- How does it impact our results?
Identify and handle outliers
To provide an example on identifying and handling outliers, we we will focus in on the specific variables which relate to our scientific question, i.e. the geography of our fossil occurrences. First we’ll plot where the crocodile fossils have been found across the globe: how does this match what we already know from the country codes?
# Load in a world map
world <- ne_countries(scale = "medium", returnclass = "sf")
# Plot the geographic coordinates of each locality over the world map
ggplot(fossils) +
geom_sf(data = world) +
geom_point(aes(x = lng, y = lat),
shape = 21, size = 0.75, colour = "black", fill = "purple3") +
labs(x = "Longitude (º)",
y = "Latitude (º)")
We have a large density of crocodile occurrences in Europe and the western interior of the United States, along with a smattering of occurrences across the other continents. This distribution seems to fit our previous knowledge, that the occurrences are spread across 89 countries. However, the crocodile occurrences in Antarctica seem particularly suspicious: crocodiles need a warm climate, and modern-day Antarctica certainly doesn’t fit this description. Let’s investigate further. We’ll do this by plotting the latitude of the occurrences through time.
# Add a column to the data frame with the midpoint of the fossil ages
fossils <- mutate(fossils, mid_ma = (min_ma + max_ma) / 2)
# Create dataset containing only Antarctic fossils
antarctic <- filter(fossils, cc == "AQ")
# Plot the age of each occurrence against its latitude
ggplot(fossils, aes(x = mid_ma, y = lat)) +
geom_point(colour = "black") +
geom_point(data = antarctic, colour = "red") +
labs(x = "Age (Ma)",
y = "Latitude (º)") +
scale_x_reverse() +
geom_hline(yintercept = 0) +
coord_geo(dat = "stages", expand = TRUE, size = "auto")
Here we can see the latitude of each occurrence, plotted against the temporal midpoint of the collection. We have highlighted our Antarctic occurrences in red - these points are still looking pretty anomalous.
But, wait, we should actually be looking at palaeolatitude instead. Let’s plot that against time.
# Plot the age of each occurrence against its palaeolatitude
ggplot(fossils, aes(x = mid_ma, y = paleolat)) +
geom_point(colour = "black") +
geom_point(data = antarctic, colour = "red") +
labs(x = "Age (Ma)",
y = "Palaeolatitude (º)") +
scale_x_reverse() +
geom_hline(yintercept = 0) +
coord_geo(dat = "stages", expand = TRUE, size = "auto")
Hmm… when we look at palaeolatitude the Antarctic occurrences are even further south. Time to really check out these occurrences. Which collections are they within?
# Find Antarctic collection numbers
unique(antarctic$collection_no)[1] 43030 120887 31173
Well, upon further visual inspection using the PBDB website, all appear to be fairly legitimate. However, all three occurrences still appear to be outliers, especially as in the late Eocene temperatures were dropping. What about the taxonomic certainty of these occurrences?
# List taxonomic names associated with Antarctic occurrences
antarctic$identified_name[1] "Crocodilia indet." "Crocodylia indet." "Crocodylia indet."
Since all three occurrences are listed as “Crocodylia indet.”, it may make sense to remove them from further analyses anyway.
Let’s investigate if there are any other anomalies or outliers in our data. We’ll bin the occurrences by stage to look for stage-level outliers, using boxplots to show us any anomalous data points.
# Put occurrences into stage bins
bins <- time_bins(scale = "international ages")
fossils <- bin_time(occdf = fossils, bins = bins,
min_ma = "min_ma", max_ma = "max_ma", method = "majority")
# Add interval name labels to occurrences
bins <- select(bins, bin, interval_name)
fossils <- left_join(fossils, bins, by = c("bin_assignment" = "bin"))
# Plot occurrences
ggplot(fossils, aes(x = bin_midpoint, y = paleolat, fill = interval_name)) +
geom_boxplot(show.legend = FALSE) +
labs(x = "Age (Ma)",
y = "Palaeolatitude (º)") +
scale_x_reverse() +
scale_fill_geo("stages") +
coord_geo(dat = "stages", expand = TRUE, size = "auto")
Box plots are a great way to look for outliers, because their calculation automatically includes outlier determination, and any such points can clearly be seen in the graph. At time of writing, the guidance for geom_boxplot() states that “The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called ‘outlying’ points and are plotted individually.” 1.5 times the interquartile range seems a reasonable cut-off for determining outliers, so we will use these plots at face value to identify data points to check.
Here, the Ypresian (“Y”) is looking pretty suspicious - it seems to have a lot of outliers. Let’s plot the Ypresian occurrences on a palaeogeographic map to investigate further.
# Load map of the Ypresian, and identify Ypresian fossils
fossils_y <- fossils %>%
filter(interval_name == "Ypresian")
world_y <- reconstruct("coastlines", model = "PALEOMAP", age = 51.9)
# Plot localities on the Ypresian map
ggplot(fossils_y) +
geom_sf(data = world_y) +
geom_point(aes(x = paleolng, y = paleolat)) +
labs(x = "Palaeolongitude (º)",
y = "Palaeolatitude (º)")
Aha! There is a concentrated cluster of occurrences in the western interior of North America. This high number of occurrences is increasing the weight of data at this palaeolatitude, and narrowing the boundaries at which other points are considered outliers. We can check the effect this is having on our outlier identification by removing the US occurrences from the dataset and checking the distribution again.
# Remove US fossils from the Ypresian dataset
fossils_y <- fossils_y %>%
filter(cc != "US")
# Plot boxplot of non-US Ypresian fossil palaeolatitudes
ggplot(fossils_y) +
geom_boxplot(aes(y = paleolat)) +
labs(y = "Palaeolatitude (º)") +
scale_x_continuous(breaks = NULL)
We can now see that none of our occurrences are being flagged as outliers. Without this strong geographic bias towards the US, all of the occurrences in the Ypresian appear to be reasonable. This fits our prior knowledge, as elevated global temperatures during this time likely helped crocodiles to live at higher latitudes than was possible earlier in the Paleogene.
So to sum up, it seems that our outliers are not concerning, so we will leave them in our dataset and continue with our analytical pipeline.
Identify and handle inconsistencies
We’re now going to look for inconsistencies in our dataset. Let’s start by revisiting its structure, focusing on whether the class types of the variables make sense.
# Check the data class of each field in our dataset
str(fossils)'data.frame': 2009 obs. of 142 variables:
$ occurrence_no : int 40163 40167 40168 40169 150323 168759 203975 205062 206351 211735 ...
$ record_type : chr "occ" "occ" "occ" "occ" ...
$ reid_no : int 18506 NA NA NA NA NA 20034 NA 13474 NA ...
$ flags : chr NA NA NA NA ...
$ collection_no : int 3113 3113 3113 3113 13346 15458 14764 22924 14830 15895 ...
$ identified_name : chr "Crocodylia indet." "Thoracosaurus basifissus" "Thoracosaurus basitruncatus" "Thoracosaurus neocesariensis" ...
$ identified_rank : chr "unranked clade" "species" "species" "species" ...
$ identified_no : int 38309 216615 216614 184628 38435 110899 38309 110902 38424 274001 ...
$ difference : chr NA "species not entered" "species not entered" NA ...
$ accepted_name : chr "Crocodylia" "Gavialoidea" "Gavialoidea" "Thoracosaurus neocesariensis" ...
$ accepted_attr : logi NA NA NA NA NA NA ...
$ accepted_rank : chr "unranked clade" "superfamily" "superfamily" "species" ...
$ accepted_no : int 36582 96627 96627 184627 38435 110899 36582 110902 38424 274001 ...
$ early_interval : chr "Thanetian" "Thanetian" "Thanetian" "Thanetian" ...
$ late_interval : chr NA NA NA NA ...
$ max_ma : num 59.2 59.2 59.2 59.2 48.1 ...
$ min_ma : num 56 56 56 56 41 ...
$ ref_author : chr "Alroy 2006" "Cook and Ramsdell 1991" "Cook and Ramsdell 1991" "Cook and Ramsdell 1991" ...
$ ref_pubyr : int 2006 1991 1991 1991 1988 2001 2007 1932 1986 1988 ...
$ reference_no : int 18120 140 140 140 688 7530 19636 34368 2930 766 ...
$ phylum : chr "Chordata" "Chordata" "Chordata" "Chordata" ...
$ class : chr "Reptilia" "Reptilia" "Reptilia" "Reptilia" ...
$ order : chr "Crocodylia" "Crocodylia" "Crocodylia" "Crocodylia" ...
$ family : chr NA NA NA "Gavialidae" ...
$ genus : chr NA NA NA "Thoracosaurus" ...
$ plant_organ : logi NA NA NA NA NA NA ...
$ abund_value : int NA NA NA NA 62 NA NA NA NA NA ...
$ abund_unit : chr NA NA NA NA ...
$ lng : num -74.7 -74.7 -74.7 -74.7 -86.5 ...
$ lat : num 40 40 40 40 31.4 ...
$ occurrence_comments : chr "originally entered as \"Crocodylus? sp.\"" NA NA NA ...
$ collection_name : chr "Vincentown Formation, NJ" "Vincentown Formation, NJ" "Vincentown Formation, NJ" "Vincentown Formation, NJ" ...
$ collection_subset : int NA NA NA NA NA NA NA NA NA NA ...
$ collection_aka : chr NA NA NA NA ...
$ cc : chr "US" "US" "US" "US" ...
$ state : chr "New Jersey" "New Jersey" "New Jersey" "New Jersey" ...
$ county : chr NA NA NA NA ...
$ latlng_basis : chr "estimated from map" "estimated from map" "estimated from map" "estimated from map" ...
$ latlng_precision : chr "seconds" "seconds" "seconds" "seconds" ...
$ altitude_value : int NA NA NA NA NA NA NA NA NA NA ...
$ altitude_unit : chr NA NA NA NA ...
$ geogscale : chr "local area" "local area" "local area" "local area" ...
$ geogcomments : chr "\"The Vincentown Fm. occurs in an irregular, narrow belt extending diagonally [NE-SW] across NJ through portion"| __truncated__ "\"The Vincentown Fm. occurs in an irregular, narrow belt extending diagonally [NE-SW] across NJ through portion"| __truncated__ "\"The Vincentown Fm. occurs in an irregular, narrow belt extending diagonally [NE-SW] across NJ through portion"| __truncated__ "\"The Vincentown Fm. occurs in an irregular, narrow belt extending diagonally [NE-SW] across NJ through portion"| __truncated__ ...
$ paleomodel : chr "gplates" "gplates" "gplates" "gplates" ...
$ geoplate : chr "109" "109" "109" "109" ...
$ paleoage : chr "mid" "mid" "mid" "mid" ...
$ paleolng : num -44.5 -44.5 -44.5 -44.5 -66.8 ...
$ paleolat : num 40.1 40.1 40.1 40.1 34.7 ...
$ protected : chr NA NA NA NA ...
$ direct_ma_value : num NA NA NA NA NA NA NA NA NA NA ...
$ direct_ma_error : num NA NA NA NA NA NA NA NA NA NA ...
$ direct_ma_unit : chr NA NA NA NA ...
$ direct_ma_method : chr NA NA NA NA ...
$ max_ma_value : num NA NA NA NA NA NA NA NA NA NA ...
$ max_ma_error : num NA NA NA NA NA NA NA NA NA NA ...
$ max_ma_unit : chr NA NA NA NA ...
$ max_ma_method : chr NA NA NA NA ...
$ min_ma_value : num NA NA NA NA NA NA NA NA NA NA ...
$ min_ma_error : num NA NA NA NA NA NA NA NA NA NA ...
$ min_ma_unit : chr NA NA NA NA ...
$ min_ma_method : chr NA NA NA NA ...
$ formation : chr "Vincentown" "Vincentown" "Vincentown" "Vincentown" ...
$ geological_group : chr NA NA NA NA ...
$ member : chr NA NA NA NA ...
$ stratscale : chr "formation" "formation" "formation" "formation" ...
$ zone : chr NA NA NA NA ...
$ zone_type : chr NA NA NA NA ...
$ localsection : chr "New Jersey" "New Jersey" "New Jersey" "New Jersey" ...
$ localbed : chr NA NA NA NA ...
$ localbedunit : chr NA NA NA NA ...
$ localorder : chr NA NA NA NA ...
$ regionalsection : chr NA NA NA NA ...
$ regionalbed : chr NA NA NA NA ...
$ regionalbedunit : chr NA NA NA NA ...
$ regionalorder : chr NA NA NA NA ...
$ stratcomments : chr NA NA NA NA ...
$ lithdescript : chr NA NA NA NA ...
$ lithology1 : chr "sandstone" "sandstone" "sandstone" "sandstone" ...
$ lithadj1 : chr "glauconitic" "glauconitic" "glauconitic" "glauconitic" ...
$ lithification1 : chr NA NA NA NA ...
$ minor_lithology1 : chr "sandy,calcareous" "sandy,calcareous" "sandy,calcareous" "sandy,calcareous" ...
$ fossilsfrom1 : chr NA NA NA NA ...
$ lithology2 : chr NA NA NA NA ...
$ lithadj2 : chr NA NA NA NA ...
$ lithification2 : chr NA NA NA NA ...
$ minor_lithology2 : chr NA NA NA NA ...
$ fossilsfrom2 : chr NA NA NA NA ...
$ environment : chr NA NA NA NA ...
$ tectonic_setting : chr NA NA NA NA ...
$ geology_comments : chr "lithology described as a calcareous \"lime sand\" interbedded with a quartz or \"yellow sand\"" "lithology described as a calcareous \"lime sand\" interbedded with a quartz or \"yellow sand\"" "lithology described as a calcareous \"lime sand\" interbedded with a quartz or \"yellow sand\"" "lithology described as a calcareous \"lime sand\" interbedded with a quartz or \"yellow sand\"" ...
$ size_classes : chr NA NA NA NA ...
$ articulated_parts : chr NA NA NA NA ...
$ associated_parts : chr NA NA NA NA ...
$ common_body_parts : chr NA NA NA NA ...
$ rare_body_parts : chr NA NA NA NA ...
$ feed_pred_traces : chr NA NA NA NA ...
$ artifacts : chr NA NA NA NA ...
$ component_comments : chr NA NA NA NA ...
$ pres_mode : chr NA NA NA NA ...
[list output truncated]
This looks reasonable. For example, we can see that our collection IDs are numerical, and our identified_name column contains character strings.
Now let’s dive in further to look for inconsistencies in spelling, which could cause taxonomic names or geological units to be grouped separately when they are really the same thing. We’ll start by checking for potential taxonomic misspellings.
We can use the table() function to look at the frequencies of various taxonomic names in the dataset. Here, inconsistencies like misspellings or antiquated taxonomic names might be recognised. We will check the columns family, genus, and accepted_name, the latter of which gives the name of the identification regardless of taxonomic level, and is the only column to give species binomials.
# Tabulate the frequency of values in the "family" and "genus" columns
table(fossils$family)
Alligatoridae Crocodylidae Gavialidae NO_FAMILY_SPECIFIED
466 422 210 357
Planocraniidae
24
table(fossils$genus)
Acresuchus Ahdeskatanka
7 1
Akanthosuchus Aktiogavialis
3 3
Alligator Allognathosuchus
74 128
Antecrocodylus Argochampsa
2 4
Asiatosuchus Asifcroco
32 1
Astorgosuchus Australosuchus
2 4
Baru Borealosuchus
14 48
Bottosaurus Boverisuchus
5 21
Brachychampsa Brachygnathosuchus
1 1
Brachyuranochampsa Brasilosuchus
1 1
Brochuchus Caiman
8 31
Ceratosuchus Charactosuchus
5 7
Chinatichampsus Chrysochampsa
1 1
Crocodylus Crocodylus (Leptorhynchus)
270 1
Dinosuchus Diplocynodon
1 127
Dollosuchoides Dongnanosuchus
1 1
Duerosuchus Dzungarisuchus
1 1
Eoalligator Eocaiman
3 7
Eogavialis Eosuchus
4 6
Euthecodon Gavialis
57 40
Gavialosuchus Globidentosuchus
9 9
Gnatusuchus Gryposuchus
6 31
Gunggamarandu Harpacochampsa
1 1
Hassiacosuchus Hesperogavialis
2 6
Ikanogavialis Kalthifrons
5 1
Kambara Kentisuchus
4 3
Kinyang Krabisuchus
5 3
Kuttanacaiman Leidyosuchus
3 1
Leptorramphus Lianghusuchus
1 2
Listrognathosuchus Maomingosuchus
1 4
Maroccosuchus Mecistops
3 11
Megadontosuchus Mekosuchus
1 5
Melanosuchus Menatalligator
4 1
Mourasuchus Navajosuchus
38 2
Necrosuchus Nihilichnus
1 1
Orientalosuchus Orthogenysuchus
1 1
Osteolaemus Paleosuchus
4 3
Paludirex Paranacaiman
7 1
Paranasuchus Paratomistoma
2 1
Penghusuchus Piscogavialis
1 8
Planocrania Procaimanoidea
2 5
Protoalligator Protocaiman
1 1
Purussaurus Qianshanosuchus
60 1
Quinkana Rhamphostomopsis
8 4
Rhamphosuchus Rimasuchus
3 8
Sacacosuchus Sakhibaghoon
8 1
Siquisiquesuchus Sutekhsuchus
6 5
Thecachampsa Thoracosaurus
38 11
Tienosuchus Tomistoma
1 26
Toyotamaphimeia Trilophosuchus
3 2
Tsoabichi Tzaganosuchus
2 1
Ultrastenos Wannaganosuchus
1 1
# Filter occurrences to those identified at species level, then tabulate species
# names
fossils_sp <- filter(fossils, accepted_rank == "species")
table(fossils_sp$accepted_name)
Acresuchus pachytemporalis Ahdeskatanka russlanddeutsche
7 1
Akanthosuchus langstoni Aktiogavialis caribesi
3 1
Aktiogavialis puertoricensis Alligator darwini
2 7
Alligator gaudryi Alligator hailensis
1 2
Alligator hantoniensis Alligator luicus
2 1
Alligator mcgrewi Alligator mefferdi
1 2
Alligator mississippiensis Alligator munensis
12 1
Alligator olseni Alligator prenasalis
4 8
Alligator sinensis Alligator thomsoni
2 1
Allognathosuchus heterodon Allognathosuchus mlynarskii
2 1
Allognathosuchus polyodon Allognathosuchus wartheni
2 4
Allognathosuchus woutersi Antecrocodylus chiangmuanensis
1 2
Argochampsa krebsi Asiatosuchus depressifrons
4 11
Asiatosuchus germanicus Asiatosuchus grangeri
3 1
Asiatosuchus nanlingensis Asiatosuchus oenotriensis
4 1
Asifcroco retrai Astorgosuchus bugtiensis
1 2
Australosuchus clarkae Baru darrowi
4 4
Baru huberi Baru iylwenpeny
1 2
Baru wickeni Borealosuchus acutidentatus
7 1
Borealosuchus formidabilis Borealosuchus griffithi
17 2
Borealosuchus sternbergii Borealosuchus wilsoni
12 2
Bottosaurus fustidens Boverisuchus magnifrons
2 2
Boverisuchus vorax Brachyuranochampsa eversolei
17 1
Brasilosuchus mendesi Brochuchus parvidens
1 1
Brochuchus pigotti Caiman australis
4 2
Caiman brevirostris Caiman crocodilus
3 2
Caiman latirostris Caiman paranensis
5 1
Caiman praecursor Caiman wannlangstoni
1 4
Caiman yacare Ceratosuchus burdoshi
3 4
Charactosuchus fieldsi Charactosuchus sansoai
3 1
Chinatichampsus wilsonorum Crocodilus antiquus
1 1
Crocodilus ebertsi Crocodilus ziphodon
1 2
Crocodylus acer Crocodylus acutus
1 1
Crocodylus affinis Crocodylus anthropophagus
23 6
Crocodylus aptus Crocodylus checchiai
2 5
Crocodylus elliotti Crocodylus falconensis
1 1
Crocodylus gariepensis Crocodylus megarhinus
1 3
Crocodylus niloticus Crocodylus palaeindicus
38 5
Crocodylus palustris Crocodylus porosus
5 5
Crocodylus rhombifer Crocodylus siamensis
5 10
Crocodylus thorbjarnarsoni Diplocynodon buetikonensis
7 1
Diplocynodon darwini Diplocynodon deponiae
1 3
Diplocynodon elavericus Diplocynodon hantoniensis
1 1
Diplocynodon kochi Diplocynodon levantinicum
4 2
Diplocynodon muelleri Diplocynodon plenidens
6 2
Diplocynodon ratelii Diplocynodon remensis
8 2
Diplocynodon tormis Diplocynodon ungeri
4 16
Dollosuchoides densmorei Dongnanosuchus hsui
1 1
Duerosuchus piscator Dzungarisuchus manacensis
1 1
Eoalligator chunyii Eocaiman cavernensis
3 1
Eocaiman itaboraiensis Eocaiman palaeocenicus
1 3
Eogavialis africanum Eogavialis andrewsi
1 2
Eogavialis gavialoides Eosuchus lerichei
1 1
Eosuchus minor Euthecodon arambourgii
5 1
Euthecodon brumpti Euthecodon nitriae
33 3
Gavialis bengawanicus Gavialis browni
7 5
Gavialis gangeticus Gavialis lewisi
10 3
Gavialosuchus antiquus Gavialosuchus eggenburgensis
1 1
Globidentosuchus brachyrostris Gnatusuchus pebasensis
9 6
Gryposuchus colombianus Gryposuchus croizati
8 5
Gryposuchus jessei Gryposuchus neogaeus
4 1
Gryposuchus pachakamue Gunggamarandu maunala
7 1
Harpacochampsa camfieldensis Hassiacosuchus haupti
1 1
Hesperogavialis cruxenti Ikanogavialis gameroi
3 3
Kalthifrons aurivellensis Kambara implexidens
1 1
Kambara molnari Kambara murgonensis
1 1
Kambara taraina Kentisuchus astrei
1 1
Kentisuchus spenceri Kinyang mabokoensis
2 1
Kinyang tchernovi Krabisuchus siamogallicus
2 3
Kuttanacaiman iquitosensis Leptorramphus entrerrianus
3 1
Lianghusuchus hengyangensis Listrognathosuchus multidentatus
1 1
Maomingosuchus acutirostris Maomingosuchus petrolica
1 2
Maroccosuchus zennaroi Mecistops cataphractus
3 2
Mecistops nkondoensis Megadontosuchus arduini
6 1
Mekosuchus sanderi Mekosuchus whitehunterensis
1 4
Melanosuchus fisheri Melanosuchus latrubessei
1 1
Melanosuchus niger Menatalligator bergouniouxi
1 1
Mourasuchus amazonensis Mourasuchus arendsi
4 9
Mourasuchus atopus Mourasuchus pattersoni
8 1
Navajosuchus mooki Necrosuchus ionensis
2 1
Nihilichnus nihilicus Orientalosuchus naduongensis
1 1
Orthogenysuchus olseni Osteolaemus osborni
1 1
Osteolaemus tetraspes Paludirex gracilis
3 3
Paludirex vincenti Paranacaiman bravardi
3 1
Paranasuchus gasparinae Paratomistoma courtii
2 1
Penghusuchus pani Piscogavialis jugaliperforatus
1 3
Planocrania datangensis Planocrania hengdongensis
1 1
Procaimanoidea kayi Procaimanoidea utahensis
2 1
Protoalligator huiningensis Protocaiman peligrensis
1 1
Purussaurus brasiliensis Purussaurus mirandai
4 9
Purussaurus neivensis Qianshanosuchus youngi
9 1
Quinkana babarra Quinkana fortirostrum
1 1
Quinkana meboldi Quinkana timara
1 2
Rhamphostomopsis neogaeus Rhamphosuchus crassidens
2 3
Rimasuchus lloydi Sacacosuchus cordovai
8 3
Sakhibaghoon khizari Siquisiquesuchus venezuelensis
1 2
Sutekhsuchus dowsoni Thecachampsa antiquus
5 8
Thecachampsa carolinensis Thecachampsa marylandica
7 2
Thecachampsa sericodon Thoracosaurus isorhynchus
16 1
Thoracosaurus neocesariensis Tienosuchus hsiangi
5 1
Tomistoma brumpti Tomistoma cairense
1 1
Tomistoma calaritanum Tomistoma coppensi
1 8
Tomistoma kerunense Tomistoma lusitanica
1 2
Tomistoma schlegelii Tomistoma tandoni
1 1
Tomistoma tenuirostre Toyotamaphimeia taiwanicus
1 2
Trilophosuchus rackhami Tsoabichi greenriverensis
1 2
Tzaganosuchus infansis Ultrastenos willisi
1 1
Wannaganosuchus brachymanus
1
Alternatively, we can use the tax_check() function in the palaeoverse package, which systematically searches for and flags potential spelling variation using a defined dissimilarity threshold.
# Check for close spellings in the "genus" column
tax_check(taxdf = fossils, name = "genus", dis = 0.1)Warning in tax_check(taxdf = fossils, name = "genus", dis = 0.1): Non-letter
characters present in the taxon names
$synonyms
NULL
$non_letter_name
[1] "Crocodylus (Leptorhynchus)"
$non_letter_group
NULL
# Check for close spellings in the "accepted_name" column
tax_check(taxdf = fossils_sp, name = "accepted_name" , dis = 0.1)$synonyms
group greater lesser count_greater count_lesser
1 C Crocodylus aptus Crocodylus acutus 2 1
2 D Diplocynodon ungeri Diplocynodon muelleri 16 6
$non_letter_name
NULL
$non_letter_group
NULL
Two names are flagged here for our dissimilarity theshold. However, on further inspection from the literature, these are two distinct species and therefore not a spelling mistake.
We can also check formatting and spelling using the fossilbrush package.
# Create a list of taxonomic ranks to check
fossil_ranks <- c("phylum", "class", "order", "family", "genus")
# Run checks
check_taxonomy(as.data.frame(fossils), ranks = fossil_ranks)Checking formatting [1/4]
- formatting errors detected (see $formatting in output)
Checking spelling [2/4]
- no potential synonyms detected
Checking ranks [3/4]
- no cross-rank names detected
Checking taxonomy [4/4]
- conflicting classifications detected (see $duplicates in output)
$formatting
$formatting$`non-letter`
$formatting$`non-letter`$phylum
integer(0)
$formatting$`non-letter`$class
integer(0)
$formatting$`non-letter`$order
integer(0)
$formatting$`non-letter`$family
[1] 6 8 179 183 184 187 188 191 208 214 218 232 270 281 282
[16] 288 298 299 314 315 328 329 331 332 335 336 367 368 369 370
[31] 504 534 538 542 562 563 565 567 568 569 570 571 572 573 578
[46] 579 580 581 582 583 584 588 589 590 601 607 608 614 615 616
[61] 619 620 629 631 663 665 666 679 703 704 705 706 707 708 709
[76] 710 711 713 714 715 720 721 722 723 727 735 749 750 752 753
[91] 757 760 784 794 795 813 822 825 826 827 828 838 839 840 844
[106] 860 862 863 864 865 866 867 868 874 876 877 878 879 880 890
[121] 891 892 893 894 896 897 899 900 902 903 904 905 907 921 922
[136] 923 924 925 926 927 928 929 935 936 937 938 939 940 941 942
[151] 943 944 945 946 956 957 958 959 960 962 963 964 976 977 978
[166] 982 987 999 1011 1027 1028 1029 1034 1035 1036 1037 1038 1039 1074 1075
[181] 1076 1077 1082 1086 1098 1099 1100 1101 1102 1103 1104 1105 1107 1110 1129
[196] 1130 1136 1137 1138 1152 1155 1156 1157 1158 1159 1161 1166 1210 1222 1226
[211] 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247
[226] 1248 1249 1250 1251 1252 1253 1268 1270 1271 1274 1281 1283 1284 1293 1294
[241] 1295 1331 1334 1335 1337 1342 1388 1392 1399 1400 1401 1403 1404 1412 1418
[256] 1451 1452 1453 1454 1455 1456 1457 1458 1462 1463 1465 1467 1492 1494 1496
[271] 1497 1498 1500 1501 1502 1506 1508 1509 1510 1517 1518 1519 1520 1523 1531
[286] 1579 1587 1594 1611 1612 1613 1618 1621 1628 1629 1641 1646 1657 1658 1660
[301] 1678 1679 1701 1702 1724 1732 1735 1741 1776 1778 1779 1810 1811 1814 1824
[316] 1826 1830 1831 1832 1833 1835 1836 1838 1879 1880 1881 1882 1883 1884 1885
[331] 1886 1887 1888 1889 1890 1891 1931 1937 1938 1939 1940 1945 1946 1951 1952
[346] 1956 1958 1959 1964 1980 1982 1983 1984 1985 1986 1987 1988
$formatting$`non-letter`$genus
[1] 1773
$formatting$`word-count`
$formatting$`word-count`$phylum
integer(0)
$formatting$`word-count`$class
integer(0)
$formatting$`word-count`$order
integer(0)
$formatting$`word-count`$family
integer(0)
$formatting$`word-count`$genus
[1] 1773
$ranks
$ranks$crossed_adj
$ranks$crossed_adj$`genus--family`
character(0)
$ranks$crossed_adj$`family--order`
character(0)
$ranks$crossed_adj$`order--class`
character(0)
$ranks$crossed_adj$`class--phylum`
character(0)
$ranks$crossed_all
$ranks$crossed_all$genus
character(0)
$ranks$crossed_all$family
character(0)
$ranks$crossed_all$order
character(0)
$ranks$crossed_all$class
character(0)
$duplicates
[1] taxon rank
<0 rows> (or 0-length row.names)
As before, no major inconsistencies or potential spelling errors were flagged.
The PBDB has an integrated taxonomy system which limits the extent to which taxon name inconsistencies can arise. However, this is not the case for some other data fields. Therefore, we should certainly check for inconsistencies in other of these fields.
For now, let’s proceed to the next step of the analytical pipeline, but be sure to further explore the data looking for inconsistencies during the practical (below).
Identify and handle duplicates
Our next step is to remove duplicates. This is an important step for count data, as duplicated values will artificially inflate our counts. Here, the function dplyr::distinct() is incredibly useful, as we can provide it with the columns we want it to check, and it removes rows for which data within those columns is identical.
First, we will remove absolute duplicates: by this, we mean occurrences within a single collection which have identical taxonomic names. This can occur when, for example, two species are named within a collection, one of which is later synonymised with the other.
# Show number of rows in dataset before duplicates are removed
nrow(fossils)[1] 2009
# Remove occurrences with the same collection number and `accepted_name`
fossils <- distinct(fossils, collection_no, accepted_name, .keep_all = TRUE)
# Show number of rows in dataset after duplicates are removed
nrow(fossils)[1] 1956
The number of rows dropped, which means that some of our occurrences were absolute duplicates and have now been removed.
Next, we can look at geographic duplicates. We mentioned earlier that sometimes PBDB collections are entered separately for different beds from the same locality, and this means that the number of collections can be higher than the number of geographic sampling localities. Let’s check whether this is the case in our dataset.
# Remove duplicates based on geographic coordinates
fossils_localities <- distinct(fossils, lng, lat, .keep_all = TRUE)
# Compare length of vector of unique collection numbers with and without this
# filter
length(unique(fossils$collection_no))[1] 1484
length(unique(fossils_localities$collection_no))[1] 1085
Here we can see that the number collections of our original dataset dropped after we removed latitude-longitude repeats. This means that, in some cases, more than one fossil sampling events have taken place at the same locality. In other words, we have more collections than geographically distinct localities in the dataset.
If we are interested in taxonomic diversity, we can also look at repeated names in our dataset. For example, we might want to identify taxa which are represented multiple times in order to then return to the literature and check that they definitely represent the same taxon. We can do this by flagging species names which are represented more than once in the dataset.
# Update dataset of occurrences identified to species level
fossils_sp <- filter(fossils, accepted_rank == "species")
# Identify and flag taxonomic duplicates
fossils_sp <- fossils_sp %>%
group_by(accepted_name) %>%
mutate(duplicate_flag = n() > 1)
# Show counts of flagged occurrences
table(fossils_sp$duplicate_flag)
FALSE TRUE
100 604
Some FALSE values are shown, indicating that some species are represented by a single occurrence. We also have TRUE values, for which the species are represented two or more times. We can then filter our dataset to those flagged, and sort them by their name, enabling easier checking.
# Filter table to flagged occurrences
fossils_sp <- filter(fossils_sp, duplicate_flag == TRUE)
# Sort table by genus name
fossils_sp <- arrange(fossils_sp, accepted_name)fossils_sp# A tibble: 604 × 143
# Groups: accepted_name [115]
occurrence_no record_type reid_no flags collection_no identified_name
<int> <chr> <int> <chr> <int> <chr>
1 624984 occ 35409 <NA> 55602 Acresuchus pachytempor…
2 1094079 occ 35408 <NA> 136717 Acresuchus pachytempor…
3 1430835 occ NA <NA> 144739 Acresuchus pachytempor…
4 1430836 occ NA <NA> 191932 Acresuchus pachytempor…
5 1430837 occ NA <NA> 136720 Acresuchus pachytempor…
6 1430838 occ NA <NA> 67386 Acresuchus pachytempor…
7 1557989 occ NA <NA> 219762 Acresuchus pachytempor…
8 691946 occ NA <NA> 74555 Akanthosuchus langstoni
9 710089 occ NA <NA> 76063 Akanthosuchus langston…
10 710090 occ NA <NA> 76064 Akanthosuchus langstoni
# ℹ 594 more rows
# ℹ 137 more variables: identified_rank <chr>, identified_no <int>,
# difference <chr>, accepted_name <chr>, accepted_attr <lgl>,
# accepted_rank <chr>, accepted_no <int>, early_interval <chr>,
# late_interval <chr>, max_ma <dbl>, min_ma <dbl>, ref_author <chr>,
# ref_pubyr <int>, reference_no <int>, phylum <chr>, class <chr>,
# order <chr>, family <chr>, genus <chr>, plant_organ <lgl>, …
If data are altered or filtered at any point, this can change the overall summary statistics, and affect how we perceive the data. We recommend double-checking the data before proceeding to analytical processes relating to your research question.
Practical (if you so desire)
Now it’s time for you to explore that data yourself. First, using the code chunks below, add your own additional lines of code addressing each of the posed questions. You could modify some of the code above to help you, or write your own!
Can you find any additional missing data? What will you do with them?
Can you find any additional data outliers? What will you do with them?
Can you find any additional data inconsistencies? What will you do with them?
Can you find any additional data duplicates? What will you do with them?
Let’s save our data for the next unit!
# Save data
write.csv(x = fossils, file = "./cenozoic_crocs_clean.csv", row.names = FALSE)Resources
- AGGARWAL, C. C. 2017. Outlier Analysis. Springer.
- CHAPMAN, A. D. 2005. Principles and methods of data cleaning. Global Biodiversity Information Facility.
- HAMMER, Ø. and HARPER, D. A. 2024. Paleontological data analysis. John Wiley & Sons.
- NEWMAN, D. A. 2014. Missing data: Five practical guidelines. Organizational research methods, 17, 372–411.
- RIBEIRO, B. R., VELAZCO, S. J. E., GUIDONI-MARTINS, K., TESSAROLO, G., JARDIM, L., BACHMAN, S. P. and LOYOLA, R. 2022. bdc: A toolkit for standardizing, integrating and cleaning biodiversity data. Methods in Ecology and Evolution, 13, 1421–1428.
- TUKEY, J. W. 1977. Exploratory data analysis. Vol. 1. Springer.
- VAN BUUREN, S. 2018. Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton,.